linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/35] my inode scaling series for review
@ 2010-10-19  3:42 npiggin
  2010-10-19  3:42 ` [patch 01/35] bit_spinlock: add required includes npiggin
                   ` (36 more replies)
  0 siblings, 37 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

Here is my famously tardy inode scaling patch set, for review for merging.
Yes it is a lot of patches, but it is very well broken out. It is not
rocket science -- if you don't understand something please get me to add
comments.

Patches 1-13 incrementally take over inode_lock in small, conservative,
obvious (as possible) steps. Subsequent patches improve code and
performance.

The only significant changes made from the inode scaling work in the
vfs-scale tree are merging to mainline, taking review suggestions, changing
the patch set to be better split up and improving comments and changelogs.

This is compatible with the rest of the dcache scaling improvements in my
tree, including the store-free path walking (rcu-walk).

I don't think Dave Chinner's approach is the way to go for a number of
reasons.

* My locking design allows i_lock to lock the entire state of the icache
  for a particular inode. Not so with Dave's, and he had to add code not
  required with inode_lock synchronisation or my i_lock synchronisation.
  I prefer being very conservative about making changes, especially before
  inode_lock is lifted (which will be the end-point of bisection for any
  locking breakage before it).

* As far as I can tell, I have addressed all Dave and Christoph's real
  concerns.  The disagreement about the i_lock locking model can easily be
  solved if they post a couple of small incremental patches to the end of the
  series, making i_lock locking less regular and no longer protecting icache
  state of that given inode (like inode_lock was able to pre-patchset). I've
  repeatedly disagreed with this approach, however.

* I have used RCU for inodes, and structured a lot of the locking around that.
  RCU is required for store-free path walking, so it makes more sense IMO to
  implement now rather than in a subsequent release (and reworking inode locking  to take advantage of it). I have a design sketched for using slab RCU freeing,  which is a little more complex, but it should be able to take care of any
  real-workload regressions if we do discover them.

* I implement per-zone LRU lists and locking, which are desperately required
  for reasonable NUMA performance, and are a first step towards proper mem
  controller control of vfs caches (Google have a similar per-zone LRU patch
  they need for their fakenuma based memory control, I believe).

* I implemented per-cpu locking for inode sb lists. The scalability and
  single threaded performance of the full vfs-scale stack has been tested
  quite well. Most of the vfs scales pretty linearly up to several hundreds
  of sockets at least. I have counted cycles on various x86 and POWER
  architectures to improve single threaded performance. It's an ongoing
  process but there has been a lot of work done already there.

  We want all these things ASAP, so it doesn't make sense to me to stage out
  significant locking changes in the icache code over several releases. Just
  get them out of the way now -- the series is bisectable and reviewable, so
  I think it will reduce churn and headache for everyone to get it out of the
  way now.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 01/35] bit_spinlock: add required includes
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 02/35] kernel: add bl_list npiggin
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: bit_spinlock-includes.patch --]
[-- Type: text/plain, Size: 607 bytes --]

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 include/linux/bit_spinlock.h |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/linux/bit_spinlock.h
===================================================================
--- linux-2.6.orig/include/linux/bit_spinlock.h	2010-10-19 14:17:28.000000000 +1100
+++ linux-2.6/include/linux/bit_spinlock.h	2010-10-19 14:18:58.000000000 +1100
@@ -1,6 +1,10 @@
 #ifndef __LINUX_BIT_SPINLOCK_H
 #define __LINUX_BIT_SPINLOCK_H
 
+#include <linux/kernel.h>
+#include <linux/preempt.h>
+#include <asm/atomic.h>
+
 /*
  *  bit-based spin_lock()
  *



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 02/35] kernel: add bl_list
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
  2010-10-19  3:42 ` [patch 01/35] bit_spinlock: add required includes npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: list-bitlock.patch --]
[-- Type: text/plain, Size: 9418 bytes --]

Introduce a type of hlist that can support the use of the lowest bit in the
hlist_head. This will be subsequently used to implement per-bucket bit spinlock
for inode and dentry hashes, and may be useful in other cases such as network
hashes.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 include/linux/list_bl.h    |  141 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/rculist_bl.h |  128 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 269 insertions(+)

Index: linux-2.6/include/linux/list_bl.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/list_bl.h	2010-10-19 14:18:58.000000000 +1100
@@ -0,0 +1,141 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+#include <linux/bit_spinlock.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ *
+ * For modification operations, the 0 bit of hlist_bl_head->first
+ * pointer must be set.
+ */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define LIST_BL_LOCKMASK	1UL
+#else
+#define LIST_BL_LOCKMASK	0UL
+#endif
+
+#ifdef CONFIG_DEBUG_LIST
+#define LIST_BL_BUG_ON(x) BUG_ON(x)
+#else
+#define LIST_BL_BUG_ON(x)
+#endif
+
+
+struct hlist_bl_head {
+	struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+	struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+	((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+	h->next = NULL;
+	h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+	return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+	return !((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+	hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+	struct hlist_bl_node *next = n->next;
+	struct hlist_bl_node **pprev = n->pprev;
+
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+
+	/* pprev may be `first`, so be careful not to lose the lock bit */
+	*pprev = (struct hlist_bl_node *)
+			((unsigned long)next |
+			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
+	if (next)
+		next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->next = LIST_POISON1;
+	n->pprev = LIST_POISON2;
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		INIT_HLIST_BL_NODE(n);
+	}
+}
+
+/**
+ * hlist_bl_for_each_entry	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first(head);				\
+	     pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = pos->next)
+
+/**
+ * hlist_bl_for_each_entry_safe - iterate over list of given type safe against removal of list entry
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @n:		another &struct hlist_node to use as temporary storage
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_bl_for_each_entry_safe(tpos, pos, n, head, member)	 \
+	for (pos = hlist_bl_first(head);				 \
+	     pos && ({ n = pos->next; 1; }) && 				 \
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = n)
+
+#endif
Index: linux-2.6/include/linux/rculist_bl.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rculist_bl.h	2010-10-19 14:18:58.000000000 +1100
@@ -0,0 +1,128 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+/*
+ * RCU-protected bl list version. See include/linux/list_bl.h.
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+#include <linux/bit_spinlock.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	rcu_assign_pointer(h->first,
+		(struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)rcu_dereference(h->first) & ~LIST_BL_LOCKMASK);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on the node returns true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		n->pprev = NULL;
+	}
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first;
+
+	/* don't need hlist_bl_first_rcu because we're under lock */
+	first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+
+	/* need _rcu because we can have concurrent lock free readers */
+	hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_bl_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_bl_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first_rcu(head);				\
+		pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+		pos = rcu_dereference_raw(pos->next))
+
+#endif



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 03/35] mm: implement per-zone shrinker
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
  2010-10-19  3:42 ` [patch 01/35] bit_spinlock: add required includes npiggin
  2010-10-19  3:42 ` [patch 02/35] kernel: add bl_list npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  4:49   ` KOSAKI Motohiro
  2010-10-19  3:42 ` [patch 04/35] vfs: convert inode and dentry caches to " npiggin
                   ` (33 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: linux-mm

[-- Attachment #1: mm-zone-shrinker.patch --]
[-- Type: text/plain, Size: 23488 bytes --]

Allow the shrinker to do per-zone shrinking. This requires adding a zone
argument to the shrinker callback and calling shrinkers for each zone
scanned. The logic somewhat in vmscan code gets simpler: the shrinkers are
invoked for each zone, around the same time as the pagecache scanner.
Zone reclaim needed a bit of surgery to cope with the change, but the
idea is the same.

But all shrinkers are currently global-based, so they need a way to
convert per-zone ratios into global scan ratios. So seeing as we are
changing the shrinker API anyway, let's reorganise it to make it saner.

So the shrinker callback is passed:
- the number of pagecache pages scanned in this zone
- the number of pagecache pages in this zone
- the total number of pagecache pages in all zones to be scanned

The shrinker is now completely responsible for calculating and batching
(given helpers), which provides better flexibility. vmscan helper functions
are provided to accumulate these ratios, and help with batching.

Finally, add some fixed-point scaling to the ratio, which helps rounding.

The old shrinker API remains for unconverted code. There is no urgency
to convert them at once.

Cc: linux-mm@kvack.org
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/drop_caches.c    |    6 
 include/linux/mm.h  |   43 ++++++
 mm/memory-failure.c |   10 -
 mm/vmscan.c         |  327 +++++++++++++++++++++++++++++++++++++---------------
 4 files changed, 279 insertions(+), 107 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
@@ -997,6 +997,10 @@
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
+ * 'shrink_zone' is the new shrinker API. It is to be used in preference
+ * to 'shrink'. One must point to a shrinker function, the other must
+ * be NULL. See 'shrink_slab' for details about the shrink_zone API.
+ *
  * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'.  It should
  * look through the least-recently-used 'nr_to_scan' entries and
  * attempt to free them up.  It should return the number of objects
@@ -1013,13 +1017,53 @@
 	int (*shrink)(struct shrinker *, int nr_to_scan, gfp_t gfp_mask);
 	int seeks;	/* seeks to recreate an obj */
 
+	/*
+	 * shrink_zone - slab shrinker callback for reclaimable objects
+	 * @shrink: this struct shrinker
+	 * @zone: zone to scan
+	 * @scanned: pagecache lru pages scanned in zone
+	 * @total: total pagecache lru pages in zone
+	 * @global: global pagecache lru pages (for zone-unaware shrinkers)
+	 * @flags: shrinker flags
+	 * @gfp_mask: gfp context we are operating within
+	 *
+	 * The shrinkers are responsible for calculating the appropriate
+	 * pressure to apply, batching up scanning (and cond_resched,
+	 * cond_resched_lock etc), and updating events counters including
+	 * count_vm_event(SLABS_SCANNED, nr).
+	 *
+	 * This approach gives flexibility to the shrinkers. They know best how
+	 * to do batching, how much time between cond_resched is appropriate,
+	 * what statistics to increment, etc.
+	 */
+	void (*shrink_zone)(struct shrinker *shrink,
+		struct zone *zone, unsigned long scanned,
+		unsigned long total, unsigned long global,
+		unsigned long flags, gfp_t gfp_mask);
+
 	/* These are for internal use */
 	struct list_head list;
 	long nr;	/* objs pending delete */
 };
+
+/* Constants for use by old shrinker API */
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
+
+/* Constants for use by new shrinker API */
+/*
+ * SHRINK_DEFAULT_SEEKS is shifted by 4 to match an arbitrary constant
+ * in the old shrinker code.
+ */
+#define SHRINK_FACTOR	(128UL) /* Fixed point shift */
+#define SHRINK_DEFAULT_SEEKS	(SHRINK_FACTOR*DEFAULT_SEEKS/4)
+#define SHRINK_BATCH	128	/* A good number if you don't know better */
+
 extern void register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
+extern void shrinker_add_scan(unsigned long *dst,
+				unsigned long scanned, unsigned long total,
+				unsigned long objects, unsigned int ratio);
+extern unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch);
 
 int vma_wants_writenotify(struct vm_area_struct *vma);
 
@@ -1443,8 +1487,7 @@
 
 int drop_caches_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
-unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
+void shrink_all_slab(void);
 
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/mm/vmscan.c	2010-10-19 14:33:38.000000000 +1100
@@ -74,6 +74,9 @@
 	/* Can pages be swapped as part of reclaim? */
 	int may_swap;
 
+	/* Can slab pages be reclaimed? */
+	int may_reclaim_slab;
+
 	int swappiness;
 
 	int order;
@@ -163,6 +166,8 @@
  */
 void register_shrinker(struct shrinker *shrinker)
 {
+	BUG_ON(shrinker->shrink && shrinker->shrink_zone);
+	BUG_ON(!shrinker->shrink && !shrinker->shrink_zone);
 	shrinker->nr = 0;
 	down_write(&shrinker_rwsem);
 	list_add_tail(&shrinker->list, &shrinker_list);
@@ -181,43 +186,101 @@
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
-#define SHRINK_BATCH 128
 /*
- * Call the shrink functions to age shrinkable caches
+ * shrinker_add_scan - accumulate shrinker scan
+ * @dst: scan counter variable
+ * @scanned: pagecache pages scanned
+ * @total: total pagecache objects
+ * @tot: total objects in this cache
+ * @ratio: ratio of pagecache value to object value
  *
- * Here we assume it costs one seek to replace a lru page and that it also
- * takes a seek to recreate a cache object.  With this in mind we age equal
- * percentages of the lru and ageable caches.  This should balance the seeks
- * generated by these structures.
+ * shrinker_add_scan accumulates a number of objects to scan into @dst,
+ * based on the following ratio:
  *
- * If the vm encountered mapped pages on the LRU it increase the pressure on
- * slab to avoid swapping.
+ * proportion = scanned / total        // proportion of pagecache scanned
+ * obj_prop   = objects * proportion   // same proportion of objects
+ * to_scan    = obj_prop / ratio       // modify by ratio
+ * *dst += (total / scanned)           // accumulate to dst
  *
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
+ * The ratio is a fixed point integer with a factor SHRINK_FACTOR.
+ * Higher ratios give objects higher value.
  *
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt.  It is used for balancing
- * slab reclaim versus page reclaim.
+ * @dst is also fixed point, so cannot be used as a simple count.
+ * shrinker_do_scan will take care of that for us.
  *
- * Returns the number of slab objects which we shrunk.
+ * There is no synchronisation here, which is fine really. A rare lost
+ * update is no huge deal in reclaim code.
  */
-unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+void shrinker_add_scan(unsigned long *dst,
+			unsigned long scanned, unsigned long total,
+			unsigned long objects, unsigned int ratio)
 {
-	struct shrinker *shrinker;
-	unsigned long ret = 0;
+	unsigned long long delta;
 
-	if (scanned == 0)
-		scanned = SWAP_CLUSTER_MAX;
+	delta = (unsigned long long)scanned * objects;
+	delta *= SHRINK_FACTOR;
+	do_div(delta, total + 1);
+	delta *= SHRINK_FACTOR; /* ratio is also in SHRINK_FACTOR units */
+	do_div(delta, ratio + 1);
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+	/*
+	 * Avoid risking looping forever due to too large nr value:
+	 * never try to free more than twice the estimate number of
+	 * freeable entries.
+	 */
+	*dst += delta;
+
+	if (*dst / SHRINK_FACTOR > objects)
+		*dst = objects * SHRINK_FACTOR;
+}
+EXPORT_SYMBOL(shrinker_add_scan);
+
+/*
+ * shrinker_do_scan - scan a batch of objects
+ * @dst: scan counter
+ * @batch: number of objects to scan in this batch
+ * @Returns: number of objects to scan
+ *
+ * shrinker_do_scan takes the scan counter accumulated by shrinker_add_scan,
+ * and decrements it by @batch if it is greater than batch and returns batch.
+ * Otherwise returns 0. The caller should use the return value as the number
+ * of objects to scan next.
+ *
+ * Between shrinker_do_scan calls, the caller should drop locks if possible
+ * and call cond_resched.
+ *
+ * Note, @dst is a fixed point scaled integer. See shrinker_add_scan.
+ *
+ * Like shrinker_add_scan, shrinker_do_scan is not SMP safe, but it doesn't
+ * really need to be.
+ */
+unsigned long shrinker_do_scan(unsigned long *dst, unsigned long batch)
+{
+	unsigned long nr = ACCESS_ONCE(*dst);
+	if (nr < batch * SHRINK_FACTOR)
+		return 0;
+	*dst = nr - batch * SHRINK_FACTOR;
+	return batch;
+}
+EXPORT_SYMBOL(shrinker_do_scan);
+
+#define SHRINK_BATCH 128
+/*
+ * Scan the deprecated shrinkers. This will go away soon in favour of
+ * converting everybody to new shrinker API.
+ */
+static void shrink_slab_old(unsigned long scanned, gfp_t gfp_mask,
+			unsigned long lru_pages)
+{
+	struct shrinker *shrinker;
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
 		unsigned long total_scan;
 		unsigned long max_pass;
 
+		if (!shrinker->shrink)
+			continue;
 		max_pass = (*shrinker->shrink)(shrinker, 0, gfp_mask);
 		delta = (4 * scanned) / shrinker->seeks;
 		delta *= max_pass;
@@ -244,15 +307,11 @@
 		while (total_scan >= SHRINK_BATCH) {
 			long this_scan = SHRINK_BATCH;
 			int shrink_ret;
-			int nr_before;
 
-			nr_before = (*shrinker->shrink)(shrinker, 0, gfp_mask);
 			shrink_ret = (*shrinker->shrink)(shrinker, this_scan,
 								gfp_mask);
 			if (shrink_ret == -1)
 				break;
-			if (shrink_ret < nr_before)
-				ret += nr_before - shrink_ret;
 			count_vm_events(SLABS_SCANNED, this_scan);
 			total_scan -= this_scan;
 
@@ -261,8 +320,75 @@
 
 		shrinker->nr += total_scan;
 	}
+}
+/*
+ * shrink_slab - Call the shrink functions to age shrinkable caches
+ * @zone: the zone we are currently reclaiming from
+ * @scanned: how many pagecache pages were scanned in this zone
+ * @total: total number of reclaimable pagecache pages in this zone
+ * @global: total number of reclaimable pagecache pages in the system
+ * @gfp_mask: gfp context that we are in
+ *
+ * Slab shrinkers should scan their objects in a proportion to the ratio of
+ * scanned to total pagecache pages in this zone, modified by a "cost"
+ * constant.
+ *
+ * For example, we have a slab cache with 100 reclaimable objects in a
+ * particular zone, and the cost of reclaiming an object is determined to be
+ * twice as expensive as reclaiming a pagecache page (due to likelihood and
+ * cost of reconstruction). If we have 200 reclaimable pagecache pages in that
+ * zone particular zone, and scan 20 of them (10%), we should scan 5% (5) of
+ * the objects in our slab cache.
+ *
+ * If we have a single global list of objects and no per-zone lists, the
+ * global count of objects can be used to find the correct ratio to scan.
+ *
+ * See shrinker_add_scan and shrinker_do_scan for helper functions and
+ * details on how to calculate these numbers.
+ */
+static void shrink_slab(struct zone *zone, unsigned long scanned,
+			unsigned long total, unsigned long global,
+			gfp_t gfp_mask)
+{
+	struct shrinker *shrinker;
+
+	if (scanned == 0)
+		scanned = SWAP_CLUSTER_MAX;
+
+	if (!down_read_trylock(&shrinker_rwsem))
+		return;
+
+	/* do a global shrink with the old shrinker API */
+	shrink_slab_old(scanned, gfp_mask, global);
+
+	list_for_each_entry(shrinker, &shrinker_list, list) {
+		if (!shrinker->shrink_zone)
+			continue;
+		(*shrinker->shrink_zone)(shrinker, zone, scanned,
+					total, global, 0, gfp_mask);
+	}
 	up_read(&shrinker_rwsem);
-	return ret;
+}
+
+void shrink_all_slab(void)
+{
+	struct zone *zone;
+	struct reclaim_state reclaim_state;
+
+	current->reclaim_state = &reclaim_state;
+	do {
+		reclaim_state.reclaimed_slab = 0;
+		/*
+		 * Use "100" for "scanned", "total", and "global", so
+		 * that shrinkers scan a large proportion of their
+		 * objects. 100 rather than 1 in order to reduce rounding
+		 * errors.
+		 */
+		for_each_populated_zone(zone)
+			shrink_slab(zone, 100, 100, 100, GFP_KERNEL);
+	} while (reclaim_state.reclaimed_slab);
+
+	current->reclaim_state = NULL;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
@@ -1740,18 +1866,24 @@
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static void shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+		struct scan_control *sc, unsigned long global_lru_pages)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	enum lru_list l;
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	unsigned long nr_scanned = sc->nr_scanned;
+	unsigned long lru_pages = 0;
 
 	get_scan_count(zone, sc, nr, priority);
 
 	set_lumpy_reclaim_mode(priority, sc);
 
+	/* Used by slab shrinking, below */
+	if (sc->may_reclaim_slab)
+		lru_pages = zone_reclaimable_pages(zone);
+
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
@@ -1776,8 +1908,6 @@
 			break;
 	}
 
-	sc->nr_reclaimed = nr_reclaimed;
-
 	/*
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
@@ -1785,6 +1915,23 @@
 	if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from
+	 * over limit cgroups
+	 */
+	if (sc->may_reclaim_slab) {
+		struct reclaim_state *reclaim_state = current->reclaim_state;
+
+		shrink_slab(zone, sc->nr_scanned - nr_scanned,
+			lru_pages, global_lru_pages, sc->gfp_mask);
+		if (reclaim_state) {
+			nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
+	sc->nr_reclaimed = nr_reclaimed;
+
 	throttle_vm_writeout(sc->gfp_mask);
 }
 
@@ -1805,7 +1952,7 @@
  * scan then give up on it.
  */
 static void shrink_zones(int priority, struct zonelist *zonelist,
-					struct scan_control *sc)
+		struct scan_control *sc, unsigned long global_lru_pages)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -1825,7 +1972,7 @@
 				continue;	/* Let kswapd poll it */
 		}
 
-		shrink_zone(priority, zone, sc);
+		shrink_zone(priority, zone, sc, global_lru_pages);
 	}
 }
 
@@ -1882,7 +2029,6 @@
 {
 	int priority;
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long writeback_threshold;
@@ -1894,30 +2040,20 @@
 		count_vm_event(ALLOCSTALL);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		sc->nr_scanned = 0;
-		if (!priority)
-			disable_swap_token();
-		shrink_zones(priority, zonelist, sc);
-		/*
-		 * Don't shrink slabs when reclaiming memory from
-		 * over limit cgroups
-		 */
-		if (scanning_global_lru(sc)) {
-			unsigned long lru_pages = 0;
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
+		unsigned long lru_pages = 0;
 
-				lru_pages += zone_reclaimable_pages(zone);
-			}
+		for_each_zone_zonelist(zone, z, zonelist,
+				gfp_zone(sc->gfp_mask)) {
+			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+				continue;
 
-			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
+			lru_pages += zone_reclaimable_pages(zone);
 		}
+
+		sc->nr_scanned = 0;
+		if (!priority)
+			disable_swap_token();
+		shrink_zones(priority, zonelist, sc, lru_pages);
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
@@ -1975,6 +2111,7 @@
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.may_reclaim_slab = 1,
 		.swappiness = vm_swappiness,
 		.order = order,
 		.mem_cgroup = NULL,
@@ -2004,6 +2141,7 @@
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
+		.may_reclaim_slab = 0,
 		.swappiness = swappiness,
 		.order = 0,
 		.mem_cgroup = mem,
@@ -2022,7 +2160,7 @@
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone(0, zone, &sc);
+	shrink_zone(0, zone, &sc, zone_reclaimable_pages(zone));
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2040,6 +2178,7 @@
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
+		.may_reclaim_slab = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.swappiness = swappiness,
 		.order = 0,
@@ -2117,11 +2256,11 @@
 	int priority;
 	int i;
 	unsigned long total_scanned;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.may_reclaim_slab = 1,
 		/*
 		 * kswapd doesn't want to be bailed out while reclaim. because
 		 * we want to put equal scanning pressure on each zone.
@@ -2195,7 +2334,6 @@
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2217,15 +2355,11 @@
 			 */
 			if (!zone_watermark_ok(zone, order,
 					8*high_wmark_pages(zone), end_zone, 0))
-				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				shrink_zone(priority, zone, &sc, lru_pages);
 			total_scanned += sc.nr_scanned;
 			if (zone->all_unreclaimable)
 				continue;
-			if (nr_slab == 0 && !zone_reclaimable(zone))
+			if (!zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
 			 * If we've done a decent amount of scanning and
@@ -2482,6 +2616,7 @@
 		.may_swap = 1,
 		.may_unmap = 1,
 		.may_writepage = 1,
+		.may_reclaim_slab = 1,
 		.nr_to_reclaim = nr_to_reclaim,
 		.hibernation_mode = 1,
 		.swappiness = vm_swappiness,
@@ -2665,13 +2800,14 @@
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
+		.may_reclaim_slab = 0,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
 				       SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,
 		.swappiness = vm_swappiness,
 		.order = order,
 	};
-	unsigned long nr_slab_pages0, nr_slab_pages1;
+	unsigned long lru_pages, slab_pages;
 
 	cond_resched();
 	/*
@@ -2684,51 +2820,61 @@
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
+	lru_pages = zone_reclaimable_pages(zone);
+	slab_pages = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+
 	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
+		if (slab_pages > zone->min_slab_pages)
+			sc.may_reclaim_slab = 1;
 		/*
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			shrink_zone(priority, zone, &sc);
+			shrink_zone(priority, zone, &sc, lru_pages);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
-	}
 
-	nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
-	if (nr_slab_pages0 > zone->min_slab_pages) {
+	} else if (slab_pages > zone->min_slab_pages) {
 		/*
-		 * shrink_slab() does not currently allow us to determine how
-		 * many pages were freed in this zone. So we take the current
-		 * number of slab pages and shake the slab until it is reduced
-		 * by the same nr_pages that we used for reclaiming unmapped
-		 * pages.
-		 *
-		 * Note that shrink_slab will free memory on all zones and may
-		 * take a long time.
+		 * Scanning slab without pagecache, have to open code
+		 * call to shrink_slab (shirnk_zone drives slab reclaim via
+		 * pagecache scanning, so it isn't set up to shrink slab
+		 * without scanning pagecache.
 		 */
-		for (;;) {
-			unsigned long lru_pages = zone_reclaimable_pages(zone);
-
-			/* No reclaimable slab or very low memory pressure */
-			if (!shrink_slab(sc.nr_scanned, gfp_mask, lru_pages))
-				break;
 
-			/* Freed enough memory */
-			nr_slab_pages1 = zone_page_state(zone,
-							NR_SLAB_RECLAIMABLE);
-			if (nr_slab_pages1 + nr_pages <= nr_slab_pages0)
-				break;
-		}
+		/*
+		 * lru_pages / 10  -- put a 10% pressure on the slab
+		 * which roughly corresponds to ZONE_RECLAIM_PRIORITY
+		 * scanning 1/16th of pagecache.
+		 *
+		 * Global slabs will be shrink at a relatively more
+		 * aggressive rate because we don't calculate the
+		 * global lru size for speed. But they really should
+		 * be converted to per zone slabs if they are important
+		 */
+		shrink_slab(zone, lru_pages / 10, lru_pages, lru_pages,
+				gfp_mask);
 
 		/*
-		 * Update nr_reclaimed by the number of slab pages we
-		 * reclaimed from this zone.
+		 * Although we have a zone based slab shrinker API, some slabs
+		 * are still scanned globally. This means we can't quite
+		 * determine how many pages were freed in this zone by
+		 * checking reclaimed_slab. However the regular shrink_zone
+		 * paths have exactly the same problem that they largely
+		 * ignore. So don't be different.
+		 *
+		 * The situation will improve dramatically as important slabs
+		 * are switched over to using reclaimed_slab after the
+		 * important slabs are converted to using per zone shrinkers.
+		 *
+		 * Note that shrink_slab may free memory on all zones and may
+		 * take a long time, but again switching important slabs to
+		 * zone based shrinkers will solve this problem.
 		 */
-		nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
-		if (nr_slab_pages1 < nr_slab_pages0)
-			sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1;
+		sc.nr_reclaimed += reclaim_state.reclaimed_slab;
+		reclaim_state.reclaimed_slab = 0;
 	}
 
 	p->reclaim_state = NULL;
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:20:01.000000000 +1100
@@ -35,11 +35,7 @@
 
 static void drop_slab(void)
 {
-	int nr_objects;
-
-	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
-	} while (nr_objects > 10);
+	shrink_all_slab();
 }
 
 int drop_caches_sysctl_handler(ctl_table *table, int write,
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c	2010-10-19 14:19:40.000000000 +1100
+++ linux-2.6/mm/memory-failure.c	2010-10-19 14:20:01.000000000 +1100
@@ -231,14 +231,8 @@
 	 * Only all shrink_slab here (which would also
 	 * shrink other caches) if access is not potentially fatal.
 	 */
-	if (access) {
-		int nr;
-		do {
-			nr = shrink_slab(1000, GFP_KERNEL, 1000);
-			if (page_count(p) == 1)
-				break;
-		} while (nr > 10);
-	}
+	if (access)
+		shrink_all_slab();
 }
 EXPORT_SYMBOL_GPL(shake_page);
 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 04/35] vfs: convert inode and dentry caches to per-zone shrinker
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (2 preceding siblings ...)
  2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 05/35] fs: icache lock s_inodes list npiggin
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: linux-mm

[-- Attachment #1: vfs-zone-shrinker.patch --]
[-- Type: text/plain, Size: 4924 bytes --]

Convert inode and dentry caches to per-zone shrinker API in preparation
for doing proper per-zone cache LRU lists. These two caches tend to be
the most important in the system after the pagecache lrus, so making these
per-zone will help to fix up the funny quirks in vmscan code that tries
to reconcile the whole zone-driven scanning with the global slab reclaim.

Cc: linux-mm@kvack.org
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/dcache.c |   31 ++++++++++++++++++++-----------
 fs/inode.c  |   39 ++++++++++++++++++++++++---------------
 2 files changed, 44 insertions(+), 26 deletions(-)

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2010-10-19 14:35:42.000000000 +1100
+++ linux-2.6/fs/dcache.c	2010-10-19 14:36:53.000000000 +1100
@@ -534,7 +534,7 @@
  *
  * This function may fail to free any resources if all the dentries are in use.
  */
-static void prune_dcache(int count)
+static void prune_dcache(unsigned long count)
 {
 	struct super_block *sb, *p = NULL;
 	int w_count;
@@ -887,7 +887,8 @@
 EXPORT_SYMBOL(shrink_dcache_parent);
 
 /*
- * Scan `nr' dentries and return the number which remain.
+ * shrink_dcache_memory scans and reclaims unused dentries. This function
+ * is defined according to the shrinker API described in linux/mm.h.
  *
  * We need to avoid reentering the filesystem if the caller is performing a
  * GFP_NOFS allocation attempt.  One example deadlock is:
@@ -895,22 +896,30 @@
  * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
  * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
  * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
- *
- * In this case we return -1 to tell the caller that we baled.
  */
-static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
+static void shrink_dcache_memory(struct shrinker *shrink,
+		struct zone *zone, unsigned long scanned,
+		unsigned long total, unsigned long global,
+		unsigned long flags, gfp_t gfp_mask)
 {
-	if (nr) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
+	static unsigned long nr_to_scan;
+	unsigned long nr;
+
+	shrinker_add_scan(&nr_to_scan, scanned, global,
+			dentry_stat.nr_unused,
+			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
+	if (!(gfp_mask & __GFP_FS))
+	       return;
+
+	while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
 		prune_dcache(nr);
+		count_vm_events(SLABS_SCANNED, nr);
+		cond_resched();
 	}
-	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker dcache_shrinker = {
-	.shrink = shrink_dcache_memory,
-	.seeks = DEFAULT_SEEKS,
+	.shrink_zone = shrink_dcache_memory,
 };
 
 /**
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:35:42.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:37:05.000000000 +1100
@@ -445,7 +445,7 @@
  * If the inode has metadata buffers attached to mapping->private_list then
  * try to remove them.
  */
-static void prune_icache(int nr_to_scan)
+static void prune_icache(unsigned long nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
@@ -503,27 +503,36 @@
  * not open and the dcache references to those inodes have already been
  * reclaimed.
  *
- * This function is passed the number of inodes to scan, and it returns the
- * total number of remaining possibly-reclaimable inodes.
+ * This function is defined according to shrinker API described in linux/mm.h.
  */
-static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
+static void shrink_icache_memory(struct shrinker *shrink,
+		struct zone *zone, unsigned long scanned,
+		unsigned long total, unsigned long global,
+		unsigned long flags, gfp_t gfp_mask)
 {
-	if (nr) {
-		/*
-		 * Nasty deadlock avoidance.  We may hold various FS locks,
-		 * and we don't want to recurse into the FS that called us
-		 * in clear_inode() and friends..
-		 */
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
+	static unsigned long nr_to_scan;
+	unsigned long nr;
+
+	shrinker_add_scan(&nr_to_scan, scanned, global,
+			inodes_stat.nr_unused,
+			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
+	/*
+	 * Nasty deadlock avoidance.  We may hold various FS locks,
+	 * and we don't want to recurse into the FS that called us
+	 * in clear_inode() and friends..
+	 */
+	if (!(gfp_mask & __GFP_FS))
+	       return;
+
+	while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
 		prune_icache(nr);
+		count_vm_events(SLABS_SCANNED, nr);
+		cond_resched();
 	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
-	.shrink = shrink_icache_memory,
-	.seeks = DEFAULT_SEEKS,
+	.shrink_zone = shrink_icache_memory,
 };
 
 static void __wait_on_freeing_inode(struct inode *inode);



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 05/35] fs: icache lock s_inodes list
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (3 preceding siblings ...)
  2010-10-19  3:42 ` [patch 04/35] vfs: convert inode and dentry caches to " npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 06/35] fs: icache lock inode hash npiggin
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale.patch --]
[-- Type: text/plain, Size: 9245 bytes --]

Protect sb->s_inodes with a new lock, sb_inode_list_lock.

***
This is the first patch in the inode lock scaling series, and as such
I will provide an overview of the structure of the patchset and its goals.

Firstly, the overall locking design is to move from a global lock for the
inode cache to a per-inode lock to protect the icache state of an inode;
and other fine grained locks to protect access to various icache data
structures.

The per-inode lock used is i_lock, but it isn't tied to other users of i_lock
so if there is a problem with this usage of the lock in future, another
lock can easily be added to the inode for icache locking.

The per-inode lock to lock the icache state of a given inode works nicely to
naturally replace the inode_lock without significantly changing the rest of the
code. Most of the icache operation is interested in operating on a particular
inode at a time (after finding it from a particular data structure or
reference), and so, in these places, where we once took the inode_lock to lock
icache state, we may now replace it with i_lock without new concurrencies
introduced in the icache.

Secondly, the inode scaling patchset is broken into 3 parts.
1: adds global locks to take over protections of various data structures from
   the inode_lock, and protects per-inode state with i_lock.
2: removes the global inode_lock.
3: various strategies to improve scalability of the newly added locks, and
   streamline locking.

This approach has several benefits. Firstly, the steps to add locking and
take over inode_lock are very conservative and as simple as realistically
possible to review and audit for correctness. Secondly, removing inode_lock
before more complex code and locking transforms allows for much better
bisectability for bugs and performance regressions than if we lift the
inode_lock as a final step after the more complex transformations are done.
Lastly, small, reviewable and often independent changes to improve locking
can be reviewed, reverted, bisected and tested after inode_lock is gone.

There is also a disadvantage in that the conservative and clunky locking
built up in the first step is not performant, often looks ugly, and
contains more nesting and lock ordering problems than necessary. However
these steps are very much intended only to be intermediate and obvious
small steps along the way.

The choice to nest icache structure locks inside i_lock was made because
that ended up working the best for paths such as inode insertion or removal
from data structures, when several locks would otherwise need to be held at
once. The icache structure locks are all private to icache, so it is possible
to reduce the width of i_lock or change the nesting rules in small obvious
steps, if any real problems are found with this nesting scheme. 

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/drop_caches.c          |    4 ++++
 fs/fs-writeback.c         |    4 ++++
 fs/inode.c                |   19 +++++++++++++++++++
 fs/notify/inode_mark.c    |    2 ++
 fs/quota/dquot.c          |    6 ++++++
 include/linux/writeback.h |    1 +
 6 files changed, 36 insertions(+)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:19:38.000000000 +1100
@@ -17,18 +17,22 @@
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:17:27.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:38.000000000 +1100
@@ -1029,6 +1029,7 @@
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1046,6 +1047,7 @@
 		if (mapping->nrpages == 0)
 			continue;
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
@@ -1063,7 +1065,9 @@
 		cond_resched();
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:38.000000000 +1100
@@ -26,6 +26,15 @@
 #include <linux/posix_acl.h>
 
 /*
+ * Usage:
+ * sb_inode_list_lock protects:
+ *   s_inodes, i_sb_list
+ *
+ * Ordering:
+ * inode_lock
+ *   sb_inode_list_lock
+ */
+/*
  * This is needed for the following functions:
  *  - inode_has_buffers
  *  - invalidate_inode_buffers
@@ -83,6 +92,7 @@
  * the i_state of an inode while it is in use..
  */
 DEFINE_SPINLOCK(inode_lock);
+DEFINE_SPINLOCK(sb_inode_list_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -339,7 +349,9 @@
 
 		spin_lock(&inode_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -371,6 +383,7 @@
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
+		cond_resched_lock(&sb_inode_list_lock);
 
 		next = next->next;
 		if (tmp == head)
@@ -408,8 +421,10 @@
 
 	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
@@ -606,7 +621,9 @@
 {
 	inodes_stat.nr_inodes++;
 	list_add(&inode->i_list, &inode_in_use);
+	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb_inode_list_lock);
 	if (head)
 		hlist_add_head(&inode->i_hash, head);
 }
@@ -1240,7 +1257,9 @@
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
+	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c	2010-10-19 14:17:27.000000000 +1100
+++ linux-2.6/fs/quota/dquot.c	2010-10-19 14:19:38.000000000 +1100
@@ -897,6 +897,7 @@
 #endif
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -910,6 +911,7 @@
 			continue;
 
 		__iget(inode);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
@@ -921,7 +923,9 @@
 		 * keep the reference and iput it later. */
 		old_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 
@@ -1004,6 +1008,7 @@
 	int reserved = 0;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1017,6 +1022,7 @@
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:17:27.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:34.000000000 +1100
@@ -10,6 +10,7 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
+extern spinlock_t sb_inode_list_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:17:27.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:38.000000000 +1100
@@ -283,6 +283,7 @@
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
@@ -296,5 +297,6 @@
 		iput(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
 	}
 }



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 06/35] fs: icache lock inode hash
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (4 preceding siblings ...)
  2010-10-19  3:42 ` [patch 05/35] fs: icache lock s_inodes list npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 07/35] fs: icache lock i_state npiggin
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-2.patch --]
[-- Type: text/plain, Size: 4582 bytes --]

Add a new lock, inode_hash_lock, to protect the inode hash table lists.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c |   33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:38.000000000 +1100
@@ -29,10 +29,14 @@
  * Usage:
  * sb_inode_list_lock protects:
  *   s_inodes, i_sb_list
+ * inode_hash_lock protects:
+ *   inode hash table, i_hash
  *
  * Ordering:
  * inode_lock
  *   sb_inode_list_lock
+ * inode_lock
+ *   inode_hash_lock
  */
 /*
  * This is needed for the following functions:
@@ -93,6 +97,7 @@
  */
 DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -348,7 +353,9 @@
 		evict(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_unlock(&inode_hash_lock);
 		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&sb_inode_list_lock);
@@ -566,17 +573,20 @@
 	struct inode *inode = NULL;
 
 repeat:
+	spin_lock(&inode_hash_lock);
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock(&inode_hash_lock);
 	return node ? inode : NULL;
 }
 
@@ -591,17 +601,20 @@
 	struct inode *inode = NULL;
 
 repeat:
+	spin_lock(&inode_hash_lock);
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
+	spin_unlock(&inode_hash_lock);
 	return node ? inode : NULL;
 }
 
@@ -624,8 +637,11 @@
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
-	if (head)
+	if (head) {
+		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
+		spin_unlock(&inode_hash_lock);
+	}
 }
 
 /**
@@ -1103,7 +1119,9 @@
 	while (1) {
 		struct hlist_node *node;
 		struct inode *old = NULL;
+
 		spin_lock(&inode_lock);
+		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
@@ -1115,9 +1133,11 @@
 		}
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
@@ -1143,6 +1163,7 @@
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
+		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_sb != sb)
 				continue;
@@ -1154,9 +1175,11 @@
 		}
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode_hash_lock);
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
@@ -1181,7 +1204,9 @@
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -1195,7 +1220,9 @@
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -1254,7 +1281,9 @@
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
+		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
+		spin_unlock(&inode_hash_lock);
 	}
 	list_del_init(&inode->i_list);
 	spin_lock(&sb_inode_list_lock);
@@ -1266,7 +1295,9 @@
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 07/35] fs: icache lock i_state
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (5 preceding siblings ...)
  2010-10-19  3:42 ` [patch 06/35] fs: icache lock inode hash npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 10:47   ` Miklos Szeredi
  2010-10-19  3:42 ` [patch 08/35] fs: icache lock i_count npiggin
                   ` (29 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-3.patch --]
[-- Type: text/plain, Size: 17388 bytes --]

Protect i_state updates with i_lock

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/drop_caches.c       |    9 +++--
 fs/fs-writeback.c      |   35 +++++++++++++++++---
 fs/inode.c             |   83 ++++++++++++++++++++++++++++++++++++++++++-------
 fs/notify/inode_mark.c |   23 +++++++++----
 fs/quota/dquot.c       |   28 ++++++++++++++--
 5 files changed, 148 insertions(+), 30 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:19:32.000000000 +1100
@@ -19,11 +19,14 @@
 	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
+				|| inode->i_mapping->nrpages == 0) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:36.000000000 +1100
@@ -288,10 +288,12 @@
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	 while (inode->i_state & I_SYNC) {
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 	}
 }
 
@@ -345,6 +347,7 @@
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -366,8 +369,10 @@
 	 * write_inode()
 	 */
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -377,6 +382,7 @@
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -487,7 +493,9 @@
 			return 0;
 		}
 
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
+			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			continue;
 		}
@@ -495,8 +503,10 @@
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
 		 */
-		if (inode_dirtied_after(inode, wbc->wb_start))
+		if (inode_dirtied_after(inode, wbc->wb_start)) {
+			spin_unlock(&inode->i_lock);
 			return 1;
+		}
 
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
@@ -509,6 +519,7 @@
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
@@ -944,6 +955,7 @@
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
@@ -994,6 +1006,7 @@
 		}
 	}
 out:
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	if (wakeup_bdi)
@@ -1041,12 +1054,20 @@
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
+
 		mapping = inode->i_mapping;
-		if (mapping->nrpages == 0)
+		if (mapping->nrpages == 0) {
+			spin_unlock(&inode->i_lock);
 			continue;
-		__iget(inode);
+		}
+
+ 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		/*
@@ -1173,7 +1194,9 @@
 
 	might_sleep();
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, &wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
@@ -1197,7 +1220,9 @@
 	int ret;
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	return ret;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:37.000000000 +1100
@@ -31,10 +31,13 @@
  *   s_inodes, i_sb_list
  * inode_hash_lock protects:
  *   inode hash table, i_hash
+ * inode->i_lock protects:
+ *   i_state
  *
  * Ordering:
  * inode_lock
  *   sb_inode_list_lock
+ *     inode->i_lock
  * inode_lock
  *   inode_hash_lock
  */
@@ -97,7 +100,7 @@
  */
 DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
-DEFINE_SPINLOCK(inode_hash_lock);
+static DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -296,6 +299,8 @@
  */
 void __iget(struct inode *inode)
 {
+	assert_spin_locked(&inode->i_lock);
+
 	if (atomic_inc_return(&inode->i_count) != 1)
 		return;
 
@@ -396,16 +401,21 @@
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		if (inode->i_state & I_NEW)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+			spin_unlock(&inode->i_lock);
 			count++;
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
 	/* only unused inodes may be cached with i_count zero */
@@ -484,12 +494,15 @@
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
+		spin_lock(&inode->i_lock);
 		if (inode->i_state || atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, &inode_unused);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			__iget(inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -500,12 +513,16 @@
 			if (inode != list_entry(inode_unused.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
-			if (!can_unuse(inode))
+			spin_lock(&inode->i_lock);
+			if (!can_unuse(inode)) {
+				spin_unlock(&inode->i_lock);
 				continue;
+			}
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
 		nr_pruned++;
 	}
 	inodes_stat.nr_unused -= nr_pruned;
@@ -577,8 +594,14 @@
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!test(inode, data))
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&inode_hash_lock);
+			goto repeat;
+		}
+		if (!test(inode, data)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
@@ -607,6 +630,10 @@
 			continue;
 		if (inode->i_sb != sb)
 			continue;
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&inode_hash_lock);
+			goto repeat;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
@@ -633,10 +660,10 @@
 			struct inode *inode)
 {
 	inodes_stat.nr_inodes++;
-	list_add(&inode->i_list, &inode_in_use);
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
+	list_add(&inode->i_list, &inode_in_use);
 	if (head) {
 		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
@@ -693,9 +720,9 @@
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_add_to_lists(sb, NULL, inode);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
+		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -762,8 +789,8 @@
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, head, inode);
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -778,6 +805,7 @@
 		 * allocated.
 		 */
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -786,6 +814,7 @@
 	return inode;
 
 set_failed:
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
@@ -809,8 +838,8 @@
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
-			__inode_add_to_lists(sb, head, inode);
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -825,6 +854,7 @@
 		 * allocated.
 		 */
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -866,6 +896,7 @@
 		res = counter++;
 		head = inode_hashtable + hash(sb, res);
 		inode = find_inode_fast(sb, head, res);
+		spin_unlock(&inode->i_lock);
 	} while (inode != NULL);
 	spin_unlock(&inode_lock);
 
@@ -875,7 +906,10 @@
 
 struct inode *igrab(struct inode *inode)
 {
+	struct inode *ret = inode;
+
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
 		__iget(inode);
 	else
@@ -884,9 +918,11 @@
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
-		inode = NULL;
+		ret = NULL;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
-	return inode;
+
+	return ret;
 }
 EXPORT_SYMBOL(igrab);
 
@@ -919,6 +955,7 @@
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -952,6 +989,7 @@
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1121,6 +1159,7 @@
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
+repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_ino != ino)
@@ -1129,6 +1168,10 @@
 				continue;
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
+			if (!spin_trylock(&old->i_lock)) {
+				spin_unlock(&inode_hash_lock);
+				goto repeat;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1139,6 +1182,7 @@
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1163,6 +1207,7 @@
 		struct inode *old = NULL;
 
 		spin_lock(&inode_lock);
+repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
 			if (old->i_sb != sb)
@@ -1171,6 +1216,10 @@
 				continue;
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
+			if (!spin_trylock(&old->i_lock)) {
+				spin_unlock(&inode_hash_lock);
+				goto repeat;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1181,6 +1230,7 @@
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1260,6 +1310,8 @@
 	const struct super_operations *op = inode->i_sb->s_op;
 	int drop;
 
+	spin_lock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
 	if (op && op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1270,14 +1322,20 @@
 			list_move(&inode->i_list, &inode_unused);
 		inodes_stat.nr_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
+			spin_unlock(&inode->i_lock);
+			spin_unlock(&sb_inode_list_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
@@ -1286,12 +1344,12 @@
 		spin_unlock(&inode_hash_lock);
 	}
 	list_del_init(&inode->i_list);
-	spin_lock(&sb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
@@ -1500,6 +1558,8 @@
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
@@ -1507,6 +1567,7 @@
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/quota/dquot.c	2010-10-19 14:19:32.000000000 +1100
@@ -246,6 +246,7 @@
 EXPORT_SYMBOL(dqstats);
 
 static qsize_t inode_get_rsv_space(struct inode *inode);
+static qsize_t __inode_get_rsv_space(struct inode *inode);
 static void __dquot_initialize(struct inode *inode, int type);
 
 static inline unsigned int
@@ -899,18 +900,26 @@
 	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 #ifdef CONFIG_QUOTA_DEBUG
-		if (unlikely(inode_get_rsv_space(inode) > 0))
+		if (unlikely(__inode_get_rsv_space(inode) > 0))
 			reserved = 1;
 #endif
-		if (!atomic_read(&inode->i_writecount))
+		if (!atomic_read(&inode->i_writecount)) {
+			spin_unlock(&inode->i_lock);
 			continue;
-		if (!dqinit_needed(inode, type))
+		}
+		if (!dqinit_needed(inode, type)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		__iget(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode_lock);
 
@@ -1494,6 +1503,17 @@
 }
 EXPORT_SYMBOL(inode_sub_rsv_space);
 
+/* no i_lock variant of inode_get_rsv_space */
+static qsize_t __inode_get_rsv_space(struct inode *inode)
+{
+	qsize_t ret;
+
+	if (!inode->i_sb->dq_op->get_reserved_space)
+		return 0;
+	ret = *inode_reserved_space(inode);
+	return ret;
+}
+
 static qsize_t inode_get_rsv_space(struct inode *inode)
 {
 	qsize_t ret;
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:37.000000000 +1100
@@ -243,13 +243,16 @@
 	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
 		struct inode *need_iput_tmp;
 
+		spin_lock(&inode->i_lock);
 		/*
 		 * We cannot __iget() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
@@ -257,8 +260,10 @@
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count))
+		if (!atomic_read(&inode->i_count)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
@@ -268,13 +273,17 @@
 			__iget(inode);
 		else
 			need_iput_tmp = NULL;
+		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
-		    atomic_read(&next_i->i_count) &&
-		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			__iget(next_i);
-			need_iput = next_i;
+		if (&next_i->i_sb_list != list) {
+			spin_lock(&next_i->i_lock);
+			if (atomic_read(&next_i->i_count) &&
+			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
+				__iget(next_i);
+				need_iput = next_i;
+			}
+			spin_unlock(&next_i->i_lock);
 		}
 
 		/*



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 08/35] fs: icache lock i_count
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (6 preceding siblings ...)
  2010-10-19  3:42 ` [patch 07/35] fs: icache lock i_state npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 10:16   ` Boaz Harrosh
  2010-10-19  3:42 ` [patch 09/35] fs: icache lock lru/writeback lists npiggin
                   ` (28 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-4.patch --]
[-- Type: text/plain, Size: 40361 bytes --]

Protect inode->i_count with i_lock, rather than having it atomic.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 arch/powerpc/platforms/cell/spufs/file.c |    2 -
 drivers/staging/pohmelfs/inode.c         |   10 ++++----
 fs/9p/vfs_inode.c                        |    4 ++-
 fs/affs/inode.c                          |    4 ++-
 fs/afs/dir.c                             |    4 ++-
 fs/anon_inodes.c                         |    4 ++-
 fs/bfs/dir.c                             |    4 ++-
 fs/block_dev.c                           |   15 ++++++++++--
 fs/btrfs/inode.c                         |   17 ++++++++++----
 fs/ceph/mds_client.c                     |    2 -
 fs/cifs/inode.c                          |    2 -
 fs/coda/dir.c                            |    4 ++-
 fs/exofs/inode.c                         |   12 +++++++---
 fs/exofs/namei.c                         |    4 ++-
 fs/ext2/namei.c                          |    4 ++-
 fs/ext3/ialloc.c                         |    4 +--
 fs/ext3/namei.c                          |    4 ++-
 fs/ext4/ialloc.c                         |    4 +--
 fs/ext4/namei.c                          |    4 ++-
 fs/fs-writeback.c                        |    4 +--
 fs/gfs2/ops_inode.c                      |    4 ++-
 fs/hfsplus/dir.c                         |    4 ++-
 fs/hpfs/inode.c                          |    2 -
 fs/inode.c                               |   36 +++++++++++++++++++++----------
 fs/jffs2/dir.c                           |    8 +++++-
 fs/jfs/jfs_txnmgr.c                      |    4 ++-
 fs/jfs/namei.c                           |    4 ++-
 fs/libfs.c                               |    4 ++-
 fs/locks.c                               |    4 +--
 fs/logfs/dir.c                           |    4 ++-
 fs/logfs/readwrite.c                     |    2 -
 fs/minix/namei.c                         |    4 ++-
 fs/namei.c                               |    7 ++++--
 fs/nfs/dir.c                             |    4 ++-
 fs/nfs/getroot.c                         |    4 ++-
 fs/nfs/inode.c                           |    4 +--
 fs/nfs/nfs4state.c                       |    2 -
 fs/nfs/write.c                           |    2 -
 fs/nilfs2/mdt.c                          |    2 -
 fs/nilfs2/namei.c                        |    4 ++-
 fs/notify/inode_mark.c                   |    4 +--
 fs/ntfs/super.c                          |    4 ++-
 fs/ocfs2/namei.c                         |    4 ++-
 fs/reiserfs/namei.c                      |    4 ++-
 fs/reiserfs/stree.c                      |    2 -
 fs/sysv/namei.c                          |    4 ++-
 fs/ubifs/dir.c                           |    4 ++-
 fs/ubifs/super.c                         |    2 -
 fs/udf/namei.c                           |    4 ++-
 fs/ufs/namei.c                           |    4 ++-
 fs/xfs/linux-2.6/xfs_iops.c              |    4 ++-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 -
 fs/xfs/xfs_inode.h                       |    6 +++--
 include/linux/fs.h                       |    2 -
 ipc/mqueue.c                             |    7 ++++--
 kernel/futex.c                           |    4 ++-
 mm/shmem.c                               |    4 ++-
 net/socket.c                             |    4 ++-
 58 files changed, 200 insertions(+), 90 deletions(-)

Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c	2010-10-19 14:19:16.000000000 +1100
@@ -1549,7 +1549,7 @@
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (inode->i_count != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
Index: linux-2.6/fs/affs/inode.c
===================================================================
--- linux-2.6.orig/fs/affs/inode.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/affs/inode.c	2010-10-19 14:19:18.000000000 +1100
@@ -388,7 +388,9 @@
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
Index: linux-2.6/fs/afs/dir.c
===================================================================
--- linux-2.6.orig/fs/afs/dir.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/afs/dir.c	2010-10-19 14:19:19.000000000 +1100
@@ -1045,7 +1045,9 @@
 	if (ret < 0)
 		goto link_error;
 
-	atomic_inc(&vnode->vfs_inode.i_count);
+	spin_lock(&vnode->vfs_inode.i_lock);
+	vnode->vfs_inode.i_count++;
+	spin_unlock(&vnode->vfs_inode.i_lock);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/anon_inodes.c	2010-10-19 14:19:22.000000000 +1100
@@ -114,7 +114,9 @@
 	 * so we can avoid doing an igrab() and we can use an open-coded
 	 * atomic_inc().
 	 */
-	atomic_inc(&anon_inode_inode->i_count);
+	spin_lock(&anon_inode_inode->i_lock);
+	anon_inode_inode->i_count++;
+	spin_unlock(&anon_inode_inode->i_lock);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/block_dev.c	2010-10-19 14:19:28.000000000 +1100
@@ -550,7 +550,12 @@
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	atomic_inc(&bdev->bd_inode->i_count);
+	struct inode *inode = bdev->bd_inode;
+
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
+
 	return bdev;
 }
 
@@ -580,7 +585,9 @@
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		atomic_inc(&bdev->bd_inode->i_count);
+		spin_lock(&inode->i_lock);
+		bdev->bd_inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -596,7 +603,9 @@
 			 * So, we can access it via ->i_mapping always
 			 * without igrab().
 			 */
-			atomic_inc(&bdev->bd_inode->i_count);
+			spin_lock(&inode->i_lock);
+			bdev->bd_inode->i_count++;
+			spin_unlock(&inode->i_lock);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
Index: linux-2.6/fs/ext2/namei.c
===================================================================
--- linux-2.6.orig/fs/ext2/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/ext2/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -206,7 +206,9 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/ext3/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/ialloc.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/ext3/ialloc.c	2010-10-19 14:19:16.000000000 +1100
@@ -100,9 +100,9 @@
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_count > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					atomic_read(&inode->i_count));
+					inode->i_count);
 		return;
 	}
 	if (inode->i_nlink) {
Index: linux-2.6/fs/ext3/namei.c
===================================================================
--- linux-2.6.orig/fs/ext3/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/ext3/namei.c	2010-10-19 14:19:19.000000000 +1100
@@ -2260,7 +2260,9 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:34.000000000 +1100
@@ -317,7 +317,7 @@
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
+	if (!inode->i_count)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -414,7 +414,7 @@
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
+		} else if (inode->i_count) {
 			/*
 			 * The inode is clean, inuse
 			 */
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:34.000000000 +1100
@@ -32,14 +32,13 @@
  * inode_hash_lock protects:
  *   inode hash table, i_hash
  * inode->i_lock protects:
- *   i_state
+ *   i_state, i_count
  *
  * Ordering:
  * inode_lock
  *   sb_inode_list_lock
  *     inode->i_lock
- * inode_lock
- *   inode_hash_lock
+ *       inode_hash_lock
  */
 /*
  * This is needed for the following functions:
@@ -150,7 +149,7 @@
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	inode->i_count = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -301,7 +300,8 @@
 {
 	assert_spin_locked(&inode->i_lock);
 
-	if (atomic_inc_return(&inode->i_count) != 1)
+	inode->i_count++;
+	if (inode->i_count > 1)
 		return;
 
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
@@ -407,7 +407,7 @@
 			continue;
 		}
 		invalidate_inode_buffers(inode);
-		if (!atomic_read(&inode->i_count)) {
+		if (!inode->i_count) {
 			list_move(&inode->i_list, dispose);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
@@ -457,7 +457,7 @@
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (atomic_read(&inode->i_count))
+	if (inode->i_count)
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -495,7 +495,7 @@
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
 		spin_lock(&inode->i_lock);
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		if (inode->i_state || inode->i_count) {
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -1310,8 +1310,6 @@
 	const struct super_operations *op = inode->i_sb->s_op;
 	int drop;
 
-	spin_lock(&sb_inode_list_lock);
-	spin_lock(&inode->i_lock);
 	if (op && op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1376,8 +1374,24 @@
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+retry:
+		spin_lock(&inode->i_lock);
+		if (inode->i_count == 1) {
+			if (!spin_trylock(&inode_lock)) {
+				spin_unlock(&inode->i_lock);
+				goto retry;
+			}
+			if (!spin_trylock(&sb_inode_list_lock)) {
+				spin_unlock(&inode_lock);
+				spin_unlock(&inode->i_lock);
+				goto retry;
+			}
+			inode->i_count--;
 			iput_final(inode);
+		} else {
+			inode->i_count--;
+			spin_unlock(&inode->i_lock);
+		}
 	}
 }
 EXPORT_SYMBOL(iput);
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-10-19 14:17:28.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-10-19 14:19:18.000000000 +1100
@@ -255,7 +255,9 @@
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
Index: linux-2.6/fs/locks.c
===================================================================
--- linux-2.6.orig/fs/locks.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/locks.c	2010-10-19 14:19:16.000000000 +1100
@@ -1375,8 +1375,8 @@
 		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
 			goto out;
 		if ((arg == F_WRLCK)
-		    && ((atomic_read(&dentry->d_count) > 1)
-			|| (atomic_read(&inode->i_count) > 1)))
+		    && (atomic_read(&dentry->d_count) > 1
+			|| inode->i_count > 1))
 			goto out;
 	}
 
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -2290,8 +2290,11 @@
 		if (nd.last.name[nd.last.len])
 			goto slashes;
 		inode = dentry->d_inode;
-		if (inode)
-			atomic_inc(&inode->i_count);
+		if (inode) {
+			spin_lock(&inode->i_lock);
+			inode->i_count++;
+			spin_unlock(&inode->i_lock);
+		}
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/nfs/dir.c	2010-10-19 14:19:18.000000000 +1100
@@ -1580,7 +1580,9 @@
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		d_add(dentry, inode);
 	}
 	return error;
Index: linux-2.6/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_iops.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_iops.c	2010-10-19 14:19:18.000000000 +1100
@@ -352,7 +352,9 @@
 	if (unlikely(error))
 		return -error;
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	return 0;
 }
Index: linux-2.6/fs/xfs/xfs_inode.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_inode.h	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/xfs/xfs_inode.h	2010-10-19 14:19:18.000000000 +1100
@@ -481,8 +481,10 @@
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
-	atomic_inc(&(VFS_I(ip)->i_count)); \
+	spin_lock(&VFS_I(ip)->i_lock); \
+	ASSERT(VFS_I(ip)->i_count > 0) ; \
+	VFS_I(ip)->i_count++; \
+	spin_unlock(&VFS_I(ip)->i_lock); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:33.000000000 +1100
@@ -728,7 +728,7 @@
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
-	atomic_t		i_count;
+	unsigned int		i_count;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/ipc/mqueue.c	2010-10-19 14:19:28.000000000 +1100
@@ -768,8 +768,11 @@
 	}
 
 	inode = dentry->d_inode;
-	if (inode)
-		atomic_inc(&inode->i_count);
+	if (inode) {
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
+	}
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/kernel/futex.c	2010-10-19 14:19:18.000000000 +1100
@@ -168,7 +168,9 @@
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		atomic_inc(&key->shared.inode->i_count);
+		spin_lock(&key->shared.inode->i_lock);
+		key->shared.inode->i_count++;
+		spin_unlock(&key->shared.inode->i_lock);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/mm/shmem.c	2010-10-19 14:19:31.000000000 +1100
@@ -1903,7 +1903,9 @@
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);	/* New dentry reference */
+	spin_lock(&inode->i_lock);
+	inode->i_count++;	/* New dentry reference */
+	spin_unlock(&inode->i_lock);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
Index: linux-2.6/fs/bfs/dir.c
===================================================================
--- linux-2.6.orig/fs/bfs/dir.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/bfs/dir.c	2010-10-19 14:19:18.000000000 +1100
@@ -176,7 +176,9 @@
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/btrfs/inode.c	2010-10-19 14:19:31.000000000 +1100
@@ -1964,8 +1964,13 @@
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct delayed_iput *delayed;
 
-	if (atomic_add_unless(&inode->i_count, -1, 1))
+	spin_lock(&inode->i_lock);
+	if (inode->i_count > 1) {
+		inode->i_count--;
+		spin_unlock(&inode->i_lock);
 		return;
+	}
+	spin_unlock(&inode->i_lock);
 
 	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
 	delayed->inode = inode;
@@ -2718,10 +2723,10 @@
 		return ERR_PTR(-ENOSPC);
 
 	/* check if there is someone else holds reference */
-	if (S_ISDIR(inode->i_mode) && atomic_read(&inode->i_count) > 1)
+	if (S_ISDIR(inode->i_mode) && inode->i_count > 1)
 		return ERR_PTR(-ENOSPC);
 
-	if (atomic_read(&inode->i_count) > 2)
+	if (inode->i_count > 2)
 		return ERR_PTR(-ENOSPC);
 
 	if (xchg(&root->fs_info->enospc_unlink, 1))
@@ -3939,7 +3944,7 @@
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+			if (inode->i_count > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
@@ -4758,7 +4763,9 @@
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
Index: linux-2.6/fs/coda/dir.c
===================================================================
--- linux-2.6.orig/fs/coda/dir.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/coda/dir.c	2010-10-19 14:19:18.000000000 +1100
@@ -303,7 +303,9 @@
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
Index: linux-2.6/fs/exofs/inode.c
===================================================================
--- linux-2.6.orig/fs/exofs/inode.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/exofs/inode.c	2010-10-19 14:19:18.000000000 +1100
@@ -1107,7 +1107,9 @@
 
 	set_obj_created(oi);
 
-	atomic_dec(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count--;
+	spin_unlock(&inode->i_lock);
 	wake_up(&oi->i_wq);
 }
 
@@ -1160,14 +1162,18 @@
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	ios->done = create_done;
 	ios->private = inode;
 	ios->cred = oi->i_cred;
 	ret = exofs_sbi_create(ios);
 	if (ret) {
-		atomic_dec(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count--;
+		spin_unlock(&inode->i_lock);
 		exofs_put_io_state(ios);
 		return ERR_PTR(ret);
 	}
Index: linux-2.6/fs/exofs/namei.c
===================================================================
--- linux-2.6.orig/fs/exofs/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/exofs/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -153,7 +153,9 @@
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	return exofs_add_nondir(dentry, inode);
 }
Index: linux-2.6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/ialloc.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/ext4/ialloc.c	2010-10-19 14:19:16.000000000 +1100
@@ -189,9 +189,9 @@
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (inode->i_count > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       atomic_read(&inode->i_count));
+		       inode->i_count);
 		return;
 	}
 	if (inode->i_nlink) {
Index: linux-2.6/fs/ext4/namei.c
===================================================================
--- linux-2.6.orig/fs/ext4/namei.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/ext4/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -2312,7 +2312,9 @@
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/gfs2/ops_inode.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_inode.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/gfs2/ops_inode.c	2010-10-19 14:19:19.000000000 +1100
@@ -253,7 +253,9 @@
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
Index: linux-2.6/fs/hfsplus/dir.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/dir.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/hfsplus/dir.c	2010-10-19 14:19:18.000000000 +1100
@@ -301,7 +301,9 @@
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
Index: linux-2.6/fs/hpfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hpfs/inode.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/hpfs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -183,7 +183,7 @@
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !i->i_count) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
Index: linux-2.6/fs/jffs2/dir.c
===================================================================
--- linux-2.6.orig/fs/jffs2/dir.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/jffs2/dir.c	2010-10-19 14:19:18.000000000 +1100
@@ -289,7 +289,9 @@
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 	}
 	return ret;
 }
@@ -864,7 +866,9 @@
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
Index: linux-2.6/fs/jfs/jfs_txnmgr.c
===================================================================
--- linux-2.6.orig/fs/jfs/jfs_txnmgr.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/jfs/jfs_txnmgr.c	2010-10-19 14:19:19.000000000 +1100
@@ -1279,7 +1279,9 @@
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		atomic_inc(&tblk->u.ip->i_count);
+		spin_lock(&tblk->u.ip->i_lock);
+		tblk->u.ip->i_count++;
+		spin_unlock(&tblk->u.ip->i_lock);
 		/*
 		 * Avoid a rare deadlock
 		 *
Index: linux-2.6/fs/jfs/namei.c
===================================================================
--- linux-2.6.orig/fs/jfs/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/jfs/namei.c	2010-10-19 14:19:19.000000000 +1100
@@ -839,7 +839,9 @@
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	atomic_inc(&ip->i_count);
+	spin_lock(&ip->i_lock);
+	ip->i_count++;
+	spin_unlock(&ip->i_lock);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
Index: linux-2.6/fs/minix/namei.c
===================================================================
--- linux-2.6.orig/fs/minix/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/minix/namei.c	2010-10-19 14:19:19.000000000 +1100
@@ -101,7 +101,9 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	return add_nondir(dentry, inode);
 }
 
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/nfs/inode.c	2010-10-19 14:19:29.000000000 +1100
@@ -384,7 +384,7 @@
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		atomic_read(&inode->i_count));
+		inode->i_count);
 
 out:
 	return inode;
@@ -1190,7 +1190,7 @@
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			atomic_read(&inode->i_count), fattr->valid);
+			inode->i_count, fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
Index: linux-2.6/fs/nilfs2/mdt.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/mdt.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/nilfs2/mdt.c	2010-10-19 14:19:21.000000000 +1100
@@ -480,7 +480,7 @@
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		atomic_set(&inode->i_count, 1);
+		inode->i_count = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
Index: linux-2.6/fs/nilfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/namei.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/nilfs2/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -219,7 +219,9 @@
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
Index: linux-2.6/fs/ocfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/ocfs2/namei.c	2010-10-19 14:19:19.000000000 +1100
@@ -741,7 +741,9 @@
 		goto out_commit;
 	}
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
Index: linux-2.6/fs/reiserfs/namei.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/reiserfs/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -1156,7 +1156,9 @@
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
Index: linux-2.6/fs/reiserfs/stree.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/stree.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/reiserfs/stree.c	2010-10-19 14:19:16.000000000 +1100
@@ -1477,7 +1477,7 @@
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (inode->i_count > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
Index: linux-2.6/fs/sysv/namei.c
===================================================================
--- linux-2.6.orig/fs/sysv/namei.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/sysv/namei.c	2010-10-19 14:19:19.000000000 +1100
@@ -126,7 +126,9 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	return add_nondir(dentry, inode);
 }
Index: linux-2.6/fs/ubifs/dir.c
===================================================================
--- linux-2.6.orig/fs/ubifs/dir.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/ubifs/dir.c	2010-10-19 14:19:18.000000000 +1100
@@ -550,7 +550,9 @@
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/ubifs/super.c	2010-10-19 14:19:29.000000000 +1100
@@ -342,7 +342,7 @@
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!inode->i_count);
 
 	truncate_inode_pages(&inode->i_data, 0);
 
Index: linux-2.6/fs/udf/namei.c
===================================================================
--- linux-2.6.orig/fs/udf/namei.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/udf/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -1101,7 +1101,9 @@
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
Index: linux-2.6/fs/ufs/namei.c
===================================================================
--- linux-2.6.orig/fs/ufs/namei.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/ufs/namei.c	2010-10-19 14:19:18.000000000 +1100
@@ -180,7 +180,9 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:32.000000000 +1100
@@ -260,7 +260,7 @@
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count)) {
+		if (!inode->i_count) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
@@ -278,7 +278,7 @@
 		/* In case the dropping of a reference would nuke next_i. */
 		if (&next_i->i_sb_list != list) {
 			spin_lock(&next_i->i_lock);
-			if (atomic_read(&next_i->i_count) &&
+			if (next_i->i_count &&
 			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
 				__iget(next_i);
 				need_iput = next_i;
Index: linux-2.6/fs/ntfs/super.c
===================================================================
--- linux-2.6.orig/fs/ntfs/super.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/ntfs/super.c	2010-10-19 14:19:18.000000000 +1100
@@ -2930,7 +2930,9 @@
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
 		/* We increment i_count simulating an ntfs_iget(). */
-		atomic_inc(&vol->root_ino->i_count);
+		spin_lock(&vol->root_ino->i_lock);
+		vol->root_ino->i_count++;
+		spin_unlock(&vol->root_ino->i_lock);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
Index: linux-2.6/fs/cifs/inode.c
===================================================================
--- linux-2.6.orig/fs/cifs/inode.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/cifs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -1641,7 +1641,7 @@
 	}
 
 	cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
-		 "jiffies %ld", full_path, inode, inode->i_count.counter,
+		 "jiffies %ld", full_path, inode, inode->i_count,
 		 dentry, dentry->d_time, jiffies);
 
 	if (CIFS_SB(sb)->tcon->unix_ext)
Index: linux-2.6/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_trace.h	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_trace.h	2010-10-19 14:19:16.000000000 +1100
@@ -599,7 +599,7 @@
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->count = VFS_I(ip)->i_count;
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/net/socket.c	2010-10-19 14:19:29.000000000 +1100
@@ -377,7 +377,9 @@
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		atomic_inc(&path.dentry->d_inode->i_count);
+		spin_lock(&path.dentry->d_inode->i_lock);
+		path.dentry->d_inode->i_count++;
+		spin_unlock(&path.dentry->d_inode->i_lock);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
Index: linux-2.6/fs/nfs/nfs4state.c
===================================================================
--- linux-2.6.orig/fs/nfs/nfs4state.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/nfs/nfs4state.c	2010-10-19 14:18:58.000000000 +1100
@@ -506,8 +506,8 @@
 		state->owner = owner;
 		atomic_inc(&owner->so_count);
 		list_add(&state->inode_states, &nfsi->open_states);
-		state->inode = igrab(inode);
 		spin_unlock(&inode->i_lock);
+		state->inode = igrab(inode);
 		/* Note: The reclaim code dictates that we add stateless
 		 * and read-only stateids to the end of the list */
 		list_add_tail(&state->open_states, &owner->so_states);
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/nfs/write.c	2010-10-19 14:19:18.000000000 +1100
@@ -390,7 +390,7 @@
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		igrab(inode);
+		__iget(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/nfs/getroot.c	2010-10-19 14:19:18.000000000 +1100
@@ -55,7 +55,9 @@
 			return -ENOMEM;
 		}
 		/* Circumvent igrab(): we know the inode is not being freed */
-		atomic_inc(&inode->i_count);
+		spin_lock(&inode->i_lock);
+		inode->i_count++;
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-10-19 14:19:28.000000000 +1100
@@ -1289,11 +1289,11 @@
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, count: %u.\n",
 				__func__, pi->ino, pi, inode, count);
 
-		if (atomic_read(&inode->i_count) != count) {
+		if (inode->i_count != count) {
 			printk("%s: ino: %llu, pi: %p, inode: %p, count: %u, i_count: %d.\n",
 					__func__, pi->ino, pi, inode, count,
-					atomic_read(&inode->i_count));
-			count = atomic_read(&inode->i_count);
+					inode->i_count);
+			count = inode->i_count;
 			in_drop_list++;
 		}
 
@@ -1305,7 +1305,7 @@
 		pi = POHMELFS_I(inode);
 
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, i_count: %u.\n",
-				__func__, pi->ino, pi, inode, atomic_read(&inode->i_count));
+				__func__, pi->ino, pi, inode, inode->i_count);
 
 		/*
 		 * These are special inodes, they were created during
@@ -1313,7 +1313,7 @@
 		 * so they live here with reference counter being 1 and prevent
 		 * umount from succeed since it believes that they are busy.
 		 */
-		count = atomic_read(&inode->i_count);
+		count = inode->i_count;
 		if (count) {
 			list_del_init(&inode->i_sb_list);
 			while (count--)
Index: linux-2.6/fs/9p/vfs_inode.c
===================================================================
--- linux-2.6.orig/fs/9p/vfs_inode.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/9p/vfs_inode.c	2010-10-19 14:19:28.000000000 +1100
@@ -1791,7 +1791,9 @@
 		/* Caching disabled. No need to get upto date stat info.
 		 * This dentry will be released immediately. So, just i_count++
 		 */
-		atomic_inc(&old_dentry->d_inode->i_count);
+		spin_lock(&old_dentry->d_inode->i_lock);
+		old_dentry->d_inode->i_count++;
+		spin_unlock(&old_dentry->d_inode->i_lock);
 	}
 
 	dentry->d_op = old_dentry->d_op;
Index: linux-2.6/fs/ceph/mds_client.c
===================================================================
--- linux-2.6.orig/fs/ceph/mds_client.c	2010-10-19 14:17:26.000000000 +1100
+++ linux-2.6/fs/ceph/mds_client.c	2010-10-19 14:19:16.000000000 +1100
@@ -1102,7 +1102,7 @@
 		spin_unlock(&inode->i_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, atomic_read(&inode->i_count));
+		     inode, cap, inode->i_count);
 		return 0;
 	}
 
Index: linux-2.6/fs/logfs/dir.c
===================================================================
--- linux-2.6.orig/fs/logfs/dir.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/logfs/dir.c	2010-10-19 14:19:18.000000000 +1100
@@ -569,7 +569,9 @@
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	atomic_inc(&inode->i_count);
+	spin_lock(&inode->i_lock);
+	inode->i_count++;
+	spin_unlock(&inode->i_lock);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
Index: linux-2.6/fs/logfs/readwrite.c
===================================================================
--- linux-2.6.orig/fs/logfs/readwrite.c	2010-10-19 14:17:25.000000000 +1100
+++ linux-2.6/fs/logfs/readwrite.c	2010-10-19 14:19:16.000000000 +1100
@@ -1002,7 +1002,7 @@
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
-	if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)
+	if ((inode->i_nlink == 0) && inode->i_count == 1)
 		return 0;
 
 	if (bix < I0_BLOCKS)



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 09/35] fs: icache lock lru/writeback lists
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (7 preceding siblings ...)
  2010-10-19  3:42 ` [patch 08/35] fs: icache lock i_count npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 10/35] fs: icache atomic inodes_stat npiggin
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-6.patch --]
[-- Type: text/plain, Size: 11749 bytes --]

Add a new lock, wb_inode_list_lock, to protect i_list and various lists
which the inode can be put onto.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c         |   48 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/inode.c                |   44 ++++++++++++++++++++++++++++++++++--------
 include/linux/writeback.h |    1 
 mm/backing-dev.c          |    4 +++
 4 files changed, 87 insertions(+), 10 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:33.000000000 +1100
@@ -169,6 +169,7 @@
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb_inode_list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -186,6 +187,7 @@
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb_inode_list_lock);
 	list_move(&inode->i_list, &wb->b_more_io);
 }
 
@@ -226,6 +228,7 @@
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	assert_spin_locked(&wb_inode_list_lock);
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (older_than_this &&
@@ -289,11 +292,13 @@
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
+		spin_lock(&wb_inode_list_lock);
 	}
 }
 
@@ -347,6 +352,7 @@
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
@@ -383,6 +389,7 @@
 
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -469,11 +476,18 @@
 static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		struct writeback_control *wbc, bool only_this_sb)
 {
+again:
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
 						 struct inode, i_list);
 
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&wb_inode_list_lock);
+			spin_lock(&wb_inode_list_lock);
+			goto again;
+		}
+
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
 				/*
@@ -482,9 +496,12 @@
 				 * to it back onto the dirty list.
 				 */
 				redirty_tail(inode);
+				spin_unlock(&inode->i_lock);
 				continue;
 			}
 
+			spin_unlock(&inode->i_lock);
+
 			/*
 			 * The inode belongs to a different superblock.
 			 * Bounce back to the caller to unpin this and
@@ -493,10 +510,9 @@
 			return 0;
 		}
 
-		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		/*
@@ -519,11 +535,13 @@
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
+		spin_lock(&wb_inode_list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -543,6 +561,9 @@
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
+again:
+	spin_lock(&wb_inode_list_lock);
+
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -552,7 +573,12 @@
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
+			if (!spin_trylock(&inode->i_lock)) {
+				spin_unlock(&wb_inode_list_lock);
+				goto again;
+			}
 			requeue_io(inode);
+			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -561,6 +587,7 @@
 		if (ret)
 			break;
 	}
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
@@ -571,9 +598,11 @@
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&wb_inode_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -684,13 +713,22 @@
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
+retry:
 		spin_lock(&inode_lock);
+		spin_lock(&wb_inode_list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_list);
+			if (!spin_trylock(&inode->i_lock)) {
+				spin_unlock(&wb_inode_list_lock);
+				spin_unlock(&inode_lock);
+				goto retry;
+			}
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
+			spin_unlock(&inode->i_lock);
 		}
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode_lock);
 	}
 
@@ -1002,7 +1040,9 @@
 			}
 
 			inode->dirtied_when = jiffies;
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, &bdi->wb.b_dirty);
+			spin_unlock(&wb_inode_list_lock);
 		}
 	}
 out:
@@ -1195,7 +1235,9 @@
 	might_sleep();
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, &wbc);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	if (sync)
@@ -1221,7 +1263,9 @@
 
 	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
+	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, wbc);
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	return ret;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:33.000000000 +1100
@@ -31,6 +31,8 @@
  *   s_inodes, i_sb_list
  * inode_hash_lock protects:
  *   inode hash table, i_hash
+ * wb_inode_list_lock protects:
+ *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
  * inode->i_lock protects:
  *   i_state, i_count
  *
@@ -38,6 +40,7 @@
  * inode_lock
  *   sb_inode_list_lock
  *     inode->i_lock
+ *       wb_inode_list_lock
  *       inode_hash_lock
  */
 /*
@@ -99,6 +102,7 @@
  */
 DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
+DEFINE_SPINLOCK(wb_inode_list_lock);
 static DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
@@ -304,8 +308,11 @@
 	if (inode->i_count > 1)
 		return;
 
-	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+	if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+		spin_lock(&wb_inode_list_lock);
 		list_move(&inode->i_list, &inode_in_use);
+		spin_unlock(&wb_inode_list_lock);
+	}
 	inodes_stat.nr_unused--;
 }
 
@@ -408,7 +415,9 @@
 		}
 		invalidate_inode_buffers(inode);
 		if (!inode->i_count) {
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, dispose);
+			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
@@ -486,6 +495,8 @@
 
 	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
+again:
+	spin_lock(&wb_inode_list_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
@@ -494,13 +505,17 @@
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		spin_lock(&inode->i_lock);
+		if (!spin_trylock(&inode->i_lock)) {
+			spin_unlock(&wb_inode_list_lock);
+			goto again;
+		}
 		if (inode->i_state || inode->i_count) {
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
@@ -509,11 +524,16 @@
 								0, -1);
 			iput(inode);
 			spin_lock(&inode_lock);
+again2:
+			spin_lock(&wb_inode_list_lock);
 
 			if (inode != list_entry(inode_unused.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
-			spin_lock(&inode->i_lock);
+			if (!spin_trylock(&inode->i_lock)) {
+				spin_unlock(&wb_inode_list_lock);
+				goto again2;
+			}
 			if (!can_unuse(inode)) {
 				spin_unlock(&inode->i_lock);
 				continue;
@@ -531,6 +551,7 @@
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
+	spin_unlock(&wb_inode_list_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -663,7 +684,9 @@
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
+	spin_lock(&wb_inode_list_lock);
 	list_add(&inode->i_list, &inode_in_use);
+	spin_unlock(&wb_inode_list_lock);
 	if (head) {
 		spin_lock(&inode_hash_lock);
 		hlist_add_head(&inode->i_hash, head);
@@ -1316,8 +1339,11 @@
 		drop = generic_drop_inode(inode);
 
 	if (!drop) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+		if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+			spin_lock(&wb_inode_list_lock);
 			list_move(&inode->i_list, &inode_unused);
+			spin_unlock(&wb_inode_list_lock);
+		}
 		inodes_stat.nr_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
@@ -1341,7 +1367,9 @@
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
 	}
+	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
+	spin_unlock(&wb_inode_list_lock);
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
@@ -1374,17 +1402,17 @@
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-retry:
+retry1:
 		spin_lock(&inode->i_lock);
 		if (inode->i_count == 1) {
 			if (!spin_trylock(&inode_lock)) {
+retry2:
 				spin_unlock(&inode->i_lock);
-				goto retry;
+				goto retry1;
 			}
 			if (!spin_trylock(&sb_inode_list_lock)) {
 				spin_unlock(&inode_lock);
-				spin_unlock(&inode->i_lock);
-				goto retry;
+				goto retry2;
 			}
 			inode->i_count--;
 			iput_final(inode);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:32.000000000 +1100
@@ -11,6 +11,7 @@
 
 extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
+extern spinlock_t wb_inode_list_lock;
 extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2010-10-19 14:17:24.000000000 +1100
+++ linux-2.6/mm/backing-dev.c	2010-10-19 14:19:32.000000000 +1100
@@ -74,12 +74,14 @@
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
+	spin_lock(&wb_inode_list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_list)
 		nr_more_io++;
+	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -683,9 +685,11 @@
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
 		spin_lock(&inode_lock);
+		spin_lock(&wb_inode_list_lock);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode_lock);
 	}
 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 10/35] fs: icache atomic inodes_stat
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (8 preceding siblings ...)
  2010-10-19  3:42 ` [patch 09/35] fs: icache lock lru/writeback lists npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 11/35] fs: icache lock inode state npiggin
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-5.patch --]
[-- Type: text/plain, Size: 5402 bytes --]

Protect inodes_stat statistics with atomic ops rather than inode_lock.
Also move nr_inodes statistic into low level inode init and teardown
functions in anticipation of future patches which skip putting some
inodes (eg. sockets) onto sb list.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c  |    6 ++++--
 fs/inode.c         |   28 +++++++++++++---------------
 include/linux/fs.h |   11 ++++++++---
 3 files changed, 25 insertions(+), 20 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:37:09.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:37:31.000000000 +1100
@@ -772,7 +772,8 @@
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(atomic_read(&inodes_stat.nr_inodes) -
+			atomic_read(&inodes_stat.nr_unused));
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1156,7 +1157,8 @@
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	work.nr_pages = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			(atomic_read(&inodes_stat.nr_inodes) -
+			atomic_read(&inodes_stat.nr_unused));
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:37:09.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:37:56.000000000 +1100
@@ -122,7 +122,10 @@
 /*
  * Statistics gathering..
  */
-struct inodes_stat_t inodes_stat;
+struct inodes_stat_t inodes_stat = {
+	.nr_inodes = ATOMIC_INIT(0),
+	.nr_unused = ATOMIC_INIT(0),
+};
 
 static struct kmem_cache *inode_cachep __read_mostly;
 
@@ -213,6 +216,8 @@
 	inode->i_fsnotify_mask = 0;
 #endif
 
+	atomic_inc(&inodes_stat.nr_inodes);
+
 	return 0;
 out:
 	return -ENOMEM;
@@ -253,6 +258,7 @@
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
+	atomic_dec(&inodes_stat.nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
 
@@ -313,7 +319,7 @@
 		list_move(&inode->i_list, &inode_in_use);
 		spin_unlock(&wb_inode_list_lock);
 	}
-	inodes_stat.nr_unused--;
+	atomic_dec(&inodes_stat.nr_unused);
 }
 
 void end_writeback(struct inode *inode)
@@ -354,8 +360,6 @@
  */
 static void dispose_list(struct list_head *head)
 {
-	int nr_disposed = 0;
-
 	while (!list_empty(head)) {
 		struct inode *inode;
 
@@ -375,11 +379,7 @@
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
-		nr_disposed++;
 	}
-	spin_lock(&inode_lock);
-	inodes_stat.nr_inodes -= nr_disposed;
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -428,7 +428,7 @@
 		busy = 1;
 	}
 	/* only unused inodes may be cached with i_count zero */
-	inodes_stat.nr_unused -= count;
+	atomic_sub(count, &inodes_stat.nr_unused);
 	return busy;
 }
 
@@ -545,7 +545,7 @@
 		spin_unlock(&inode->i_lock);
 		nr_pruned++;
 	}
-	inodes_stat.nr_unused -= nr_pruned;
+	atomic_sub(nr_pruned, &inodes_stat.nr_unused);
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -574,7 +574,7 @@
 	unsigned long nr;
 
 	shrinker_add_scan(&nr_to_scan, scanned, global,
-			inodes_stat.nr_unused,
+			atomic_read(&inodes_stat.nr_unused),
 			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
 	/*
 	 * Nasty deadlock avoidance.  We may hold various FS locks,
@@ -680,7 +680,6 @@
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	inodes_stat.nr_inodes++;
 	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
@@ -1344,7 +1343,7 @@
 			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&wb_inode_list_lock);
 		}
-		inodes_stat.nr_unused++;
+		atomic_inc(&inodes_stat.nr_unused);
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
@@ -1362,10 +1361,10 @@
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		inodes_stat.nr_unused--;
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
+		atomic_dec(&inodes_stat.nr_unused);
 	}
 	spin_lock(&wb_inode_list_lock);
 	list_del_init(&inode->i_list);
@@ -1374,7 +1373,6 @@
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
-	inodes_stat.nr_inodes--;
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	evict(inode);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:37:09.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:37:31.000000000 +1100
@@ -40,12 +40,17 @@
 };
 
 struct inodes_stat_t {
-	int nr_inodes;
-	int nr_unused;
+	/*
+	 * Using atomics here is a hack which should just happen to
+	 * work on all architectures today. Not a big deal though,
+	 * because it goes away and gets fixed properly later in the
+	 * inode scaling series.
+	 */
+	atomic_t nr_inodes;
+	atomic_t nr_unused;
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 
-
 #define NR_FILE  8192	/* this can well be larger on a larger system */
 
 #define MAY_EXEC 1



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 11/35] fs: icache lock inode state
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (9 preceding siblings ...)
  2010-10-19  3:42 ` [patch 10/35] fs: icache atomic inodes_stat npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 12/35] fs: inode atomic last_ino, iunique lock npiggin
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-6b.patch --]
[-- Type: text/plain, Size: 4168 bytes --]

Protect remaining unprotected i_hash, i_sb_list etc members with i_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:33.000000000 +1100
@@ -34,7 +34,11 @@
  * wb_inode_list_lock protects:
  *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
  * inode->i_lock protects:
- *   i_state, i_count
+ *   i_state
+ *   i_count
+ *   i_hash
+ *   i_list
+ *   i_sb_list
  *
  * Ordering:
  * inode_lock
@@ -369,12 +373,14 @@
 		evict(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
-		spin_lock(&sb_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&sb_inode_list_lock);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -680,7 +686,6 @@
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	spin_lock(&sb_inode_list_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
 	spin_lock(&wb_inode_list_lock);
@@ -710,7 +715,10 @@
 	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb_inode_list_lock);
+	spin_lock(&inode->i_lock);
 	__inode_add_to_lists(sb, head, inode);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -742,9 +750,12 @@
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
+		spin_lock(&sb_inode_list_lock);
+		spin_lock(&inode->i_lock);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -808,11 +819,14 @@
 		/* We released the lock, so.. */
 		old = find_inode(sb, head, test, data);
 		if (!old) {
+			spin_lock(&sb_inode_list_lock);
+			spin_lock(&inode->i_lock);
 			if (set(inode, data))
 				goto set_failed;
 
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -837,6 +851,7 @@
 
 set_failed:
 	spin_unlock(&inode->i_lock);
+	spin_unlock(&sb_inode_list_lock);
 	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
@@ -859,9 +874,12 @@
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
+			spin_lock(&sb_inode_list_lock);
+			spin_lock(&inode->i_lock);
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -1275,10 +1293,13 @@
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -1292,9 +1313,11 @@
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -1377,9 +1400,11 @@
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 12/35] fs: inode atomic last_ino, iunique lock
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (10 preceding siblings ...)
  2010-10-19  3:42 ` [patch 11/35] fs: icache lock inode state npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 13/35] fs: icache remove inode_lock npiggin
                   ` (24 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-6c.patch --]
[-- Type: text/plain, Size: 2507 bytes --]

Make last_ino atomic in preperation for removing inode_lock.
Make a new lock for iunique counter, for removing inode_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c |   37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:32.000000000 +1100
@@ -742,7 +742,7 @@
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static unsigned int last_ino;
+	static atomic_t last_ino = ATOMIC_INIT(0);
 	struct inode *inode;
 
 	spin_lock_prefetch(&inode_lock);
@@ -752,7 +752,7 @@
 		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
-		inode->i_ino = ++last_ino;
+		inode->i_ino = (unsigned int)atomic_inc_return(&last_ino);
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);
@@ -903,6 +903,29 @@
 	return inode;
 }
 
+/* Is the ino for this sb hashed right now? */
+static int is_ino_hashed(struct super_block *sb, unsigned long ino)
+{
+	struct hlist_node *node;
+	struct inode *inode = NULL;
+	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+
+	spin_lock(&inode_hash_lock);
+	hlist_for_each_entry(inode, node, head, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb) {
+			spin_unlock(&inode_hash_lock);
+			return 0;
+		}
+		/*
+		 * Don't bother checking for I_FREEING etc., because
+		 * we don't want iunique to wait on freeing inodes. Just
+		 * skip it and get the next one.
+		 */
+	}
+	spin_unlock(&inode_hash_lock);
+	return 1;
+}
+
 /**
  *	iunique - get a unique inode number
  *	@sb: superblock
@@ -924,20 +947,18 @@
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
+	static DEFINE_SPINLOCK(unique_lock);
 	static unsigned int counter;
-	struct inode *inode;
-	struct hlist_head *head;
 	ino_t res;
 
 	spin_lock(&inode_lock);
+	spin_lock(&unique_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		head = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, head, res);
-		spin_unlock(&inode->i_lock);
-	} while (inode != NULL);
+	} while (!is_ino_hashed(sb, res));
+	spin_unlock(&unique_lock);
 	spin_unlock(&inode_lock);
 
 	return res;



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 13/35] fs: icache remove inode_lock
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (11 preceding siblings ...)
  2010-10-19  3:42 ` [patch 12/35] fs: inode atomic last_ino, iunique lock npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 14/35] fs: icache factor hash lock into functions npiggin
                   ` (23 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-7.patch --]
[-- Type: text/plain, Size: 33088 bytes --]

Remove the global inode_lock, it has been made redundant by the
previous lock breakup.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 Documentation/filesystems/Locking |    2 
 Documentation/filesystems/porting |   10 +++-
 Documentation/filesystems/vfs.txt |    2 
 fs/buffer.c                       |    2 
 fs/drop_caches.c                  |    4 -
 fs/fs-writeback.c                 |   47 ++++--------------
 fs/inode.c                        |   95 +++++++-------------------------------
 fs/notify/inode_mark.c            |   11 +---
 fs/ntfs/inode.c                   |    4 -
 fs/ocfs2/inode.c                  |    2 
 fs/quota/dquot.c                  |   16 ++----
 include/linux/fs.h                |    2 
 include/linux/writeback.h         |    1 
 mm/backing-dev.c                  |    4 -
 mm/filemap.c                      |    6 +-
 mm/rmap.c                         |    6 +-
 16 files changed, 60 insertions(+), 154 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2010-10-19 14:17:23.000000000 +1100
+++ linux-2.6/fs/buffer.c	2010-10-19 14:18:59.000000000 +1100
@@ -1145,7 +1145,7 @@
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:19:25.000000000 +1100
@@ -16,7 +16,6 @@
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
 
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:31.000000000 +1100
@@ -194,7 +194,7 @@
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
@@ -294,18 +294,16 @@
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		spin_lock(&wb_inode_list_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+ * Write out an inode's dirty pages. Either the caller has ref on the inode
+ * (either via __iget or via syscall against an fd) or the inode has
+ * I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
  *
@@ -313,7 +311,8 @@
  * starvation of particular inodes when others are being redirtied, prevent
  * livelocks, etc.
  *
- * Called under inode_lock.
+ * Called under wb_inode_list_lock and i_lock. May drop the locks but returns
+ * with them locked.
  */
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
@@ -354,7 +353,6 @@
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -374,12 +372,10 @@
 	 * due to delalloc, clear dirty metadata flags right before
 	 * write_inode()
 	 */
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
@@ -387,7 +383,6 @@
 			ret = err;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	inode->i_state &= ~I_SYNC;
@@ -537,10 +532,8 @@
 		}
 		spin_unlock(&wb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_lock);
 		spin_lock(&wb_inode_list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
@@ -560,7 +553,6 @@
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_lock);
 again:
 	spin_lock(&wb_inode_list_lock);
 
@@ -588,7 +580,6 @@
 			break;
 	}
 	spin_unlock(&wb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -597,13 +588,11 @@
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&wb_inode_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&wb_inode_list_lock);
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -714,14 +703,12 @@
 		 * we'll just busyloop.
 		 */
 retry:
-		spin_lock(&inode_lock);
 		spin_lock(&wb_inode_list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_list);
 			if (!spin_trylock(&inode->i_lock)) {
 				spin_unlock(&wb_inode_list_lock);
-				spin_unlock(&inode_lock);
 				goto retry;
 			}
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
@@ -729,7 +716,6 @@
 			spin_unlock(&inode->i_lock);
 		}
 		spin_unlock(&wb_inode_list_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	return wrote;
@@ -993,7 +979,6 @@
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
@@ -1048,7 +1033,6 @@
 	}
 out:
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	if (wakeup_bdi)
 		bdi_wakeup_thread_delayed(bdi);
@@ -1082,7 +1066,6 @@
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 
 	/*
@@ -1110,14 +1093,12 @@
  		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		/*
-		 * We hold a reference to 'inode' so it couldn't have
-		 * been removed from s_inodes list while we dropped the
-		 * inode_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it
-		 * under inode_lock. So we keep the reference and iput
-		 * it later.
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * sb_inode_list_lock.  We cannot iput the inode now as we can
+		 * be holding the last reference and we cannot iput it under
+		 * spinlock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1126,11 +1107,9 @@
 
 		cond_resched();
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
 
@@ -1235,13 +1214,11 @@
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, &wbc);
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1263,13 +1240,11 @@
 {
 	int ret;
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	ret = writeback_single_inode(inode, wbc);
 	spin_unlock(&wb_inode_list_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:31.000000000 +1100
@@ -41,11 +41,10 @@
  *   i_sb_list
  *
  * Ordering:
- * inode_lock
- *   sb_inode_list_lock
- *     inode->i_lock
- *       wb_inode_list_lock
- *       inode_hash_lock
+ * sb_inode_list_lock
+ *   inode->i_lock
+ *     wb_inode_list_lock
+ *     inode_hash_lock
  */
 /*
  * This is needed for the following functions:
@@ -104,7 +103,6 @@
  * NOTE! You also have to own the lock if you change
  * the i_state of an inode while it is in use..
  */
-DEFINE_SPINLOCK(inode_lock);
 DEFINE_SPINLOCK(sb_inode_list_lock);
 DEFINE_SPINLOCK(wb_inode_list_lock);
 static DEFINE_SPINLOCK(inode_hash_lock);
@@ -136,7 +134,7 @@
 static void wake_up_inode(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_NEW);
@@ -308,7 +306,7 @@
 }
 
 /*
- * inode_lock must be held
+ * i_lock must be held
  */
 void __iget(struct inode *inode)
 {
@@ -372,16 +370,14 @@
 
 		evict(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		spin_lock(&inode_hash_lock);
 		hlist_del_init(&inode->i_hash);
 		spin_unlock(&inode_hash_lock);
 		list_del_init(&inode->i_sb_list);
-		spin_unlock(&sb_inode_list_lock);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
+		spin_unlock(&sb_inode_list_lock);
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
@@ -407,7 +403,6 @@
 		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
-		cond_resched_lock(&inode_lock);
 		cond_resched_lock(&sb_inode_list_lock);
 
 		next = next->next;
@@ -452,12 +447,10 @@
 	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
@@ -481,7 +474,7 @@
 
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * a temporary list and then are freed outside LRU lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -500,7 +493,6 @@
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
-	spin_lock(&inode_lock);
 again:
 	spin_lock(&wb_inode_list_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
@@ -524,12 +516,10 @@
 			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lock);
 again2:
 			spin_lock(&wb_inode_list_lock);
 
@@ -556,7 +546,6 @@
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&inode_lock);
 	spin_unlock(&wb_inode_list_lock);
 
 	dispose_list(&freeable);
@@ -704,9 +693,9 @@
  * @inode: inode to mark in use
  *
  * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash. This needs to be done under
- * the inode_lock, so export a function to do this rather than the inode lock
- * itself. We calculate the hash list to add to here so it is all internal
+ * list, the owning superblock and the inode hash.
+ *
+ * We calculate the hash list to add to here so it is all internal
  * which requires the caller to have already set up the inode number in the
  * inode to add.
  */
@@ -714,12 +703,10 @@
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	spin_lock(&inode->i_lock);
 	__inode_add_to_lists(sb, head, inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
@@ -745,18 +732,14 @@
 	static atomic_t last_ino = ATOMIC_INIT(0);
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_lock);
-
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		inode->i_ino = (unsigned int)atomic_inc_return(&last_ino);
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
@@ -815,7 +798,6 @@
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode(sb, head, test, data);
 		if (!old) {
@@ -827,7 +809,6 @@
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -842,7 +823,6 @@
 		 */
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -852,7 +832,6 @@
 set_failed:
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -870,7 +849,6 @@
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, head, ino);
 		if (!old) {
@@ -880,7 +858,6 @@
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, head, inode);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -895,7 +872,6 @@
 		 */
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -951,7 +927,6 @@
 	static unsigned int counter;
 	ino_t res;
 
-	spin_lock(&inode_lock);
 	spin_lock(&unique_lock);
 	do {
 		if (counter <= max_reserved)
@@ -959,7 +934,6 @@
 		res = counter++;
 	} while (!is_ino_hashed(sb, res));
 	spin_unlock(&unique_lock);
-	spin_unlock(&inode_lock);
 
 	return res;
 }
@@ -969,7 +943,6 @@
 {
 	struct inode *ret = inode;
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
 		__iget(inode);
@@ -981,7 +954,6 @@
 		 */
 		ret = NULL;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	return ret;
 }
@@ -1004,7 +976,7 @@
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
 		struct hlist_head *head, int (*test)(struct inode *, void *),
@@ -1012,17 +984,14 @@
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1046,16 +1015,13 @@
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1078,7 +1044,7 @@
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1106,7 +1072,7 @@
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1157,7 +1123,7 @@
  * inode and this is returned locked, hashed, and with the I_NEW flag set. The
  * file system gets to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_lock held, so can't sleep.
+ * Note both @test and @set are called with the i_lock held, so can't sleep.
  */
 struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
@@ -1219,7 +1185,6 @@
 		struct hlist_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
@@ -1238,13 +1203,11 @@
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1267,7 +1230,6 @@
 		struct hlist_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 repeat:
 		spin_lock(&inode_hash_lock);
 		hlist_for_each_entry(old, node, head, i_hash) {
@@ -1286,13 +1248,11 @@
 		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_hash_lock);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		spin_unlock(&inode_hash_lock);
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1315,13 +1275,11 @@
 {
 	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -1333,13 +1291,11 @@
  */
 void remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
 
@@ -1391,16 +1347,13 @@
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
@@ -1418,15 +1371,12 @@
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	evict(inode);
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	spin_lock(&inode_hash_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_hash_lock);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);
@@ -1446,17 +1396,12 @@
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-retry1:
+retry:
 		spin_lock(&inode->i_lock);
 		if (inode->i_count == 1) {
-			if (!spin_trylock(&inode_lock)) {
-retry2:
-				spin_unlock(&inode->i_lock);
-				goto retry1;
-			}
 			if (!spin_trylock(&sb_inode_list_lock)) {
-				spin_unlock(&inode_lock);
-				goto retry2;
+				spin_unlock(&inode->i_lock);
+				goto retry;
 			}
 			inode->i_count--;
 			iput_final(inode);
@@ -1643,8 +1588,6 @@
  * It doesn't matter if I_NEW is not set initially, a call to
  * wake_up_inode() after removing from the hash list will DTRT.
  *
- * This is called with inode_lock held.
- *
  * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
@@ -1654,10 +1597,8 @@
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_lock);
 }
 
 static __initdata unsigned long ihash_entries;
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:30.000000000 +1100
@@ -9,7 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_lock;
 extern spinlock_t sb_inode_list_lock;
 extern spinlock_t wb_inode_list_lock;
 extern struct list_head inode_in_use;
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/quota/dquot.c	2010-10-19 14:19:25.000000000 +1100
@@ -76,7 +76,7 @@
 #include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
-#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -897,7 +897,6 @@
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -921,21 +920,18 @@
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the inode_lock.
-		 * We cannot iput the inode now as we can be holding the last
-		 * reference and we cannot iput it under inode_lock. So we
-		 * keep the reference and iput it later. */
+		 * removed from s_inodes list while we dropped the
+		 * sb_inode_list_lock.  We cannot iput the inode now as we can
+		 * be holding the last reference and we cannot iput it under
+		 * lock. So we keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1016,7 +1012,6 @@
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
@@ -1032,7 +1027,6 @@
 		}
 	}
 	spin_unlock(&sb_inode_list_lock);
-	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:25.000000000 +1100
@@ -22,7 +22,7 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
+#include <linux/writeback.h>
 
 #include <asm/atomic.h>
 
@@ -232,9 +232,8 @@
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with inode_lock held, protecting the unmounting super block's list
- * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
- * We temporarily drop inode_lock, however, and CAN block.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * sb_inode_list_lock to protect the super block's list of inodes.
  */
 void fsnotify_unmount_inodes(struct list_head *list)
 {
@@ -287,13 +286,12 @@
 		}
 
 		/*
-		 * We can safely drop inode_lock here because we hold
+		 * We can safely drop sb_inode_list_lock here because we hold
 		 * references on both inode and next_i.  Also no new inodes
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
 		spin_unlock(&sb_inode_list_lock);
-		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -305,7 +303,6 @@
 
 		iput(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb_inode_list_lock);
 	}
 }
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/mm/backing-dev.c	2010-10-19 14:19:21.000000000 +1100
@@ -73,7 +73,6 @@
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_lock);
 	spin_lock(&wb_inode_list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_list)
 		nr_dirty++;
@@ -82,7 +81,6 @@
 	list_for_each_entry(inode, &wb->b_more_io, i_list)
 		nr_more_io++;
 	spin_unlock(&wb_inode_list_lock);
-	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -684,13 +682,11 @@
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_lock);
 		spin_lock(&wb_inode_list_lock);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
 		spin_unlock(&wb_inode_list_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	bdi_unregister(bdi);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2010-10-19 14:17:28.000000000 +1100
+++ linux-2.6/mm/filemap.c	2010-10-19 14:18:59.000000000 +1100
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  ->inode_lock
+ *  ->i_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,8 +98,8 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(zap_pte_range->set_page_dirty)
+ *    ->i_lock			(page_remove_rmap->set_page_dirty)
+ *    ->i_lock			(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
  *  ->task->proc_lock
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2010-10-19 14:17:23.000000000 +1100
+++ linux-2.6/mm/rmap.c	2010-10-19 14:18:59.000000000 +1100
@@ -31,11 +31,11 @@
  *             swap_lock (in swap_duplicate, swap_info_get)
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
- *               inode_lock (in set_page_dirty's __mark_inode_dirty)
- *                 sb_lock (within inode_lock in fs/fs-writeback.c)
+ *               i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                 sb_lock (within i_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_lock in __sync_single_inode)
+ *                           within i_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/Documentation/filesystems/Locking	2010-10-19 14:19:25.000000000 +1100
@@ -114,7 +114,7 @@
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!inode_lock!!!
+drop_inode:				!!!i_lock, sb_inode_list_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/Documentation/filesystems/vfs.txt	2010-10-19 14:19:25.000000000 +1100
@@ -246,7 +246,7 @@
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the inode_lock spinlock held.
+	with the i_lock and sb_inode_list_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c	2010-10-19 14:17:23.000000000 +1100
+++ linux-2.6/fs/ntfs/inode.c	2010-10-19 14:19:28.000000000 +1100
@@ -54,7 +54,7 @@
  *
  * Return 1 if the attributes match and 0 if not.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep.
  */
 int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
@@ -98,7 +98,7 @@
  *
  * Return 0 on success and -errno on error.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep. (Hence the GFP_ATOMIC allocation.)
  */
 static int ntfs_init_locked_inode(struct inode *vi, ntfs_attr *na)
Index: linux-2.6/fs/ocfs2/inode.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/inode.c	2010-10-19 14:17:23.000000000 +1100
+++ linux-2.6/fs/ocfs2/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -1195,7 +1195,7 @@
 	ocfs2_clear_inode(inode);
 }
 
-/* Called under inode_lock, with no more references on the
+/* Called under i_lock, with no more references on the
  * struct inode, so it's safe here to check the flags field
  * and to manipulate i_nlink without any other locks. */
 int ocfs2_drop_inode(struct inode *inode)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:31.000000000 +1100
@@ -1589,7 +1589,7 @@
 };
 
 /*
- * Inode state bits.  Protected by inode_lock.
+ * Inode state bits.  Protected by i_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
  * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
Index: linux-2.6/Documentation/filesystems/porting
===================================================================
--- linux-2.6.orig/Documentation/filesystems/porting	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/Documentation/filesystems/porting	2010-10-19 14:19:28.000000000 +1100
@@ -299,7 +299,7 @@
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
 metadata buffers; getting rid of those is responsibility of method, as it had
 been for ->delete_inode().
-	->drop_inode() returns int now; it's called on final iput() with inode_lock
+	->drop_inode() returns int now; it's called on final iput() with i_lock
 held and it returns true if filesystems wants the inode to be dropped.  As before,
 generic_drop_inode() is still the default and it's been updated appropriately.
 generic_delete_inode() is also alive and it consists simply of return 1.  Note that
@@ -318,3 +318,11 @@
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+--
+[mandatory]
+	inode_lock is gone, replaced by fine grained locks. See fs/inode.c
+for details of what locks to replace inode_lock with in order to protect
+particular things. Most of the time, a filesystem only needs ->i_lock, which
+protects *all* the inode state and its membership on lists that was
+previously protected with inode_lock.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 14/35] fs: icache factor hash lock into functions
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (12 preceding siblings ...)
  2010-10-19  3:42 ` [patch 13/35] fs: icache remove inode_lock npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 15/35] fs: icache per-bucket inode hash locks npiggin
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-8.patch --]
[-- Type: text/plain, Size: 2596 bytes --]

Add a function __remove_inode_hash

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c |   38 ++++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:31.000000000 +1100
@@ -353,6 +353,8 @@
 		cd_forget(inode);
 }
 
+static void __remove_inode_hash(struct inode *inode);
+
 /*
  * dispose_list - dispose of the contents of a local list
  * @head: the head of the list to free
@@ -372,9 +374,7 @@
 
 		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
-		spin_lock(&inode_hash_lock);
-		hlist_del_init(&inode->i_hash);
-		spin_unlock(&inode_hash_lock);
+		__remove_inode_hash(inode);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb_inode_list_lock);
@@ -1284,6 +1284,20 @@
 EXPORT_SYMBOL(__insert_inode_hash);
 
 /**
+ *	__remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock. inode->i_lock must be
+ *	held.
+ */
+static void __remove_inode_hash(struct inode *inode)
+{
+	spin_lock(&inode_hash_lock);
+	hlist_del_init(&inode->i_hash);
+	spin_unlock(&inode_hash_lock);
+}
+
+/**
  *	remove_inode_hash - remove an inode from the hash
  *	@inode: inode to unhash
  *
@@ -1292,9 +1306,7 @@
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode->i_lock);
-	spin_lock(&inode_hash_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_hash_lock);
+	__remove_inode_hash(inode);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
@@ -1358,9 +1370,7 @@
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		spin_lock(&inode_hash_lock);
-		hlist_del_init(&inode->i_hash);
-		spin_unlock(&inode_hash_lock);
+		__remove_inode_hash(inode);
 		atomic_dec(&inodes_stat.nr_unused);
 	}
 	spin_lock(&wb_inode_list_lock);
@@ -1372,11 +1382,11 @@
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
 	evict(inode);
-	spin_lock(&inode->i_lock);
-	spin_lock(&inode_hash_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_hash_lock);
-	spin_unlock(&inode->i_lock);
+	/*
+	 * i_lock is required to delete from hash because find_inode_fast
+	 * might find us but go to sleep before we run wake_up_inode.
+	 */
+	remove_inode_hash(inode);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 15/35] fs: icache per-bucket inode hash locks
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (13 preceding siblings ...)
  2010-10-19  3:42 ` [patch 14/35] fs: icache factor hash lock into functions npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 16/35] fs: icache lazy inode lru npiggin
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-9.patch --]
[-- Type: text/plain, Size: 24189 bytes --]

Remove the global inode_hash_lock and replace it with per-hash-bucket locks.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/btrfs/inode.c        |    2 
 fs/fs-writeback.c       |    2 
 fs/hfs/hfs_fs.h         |    2 
 fs/hfs/inode.c          |    2 
 fs/hfsplus/hfsplus_fs.h |    2 
 fs/hfsplus/inode.c      |    2 
 fs/inode.c              |  189 ++++++++++++++++++++++++++----------------------
 fs/nilfs2/gcinode.c     |   21 ++---
 fs/nilfs2/segment.c     |    2 
 fs/nilfs2/the_nilfs.h   |    2 
 fs/reiserfs/xattr.c     |    2 
 include/linux/fs.h      |    3 
 mm/shmem.c              |    4 -
 13 files changed, 129 insertions(+), 106 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:30.000000000 +1100
@@ -24,12 +24,13 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
+#include <linux/bit_spinlock.h>
 
 /*
  * Usage:
  * sb_inode_list_lock protects:
  *   s_inodes, i_sb_list
- * inode_hash_lock protects:
+ * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
  * wb_inode_list_lock protects:
  *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
@@ -44,7 +45,7 @@
  * sb_inode_list_lock
  *   inode->i_lock
  *     wb_inode_list_lock
- *     inode_hash_lock
+ *     inode_hash_bucket lock
  */
 /*
  * This is needed for the following functions:
@@ -95,7 +96,22 @@
 
 LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
-static struct hlist_head *inode_hashtable __read_mostly;
+
+struct inode_hash_bucket {
+	struct hlist_bl_head head;
+};
+
+static inline void spin_lock_bucket(struct inode_hash_bucket *b)
+{
+	bit_spin_lock(0, (unsigned long *)b);
+}
+
+static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
+{
+	__bit_spin_unlock(0, (unsigned long *)b);
+}
+
+static struct inode_hash_bucket *inode_hashtable __read_mostly;
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -105,7 +121,6 @@
  */
 DEFINE_SPINLOCK(sb_inode_list_lock);
 DEFINE_SPINLOCK(wb_inode_list_lock);
-static DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -281,7 +296,7 @@
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
+	INIT_HLIST_BL_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
@@ -598,20 +613,21 @@
  * add any additional branch in the common code.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct inode_hash_bucket *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
+			cpu_relax();
 			goto repeat;
 		}
 		if (!test(inode, data)) {
@@ -619,13 +635,13 @@
 			continue;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
-	spin_unlock(&inode_hash_lock);
+	spin_unlock_bucket(b);
 	return node ? inode : NULL;
 }
 
@@ -634,30 +650,32 @@
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct inode_hash_bucket *b,
+				unsigned long ino)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
+			cpu_relax();
 			goto repeat;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
-	spin_unlock(&inode_hash_lock);
+	spin_unlock_bucket(b);
 	return node ? inode : NULL;
 }
 
@@ -672,7 +690,7 @@
 }
 
 static inline void
-__inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
+__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
 	list_add(&inode->i_sb_list, &sb->s_inodes);
@@ -680,10 +698,10 @@
 	spin_lock(&wb_inode_list_lock);
 	list_add(&inode->i_list, &inode_in_use);
 	spin_unlock(&wb_inode_list_lock);
-	if (head) {
-		spin_lock(&inode_hash_lock);
-		hlist_add_head(&inode->i_hash, head);
-		spin_unlock(&inode_hash_lock);
+	if (b) {
+		spin_lock_bucket(b);
+		hlist_bl_add_head(&inode->i_hash, &b->head);
+		spin_unlock_bucket(b);
 	}
 }
 
@@ -701,11 +719,11 @@
  */
 void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, inode->i_ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
 
 	spin_lock(&sb_inode_list_lock);
 	spin_lock(&inode->i_lock);
-	__inode_add_to_lists(sb, head, inode);
+	__inode_add_to_lists(sb, b, inode);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
@@ -787,7 +805,7 @@
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode *get_new_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct inode_hash_bucket *b,
 				int (*test)(struct inode *, void *),
 				int (*set)(struct inode *, void *),
 				void *data)
@@ -799,7 +817,7 @@
 		struct inode *old;
 
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, b, test, data);
 		if (!old) {
 			spin_lock(&sb_inode_list_lock);
 			spin_lock(&inode->i_lock);
@@ -807,7 +825,7 @@
 				goto set_failed;
 
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode->i_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -841,7 +859,7 @@
  * comment at iget_locked for details.
  */
 static struct inode *get_new_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct inode_hash_bucket *b, unsigned long ino)
 {
 	struct inode *inode;
 
@@ -850,13 +868,13 @@
 		struct inode *old;
 
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			spin_lock(&sb_inode_list_lock);
 			spin_lock(&inode->i_lock);
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, head, inode);
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode->i_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -882,14 +900,14 @@
 /* Is the ino for this sb hashed right now? */
 static int is_ino_hashed(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_node *node;
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_ino == ino && inode->i_sb == sb) {
-			spin_unlock(&inode_hash_lock);
+			spin_unlock_bucket(b);
 			return 0;
 		}
 		/*
@@ -898,7 +916,7 @@
 		 * skip it and get the next one.
 		 */
 	}
-	spin_unlock(&inode_hash_lock);
+	spin_unlock_bucket(b);
 	return 1;
 }
 
@@ -979,12 +997,13 @@
  * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
-		struct hlist_head *head, int (*test)(struct inode *, void *),
+		struct inode_hash_bucket *b,
+		int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 
-	inode = find_inode(sb, head, test, data);
+	inode = find_inode(sb, b, test, data);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -1011,11 +1030,12 @@
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
-		struct hlist_head *head, unsigned long ino)
+		struct inode_hash_bucket *b,
+		unsigned long ino)
 {
 	struct inode *inode;
 
-	inode = find_inode_fast(sb, head, ino);
+	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -1049,9 +1069,9 @@
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 0);
+	return ifind(sb, b, test, data, 0);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 
@@ -1077,9 +1097,9 @@
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
-	return ifind(sb, head, test, data, 1);
+	return ifind(sb, b, test, data, 1);
 }
 EXPORT_SYMBOL(ilookup5);
 
@@ -1099,9 +1119,9 @@
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
-	return ifind_fast(sb, head, ino);
+	return ifind_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(ilookup);
 
@@ -1129,17 +1149,17 @@
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
-	inode = ifind(sb, head, test, data, 1);
+	inode = ifind(sb, b, test, data, 1);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode(sb, head, test, set, data);
+	return get_new_inode(sb, b, test, set, data);
 }
 EXPORT_SYMBOL(iget5_locked);
 
@@ -1160,17 +1180,17 @@
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	inode = ifind_fast(sb, head, ino);
+	inode = ifind_fast(sb, b, ino);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode_fast() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode_fast(sb, head, ino);
+	return get_new_inode_fast(sb, b, ino);
 }
 EXPORT_SYMBOL(iget_locked);
 
@@ -1178,16 +1198,16 @@
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
 	inode->i_state |= I_NEW;
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
 repeat:
-		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		spin_lock_bucket(b);
+		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
 			if (old->i_sb != sb)
@@ -1195,21 +1215,21 @@
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
 			if (!spin_trylock(&old->i_lock)) {
-				spin_unlock(&inode_hash_lock);
+				spin_unlock_bucket(b);
 				goto repeat;
 			}
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_add_head(&inode->i_hash, &b->head);
+			spin_unlock_bucket(b);
 			return 0;
 		}
-		spin_unlock(&inode_hash_lock);
+		spin_unlock_bucket(b);
 		__iget(old);
 		spin_unlock(&old->i_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1222,17 +1242,17 @@
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct super_block *sb = inode->i_sb;
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
 
 	inode->i_state |= I_NEW;
 
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
 repeat:
-		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		spin_lock_bucket(b);
+		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_sb != sb)
 				continue;
 			if (!test(old, data))
@@ -1240,21 +1260,21 @@
 			if (old->i_state & (I_FREEING|I_WILL_FREE))
 				continue;
 			if (!spin_trylock(&old->i_lock)) {
-				spin_unlock(&inode_hash_lock);
+				spin_unlock_bucket(b);
 				goto repeat;
 			}
 			break;
 		}
 		if (likely(!node)) {
-			hlist_add_head(&inode->i_hash, head);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_add_head(&inode->i_hash, &b->head);
+			spin_unlock_bucket(b);
 			return 0;
 		}
-		spin_unlock(&inode_hash_lock);
+		spin_unlock_bucket(b);
 		__iget(old);
 		spin_unlock(&old->i_lock);
 		wait_on_inode(old);
-		if (unlikely(!hlist_unhashed(&old->i_hash))) {
+		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
 			return -EBUSY;
 		}
@@ -1273,12 +1293,12 @@
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, hashval);
 
 	spin_lock(&inode->i_lock);
-	spin_lock(&inode_hash_lock);
-	hlist_add_head(&inode->i_hash, head);
-	spin_unlock(&inode_hash_lock);
+	spin_lock_bucket(b);
+	hlist_bl_add_head(&inode->i_hash, &b->head);
+	spin_unlock_bucket(b);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -1292,9 +1312,10 @@
  */
 static void __remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_hash_lock);
-	hlist_del_init(&inode->i_hash);
-	spin_unlock(&inode_hash_lock);
+	struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
+	spin_lock_bucket(b);
+	hlist_bl_del_init(&inode->i_hash);
+	spin_unlock_bucket(b);
 }
 
 /**
@@ -1324,7 +1345,7 @@
  */
 int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || hlist_unhashed(&inode->i_hash);
+	return !inode->i_nlink || hlist_bl_unhashed(&inode->i_hash);
 }
 EXPORT_SYMBOL_GPL(generic_drop_inode);
 
@@ -1636,7 +1657,7 @@
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct inode_hash_bucket),
 					ihash_entries,
 					14,
 					HASH_EARLY,
@@ -1645,7 +1666,7 @@
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
 }
 
 void __init inode_init(void)
@@ -1667,7 +1688,7 @@
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct inode_hash_bucket),
 					ihash_entries,
 					14,
 					0,
@@ -1676,7 +1697,7 @@
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:30.000000000 +1100
@@ -998,7 +998,7 @@
 		 * dirty list.  Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_unhashed(&inode->i_hash))
+			if (hlist_bl_unhashed(&inode->i_hash))
 				goto out;
 		}
 		if (inode->i_state & I_FREEING)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:30.000000000 +1100
@@ -380,6 +380,7 @@
 #include <linux/cache.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
+#include <linux/list_bl.h>
 #include <linux/radix-tree.h>
 #include <linux/prio_tree.h>
 #include <linux/init.h>
@@ -728,7 +729,7 @@
 #define ACL_NOT_CACHED ((void *)(-1))
 
 struct inode {
-	struct hlist_node	i_hash;
+	struct hlist_bl_node	i_hash;
 	struct list_head	i_list;		/* backing dev IO list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/mm/shmem.c	2010-10-19 14:19:28.000000000 +1100
@@ -2148,7 +2148,7 @@
 	if (*len < 3)
 		return 255;
 
-	if (hlist_unhashed(&inode->i_hash)) {
+	if (hlist_bl_unhashed(&inode->i_hash)) {
 		/* Unfortunately insert_inode_hash is not idempotent,
 		 * so as we hash inodes here rather than at creation
 		 * time, we need a lock to ensure we only try
@@ -2156,7 +2156,7 @@
 		 */
 		static DEFINE_SPINLOCK(lock);
 		spin_lock(&lock);
-		if (hlist_unhashed(&inode->i_hash))
+		if (hlist_bl_unhashed(&inode->i_hash))
 			__insert_inode_hash(inode,
 					    inode->i_ino + inode->i_generation);
 		spin_unlock(&lock);
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/btrfs/inode.c	2010-10-19 14:19:29.000000000 +1100
@@ -3854,7 +3854,7 @@
 	p = &root->inode_tree.rb_node;
 	parent = NULL;
 
-	if (hlist_unhashed(&inode->i_hash))
+	if (hlist_bl_unhashed(&inode->i_hash))
 		return;
 
 	spin_lock(&root->inode_lock);
Index: linux-2.6/fs/reiserfs/xattr.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/xattr.c	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/reiserfs/xattr.c	2010-10-19 14:18:59.000000000 +1100
@@ -424,7 +424,7 @@
 static void update_ctime(struct inode *inode)
 {
 	struct timespec now = current_fs_time(inode->i_sb);
-	if (hlist_unhashed(&inode->i_hash) || !inode->i_nlink ||
+	if (hlist_bl_unhashed(&inode->i_hash) || !inode->i_nlink ||
 	    timespec_equal(&inode->i_ctime, &now))
 		return;
 
Index: linux-2.6/fs/hfs/hfs_fs.h
===================================================================
--- linux-2.6.orig/fs/hfs/hfs_fs.h	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/hfs/hfs_fs.h	2010-10-19 14:18:59.000000000 +1100
@@ -148,7 +148,7 @@
 
 	int fs_div;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFS_FLG_BITMAP_DIRTY	0
Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/hfs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -524,7 +524,7 @@
 	HFS_I(inode)->rsrc_inode = dir;
 	HFS_I(dir)->rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFS_SB(dir->i_sb)->rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
Index: linux-2.6/fs/hfsplus/hfsplus_fs.h
===================================================================
--- linux-2.6.orig/fs/hfsplus/hfsplus_fs.h	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/hfsplus/hfsplus_fs.h	2010-10-19 14:18:59.000000000 +1100
@@ -144,7 +144,7 @@
 
 	unsigned long flags;
 
-	struct hlist_head rsrc_inodes;
+	struct hlist_bl_head rsrc_inodes;
 };
 
 #define HFSPLUS_SB_WRITEBACKUP	0x0001
Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/hfsplus/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -202,7 +202,7 @@
 	HFSPLUS_I(inode).rsrc_inode = dir;
 	HFSPLUS_I(dir).rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
+	hlist_bl_add_head(&inode->i_hash, &HFSPLUS_SB(sb).rsrc_inodes);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
Index: linux-2.6/fs/nilfs2/gcinode.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/gcinode.c	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/nilfs2/gcinode.c	2010-10-19 14:18:59.000000000 +1100
@@ -45,6 +45,7 @@
 #include <linux/buffer_head.h>
 #include <linux/mpage.h>
 #include <linux/hash.h>
+#include <linux/list_bl.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include "nilfs.h"
@@ -196,13 +197,13 @@
 	INIT_LIST_HEAD(&nilfs->ns_gc_inodes);
 
 	nilfs->ns_gc_inodes_h =
-		kmalloc(sizeof(struct hlist_head) * NILFS_GCINODE_HASH_SIZE,
+		kmalloc(sizeof(struct hlist_bl_head) * NILFS_GCINODE_HASH_SIZE,
 			GFP_NOFS);
 	if (nilfs->ns_gc_inodes_h == NULL)
 		return -ENOMEM;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++)
-		INIT_HLIST_HEAD(&nilfs->ns_gc_inodes_h[loop]);
+		INIT_HLIST_BL_HEAD(&nilfs->ns_gc_inodes_h[loop]);
 	return 0;
 }
 
@@ -254,18 +255,18 @@
  */
 struct inode *nilfs_gc_iget(struct the_nilfs *nilfs, ino_t ino, __u64 cno)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
-	struct hlist_node *node;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h + ihash(ino, cno);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino == ino && NILFS_I(inode)->i_cno == cno)
 			return inode;
 	}
 
 	inode = alloc_gcinode(nilfs, ino, cno);
 	if (likely(inode)) {
-		hlist_add_head(&inode->i_hash, head);
+		hlist_bl_add_head(&inode->i_hash, head);
 		list_add(&NILFS_I(inode)->i_dirty, &nilfs->ns_gc_inodes);
 	}
 	return inode;
@@ -284,14 +285,14 @@
  */
 void nilfs_remove_all_gcinode(struct the_nilfs *nilfs)
 {
-	struct hlist_head *head = nilfs->ns_gc_inodes_h;
-	struct hlist_node *node, *n;
+	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h;
+	struct hlist_bl_node *node, *n;
 	struct inode *inode;
 	int loop;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++, head++) {
-		hlist_for_each_entry_safe(inode, node, n, head, i_hash) {
-			hlist_del_init(&inode->i_hash);
+		hlist_bl_for_each_entry_safe(inode, node, n, head, i_hash) {
+			hlist_bl_del_init(&inode->i_hash);
 			list_del_init(&NILFS_I(inode)->i_dirty);
 			nilfs_clear_gcinode(inode); /* might sleep */
 		}
Index: linux-2.6/fs/nilfs2/segment.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/segment.c	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/nilfs2/segment.c	2010-10-19 14:18:59.000000000 +1100
@@ -2452,7 +2452,7 @@
 	list_for_each_entry_safe(ii, n, head, i_dirty) {
 		if (!test_bit(NILFS_I_UPDATED, &ii->i_state))
 			continue;
-		hlist_del_init(&ii->vfs_inode.i_hash);
+		hlist_bl_del_init(&ii->vfs_inode.i_hash);
 		list_del_init(&ii->i_dirty);
 		nilfs_clear_gcinode(&ii->vfs_inode);
 	}
Index: linux-2.6/fs/nilfs2/the_nilfs.h
===================================================================
--- linux-2.6.orig/fs/nilfs2/the_nilfs.h	2010-10-19 14:17:22.000000000 +1100
+++ linux-2.6/fs/nilfs2/the_nilfs.h	2010-10-19 14:18:59.000000000 +1100
@@ -167,7 +167,7 @@
 
 	/* GC inode list and hash table head */
 	struct list_head	ns_gc_inodes;
-	struct hlist_head      *ns_gc_inodes_h;
+	struct hlist_bl_head      *ns_gc_inodes_h;
 
 	/* Disk layout information (static) */
 	unsigned int		ns_blocksize_bits;



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 16/35] fs: icache lazy inode lru
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (14 preceding siblings ...)
  2010-10-19  3:42 ` [patch 15/35] fs: icache per-bucket inode hash locks npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 17/35] fs: icache RCU free inodes npiggin
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-10.patch --]
[-- Type: text/plain, Size: 8964 bytes --]

Impelemnt lazy inode lru similarly to dcache. That is, avoid moving inode
around the LRU list in iget/iput operations and defer the refcount check
to reclaim-time. Use a flag, I_REFERENCED, to tell reclaim that iget has
touched the inode in the past.

This will reduce lock acquisition, and will also improve lock ordering
with subsequent patches.

The global inode_in_use list goes away, and !list_empty(&inode->i_list)
invariant goes away. 

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c         |    7 ---
 fs/inode.c                |   98 ++++++++++++++++++++++------------------------
 include/linux/fs.h        |   20 ++++++---
 include/linux/writeback.h |    1 
 4 files changed, 61 insertions(+), 65 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:29.000000000 +1100
@@ -94,7 +94,6 @@
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
 
 struct inode_hash_bucket {
@@ -299,6 +298,7 @@
 	INIT_HLIST_BL_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
+	INIT_LIST_HEAD(&inode->i_list);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -320,25 +320,6 @@
 	inode_init_once(inode);
 }
 
-/*
- * i_lock must be held
- */
-void __iget(struct inode *inode)
-{
-	assert_spin_locked(&inode->i_lock);
-
-	inode->i_count++;
-	if (inode->i_count > 1)
-		return;
-
-	if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
-		spin_lock(&wb_inode_list_lock);
-		list_move(&inode->i_list, &inode_in_use);
-		spin_unlock(&wb_inode_list_lock);
-	}
-	atomic_dec(&inodes_stat.nr_unused);
-}
-
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
@@ -383,7 +364,7 @@
 		struct inode *inode;
 
 		inode = list_first_entry(head, struct inode, i_list);
-		list_del(&inode->i_list);
+		list_del_init(&inode->i_list);
 
 		evict(inode);
 
@@ -432,11 +413,12 @@
 		invalidate_inode_buffers(inode);
 		if (!inode->i_count) {
 			spin_lock(&wb_inode_list_lock);
-			list_move(&inode->i_list, dispose);
+			list_del(&inode->i_list);
 			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
+			list_add(&inode->i_list, dispose);
 			count++;
 			continue;
 		}
@@ -476,7 +458,7 @@
 
 static int can_unuse(struct inode *inode)
 {
-	if (inode->i_state)
+	if (inode->i_state & ~I_REFERENCED)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
@@ -504,13 +486,12 @@
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
-	int nr_scanned;
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
 again:
 	spin_lock(&wb_inode_list_lock);
-	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
+	for (; nr_to_scan; nr_to_scan--) {
 		struct inode *inode;
 
 		if (list_empty(&inode_unused))
@@ -522,34 +503,47 @@
 			spin_unlock(&wb_inode_list_lock);
 			goto again;
 		}
-		if (inode->i_state || inode->i_count) {
+		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
+			list_del_init(&inode->i_list);
+			spin_unlock(&inode->i_lock);
+			atomic_dec(&inodes_stat.nr_unused);
+			continue;
+		}
+		if (inode->i_state & I_REFERENCED) {
 			list_move(&inode->i_list, &inode_unused);
+			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			/*
+			 * Move back to the head of the unused list in case the
+			 * invalidations failed. Could improve this by going to
+			 * the head of the list only if invalidation fails.
+			 *
+			 * We'll try to get it back if it becomes freeable.
+			 */
+			list_move(&inode->i_list, &inode_unused);
 			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
+
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-again2:
 			spin_lock(&wb_inode_list_lock);
-
-			if (inode != list_entry(inode_unused.next,
-						struct inode, i_list))
-				continue;	/* wrong inode or list_empty */
-			if (!spin_trylock(&inode->i_lock)) {
-				spin_unlock(&wb_inode_list_lock);
-				goto again2;
-			}
-			if (!can_unuse(inode)) {
-				spin_unlock(&inode->i_lock);
-				continue;
+			if (inode == list_entry(inode_unused.next,
+						struct inode, i_list)) {
+				if (spin_trylock(&inode->i_lock)) {
+					if (can_unuse(inode))
+						goto freeable;
+					spin_unlock(&inode->i_lock);
+				}
 			}
+			continue;
 		}
+freeable:
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
@@ -695,9 +689,6 @@
 {
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
-	spin_lock(&wb_inode_list_lock);
-	list_add(&inode->i_list, &inode_in_use);
-	spin_unlock(&wb_inode_list_lock);
 	if (b) {
 		spin_lock_bucket(b);
 		hlist_bl_add_head(&inode->i_hash, &b->head);
@@ -1371,13 +1362,15 @@
 		drop = generic_drop_inode(inode);
 
 	if (!drop) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
-			spin_lock(&wb_inode_list_lock);
-			list_move(&inode->i_list, &inode_unused);
-			spin_unlock(&wb_inode_list_lock);
-		}
-		atomic_inc(&inodes_stat.nr_unused);
 		if (sb->s_flags & MS_ACTIVE) {
+			inode->i_state |= I_REFERENCED;
+			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
+					list_empty(&inode->i_list)) {
+				spin_lock(&wb_inode_list_lock);
+				list_add(&inode->i_list, &inode_unused);
+				spin_unlock(&wb_inode_list_lock);
+				atomic_inc(&inodes_stat.nr_unused);
+			}
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&sb_inode_list_lock);
 			return;
@@ -1392,11 +1385,14 @@
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
-		atomic_dec(&inodes_stat.nr_unused);
 	}
-	spin_lock(&wb_inode_list_lock);
-	list_del_init(&inode->i_list);
-	spin_unlock(&wb_inode_list_lock);
+	if (!list_empty(&inode->i_list)) {
+		spin_lock(&wb_inode_list_lock);
+		list_del_init(&inode->i_list);
+		spin_unlock(&wb_inode_list_lock);
+		if (!inode->i_state)
+			atomic_dec(&inodes_stat.nr_unused);
+	}
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:28.000000000 +1100
@@ -1637,16 +1637,17 @@
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
-#define I_DIRTY_SYNC		1
-#define I_DIRTY_DATASYNC	2
-#define I_DIRTY_PAGES		4
+#define I_DIRTY_SYNC		0x01
+#define I_DIRTY_DATASYNC	0x02
+#define I_DIRTY_PAGES		0x04
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
-#define I_WILL_FREE		16
-#define I_FREEING		32
-#define I_CLEAR			64
+#define I_WILL_FREE		0x10
+#define I_FREEING		0x20
+#define I_CLEAR			0x40
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
+#define I_REFERENCED		0x100
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
@@ -2187,7 +2188,6 @@
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
 
-extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
@@ -2401,6 +2401,12 @@
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+static inline void __iget(struct inode *inode)
+{
+	assert_spin_locked(&inode->i_lock);
+	inode->i_count++;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:25.000000000 +1100
@@ -416,14 +416,9 @@
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (inode->i_count) {
-			/*
-			 * The inode is clean, inuse
-			 */
-			list_move(&inode->i_list, &inode_in_use);
 		} else {
 			/*
-			 * The inode is clean, unused
+			 * The inode is clean
 			 */
 			list_move(&inode->i_list, &inode_unused);
 		}
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:23.000000000 +1100
@@ -11,7 +11,6 @@
 
 extern spinlock_t sb_inode_list_lock;
 extern spinlock_t wb_inode_list_lock;
-extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 17/35] fs: icache RCU free inodes
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (15 preceding siblings ...)
  2010-10-19  3:42 ` [patch 16/35] fs: icache lazy inode lru npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 18/35] fs: avoid inode RCU freeing for pseudo fs npiggin
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_rcu.patch --]
[-- Type: text/plain, Size: 55462 bytes --]

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 Documentation/filesystems/porting         |    4 
 arch/powerpc/platforms/cell/spufs/inode.c |   10 +-
 drivers/staging/pohmelfs/inode.c          |   11 ++
 fs/9p/vfs_inode.c                         |    9 +-
 fs/adfs/super.c                           |    9 +-
 fs/affs/super.c                           |    9 +-
 fs/afs/super.c                            |   10 ++
 fs/befs/linuxvfs.c                        |   10 +-
 fs/bfs/inode.c                            |    9 +-
 fs/block_dev.c                            |    9 +-
 fs/btrfs/inode.c                          |    9 +-
 fs/ceph/inode.c                           |   11 ++
 fs/cifs/cifsfs.c                          |    9 +-
 fs/coda/inode.c                           |    9 +-
 fs/ecryptfs/super.c                       |   12 ++
 fs/efs/super.c                            |    9 +-
 fs/exofs/super.c                          |    9 +-
 fs/ext2/super.c                           |    9 +-
 fs/ext3/super.c                           |    9 +-
 fs/ext4/super.c                           |    9 +-
 fs/fat/inode.c                            |    9 +-
 fs/freevxfs/vxfs_inode.c                  |    9 +-
 fs/fuse/inode.c                           |    9 +-
 fs/gfs2/super.c                           |    9 +-
 fs/hfs/super.c                            |    9 +-
 fs/hfsplus/super.c                        |    9 +-
 fs/hostfs/hostfs_kern.c                   |    9 +-
 fs/hpfs/super.c                           |    9 +-
 fs/hppfs/hppfs.c                          |    9 +-
 fs/hugetlbfs/inode.c                      |    9 +-
 fs/inode.c                                |  129 +++++++++++++++---------------
 fs/isofs/inode.c                          |    9 +-
 fs/jffs2/super.c                          |    9 +-
 fs/jfs/super.c                            |   10 ++
 fs/logfs/inode.c                          |    9 +-
 fs/minix/inode.c                          |    9 +-
 fs/ncpfs/inode.c                          |    9 +-
 fs/nfs/inode.c                            |    9 +-
 fs/nilfs2/super.c                         |    9 +-
 fs/ntfs/inode.c                           |    9 +-
 fs/ocfs2/dlmfs/dlmfs.c                    |    9 +-
 fs/ocfs2/super.c                          |    9 +-
 fs/openpromfs/inode.c                     |    9 +-
 fs/proc/inode.c                           |    9 +-
 fs/qnx4/inode.c                           |    9 +-
 fs/reiserfs/super.c                       |    9 +-
 fs/romfs/super.c                          |    9 +-
 fs/smbfs/inode.c                          |    9 +-
 fs/squashfs/super.c                       |    9 +-
 fs/sysv/inode.c                           |    9 +-
 fs/ubifs/super.c                          |   10 ++
 fs/udf/super.c                            |    9 +-
 fs/ufs/super.c                            |    9 +-
 fs/xfs/xfs_iget.c                         |   13 ++-
 include/linux/fs.h                        |    7 +
 include/linux/net.h                       |    1 
 ipc/mqueue.c                              |    9 +-
 mm/shmem.c                                |    9 +-
 net/socket.c                              |   16 +--
 net/sunrpc/rpc_pipe.c                     |   10 ++
 60 files changed, 539 insertions(+), 130 deletions(-)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/ext2/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -161,11 +161,18 @@
 	return &ei->vfs_inode;
 }
 
-static void ext2_destroy_inode(struct inode *inode)
+static void ext2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
 }
 
+static void ext2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ext2_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct ext2_inode_info *ei = (struct ext2_inode_info *) foo;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:26.000000000 +1100
@@ -278,13 +278,20 @@
 }
 EXPORT_SYMBOL(__destroy_inode);
 
+static void i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(inode_cachep, inode);
+}
+
 void destroy_inode(struct inode *inode)
 {
 	__destroy_inode(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
 	else
-		kmem_cache_free(inode_cachep, (inode));
+		call_rcu(&inode->i_rcu, i_callback);
 }
 
 /*
@@ -328,6 +335,7 @@
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
+	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
 }
 EXPORT_SYMBOL(end_writeback);
@@ -691,7 +699,7 @@
 	spin_unlock(&sb_inode_list_lock);
 	if (b) {
 		spin_lock_bucket(b);
-		hlist_bl_add_head(&inode->i_hash, &b->head);
+		hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
 		spin_unlock_bucket(b);
 	}
 }
@@ -1190,42 +1198,41 @@
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
 	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_node *node;
+	struct inode *old;
 
 	inode->i_state |= I_NEW;
-	while (1) {
-		struct hlist_bl_node *node;
-		struct inode *old = NULL;
 
 repeat:
-		spin_lock_bucket(b);
-		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
-			if (old->i_ino != ino)
-				continue;
-			if (old->i_sb != sb)
-				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
-				continue;
-			if (!spin_trylock(&old->i_lock)) {
-				spin_unlock_bucket(b);
-				goto repeat;
-			}
-			break;
-		}
-		if (likely(!node)) {
-			hlist_bl_add_head(&inode->i_hash, &b->head);
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
+		if (old->i_ino != ino)
+			continue;
+		if (old->i_sb != sb)
+			continue;
+		if (old->i_state & (I_FREEING|I_WILL_FREE))
+			continue;
+		if (!spin_trylock(&old->i_lock)) {
 			spin_unlock_bucket(b);
-			return 0;
-		}
-		spin_unlock_bucket(b);
-		__iget(old);
-		spin_unlock(&old->i_lock);
-		wait_on_inode(old);
-		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
-			iput(old);
-			return -EBUSY;
+			goto repeat;
 		}
+		goto found_old;
+	}
+	hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
+	spin_unlock_bucket(b);
+	return 0;
+
+found_old:
+	spin_unlock_bucket(b);
+	__iget(old);
+	spin_unlock(&old->i_lock);
+	wait_on_inode(old);
+	if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 		iput(old);
+		return -EBUSY;
 	}
+	iput(old);
+	goto repeat;
 }
 EXPORT_SYMBOL(insert_inode_locked);
 
@@ -1234,43 +1241,43 @@
 {
 	struct super_block *sb = inode->i_sb;
 	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_node *node;
+	struct inode *old;
 
 	inode->i_state |= I_NEW;
 
-	while (1) {
-		struct hlist_bl_node *node;
-		struct inode *old = NULL;
-
 repeat:
-		spin_lock_bucket(b);
-		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
-			if (old->i_sb != sb)
-				continue;
-			if (!test(old, data))
-				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
-				continue;
-			if (!spin_trylock(&old->i_lock)) {
-				spin_unlock_bucket(b);
-				goto repeat;
-			}
-			break;
-		}
-		if (likely(!node)) {
-			hlist_bl_add_head(&inode->i_hash, &b->head);
+	spin_lock_bucket(b);
+	hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
+		if (old->i_sb != sb)
+			continue;
+		/* XXX: audit put test outside i_lock? */
+		if (!test(old, data))
+			continue;
+		if (old->i_state & (I_FREEING|I_WILL_FREE))
+			continue;
+		if (!spin_trylock(&old->i_lock)) {
 			spin_unlock_bucket(b);
-			return 0;
-		}
-		spin_unlock_bucket(b);
-		__iget(old);
-		spin_unlock(&old->i_lock);
-		wait_on_inode(old);
-		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
-			iput(old);
-			return -EBUSY;
+			cpu_relax();
+			goto repeat;
 		}
+		goto found_old;
+	}
+	hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
+	spin_unlock_bucket(b);
+	return 0;
+
+found_old:
+	spin_unlock_bucket(b);
+	__iget(old);
+	spin_unlock(&old->i_lock);
+	wait_on_inode(old);
+	if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 		iput(old);
+		return -EBUSY;
 	}
+	iput(old);
+	goto repeat;
 }
 EXPORT_SYMBOL(insert_inode_locked4);
 
@@ -1288,7 +1295,7 @@
 
 	spin_lock(&inode->i_lock);
 	spin_lock_bucket(b);
-	hlist_bl_add_head(&inode->i_hash, &b->head);
+	hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
 	spin_unlock_bucket(b);
 	spin_unlock(&inode->i_lock);
 }
@@ -1305,7 +1312,7 @@
 {
 	struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
 	spin_lock_bucket(b);
-	hlist_bl_del_init(&inode->i_hash);
+	hlist_bl_del_init_rcu(&inode->i_hash);
 	spin_unlock_bucket(b);
 }
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:26.000000000 +1100
@@ -380,7 +380,7 @@
 #include <linux/cache.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
-#include <linux/list_bl.h>
+#include <linux/rculist_bl.h>
 #include <linux/radix-tree.h>
 #include <linux/prio_tree.h>
 #include <linux/init.h>
@@ -732,7 +732,10 @@
 	struct hlist_bl_node	i_hash;
 	struct list_head	i_list;		/* backing dev IO list */
 	struct list_head	i_sb_list;
-	struct list_head	i_dentry;
+	union {
+		struct list_head	i_dentry;
+		struct rcu_head		i_rcu;
+	};
 	unsigned long		i_ino;
 	unsigned int		i_count;
 	unsigned int		i_nlink;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/block_dev.c	2010-10-19 14:19:18.000000000 +1100
@@ -395,13 +395,20 @@
 	return &ei->vfs_inode;
 }
 
-static void bdev_destroy_inode(struct inode *inode)
+static void bdev_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct bdev_inode *bdi = BDEV_I(inode);
 
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(bdev_cachep, bdi);
 }
 
+static void bdev_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, bdev_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct bdev_inode *ei = (struct bdev_inode *) foo;
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/ext3/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -485,6 +485,13 @@
 	return &ei->vfs_inode;
 }
 
+static void ext3_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+}
+
 static void ext3_destroy_inode(struct inode *inode)
 {
 	if (!list_empty(&(EXT3_I(inode)->i_orphan))) {
@@ -495,7 +502,7 @@
 				false);
 		dump_stack();
 	}
-	kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+	call_rcu(&inode->i_rcu, ext3_i_callback);
 }
 
 static void init_once(void *foo)
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/hugetlbfs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -648,11 +648,18 @@
 	return &p->vfs_inode;
 }
 
+static void hugetlbfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+}
+
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
 	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
-	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+	call_rcu(&inode->i_rcu, hugetlbfs_i_callback);
 }
 
 static const struct address_space_operations hugetlbfs_aops = {
Index: linux-2.6/fs/proc/inode.c
===================================================================
--- linux-2.6.orig/fs/proc/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/proc/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -66,11 +66,18 @@
 	return inode;
 }
 
-static void proc_destroy_inode(struct inode *inode)
+static void proc_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(proc_inode_cachep, PROC_I(inode));
 }
 
+static void proc_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, proc_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct proc_inode *ei = (struct proc_inode *) foo;
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/ipc/mqueue.c	2010-10-19 14:19:18.000000000 +1100
@@ -236,11 +236,18 @@
 	return &ei->vfs_inode;
 }
 
-static void mqueue_destroy_inode(struct inode *inode)
+static void mqueue_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(mqueue_inode_cachep, MQUEUE_I(inode));
 }
 
+static void mqueue_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, mqueue_i_callback);
+}
+
 static void mqueue_evict_inode(struct inode *inode)
 {
 	struct mqueue_inode_info *info;
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/net/socket.c	2010-10-19 14:19:26.000000000 +1100
@@ -262,20 +262,20 @@
 }
 
 
-static void wq_free_rcu(struct rcu_head *head)
+static void sock_free_rcu(struct rcu_head *head)
 {
-	struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct socket_alloc *ei = container_of(inode, struct socket_alloc,
+								vfs_inode);
 
-	kfree(wq);
+	kfree(ei->socket.wq);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(sock_inode_cachep, ei);
 }
 
 static void sock_destroy_inode(struct inode *inode)
 {
-	struct socket_alloc *ei;
-
-	ei = container_of(inode, struct socket_alloc, vfs_inode);
-	call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
-	kmem_cache_free(sock_inode_cachep, ei);
+	call_rcu(&inode->i_rcu, sock_free_rcu);
 }
 
 static void init_once(void *foo)
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/fat/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -519,11 +519,18 @@
 	return &ei->vfs_inode;
 }
 
-static void fat_destroy_inode(struct inode *inode)
+static void fat_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(fat_inode_cachep, MSDOS_I(inode));
 }
 
+static void fat_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, fat_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct msdos_inode_info *ei = (struct msdos_inode_info *)foo;
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/nfs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -1434,11 +1434,18 @@
 	return &nfsi->vfs_inode;
 }
 
-void nfs_destroy_inode(struct inode *inode)
+static void nfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
 }
 
+void nfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, nfs_i_callback);
+}
+
 static inline void nfs4_init_once(struct nfs_inode *nfsi)
 {
 #ifdef CONFIG_NFS_V4
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/mm/shmem.c	2010-10-19 14:19:18.000000000 +1100
@@ -2416,13 +2416,20 @@
 	return &p->vfs_inode;
 }
 
+static void shmem_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+}
+
 static void shmem_destroy_inode(struct inode *inode)
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
 		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
 	}
-	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+	call_rcu(&inode->i_rcu, shmem_i_callback);
 }
 
 static void init_once(void *foo)
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===================================================================
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/net/sunrpc/rpc_pipe.c	2010-10-19 14:18:59.000000000 +1100
@@ -163,11 +163,19 @@
 }
 
 static void
-rpc_destroy_inode(struct inode *inode)
+rpc_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(rpc_inode_cachep, RPC_I(inode));
 }
 
+static void
+rpc_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, rpc_i_callback);
+}
+
 static int
 rpc_pipe_open(struct inode *inode, struct file *filp)
 {
Index: linux-2.6/include/linux/net.h
===================================================================
--- linux-2.6.orig/include/linux/net.h	2010-10-19 14:17:19.000000000 +1100
+++ linux-2.6/include/linux/net.h	2010-10-19 14:19:26.000000000 +1100
@@ -120,7 +120,6 @@
 struct socket_wq {
 	wait_queue_head_t	wait;
 	struct fasync_struct	*fasync_list;
-	struct rcu_head		rcu;
 } ____cacheline_aligned_in_smp;
 
 /**
Index: linux-2.6/fs/9p/vfs_inode.c
===================================================================
--- linux-2.6.orig/fs/9p/vfs_inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/9p/vfs_inode.c	2010-10-19 14:19:18.000000000 +1100
@@ -231,10 +231,17 @@
  *
  */
 
-void v9fs_destroy_inode(struct inode *inode)
+static void v9fs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(vcookie_cache, v9fs_inode2cookie(inode));
 }
+
+void v9fs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, v9fs_i_callback);
+}
 #endif
 
 /**
Index: linux-2.6/fs/adfs/super.c
===================================================================
--- linux-2.6.orig/fs/adfs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/adfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -240,11 +240,18 @@
 	return &ei->vfs_inode;
 }
 
-static void adfs_destroy_inode(struct inode *inode)
+static void adfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(adfs_inode_cachep, ADFS_I(inode));
 }
 
+static void adfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, adfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct adfs_inode_info *ei = (struct adfs_inode_info *) foo;
Index: linux-2.6/fs/affs/super.c
===================================================================
--- linux-2.6.orig/fs/affs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/affs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -100,11 +100,18 @@
 	return &i->vfs_inode;
 }
 
-static void affs_destroy_inode(struct inode *inode)
+static void affs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(affs_inode_cachep, AFFS_I(inode));
 }
 
+static void affs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, affs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct affs_inode_info *ei = (struct affs_inode_info *) foo;
Index: linux-2.6/fs/afs/super.c
===================================================================
--- linux-2.6.orig/fs/afs/super.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/afs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -508,6 +508,14 @@
 	return &vnode->vfs_inode;
 }
 
+static void afs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct afs_vnode *vnode = AFS_FS_I(inode);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(afs_inode_cachep, vnode);
+}
+
 /*
  * destroy an AFS inode struct
  */
@@ -521,7 +529,7 @@
 
 	ASSERTCMP(vnode->server, ==, NULL);
 
-	kmem_cache_free(afs_inode_cachep, vnode);
+	call_rcu(&inode->i_rcu, afs_i_callback);
 	atomic_dec(&afs_count_active_inodes);
 }
 
Index: linux-2.6/fs/befs/linuxvfs.c
===================================================================
--- linux-2.6.orig/fs/befs/linuxvfs.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/befs/linuxvfs.c	2010-10-19 14:18:59.000000000 +1100
@@ -284,12 +284,18 @@
         return &bi->vfs_inode;
 }
 
-static void
-befs_destroy_inode(struct inode *inode)
+static void befs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
         kmem_cache_free(befs_inode_cachep, BEFS_I(inode));
 }
 
+static void befs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, befs_i_callback);
+}
+
 static void init_once(void *foo)
 {
         struct befs_inode_info *bi = (struct befs_inode_info *) foo;
Index: linux-2.6/fs/bfs/inode.c
===================================================================
--- linux-2.6.orig/fs/bfs/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/bfs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -253,11 +253,18 @@
 	return &bi->vfs_inode;
 }
 
-static void bfs_destroy_inode(struct inode *inode)
+static void bfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(bfs_inode_cachep, BFS_I(inode));
 }
 
+static void bfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, bfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct bfs_inode_info *bi = foo;
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/btrfs/inode.c	2010-10-19 14:19:18.000000000 +1100
@@ -6286,6 +6286,13 @@
 	return inode;
 }
 
+static void btrfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+}
+
 void btrfs_destroy_inode(struct inode *inode)
 {
 	struct btrfs_ordered_extent *ordered;
@@ -6340,7 +6347,7 @@
 	inode_tree_del(inode);
 	btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
 free:
-	kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+	call_rcu(&inode->i_rcu, btrfs_i_callback);
 }
 
 int btrfs_drop_inode(struct inode *inode)
Index: linux-2.6/fs/ceph/inode.c
===================================================================
--- linux-2.6.orig/fs/ceph/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/ceph/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -368,6 +368,15 @@
 	return &ci->vfs_inode;
 }
 
+static void ceph_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ceph_inode_cachep, ci);
+}
+
 void ceph_destroy_inode(struct inode *inode)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
@@ -407,7 +416,7 @@
 	if (ci->i_xattrs.prealloc_blob)
 		ceph_buffer_put(ci->i_xattrs.prealloc_blob);
 
-	kmem_cache_free(ceph_inode_cachep, ci);
+	call_rcu(&inode->i_rcu, ceph_i_callback);
 }
 
 
Index: linux-2.6/fs/cifs/cifsfs.c
===================================================================
--- linux-2.6.orig/fs/cifs/cifsfs.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/cifs/cifsfs.c	2010-10-19 14:18:59.000000000 +1100
@@ -322,10 +322,17 @@
 	return &cifs_inode->vfs_inode;
 }
 
+static void cifs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(cifs_inode_cachep, CIFS_I(inode));
+}
+
 static void
 cifs_destroy_inode(struct inode *inode)
 {
-	kmem_cache_free(cifs_inode_cachep, CIFS_I(inode));
+	call_rcu(&inode->i_rcu, cifs_i_callback);
 }
 
 static void
Index: linux-2.6/fs/coda/inode.c
===================================================================
--- linux-2.6.orig/fs/coda/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/coda/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -54,11 +54,18 @@
 	return &ei->vfs_inode;
 }
 
-static void coda_destroy_inode(struct inode *inode)
+static void coda_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(coda_inode_cachep, ITOC(inode));
 }
 
+static void coda_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, coda_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct coda_inode_info *ei = (struct coda_inode_info *) foo;
Index: linux-2.6/fs/ecryptfs/super.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/super.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/ecryptfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -63,6 +63,16 @@
 	return inode;
 }
 
+static void ecryptfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct ecryptfs_inode_info *inode_info;
+	inode_info = ecryptfs_inode_to_private(inode);
+
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
+}
+
 /**
  * ecryptfs_destroy_inode
  * @inode: The ecryptfs inode
@@ -89,7 +99,7 @@
 		}
 	}
 	ecryptfs_destroy_crypt_stat(&inode_info->crypt_stat);
-	kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
+	call_rcu(&inode->i_rcu, ecryptfs_i_callback);
 }
 
 /**
Index: linux-2.6/fs/efs/super.c
===================================================================
--- linux-2.6.orig/fs/efs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/efs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -65,11 +65,18 @@
 	return &ei->vfs_inode;
 }
 
-static void efs_destroy_inode(struct inode *inode)
+static void efs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(efs_inode_cachep, INODE_INFO(inode));
 }
 
+static void efs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, efs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct efs_inode_info *ei = (struct efs_inode_info *) foo;
Index: linux-2.6/fs/exofs/super.c
===================================================================
--- linux-2.6.orig/fs/exofs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/exofs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -150,12 +150,19 @@
 	return &oi->vfs_inode;
 }
 
+static void exofs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+}
+
 /*
  * Remove an inode from the cache
  */
 static void exofs_destroy_inode(struct inode *inode)
 {
-	kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+	call_rcu(&inode->i_rcu, exofs_i_callback);
 }
 
 /*
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/ext4/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -825,6 +825,13 @@
 	return &ei->vfs_inode;
 }
 
+static void ext4_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
+}
+
 static void ext4_destroy_inode(struct inode *inode)
 {
 	if (!list_empty(&(EXT4_I(inode)->i_orphan))) {
@@ -836,7 +843,7 @@
 				true);
 		dump_stack();
 	}
-	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
+	call_rcu(&inode->i_rcu, ext4_i_callback);
 }
 
 static void init_once(void *foo)
Index: linux-2.6/fs/freevxfs/vxfs_inode.c
===================================================================
--- linux-2.6.orig/fs/freevxfs/vxfs_inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/freevxfs/vxfs_inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -336,6 +336,13 @@
 	return ip;
 }
 
+static void vxfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(vxfs_inode_cachep, inode->i_private);
+}
+
 /**
  * vxfs_evict_inode - remove inode from main memory
  * @ip:		inode to discard.
@@ -349,5 +356,5 @@
 {
 	truncate_inode_pages(&ip->i_data, 0);
 	end_writeback(ip);
-	kmem_cache_free(vxfs_inode_cachep, ip->i_private);
+	call_rcu(&ip->i_rcu, vxfs_i_callback);
 }
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/fuse/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -99,6 +99,13 @@
 	return inode;
 }
 
+static void fuse_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(fuse_inode_cachep, inode);
+}
+
 static void fuse_destroy_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -106,7 +113,7 @@
 	BUG_ON(!list_empty(&fi->queued_writes));
 	if (fi->forget_req)
 		fuse_request_free(fi->forget_req);
-	kmem_cache_free(fuse_inode_cachep, inode);
+	call_rcu(&inode->i_rcu, fuse_i_callback);
 }
 
 void fuse_send_forget(struct fuse_conn *fc, struct fuse_req *req,
Index: linux-2.6/fs/gfs2/super.c
===================================================================
--- linux-2.6.orig/fs/gfs2/super.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/gfs2/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -1414,11 +1414,18 @@
 	return &ip->i_inode;
 }
 
-static void gfs2_destroy_inode(struct inode *inode)
+static void gfs2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(gfs2_inode_cachep, inode);
 }
 
+static void gfs2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, gfs2_i_callback);
+}
+
 const struct super_operations gfs2_super_ops = {
 	.alloc_inode		= gfs2_alloc_inode,
 	.destroy_inode		= gfs2_destroy_inode,
Index: linux-2.6/fs/hfs/super.c
===================================================================
--- linux-2.6.orig/fs/hfs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/hfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -172,11 +172,18 @@
 	return i ? &i->vfs_inode : NULL;
 }
 
-static void hfs_destroy_inode(struct inode *inode)
+static void hfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(hfs_inode_cachep, HFS_I(inode));
 }
 
+static void hfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hfs_i_callback);
+}
+
 static const struct super_operations hfs_super_operations = {
 	.alloc_inode	= hfs_alloc_inode,
 	.destroy_inode	= hfs_destroy_inode,
Index: linux-2.6/fs/hfsplus/super.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/hfsplus/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -484,11 +484,18 @@
 	return i ? &i->vfs_inode : NULL;
 }
 
-static void hfsplus_destroy_inode(struct inode *inode)
+static void hfsplus_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(hfsplus_inode_cachep, &HFSPLUS_I(inode));
 }
 
+static void hfsplus_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hfsplus_i_callback);
+}
+
 #define HFSPLUS_INODE_SIZE	sizeof(struct hfsplus_inode_info)
 
 static int hfsplus_get_sb(struct file_system_type *fs_type,
Index: linux-2.6/fs/hpfs/super.c
===================================================================
--- linux-2.6.orig/fs/hpfs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/hpfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -177,11 +177,18 @@
 	return &ei->vfs_inode;
 }
 
-static void hpfs_destroy_inode(struct inode *inode)
+static void hpfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(hpfs_inode_cachep, hpfs_i(inode));
 }
 
+static void hpfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hpfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct hpfs_inode_info *ei = (struct hpfs_inode_info *) foo;
Index: linux-2.6/fs/isofs/inode.c
===================================================================
--- linux-2.6.orig/fs/isofs/inode.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/isofs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -70,11 +70,18 @@
 	return &ei->vfs_inode;
 }
 
-static void isofs_destroy_inode(struct inode *inode)
+static void isofs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(isofs_inode_cachep, ISOFS_I(inode));
 }
 
+static void isofs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, isofs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct iso_inode_info *ei = foo;
Index: linux-2.6/fs/jffs2/super.c
===================================================================
--- linux-2.6.orig/fs/jffs2/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/jffs2/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -41,11 +41,18 @@
 	return &f->vfs_inode;
 }
 
-static void jffs2_destroy_inode(struct inode *inode)
+static void jffs2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(jffs2_inode_cachep, JFFS2_INODE_INFO(inode));
 }
 
+static void jffs2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, jffs2_i_callback);
+}
+
 static void jffs2_i_init_once(void *foo)
 {
 	struct jffs2_inode_info *f = foo;
Index: linux-2.6/fs/jfs/super.c
===================================================================
--- linux-2.6.orig/fs/jfs/super.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/jfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -116,6 +116,14 @@
 	return &jfs_inode->vfs_inode;
 }
 
+static void jfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct jfs_inode_info *ji = JFS_IP(inode);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(jfs_inode_cachep, ji);
+}
+
 static void jfs_destroy_inode(struct inode *inode)
 {
 	struct jfs_inode_info *ji = JFS_IP(inode);
@@ -129,7 +137,7 @@
 		ji->active_ag = -1;
 	}
 	spin_unlock_irq(&ji->ag_lock);
-	kmem_cache_free(jfs_inode_cachep, ji);
+	call_rcu(&inode->i_rcu, jfs_i_callback);
 }
 
 static int jfs_statfs(struct dentry *dentry, struct kstatfs *buf)
Index: linux-2.6/fs/logfs/inode.c
===================================================================
--- linux-2.6.orig/fs/logfs/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/logfs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -141,13 +141,20 @@
 	return __logfs_iget(sb, ino);
 }
 
+static void logfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(logfs_inode_cache, logfs_inode(inode));
+}
+
 static void __logfs_destroy_inode(struct inode *inode)
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
 	BUG_ON(li->li_block);
 	list_del(&li->li_freeing_list);
-	kmem_cache_free(logfs_inode_cache, li);
+	call_rcu(&inode->i_rcu, logfs_i_callback);
 }
 
 static void logfs_destroy_inode(struct inode *inode)
Index: linux-2.6/fs/minix/inode.c
===================================================================
--- linux-2.6.orig/fs/minix/inode.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/minix/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -68,11 +68,18 @@
 	return &ei->vfs_inode;
 }
 
-static void minix_destroy_inode(struct inode *inode)
+static void minix_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(minix_inode_cachep, minix_i(inode));
 }
 
+static void minix_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, minix_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct minix_inode_info *ei = (struct minix_inode_info *) foo;
Index: linux-2.6/fs/ncpfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/ncpfs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -59,11 +59,18 @@
 	return &ei->vfs_inode;
 }
 
-static void ncp_destroy_inode(struct inode *inode)
+static void ncp_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ncp_inode_cachep, NCP_FINFO(inode));
 }
 
+static void ncp_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ncp_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct ncp_inode_info *ei = (struct ncp_inode_info *) foo;
Index: linux-2.6/fs/nilfs2/super.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/nilfs2/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -166,11 +166,18 @@
 	return nilfs_alloc_inode_common(NILFS_SB(sb)->s_nilfs);
 }
 
-void nilfs_destroy_inode(struct inode *inode)
+static void nilfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(nilfs_inode_cachep, NILFS_I(inode));
 }
 
+void nilfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, nilfs_i_callback);
+}
+
 static int nilfs_sync_super(struct nilfs_sb_info *sbi, int flag)
 {
 	struct the_nilfs *nilfs = sbi->s_nilfs;
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/ntfs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -332,6 +332,13 @@
 	return NULL;
 }
 
+static void ntfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ntfs_big_inode_cache, NTFS_I(inode));
+}
+
 void ntfs_destroy_big_inode(struct inode *inode)
 {
 	ntfs_inode *ni = NTFS_I(inode);
@@ -340,7 +347,7 @@
 	BUG_ON(ni->page);
 	if (!atomic_dec_and_test(&ni->count))
 		BUG();
-	kmem_cache_free(ntfs_big_inode_cache, NTFS_I(inode));
+	call_rcu(&inode->i_rcu, ntfs_i_callback);
 }
 
 static inline ntfs_inode *ntfs_alloc_extent_inode(void)
Index: linux-2.6/fs/ocfs2/dlmfs/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlmfs/dlmfs.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/ocfs2/dlmfs/dlmfs.c	2010-10-19 14:18:59.000000000 +1100
@@ -351,11 +351,18 @@
 	return &ip->ip_vfs_inode;
 }
 
-static void dlmfs_destroy_inode(struct inode *inode)
+static void dlmfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(dlmfs_inode_cache, DLMFS_I(inode));
 }
 
+static void dlmfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, dlmfs_i_callback);
+}
+
 static void dlmfs_evict_inode(struct inode *inode)
 {
 	int status;
Index: linux-2.6/fs/ocfs2/super.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/super.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/ocfs2/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -550,11 +550,18 @@
 	return &oi->vfs_inode;
 }
 
-static void ocfs2_destroy_inode(struct inode *inode)
+static void ocfs2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ocfs2_inode_cachep, OCFS2_I(inode));
 }
 
+static void ocfs2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ocfs2_i_callback);
+}
+
 static unsigned long long ocfs2_max_file_offset(unsigned int bbits,
 						unsigned int cbits)
 {
Index: linux-2.6/fs/openpromfs/inode.c
===================================================================
--- linux-2.6.orig/fs/openpromfs/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/openpromfs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -343,11 +343,18 @@
 	return &oi->vfs_inode;
 }
 
-static void openprom_destroy_inode(struct inode *inode)
+static void openprom_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(op_inode_cachep, OP_I(inode));
 }
 
+static void openprom_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, openprom_i_callback);
+}
+
 static struct inode *openprom_iget(struct super_block *sb, ino_t ino)
 {
 	struct inode *inode;
Index: linux-2.6/fs/qnx4/inode.c
===================================================================
--- linux-2.6.orig/fs/qnx4/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/qnx4/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -431,11 +431,18 @@
 	return &ei->vfs_inode;
 }
 
-static void qnx4_destroy_inode(struct inode *inode)
+static void qnx4_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(qnx4_inode_cachep, qnx4_i(inode));
 }
 
+static void qnx4_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, qnx4_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct qnx4_inode_info *ei = (struct qnx4_inode_info *) foo;
Index: linux-2.6/fs/reiserfs/super.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/reiserfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -530,11 +530,18 @@
 	return &ei->vfs_inode;
 }
 
-static void reiserfs_destroy_inode(struct inode *inode)
+static void reiserfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(reiserfs_inode_cachep, REISERFS_I(inode));
 }
 
+static void reiserfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, reiserfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct reiserfs_inode_info *ei = (struct reiserfs_inode_info *)foo;
Index: linux-2.6/fs/romfs/super.c
===================================================================
--- linux-2.6.orig/fs/romfs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/romfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -399,11 +399,18 @@
 /*
  * return a spent inode to the slab cache
  */
-static void romfs_destroy_inode(struct inode *inode)
+static void romfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(romfs_inode_cachep, ROMFS_I(inode));
 }
 
+static void romfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, romfs_i_callback);
+}
+
 /*
  * get filesystem statistics
  */
Index: linux-2.6/fs/smbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/smbfs/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/smbfs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -62,11 +62,18 @@
 	return &ei->vfs_inode;
 }
 
-static void smb_destroy_inode(struct inode *inode)
+static void smb_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(smb_inode_cachep, SMB_I(inode));
 }
 
+static void smb_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, smb_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct smb_inode_info *ei = (struct smb_inode_info *) foo;
Index: linux-2.6/fs/squashfs/super.c
===================================================================
--- linux-2.6.orig/fs/squashfs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/squashfs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -447,11 +447,18 @@
 }
 
 
-static void squashfs_destroy_inode(struct inode *inode)
+static void squashfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(squashfs_inode_cachep, squashfs_i(inode));
 }
 
+static void squashfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, squashfs_i_callback);
+}
+
 
 static struct file_system_type squashfs_fs_type = {
 	.owner = THIS_MODULE,
Index: linux-2.6/fs/sysv/inode.c
===================================================================
--- linux-2.6.orig/fs/sysv/inode.c	2010-10-19 14:17:21.000000000 +1100
+++ linux-2.6/fs/sysv/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -333,11 +333,18 @@
 	return &si->vfs_inode;
 }
 
-static void sysv_destroy_inode(struct inode *inode)
+static void sysv_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(sysv_inode_cachep, SYSV_I(inode));
 }
 
+static void sysv_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, sysv_i_callback);
+}
+
 static void init_once(void *p)
 {
 	struct sysv_inode_info *si = (struct sysv_inode_info *)p;
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ubifs/super.c	2010-10-19 14:19:16.000000000 +1100
@@ -272,12 +272,20 @@
 	return &ui->vfs_inode;
 };
 
+static void ubifs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct ubifs_inode *ui = ubifs_inode(inode);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ubifs_inode_slab, ui);
+}
+
 static void ubifs_destroy_inode(struct inode *inode)
 {
 	struct ubifs_inode *ui = ubifs_inode(inode);
 
 	kfree(ui->data);
-	kmem_cache_free(ubifs_inode_slab, inode);
+	call_rcu(&inode->i_rcu, ubifs_i_callback);
 }
 
 /*
Index: linux-2.6/fs/udf/super.c
===================================================================
--- linux-2.6.orig/fs/udf/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/udf/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -140,11 +140,18 @@
 	return &ei->vfs_inode;
 }
 
-static void udf_destroy_inode(struct inode *inode)
+static void udf_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(udf_inode_cachep, UDF_I(inode));
 }
 
+static void udf_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, udf_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct udf_inode_info *ei = (struct udf_inode_info *)foo;
Index: linux-2.6/fs/ufs/super.c
===================================================================
--- linux-2.6.orig/fs/ufs/super.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/ufs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -1407,11 +1407,18 @@
 	return &ei->vfs_inode;
 }
 
-static void ufs_destroy_inode(struct inode *inode)
+static void ufs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ufs_inode_cachep, UFS_I(inode));
 }
 
+static void ufs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ufs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct ufs_inode_info *ei = (struct ufs_inode_info *) foo;
Index: linux-2.6/Documentation/filesystems/porting
===================================================================
--- linux-2.6.orig/Documentation/filesystems/porting	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/Documentation/filesystems/porting	2010-10-19 14:18:59.000000000 +1100
@@ -326,3 +326,7 @@
 particular things. Most of the time, a filesystem only needs ->i_lock, which
 protects *all* the inode state and its membership on lists that was
 previously protected with inode_lock.
+
+--
+[mandatory]
+	Filessystems must RCU-free their inodes. Lots of examples.
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/inode.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/inode.c	2010-10-19 14:18:59.000000000 +1100
@@ -71,12 +71,18 @@
 	return &ei->vfs_inode;
 }
 
-static void
-spufs_destroy_inode(struct inode *inode)
+static void spufs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(spufs_inode_cache, SPUFS_I(inode));
 }
 
+static void spufs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, spufs_i_callback);
+}
+
 static void
 spufs_init_once(void *p)
 {
Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -826,6 +826,14 @@
 	.set_page_dirty 	= __set_page_dirty_nobuffers,
 };
 
+static void pohmelfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(pohmelfs_inode_cache, POHMELFS_I(inode));
+	atomic_long_dec(&psb->total_inodes);
+}
+
 /*
  * ->detroy_inode() callback. Deletes inode from the caches
  *  and frees private data.
@@ -842,8 +850,7 @@
 
 	dprintk("%s: pi: %p, inode: %p, ino: %llu.\n",
 		__func__, pi, &pi->vfs_inode, pi->ino);
-	kmem_cache_free(pohmelfs_inode_cache, pi);
-	atomic_long_dec(&psb->total_inodes);
+	call_rcu(&inode->i_rcu, pohmelfs_i_callback);
 }
 
 /*
Index: linux-2.6/fs/hostfs/hostfs_kern.c
===================================================================
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/hostfs/hostfs_kern.c	2010-10-19 14:18:59.000000000 +1100
@@ -251,11 +251,18 @@
 	}
 }
 
-static void hostfs_destroy_inode(struct inode *inode)
+static void hostfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kfree(HOSTFS_I(inode));
 }
 
+static void hostfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hostfs_i_callback);
+}
+
 static int hostfs_show_options(struct seq_file *seq, struct vfsmount *vfs)
 {
 	const char *root_path = vfs->mnt_sb->s_fs_info;
Index: linux-2.6/fs/hppfs/hppfs.c
===================================================================
--- linux-2.6.orig/fs/hppfs/hppfs.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/hppfs/hppfs.c	2010-10-19 14:18:59.000000000 +1100
@@ -631,11 +631,18 @@
 	mntput(ino->i_sb->s_fs_info);
 }
 
-static void hppfs_destroy_inode(struct inode *inode)
+static void hppfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kfree(HPPFS_I(inode));
 }
 
+static void hppfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hppfs_i_callback);
+}
+
 static const struct super_operations hppfs_sbops = {
 	.alloc_inode	= hppfs_alloc_inode,
 	.destroy_inode	= hppfs_destroy_inode,
Index: linux-2.6/fs/xfs/xfs_iget.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_iget.c	2010-10-19 14:17:20.000000000 +1100
+++ linux-2.6/fs/xfs/xfs_iget.c	2010-10-19 14:18:59.000000000 +1100
@@ -91,6 +91,17 @@
 	return ip;
 }
 
+STATIC void
+xfs_inode_free_callback(
+	struct rcu_head		*head)
+{
+	struct inode		*inode = container_of(head, struct inode, i_rcu);
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_zone_free(xfs_inode_zone, ip);
+}
+
 void
 xfs_inode_free(
 	struct xfs_inode	*ip)
@@ -134,7 +145,7 @@
 	ASSERT(!spin_is_locked(&ip->i_flags_lock));
 	ASSERT(completion_done(&ip->i_flush));
 
-	kmem_zone_free(xfs_inode_zone, ip);
+	call_rcu(&ip->i_vnode.i_rcu, xfs_inode_free_callback);
 }
 
 /*



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 18/35] fs: avoid inode RCU freeing for pseudo fs
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (16 preceding siblings ...)
  2010-10-19  3:42 ` [patch 17/35] fs: icache RCU free inodes npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 19/35] fs: icache remove redundant i_sb_list umount locking npiggin
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-rcu-reduce.patch --]
[-- Type: text/plain, Size: 3698 bytes --]

Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c          |    6 ++++++
 fs/pipe.c           |    6 +++++-
 include/linux/fs.h  |    1 +
 include/linux/net.h |    1 +
 net/socket.c        |   17 +++++++++--------
 5 files changed, 22 insertions(+), 9 deletions(-)

Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/net/socket.c	2010-10-19 14:19:22.000000000 +1100
@@ -262,20 +262,21 @@
 }
 
 
-static void sock_free_rcu(struct rcu_head *head)
+
+static void wq_free_rcu(struct rcu_head *head)
 {
-	struct inode *inode = container_of(head, struct inode, i_rcu);
-	struct socket_alloc *ei = container_of(inode, struct socket_alloc,
-								vfs_inode);
+	struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
 
-	kfree(ei->socket.wq);
-	INIT_LIST_HEAD(&inode->i_dentry);
-	kmem_cache_free(sock_inode_cachep, ei);
+	kfree(wq);
 }
 
 static void sock_destroy_inode(struct inode *inode)
 {
-	call_rcu(&inode->i_rcu, sock_free_rcu);
+	struct socket_alloc *ei;
+
+	ei = container_of(inode, struct socket_alloc, vfs_inode);
+	call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
+	kmem_cache_free(sock_inode_cachep, ei);
 }
 
 static void init_once(void *foo)
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:25.000000000 +1100
@@ -263,6 +263,12 @@
 	return inode;
 }
 
+void free_inode_nonrcu(struct inode *inode)
+{
+	kmem_cache_free(inode_cachep, inode);
+}
+EXPORT_SYMBOL(free_inode_nonrcu);
+
 void __destroy_inode(struct inode *inode)
 {
 	BUG_ON(inode_has_buffers(inode));
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c	2010-10-19 14:17:18.000000000 +1100
+++ linux-2.6/fs/pipe.c	2010-10-19 14:19:22.000000000 +1100
@@ -1239,6 +1239,10 @@
 	return ret;
 }
 
+static const struct super_operations pipefs_ops = {
+	.destroy_inode = free_inode_nonrcu,
+};
+
 /*
  * pipefs should _never_ be mounted by userland - too much of security hassle,
  * no real gain from having the whole whorehouse mounted. So we don't need
@@ -1249,7 +1253,7 @@
 			 int flags, const char *dev_name, void *data,
 			 struct vfsmount *mnt)
 {
-	return get_sb_pseudo(fs_type, "pipe:", NULL, PIPEFS_MAGIC, mnt);
+	return get_sb_pseudo(fs_type, "pipe:", &pipefs_ops, PIPEFS_MAGIC, mnt);
 }
 
 static struct file_system_type pipe_fs_type = {
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:24.000000000 +1100
@@ -2196,6 +2196,7 @@
 extern void destroy_inode(struct inode *);
 extern void __destroy_inode(struct inode *);
 extern struct inode *new_inode(struct super_block *);
+extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
Index: linux-2.6/include/linux/net.h
===================================================================
--- linux-2.6.orig/include/linux/net.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/net.h	2010-10-19 14:18:59.000000000 +1100
@@ -120,6 +120,7 @@
 struct socket_wq {
 	wait_queue_head_t	wait;
 	struct fasync_struct	*fasync_list;
+	struct rcu_head		rcu;
 } ____cacheline_aligned_in_smp;
 
 /**



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 19/35] fs: icache remove redundant i_sb_list umount locking
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (17 preceding siblings ...)
  2010-10-19  3:42 ` [patch 18/35] fs: avoid inode RCU freeing for pseudo fs npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-20 12:46   ` Al Viro
  2010-10-19  3:42 ` [patch 20/35] fs: icache rcu walk for i_sb_list npiggin
                   ` (17 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-11a.patch --]
[-- Type: text/plain, Size: 2868 bytes --]

In preparation for rcu walking the inode sb lists, remove some
locking from the inode umount path according to comments and
the fact that we already rely on the list not changing at these
points.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c             |   15 +++++----------
 fs/notify/inode_mark.c |   15 +++------------
 2 files changed, 8 insertions(+), 22 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:25.000000000 +1100
@@ -407,14 +407,6 @@
 		struct list_head *tmp = next;
 		struct inode *inode;
 
-		/*
-		 * We can reschedule here without worrying about the list's
-		 * consistency because the per-sb list of inodes must not
-		 * change during umount anymore, and because iprune_sem keeps
-		 * shrink_icache_memory() away.
-		 */
-		cond_resched_lock(&sb_inode_list_lock);
-
 		next = next->next;
 		if (tmp == head)
 			break;
@@ -458,10 +450,13 @@
 	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
-	spin_lock(&sb_inode_list_lock);
+	/*
+	 * We can walk the per-sb list of inodes here without worrying about
+	 * its consistency, because the list must not change during umount
+	 * anymore, and because iprune_sem keeps shrink_icache_memory() away.
+	 */
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
-	spin_unlock(&sb_inode_list_lock);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:24.000000000 +1100
@@ -232,8 +232,9 @@
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
- * sb_inode_list_lock to protect the super block's list of inodes.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay,
+ * and with the sb going away, no new inodes will appear or be referenced
+ * from other paths.
  */
 void fsnotify_unmount_inodes(struct list_head *list)
 {
@@ -285,14 +286,6 @@
 			spin_unlock(&next_i->i_lock);
 		}
 
-		/*
-		 * We can safely drop sb_inode_list_lock here because we hold
-		 * references on both inode and next_i.  Also no new inodes
-		 * will be added since the umount has begun.  Finally,
-		 * iprune_mutex keeps shrink_icache_memory() away.
-		 */
-		spin_unlock(&sb_inode_list_lock);
-
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
 
@@ -302,7 +295,5 @@
 		fsnotify_inode_delete(inode);
 
 		iput(inode);
-
-		spin_lock(&sb_inode_list_lock);
 	}
 }



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 20/35] fs: icache rcu walk for i_sb_list
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (18 preceding siblings ...)
  2010-10-19  3:42 ` [patch 19/35] fs: icache remove redundant i_sb_list umount locking npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 21/35] fs: icache per-cpu nr_inodes, non-atomic nr_unused counters npiggin
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode_lock-scale-11.patch --]
[-- Type: text/plain, Size: 9865 bytes --]

This enables locking to be reduced and ordering simplified.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 Documentation/filesystems/Locking |    2 -
 Documentation/filesystems/vfs.txt |    2 -
 fs/drop_caches.c                  |   10 ++++-----
 fs/fs-writeback.c                 |   19 ++++++++---------
 fs/inode.c                        |   41 ++++++++++++--------------------------
 fs/quota/dquot.c                  |   18 ++++++++--------
 6 files changed, 39 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:19:24.000000000 +1100
@@ -16,8 +16,8 @@
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&sb_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
 				|| inode->i_mapping->nrpages == 0) {
@@ -26,13 +26,13 @@
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb_inode_list_lock);
+		rcu_read_unlock();
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&sb_inode_list_lock);
+		rcu_read_lock();
 	}
-	spin_unlock(&sb_inode_list_lock);
+	rcu_read_unlock();
 	iput(toput_inode);
 }
 
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:24.000000000 +1100
@@ -42,10 +42,10 @@
  *   i_sb_list
  *
  * Ordering:
- * sb_inode_list_lock
- *   inode->i_lock
- *     wb_inode_list_lock
- *     inode_hash_bucket lock
+ * inode->i_lock
+ *   sb_inode_list_lock
+ *   wb_inode_list_lock
+ *   inode_hash_bucket lock
  */
 /*
  * This is needed for the following functions:
@@ -382,12 +382,12 @@
 
 		evict(inode);
 
-		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		__remove_inode_hash(inode);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_lock);
+		spin_lock(&sb_inode_list_lock);
+		list_del_rcu(&inode->i_sb_list);
 		spin_unlock(&sb_inode_list_lock);
+		spin_unlock(&inode->i_lock);
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
@@ -696,7 +696,8 @@
 __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
-	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_lock(&sb_inode_list_lock);
+	list_add_rcu(&inode->i_sb_list, &sb->s_inodes);
 	spin_unlock(&sb_inode_list_lock);
 	if (b) {
 		spin_lock_bucket(b);
@@ -721,7 +722,6 @@
 {
 	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
 
-	spin_lock(&sb_inode_list_lock);
 	spin_lock(&inode->i_lock);
 	__inode_add_to_lists(sb, b, inode);
 	spin_unlock(&inode->i_lock);
@@ -752,7 +752,6 @@
 
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		inode->i_ino = (unsigned int)atomic_inc_return(&last_ino);
 		inode->i_state = 0;
@@ -819,7 +818,6 @@
 		/* We released the lock, so.. */
 		old = find_inode(sb, b, test, data);
 		if (!old) {
-			spin_lock(&sb_inode_list_lock);
 			spin_lock(&inode->i_lock);
 			if (set(inode, data))
 				goto set_failed;
@@ -849,7 +847,6 @@
 
 set_failed:
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&sb_inode_list_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -870,7 +867,6 @@
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
-			spin_lock(&sb_inode_list_lock);
 			spin_lock(&inode->i_lock);
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
@@ -1380,15 +1376,12 @@
 				atomic_inc(&inodes_stat.nr_unused);
 			}
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&sb_inode_list_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb_inode_list_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&sb_inode_list_lock);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
@@ -1401,7 +1394,8 @@
 		if (!inode->i_state)
 			atomic_dec(&inodes_stat.nr_unused);
 	}
-	list_del_init(&inode->i_sb_list);
+	spin_lock(&sb_inode_list_lock);
+	list_del_rcu(&inode->i_sb_list);
 	spin_unlock(&sb_inode_list_lock);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
@@ -1431,19 +1425,12 @@
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-retry:
 		spin_lock(&inode->i_lock);
-		if (inode->i_count == 1) {
-			if (!spin_trylock(&sb_inode_list_lock)) {
-				spin_unlock(&inode->i_lock);
-				goto retry;
-			}
-			inode->i_count--;
+		inode->i_count--;
+		if (inode->i_count == 0)
 			iput_final(inode);
-		} else {
-			inode->i_count--;
+		else
 			spin_unlock(&inode->i_lock);
-		}
 	}
 }
 EXPORT_SYMBOL(iput);
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/quota/dquot.c	2010-10-19 14:19:23.000000000 +1100
@@ -897,8 +897,8 @@
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
@@ -919,19 +919,19 @@
 
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb_inode_list_lock);
+		rcu_read_unlock();
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * sb_inode_list_lock.  We cannot iput the inode now as we can
+		 * i_lock.  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
 		 * lock. So we keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&sb_inode_list_lock);
+		rcu_read_lock();
 	}
-	spin_unlock(&sb_inode_list_lock);
+	rcu_read_unlock();
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1012,8 +1012,8 @@
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&sb_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1026,7 +1026,7 @@
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
-	spin_unlock(&sb_inode_list_lock);
+	rcu_read_unlock();
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/Documentation/filesystems/Locking	2010-10-19 14:18:59.000000000 +1100
@@ -114,7 +114,7 @@
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!i_lock, sb_inode_list_lock!!!
+drop_inode:				!!!i_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/Documentation/filesystems/vfs.txt	2010-10-19 14:19:16.000000000 +1100
@@ -246,7 +246,7 @@
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the i_lock and sb_inode_list_lock spinlock held.
+	with the i_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:24.000000000 +1100
@@ -1061,8 +1061,6 @@
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&sb_inode_list_lock);
-
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
 	 * because there may have been pages dirtied before our sync
@@ -1070,7 +1068,8 @@
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
 		spin_lock(&inode->i_lock);
@@ -1087,13 +1086,13 @@
 
  		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb_inode_list_lock);
+		rcu_read_unlock();
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * sb_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * spinlock. So we keep the reference and iput it later.
+		 * removed from s_inodes list while we dropped the i_lock.  We
+		 * cannot iput the inode now as we can be holding the last
+		 * reference and we cannot iput it under spinlock. So we keep
+		 * the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1102,9 +1101,9 @@
 
 		cond_resched();
 
-		spin_lock(&sb_inode_list_lock);
+		rcu_read_lock();
 	}
-	spin_unlock(&sb_inode_list_lock);
+	rcu_read_unlock();
 	iput(old_inode);
 }
 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 21/35] fs: icache per-cpu nr_inodes, non-atomic nr_unused counters
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (19 preceding siblings ...)
  2010-10-19  3:42 ` [patch 20/35] fs: icache rcu walk for i_sb_list npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 22/35] fs: icache per-cpu last_ino allocator npiggin
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: Eric Dumazet

[-- Attachment #1: fs-inode-nr_inodes.patch --]
[-- Type: text/plain, Size: 8057 bytes --]

From: Eric Dumazet <eric.dumazet@gmail.com>

[Eric Dumazet]
Make nr_inodes a per-cpu counter to avoid cache line ping pongs between cpus.

[Nick Piggin]
Make nr_unused non-atomic and protected by wb_inode_list_lock.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c  |   18 +++++++++++-----
 fs/inode.c         |   58 ++++++++++++++++++++++++++++++++++++++---------------
 include/linux/fs.h |   15 +++++--------
 kernel/sysctl.c    |    4 +--
 4 files changed, 63 insertions(+), 32 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:38:03.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:38:27.000000000 +1100
@@ -139,12 +139,42 @@
  * Statistics gathering..
  */
 struct inodes_stat_t inodes_stat = {
-	.nr_inodes = ATOMIC_INIT(0),
-	.nr_unused = ATOMIC_INIT(0),
+	.nr_inodes = 0,
+	.nr_unused = 0,
 };
 
+static DEFINE_PER_CPU(unsigned int, nr_inodes);
+
 static struct kmem_cache *inode_cachep __read_mostly;
 
+int get_nr_inodes(void)
+{
+	int i;
+	int sum = 0;
+	for_each_possible_cpu(i)
+		sum += per_cpu(nr_inodes, i);
+	return sum < 0 ? 0 : sum;
+}
+
+int get_nr_inodes_unused(void)
+{
+	return inodes_stat.nr_unused;
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+int proc_nr_inodes(ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+	inodes_stat.nr_inodes = get_nr_inodes();
+	return proc_dointvec(table, write, buffer, lenp, ppos);
+#else
+	return -ENOSYS;
+#endif
+}
+
 static void wake_up_inode(struct inode *inode)
 {
 	/*
@@ -232,7 +262,7 @@
 	inode->i_fsnotify_mask = 0;
 #endif
 
-	atomic_inc(&inodes_stat.nr_inodes);
+	this_cpu_inc(nr_inodes);
 
 	return 0;
 out:
@@ -280,7 +310,7 @@
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
-	atomic_dec(&inodes_stat.nr_inodes);
+	this_cpu_dec(nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
 
@@ -400,7 +430,7 @@
 static int invalidate_list(struct list_head *head, struct list_head *dispose)
 {
 	struct list_head *next;
-	int busy = 0, count = 0;
+	int busy = 0;
 
 	next = head->next;
 	for (;;) {
@@ -420,19 +450,17 @@
 		if (!inode->i_count) {
 			spin_lock(&wb_inode_list_lock);
 			list_del(&inode->i_list);
+			inodes_stat.nr_unused--;
 			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
 			list_add(&inode->i_list, dispose);
-			count++;
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
-	/* only unused inodes may be cached with i_count zero */
-	atomic_sub(count, &inodes_stat.nr_unused);
 	return busy;
 }
 
@@ -494,7 +522,6 @@
 static void prune_icache(unsigned long nr_to_scan)
 {
 	LIST_HEAD(freeable);
-	int nr_pruned = 0;
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
@@ -515,7 +542,7 @@
 		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
 			list_del_init(&inode->i_list);
 			spin_unlock(&inode->i_lock);
-			atomic_dec(&inodes_stat.nr_unused);
+			inodes_stat.nr_unused--;
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
@@ -557,9 +584,8 @@
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
-		nr_pruned++;
+		inodes_stat.nr_unused--;
 	}
-	atomic_sub(nr_pruned, &inodes_stat.nr_unused);
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
@@ -587,7 +613,7 @@
 	unsigned long nr;
 
 	shrinker_add_scan(&nr_to_scan, scanned, global,
-			atomic_read(&inodes_stat.nr_unused),
+			inodes_stat.nr_unused,
 			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
 	/*
 	 * Nasty deadlock avoidance.  We may hold various FS locks,
@@ -1372,8 +1398,8 @@
 					list_empty(&inode->i_list)) {
 				spin_lock(&wb_inode_list_lock);
 				list_add(&inode->i_list, &inode_unused);
+				inodes_stat.nr_unused++;
 				spin_unlock(&wb_inode_list_lock);
-				atomic_inc(&inodes_stat.nr_unused);
 			}
 			spin_unlock(&inode->i_lock);
 			return;
@@ -1390,9 +1416,9 @@
 	if (!list_empty(&inode->i_list)) {
 		spin_lock(&wb_inode_list_lock);
 		list_del_init(&inode->i_list);
-		spin_unlock(&wb_inode_list_lock);
 		if (!inode->i_state)
-			atomic_dec(&inodes_stat.nr_unused);
+			inodes_stat.nr_unused--;
+		spin_unlock(&wb_inode_list_lock);
 	}
 	spin_lock(&sb_inode_list_lock);
 	list_del_rcu(&inode->i_sb_list);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:38:03.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:38:05.000000000 +1100
@@ -40,14 +40,8 @@
 };
 
 struct inodes_stat_t {
-	/*
-	 * Using atomics here is a hack which should just happen to
-	 * work on all architectures today. Not a big deal though,
-	 * because it goes away and gets fixed properly later in the
-	 * inode scaling series.
-	 */
-	atomic_t nr_inodes;
-	atomic_t nr_unused;
+	int nr_inodes;
+	int nr_unused;
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 
@@ -413,6 +407,8 @@
 extern int get_max_files(void);
 extern int sysctl_nr_open;
 extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);
+extern int get_nr_inodes_unused(void);
 extern int leases_enable, lease_break_time;
 
 struct buffer_head;
@@ -2490,7 +2486,8 @@
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
-
+int proc_nr_inodes(struct ctl_table *table, int write,
+		   void __user *buffer, size_t *lenp, loff_t *ppos);
 int __init get_filesystem_list(char *buf);
 
 #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:38:03.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:38:05.000000000 +1100
@@ -738,6 +738,7 @@
 {
 	unsigned long expired;
 	long nr_pages;
+	int nr_dirty_inodes;
 
 	/*
 	 * When set to zero, disable periodic writeback
@@ -750,11 +751,15 @@
 	if (time_before(jiffies, expired))
 		return 0;
 
+	/* approximate dirty inodes */
+	nr_dirty_inodes = get_nr_inodes() - get_nr_inodes_unused();
+	if (nr_dirty_inodes < 0)
+		nr_dirty_inodes = 0;
+
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(atomic_read(&inodes_stat.nr_inodes) -
-			atomic_read(&inodes_stat.nr_unused));
+			nr_dirty_inodes;
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1120,6 +1125,7 @@
 {
 	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+	int nr_dirty_inodes;
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
 		.sb		= sb,
@@ -1129,9 +1135,11 @@
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	work.nr_pages = nr_dirty + nr_unstable +
-			(atomic_read(&inodes_stat.nr_inodes) -
-			atomic_read(&inodes_stat.nr_unused));
+	nr_dirty_inodes = get_nr_inodes() - get_nr_inodes_unused();
+	if (nr_dirty_inodes < 0)
+		nr_dirty_inodes = 0;
+
+	work.nr_pages = nr_dirty + nr_unstable + nr_dirty_inodes;
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-10-19 14:19:24.000000000 +1100
+++ linux-2.6/kernel/sysctl.c	2010-10-19 14:38:05.000000000 +1100
@@ -1340,14 +1340,14 @@
 		.data		= &inodes_stat,
 		.maxlen		= 2*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "inode-state",
 		.data		= &inodes_stat,
 		.maxlen		= 7*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_nr_inodes,
 	},
 	{
 		.procname	= "file-nr",



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 22/35] fs: icache per-cpu last_ino allocator
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (20 preceding siblings ...)
  2010-10-19  3:42 ` [patch 21/35] fs: icache per-cpu nr_inodes, non-atomic nr_unused counters npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists npiggin
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: Eric Dumazet

[-- Attachment #1: fs-last_ino-percpu-2.patch --]
[-- Type: text/plain, Size: 2785 bytes --]

From: Eric Dumazet <eric.dumazet@gmail.com>

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.
This reduces contention on the shared last_ino, and give same
spreading ino numbers than before (i.e. same wraparound after 2^32
allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/inode.c |   47 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 40 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:24.000000000 +1100
@@ -754,6 +754,45 @@
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
+#define LAST_INO_BATCH 1024
+
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static unsigned int get_next_ino(void)
+{
+	unsigned int res;
+
+	get_cpu();
+	res = __this_cpu_read(last_ino);
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH - 1)) == 0)) {
+		static atomic_t shared_last_ino;
+		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+		res = next - LAST_INO_BATCH;
+	}
+#endif
+	res++;
+	__this_cpu_write(last_ino, res);
+	put_cpu();
+	return res;
+}
+
 /**
  *	new_inode 	- obtain an inode
  *	@sb: superblock
@@ -768,18 +807,12 @@
  */
 struct inode *new_inode(struct super_block *sb)
 {
-	/*
-	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
-	 * error if st_ino won't fit in target struct field. Use 32bit counter
-	 * here to attempt to avoid that.
-	 */
-	static atomic_t last_ino = ATOMIC_INIT(0);
 	struct inode *inode;
 
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode->i_lock);
-		inode->i_ino = (unsigned int)atomic_inc_return(&last_ino);
+		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (21 preceding siblings ...)
  2010-10-19  3:42 ` [patch 22/35] fs: icache per-cpu last_ino allocator npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 15:33   ` Miklos Szeredi
  2010-10-19  3:42 ` [patch 24/35] fs: icache use RCU to avoid locking in hash lookups npiggin
                   ` (13 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-sb-inodes-percpu.patch --]
[-- Type: text/plain, Size: 15026 bytes --]

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/drop_caches.c                 |    4 -
 fs/fs-writeback.c                |   15 +++--
 fs/inode.c                       |   99 ++++++++++++++++++++++++++++-----------
 fs/notify/inode_mark.c           |    6 +-
 fs/quota/dquot.c                 |    8 +--
 fs/super.c                       |   16 +++++-
 include/linux/fs.h               |   58 ++++++++++++++++++++++
 include/linux/fsnotify_backend.h |    4 -
 include/linux/writeback.h        |    1 
 9 files changed, 164 insertions(+), 47 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:23.000000000 +1100
@@ -25,10 +25,11 @@
 #include <linux/async.h>
 #include <linux/posix_acl.h>
 #include <linux/bit_spinlock.h>
+#include <linux/lglock.h>
 
 /*
  * Usage:
- * sb_inode_list_lock protects:
+ * inode_list_lglock protects:
  *   s_inodes, i_sb_list
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
@@ -43,7 +44,7 @@
  *
  * Ordering:
  * inode->i_lock
- *   sb_inode_list_lock
+ *   inode_list_lglock
  *   wb_inode_list_lock
  *   inode_hash_bucket lock
  */
@@ -118,7 +119,9 @@
  * NOTE! You also have to own the lock if you change
  * the i_state of an inode while it is in use..
  */
-DEFINE_SPINLOCK(sb_inode_list_lock);
+DECLARE_LGLOCK(inode_list_lglock);
+DEFINE_LGLOCK(inode_list_lglock);
+
 DEFINE_SPINLOCK(wb_inode_list_lock);
 
 /*
@@ -395,6 +398,8 @@
 
 static void __remove_inode_hash(struct inode *inode);
 
+static void inode_sb_list_del(struct inode *inode);
+
 /*
  * dispose_list - dispose of the contents of a local list
  * @head: the head of the list to free
@@ -414,9 +419,7 @@
 
 		spin_lock(&inode->i_lock);
 		__remove_inode_hash(inode);
-		spin_lock(&sb_inode_list_lock);
-		list_del_rcu(&inode->i_sb_list);
-		spin_unlock(&sb_inode_list_lock);
+		inode_sb_list_del(inode);
 		spin_unlock(&inode->i_lock);
 
 		wake_up_inode(inode);
@@ -427,20 +430,12 @@
 /*
  * Invalidate all inodes for a device.
  */
-static int invalidate_list(struct list_head *head, struct list_head *dispose)
+static int invalidate_sb_inodes(struct super_block *sb, struct list_head *dispose)
 {
-	struct list_head *next;
+	struct inode *inode;
 	int busy = 0;
 
-	next = head->next;
-	for (;;) {
-		struct list_head *tmp = next;
-		struct inode *inode;
-
-		next = next->next;
-		if (tmp == head)
-			break;
-		inode = list_entry(tmp, struct inode, i_sb_list);
+	do_inode_list_for_each_entry_rcu(sb, inode) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & I_NEW) {
 			spin_unlock(&inode->i_lock);
@@ -460,7 +455,8 @@
 		}
 		spin_unlock(&inode->i_lock);
 		busy = 1;
-	}
+	} while_inode_list_for_each_entry_rcu
+
 	return busy;
 }
 
@@ -483,8 +479,8 @@
 	 * its consistency, because the list must not change during umount
 	 * anymore, and because iprune_sem keeps shrink_icache_memory() away.
 	 */
-	fsnotify_unmount_inodes(&sb->s_inodes);
-	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	fsnotify_unmount_inodes(sb);
+	busy = invalidate_sb_inodes(sb, &throw_away);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
@@ -718,13 +714,63 @@
 	return tmp & I_HASHMASK;
 }
 
+static inline int inode_list_cpu(struct inode *inode)
+{
+#ifdef CONFIG_SMP
+	return inode->i_sb_list_cpu;
+#else
+	return smp_processor_id();
+#endif
+}
+
+/* helper for file_sb_list_add to reduce ifdefs */
+static inline void __inode_sb_list_add(struct inode *inode, struct super_block *sb)
+{
+	struct list_head *list;
+#ifdef CONFIG_SMP
+	int cpu;
+	cpu = smp_processor_id();
+	inode->i_sb_list_cpu = cpu;
+	list = per_cpu_ptr(sb->s_inodes, cpu);
+#else
+	list = &sb->s_inodes;
+#endif
+	list_add_rcu(&inode->i_sb_list, list);
+}
+
+/**
+ * inode_sb_list_add - add an inode to the sb's file list
+ * @inode: inode to add
+ * @sb: sb to add it to
+ *
+ * Use this function to associate an with the superblock it belongs to.
+ */
+static void inode_sb_list_add(struct inode *inode, struct super_block *sb)
+{
+	lg_local_lock(inode_list_lglock);
+	__inode_sb_list_add(inode, sb);
+	lg_local_unlock(inode_list_lglock);
+}
+
+/**
+ * inode_sb_list_del - remove an inode from the sb's inode list
+ * @inode: inode to remove
+ * @sb: sb to remove it from
+ *
+ * Use this function to remove an inode from its superblock.
+ */
+static void inode_sb_list_del(struct inode *inode)
+{
+	lg_local_lock_cpu(inode_list_lglock, inode_list_cpu(inode));
+	list_del_rcu(&inode->i_sb_list);
+	lg_local_unlock_cpu(inode_list_lglock, inode_list_cpu(inode));
+}
+
 static inline void
 __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
-	spin_lock(&sb_inode_list_lock);
-	list_add_rcu(&inode->i_sb_list, &sb->s_inodes);
-	spin_unlock(&sb_inode_list_lock);
+	inode_sb_list_add(inode, sb);
 	if (b) {
 		spin_lock_bucket(b);
 		hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
@@ -1270,6 +1316,7 @@
 			continue;
 		if (!spin_trylock(&old->i_lock)) {
 			spin_unlock_bucket(b);
+			cpu_relax();
 			goto repeat;
 		}
 		goto found_old;
@@ -1453,9 +1500,7 @@
 			inodes_stat.nr_unused--;
 		spin_unlock(&wb_inode_list_lock);
 	}
-	spin_lock(&sb_inode_list_lock);
-	list_del_rcu(&inode->i_sb_list);
-	spin_unlock(&sb_inode_list_lock);
+	inode_sb_list_del(inode);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
@@ -1732,6 +1777,8 @@
 					 init_once);
 	register_shrinker(&icache_shrinker);
 
+	lg_lock_init(inode_list_lglock);
+
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
 		return;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:22.000000000 +1100
@@ -374,6 +374,7 @@
 #include <linux/cache.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
+#include <linux/rculist.h>
 #include <linux/rculist_bl.h>
 #include <linux/radix-tree.h>
 #include <linux/prio_tree.h>
@@ -733,6 +734,9 @@
 		struct rcu_head		i_rcu;
 	};
 	unsigned long		i_ino;
+#ifdef CONFIG_SMP
+	int			i_sb_list_cpu;
+#endif
 	unsigned int		i_count;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
@@ -1344,11 +1348,12 @@
 #endif
 	const struct xattr_handler **s_xattr;
 
-	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
+	struct list_head __percpu *s_inodes;
 	struct list_head __percpu *s_files;
 #else
+	struct list_head	s_inodes;	/* all inodes */
 	struct list_head	s_files;
 #endif
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
@@ -2202,6 +2207,57 @@
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+#ifdef CONFIG_SMP
+/*
+ * These macros iterate all inodes on all CPUs for a given superblock.
+ * rcu_read_lock must be held.
+ */
+#define do_inode_list_for_each_entry_rcu(__sb, __inode)		\
+{								\
+	int i;							\
+	for_each_possible_cpu(i) {				\
+		struct list_head *list;				\
+		list = per_cpu_ptr((__sb)->s_inodes, i);	\
+		list_for_each_entry_rcu((__inode), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_rcu			\
+	}							\
+}
+
+#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp)	\
+{								\
+	int i;							\
+	for_each_possible_cpu(i) {				\
+		struct list_head *list;				\
+		list = per_cpu_ptr((__sb)->s_inodes, i);	\
+		list_for_each_entry_safe((__inode), (__tmp), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_safe			\
+	}							\
+}
+
+#else
+
+#define do_inode_list_for_each_entry_rcu(__sb, __inode)		\
+{								\
+	struct list_head *list;					\
+	list = &(sb)->s_inodes;					\
+	list_for_each_entry_rcu((__inode), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_rcu			\
+}
+
+#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp)	\
+{								\
+	struct list_head *list;					\
+	list = &(sb)->s_inodes;					\
+	list_for_each_entry_safe((__inode), (__tmp), list, i_sb_list)
+
+#define while_inode_list_for_each_entry_safe			\
+}
+
+#endif
+
 #ifdef CONFIG_BLOCK
 extern void submit_bio(int, struct bio *);
 extern int bdev_read_only(struct block_device *);
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c	2010-10-19 14:17:17.000000000 +1100
+++ linux-2.6/fs/super.c	2010-10-19 14:18:59.000000000 +1100
@@ -67,12 +67,25 @@
 			for_each_possible_cpu(i)
 				INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
 		}
+		s->s_inodes = alloc_percpu(struct list_head);
+		if (!s->s_inodes) {
+			free_percpu(s->s_files);
+			security_sb_free(s);
+			kfree(s);
+			s = NULL;
+			goto out;
+		} else {
+			int i;
+
+			for_each_possible_cpu(i)
+				INIT_LIST_HEAD(per_cpu_ptr(s->s_inodes, i));
+		}
 #else
 		INIT_LIST_HEAD(&s->s_files);
+		INIT_LIST_HEAD(&s->s_inodes);
 #endif
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
-		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
@@ -124,6 +137,7 @@
 static inline void destroy_super(struct super_block *s)
 {
 #ifdef CONFIG_SMP
+	free_percpu(s->s_inodes);
 	free_percpu(s->s_files);
 #endif
 	security_sb_free(s);
Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:19:18.000000000 +1100
@@ -17,7 +17,7 @@
 	struct inode *inode, *toput_inode = NULL;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+	do_inode_list_for_each_entry_rcu(sb, inode) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
 				|| inode->i_mapping->nrpages == 0) {
@@ -31,7 +31,7 @@
 		iput(toput_inode);
 		toput_inode = inode;
 		rcu_read_lock();
-	}
+	} while_inode_list_for_each_entry_rcu
 	rcu_read_unlock();
 	iput(toput_inode);
 }
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:22.000000000 +1100
@@ -1074,7 +1074,7 @@
 	 * we still have to wait for that writeout.
 	 */
 	rcu_read_lock();
-	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+	do_inode_list_for_each_entry_rcu(sb, inode) {
 		struct address_space *mapping;
 
 		spin_lock(&inode->i_lock);
@@ -1093,11 +1093,12 @@
 		spin_unlock(&inode->i_lock);
 		rcu_read_unlock();
 		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the i_lock.  We
-		 * cannot iput the inode now as we can be holding the last
-		 * reference and we cannot iput it under spinlock. So we keep
-		 * the reference and iput it later.
+		 * We hold a reference to 'inode' so it couldn't have
+		 * been removed from s_inodes list while we dropped the
+		 * i_lock.  We cannot iput the inode now as we can be
+		 * holding the last reference and we cannot iput it
+		 * under spinlock. So we keep the reference and iput it
+		 * later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1107,7 +1108,7 @@
 		cond_resched();
 
 		rcu_read_lock();
-	}
+	} while_inode_list_for_each_entry_rcu
 	rcu_read_unlock();
 	iput(old_inode);
 }
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:18.000000000 +1100
@@ -236,11 +236,11 @@
  * and with the sb going away, no new inodes will appear or be referenced
  * from other paths.
  */
-void fsnotify_unmount_inodes(struct list_head *list)
+void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *next_i, *need_iput = NULL;
 
-	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+	do_inode_list_for_each_entry_safe(sb, inode, next_i) {
 		struct inode *need_iput_tmp;
 
 		spin_lock(&inode->i_lock);
@@ -295,5 +295,5 @@
 		fsnotify_inode_delete(inode);
 
 		iput(inode);
-	}
+	} while_inode_list_for_each_entry_safe
 }
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/quota/dquot.c	2010-10-19 14:19:18.000000000 +1100
@@ -898,7 +898,7 @@
 #endif
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+	do_inode_list_for_each_entry_rcu(sb, inode) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
@@ -930,7 +930,7 @@
 		 * lock. So we keep the reference and iput it later. */
 		old_inode = inode;
 		rcu_read_lock();
-	}
+	} while_inode_list_for_each_entry_rcu
 	rcu_read_unlock();
 	iput(old_inode);
 
@@ -1013,7 +1013,7 @@
 	int reserved = 0;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
+	do_inode_list_for_each_entry_rcu(sb, inode) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1025,7 +1025,7 @@
 				reserved = 1;
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
-	}
+	} while_inode_list_for_each_entry_rcu
 	rcu_read_unlock();
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
Index: linux-2.6/include/linux/fsnotify_backend.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify_backend.h	2010-10-19 14:17:17.000000000 +1100
+++ linux-2.6/include/linux/fsnotify_backend.h	2010-10-19 14:18:59.000000000 +1100
@@ -402,7 +402,7 @@
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_get_mark(struct fsnotify_mark *mark);
 extern void fsnotify_put_mark(struct fsnotify_mark *mark);
-extern void fsnotify_unmount_inodes(struct list_head *list);
+extern void fsnotify_unmount_inodes(struct super_block *sb);
 
 /* put here because inotify does some weird stuff when destroying watches */
 extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u32 mask,
@@ -443,7 +443,7 @@
 	return 0;
 }
 
-static inline void fsnotify_unmount_inodes(struct list_head *list)
+static inline void fsnotify_unmount_inodes(struct super_block *sb)
 {}
 
 #endif	/* CONFIG_FSNOTIFY */
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:21.000000000 +1100
@@ -9,7 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t sb_inode_list_lock;
 extern spinlock_t wb_inode_list_lock;
 extern struct list_head inode_unused;
 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 24/35] fs: icache use RCU to avoid locking in hash lookups
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (22 preceding siblings ...)
  2010-10-19  3:42 ` [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 25/35] fs: icache reduce some locking overheads npiggin
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-hash-rcu.patch --]
[-- Type: text/plain, Size: 2000 bytes --]

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:22.000000000 +1100
@@ -646,27 +646,27 @@
 	struct inode *inode = NULL;
 
 repeat:
-	spin_lock_bucket(b);
-	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
+	rcu_read_lock();
+	hlist_bl_for_each_entry_rcu(inode, node, &b->head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock_bucket(b);
-			cpu_relax();
-			goto repeat;
+		spin_lock(&inode->i_lock);
+		if (hlist_bl_unhashed(&inode->i_hash)) {
+			spin_unlock(&inode->i_lock);
+			continue;
 		}
 		if (!test(inode, data)) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			spin_unlock_bucket(b);
+			rcu_read_unlock();
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
-	spin_unlock_bucket(b);
+	rcu_read_unlock();
 	return node ? inode : NULL;
 }
 
@@ -682,25 +682,25 @@
 	struct inode *inode = NULL;
 
 repeat:
-	spin_lock_bucket(b);
-	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
+	rcu_read_lock();
+	hlist_bl_for_each_entry_rcu(inode, node, &b->head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
-		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock_bucket(b);
-			cpu_relax();
-			goto repeat;
+		spin_lock(&inode->i_lock);
+		if (hlist_bl_unhashed(&inode->i_hash)) {
+			spin_unlock(&inode->i_lock);
+			continue;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			spin_unlock_bucket(b);
+			rcu_read_unlock();
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
 		break;
 	}
-	spin_unlock_bucket(b);
+	rcu_read_unlock();
 	return node ? inode : NULL;
 }
 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 25/35] fs: icache reduce some locking overheads
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (23 preceding siblings ...)
  2010-10-19  3:42 ` [patch 24/35] fs: icache use RCU to avoid locking in hash lookups npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 26/35] fs: icache alloc anonymous inode allocation npiggin
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-reduce-locks.patch --]
[-- Type: text/plain, Size: 1577 bytes --]

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c |    8 +++++---
 fs/inode.c        |    5 ++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:21.000000000 +1100
@@ -375,15 +375,17 @@
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
-	spin_unlock(&inode->i_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
-		int err = write_inode(inode, wbc);
+		int err;
+
+		spin_unlock(&inode->i_lock);
+		err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		spin_lock(&inode->i_lock);
 	}
 
-	spin_lock(&inode->i_lock);
 	spin_lock(&wb_inode_list_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:22.000000000 +1100
@@ -857,9 +857,12 @@
 
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode->i_lock);
 		inode->i_ino = get_next_ino();
 		inode->i_state = 0;
+		/*
+		 * We could init inode locked here, to improve performance.
+		 */
+		spin_lock(&inode->i_lock);
 		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode->i_lock);
 	}



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 26/35] fs: icache alloc anonymous inode allocation
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (24 preceding siblings ...)
  2010-10-19  3:42 ` [patch 25/35] fs: icache reduce some locking overheads npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 15:50   ` Miklos Szeredi
  2010-10-19 16:33   ` Christoph Hellwig
  2010-10-19  3:42 ` [patch 27/35] fs: icache split IO and LRU lists npiggin
                   ` (10 subsequent siblings)
  36 siblings, 2 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-anon-inodes.patch --]
[-- Type: text/plain, Size: 5067 bytes --]

Provide new_anon_inode function for inodes without a default inode number, and
not on sb list. This can enable filesystems to reduce locking. "Real"
filesystems can also reduce locking by allocating anonymous inode first, then
adding it to lists after finding the inode number.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/anon_inodes.c   |    2 +-
 fs/inode.c         |   32 +++++++++++++++++++++++++++++++-
 fs/pipe.c          |    3 ++-
 include/linux/fs.h |    2 ++
 net/socket.c       |    3 ++-
 5 files changed, 38 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:22.000000000 +1100
@@ -219,6 +219,7 @@
 #ifdef CONFIG_QUOTA
 	memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
 #endif
+	INIT_LIST_HEAD(&inode->i_sb_list);
 	inode->i_pipe = NULL;
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
@@ -761,6 +762,8 @@
  */
 static void inode_sb_list_del(struct inode *inode)
 {
+	if (list_empty(&inode->i_sb_list))
+		return;
 	lg_local_lock_cpu(inode_list_lglock, inode_list_cpu(inode));
 	list_del_rcu(&inode->i_sb_list);
 	lg_local_unlock_cpu(inode_list_lglock, inode_list_cpu(inode));
@@ -819,7 +822,7 @@
  */
 static DEFINE_PER_CPU(unsigned int, last_ino);
 
-static unsigned int get_next_ino(void)
+unsigned int get_next_ino(void)
 {
 	unsigned int res;
 
@@ -838,6 +841,7 @@
 	put_cpu();
 	return res;
 }
+EXPORT_SYMBOL(get_next_ino);
 
 /**
  *	new_inode 	- obtain an inode
@@ -870,6 +874,32 @@
 }
 EXPORT_SYMBOL(new_inode);
 
+/**
+ *	new_anon_inode 	- obtain an anonymous inode
+ *	@sb: superblock
+ *
+ *	Similar to new_inode, however the inode is not given an inode
+ *	number, and is not added to the sb's list of inodes, to reduce
+ *	overheads.
+ *
+ *	A filesystem which needs an inode number must subsequently
+ *	assign one to i_ino. A filesystem which needs inodes to be on the
+ *	per-sb list (currently only used by the vfs for umount or remount)
+ *	must add the inode to that list.
+ */
+struct inode *new_anon_inode(struct super_block *sb)
+{
+	struct inode *inode;
+
+	inode = alloc_inode(sb);
+	if (inode) {
+		inode->i_ino = ULONG_MAX;
+		inode->i_state = 0;
+	}
+	return inode;
+}
+EXPORT_SYMBOL(new_anon_inode);
+
 void unlock_new_inode(struct inode *inode)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/pipe.c	2010-10-19 14:19:00.000000000 +1100
@@ -948,7 +948,7 @@
 
 static struct inode * get_pipe_inode(void)
 {
-	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+	struct inode *inode = new_anon_inode(pipe_mnt->mnt_sb);
 	struct pipe_inode_info *pipe;
 
 	if (!inode)
@@ -962,6 +962,7 @@
 	pipe->readers = pipe->writers = 1;
 	inode->i_fop = &rdwr_pipefifo_fops;
 
+	inode->i_ino = get_next_ino();
 	/*
 	 * Mark the inode dirty from the very beginning,
 	 * that way it will never be moved to the dirty
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:21.000000000 +1100
@@ -2192,11 +2192,13 @@
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
 
+extern unsigned int get_next_ino(void);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
 extern void __destroy_inode(struct inode *);
 extern struct inode *new_inode(struct super_block *);
+extern struct inode *new_anon_inode(struct super_block *);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/net/socket.c	2010-10-19 14:19:19.000000000 +1100
@@ -476,13 +476,14 @@
 	struct inode *inode;
 	struct socket *sock;
 
-	inode = new_inode(sock_mnt->mnt_sb);
+	inode = new_anon_inode(sock_mnt->mnt_sb);
 	if (!inode)
 		return NULL;
 
 	sock = SOCKET_I(inode);
 
 	kmemcheck_annotate_bitfield(sock, type);
+	inode->i_ino = get_next_ino();
 	inode->i_mode = S_IFSOCK | S_IRWXUGO;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/anon_inodes.c	2010-10-19 14:19:19.000000000 +1100
@@ -191,7 +191,7 @@
  */
 static struct inode *anon_inode_mkinode(void)
 {
-	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+	struct inode *inode = new_anon_inode(anon_inode_mnt->mnt_sb);
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 27/35] fs: icache split IO and LRU lists
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (25 preceding siblings ...)
  2010-10-19  3:42 ` [patch 26/35] fs: icache alloc anonymous inode allocation npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 16:12   ` Miklos Szeredi
  2010-10-19  3:42 ` [patch 28/35] fs: icache split writeback and lru locks npiggin
                   ` (9 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-split-lists.patch --]
[-- Type: text/plain, Size: 9980 bytes --]

Split inode reclaim and writeback lists in preparation to scale them up
(per-bdi locking for i_io and per-zone locking for i_lru)

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c  |   30 +++++++++++++++++-------------
 fs/inode.c         |   46 +++++++++++++++++++++++++++-------------------
 fs/nilfs2/mdt.c    |    3 ++-
 include/linux/fs.h |    3 ++-
 mm/backing-dev.c   |    6 +++---
 5 files changed, 51 insertions(+), 37 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:21.000000000 +1100
@@ -173,11 +173,11 @@
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_io);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &wb->b_dirty);
+	list_move(&inode->i_io, &wb->b_dirty);
 }
 
 /*
@@ -188,7 +188,7 @@
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
 	assert_spin_locked(&wb_inode_list_lock);
-	list_move(&inode->i_list, &wb->b_more_io);
+	list_move(&inode->i_io, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -230,14 +230,14 @@
 
 	assert_spin_locked(&wb_inode_list_lock);
 	while (!list_empty(delaying_queue)) {
-		inode = list_entry(delaying_queue->prev, struct inode, i_list);
+		inode = list_entry(delaying_queue->prev, struct inode, i_io);
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
-		list_move(&inode->i_list, &tmp);
+		list_move(&inode->i_io, &tmp);
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
@@ -248,12 +248,12 @@
 
 	/* Move inodes from one superblock together */
 	while (!list_empty(&tmp)) {
-		inode = list_entry(tmp.prev, struct inode, i_list);
+		inode = list_entry(tmp.prev, struct inode, i_io);
 		sb = inode->i_sb;
 		list_for_each_prev_safe(pos, node, &tmp) {
-			inode = list_entry(pos, struct inode, i_list);
+			inode = list_entry(pos, struct inode, i_io);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_list, dispatch_queue);
+				list_move(&inode->i_io, dispatch_queue);
 		}
 	}
 }
@@ -422,7 +422,11 @@
 			/*
 			 * The inode is clean
 			 */
-			list_move(&inode->i_list, &inode_unused);
+			list_del_init(&inode->i_io);
+			if (list_empty(&inode->i_lru)) {
+				list_add(&inode->i_lru, &inode_unused);
+				inodes_stat.nr_unused++;
+			}
 		}
 	}
 	inode_sync_complete(inode);
@@ -472,7 +476,7 @@
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_io);
 
 		if (!spin_trylock(&inode->i_lock)) {
 			spin_unlock(&wb_inode_list_lock);
@@ -558,7 +562,7 @@
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_io);
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
@@ -703,7 +707,7 @@
 		spin_lock(&wb_inode_list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
-						struct inode, i_list);
+						struct inode, i_io);
 			if (!spin_trylock(&inode->i_lock)) {
 				spin_unlock(&wb_inode_list_lock);
 				goto retry;
@@ -1029,7 +1033,7 @@
 
 			inode->dirtied_when = jiffies;
 			spin_lock(&wb_inode_list_lock);
-			list_move(&inode->i_list, &bdi->wb.b_dirty);
+			list_move(&inode->i_io, &bdi->wb.b_dirty);
 			spin_unlock(&wb_inode_list_lock);
 		}
 	}
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:21.000000000 +1100
@@ -727,7 +727,8 @@
 
 struct inode {
 	struct hlist_bl_node	i_hash;
-	struct list_head	i_list;		/* backing dev IO list */
+	struct list_head	i_io;		/* backing dev IO list */
+	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	union {
 		struct list_head	i_dentry;
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/mm/backing-dev.c	2010-10-19 14:19:20.000000000 +1100
@@ -74,11 +74,11 @@
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&wb_inode_list_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_list)
+	list_for_each_entry(inode, &wb->b_io, i_io)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
 	spin_unlock(&wb_inode_list_lock);
 
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:21.000000000 +1100
@@ -34,12 +34,13 @@
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
  * wb_inode_list_lock protects:
- *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
+ *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_io, i_lru
  * inode->i_lock protects:
  *   i_state
  *   i_count
  *   i_hash
- *   i_list
+ *   i_io
+ *   i_lru
  *   i_sb_list
  *
  * Ordering:
@@ -327,6 +328,7 @@
 
 void destroy_inode(struct inode *inode)
 {
+	BUG_ON(!list_empty(&inode->i_io));
 	__destroy_inode(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
@@ -345,7 +347,8 @@
 	INIT_HLIST_BL_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_list);
+	INIT_LIST_HEAD(&inode->i_io);
+	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -413,8 +416,8 @@
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_list);
-		list_del_init(&inode->i_list);
+		inode = list_first_entry(head, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
 
 		evict(inode);
 
@@ -445,13 +448,14 @@
 		invalidate_inode_buffers(inode);
 		if (!inode->i_count) {
 			spin_lock(&wb_inode_list_lock);
-			list_del(&inode->i_list);
+			list_del_init(&inode->i_io);
+			list_del(&inode->i_lru);
 			inodes_stat.nr_unused--;
 			spin_unlock(&wb_inode_list_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
-			list_add(&inode->i_list, dispose);
+			list_add(&inode->i_lru, dispose);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
@@ -530,20 +534,20 @@
 		if (list_empty(&inode_unused))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_list);
+		inode = list_entry(inode_unused.prev, struct inode, i_lru);
 
 		if (!spin_trylock(&inode->i_lock)) {
 			spin_unlock(&wb_inode_list_lock);
 			goto again;
 		}
 		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_list);
+			list_del_init(&inode->i_lru);
 			spin_unlock(&inode->i_lock);
 			inodes_stat.nr_unused--;
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_lru, &inode_unused);
 			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -556,7 +560,7 @@
 			 *
 			 * We'll try to get it back if it becomes freeable.
 			 */
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_lru, &inode_unused);
 			spin_unlock(&wb_inode_list_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
@@ -567,7 +571,7 @@
 			iput(inode);
 			spin_lock(&wb_inode_list_lock);
 			if (inode == list_entry(inode_unused.next,
-						struct inode, i_list)) {
+						struct inode, i_lru)) {
 				if (spin_trylock(&inode->i_lock)) {
 					if (can_unuse(inode))
 						goto freeable;
@@ -577,7 +581,7 @@
 			continue;
 		}
 freeable:
-		list_move(&inode->i_list, &freeable);
+		list_move(&inode->i_lru, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
@@ -1508,9 +1512,9 @@
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
 			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
-					list_empty(&inode->i_list)) {
+					list_empty(&inode->i_lru)) {
 				spin_lock(&wb_inode_list_lock);
-				list_add(&inode->i_list, &inode_unused);
+				list_add(&inode->i_lru, &inode_unused);
 				inodes_stat.nr_unused++;
 				spin_unlock(&wb_inode_list_lock);
 			}
@@ -1526,11 +1530,15 @@
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
-	if (!list_empty(&inode->i_list)) {
+	if (!list_empty(&inode->i_lru)) {
 		spin_lock(&wb_inode_list_lock);
-		list_del_init(&inode->i_list);
-		if (!inode->i_state)
-			inodes_stat.nr_unused--;
+		list_del_init(&inode->i_lru);
+		inodes_stat.nr_unused--;
+		spin_unlock(&wb_inode_list_lock);
+	}
+	if (!list_empty(&inode->i_io)) {
+		spin_lock(&wb_inode_list_lock);
+		list_del_init(&inode->i_io);
 		spin_unlock(&wb_inode_list_lock);
 	}
 	inode_sb_list_del(inode);
Index: linux-2.6/fs/nilfs2/mdt.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/mdt.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/nilfs2/mdt.c	2010-10-19 14:19:16.000000000 +1100
@@ -504,7 +504,8 @@
 #endif
 		inode->dirtied_when = 0;
 
-		INIT_LIST_HEAD(&inode->i_list);
+		INIT_LIST_HEAD(&inode->i_io);
+		INIT_LIST_HEAD(&inode->i_lru);
 		INIT_LIST_HEAD(&inode->i_sb_list);
 		inode->i_state = 0;
 #endif



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 28/35] fs: icache split writeback and lru locks
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (26 preceding siblings ...)
  2010-10-19  3:42 ` [patch 27/35] fs: icache split IO and LRU lists npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 29/35] fs: icache per-bdi writeback list locking npiggin
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-split-wb-lru-locks.patch --]
[-- Type: text/plain, Size: 7307 bytes --]

Split wb_inode_list_lock lock into two locks, inode_lru_lock to protect
inode LRU list, and a per-bdi lock to protect the inode writeback lists.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c         |    6 +--
 fs/inode.c                |   73 ++++++++++++++++++++++++++++------------------
 include/linux/fs.h        |    2 +
 include/linux/writeback.h |    1 
 4 files changed, 50 insertions(+), 32 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:20.000000000 +1100
@@ -423,10 +423,8 @@
 			 * The inode is clean
 			 */
 			list_del_init(&inode->i_io);
-			if (list_empty(&inode->i_lru)) {
-				list_add(&inode->i_lru, &inode_unused);
-				inodes_stat.nr_unused++;
-			}
+			if (list_empty(&inode->i_lru))
+				__inode_lru_list_add(inode);
 		}
 	}
 	inode_sync_complete(inode);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:20.000000000 +1100
@@ -33,8 +33,10 @@
  *   s_inodes, i_sb_list
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
+ * inode_lru_lock protects:
+ *   inode_lru, i_lru
  * wb_inode_list_lock protects:
- *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_io, i_lru
+ *   b_io, b_more_io, b_dirty, i_io, i_lru
  * inode->i_lock protects:
  *   i_state
  *   i_count
@@ -46,6 +48,7 @@
  * Ordering:
  * inode->i_lock
  *   inode_list_lglock
+ *   inode_lru_lock
  *   wb_inode_list_lock
  *   inode_hash_bucket lock
  */
@@ -96,7 +99,7 @@
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_unused);
+static LIST_HEAD(inode_lru);
 
 struct inode_hash_bucket {
 	struct hlist_bl_head head;
@@ -124,6 +127,7 @@
 DEFINE_LGLOCK(inode_list_lglock);
 
 DEFINE_SPINLOCK(wb_inode_list_lock);
+static DEFINE_SPINLOCK(inode_lru_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -432,6 +436,28 @@
 }
 
 /*
+ * Add an inode to the LRU list. i_lock must be held.
+ */
+void __inode_lru_list_add(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	list_add(&inode->i_lru, &inode_lru);
+	inodes_stat.nr_unused++;
+	spin_unlock(&inode_lru_lock);
+}
+
+/*
+ * Remove an inode from the LRU list. i_lock must be held.
+ */
+void __inode_lru_list_del(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	list_del_init(&inode->i_lru);
+	inodes_stat.nr_unused--;
+	spin_unlock(&inode_lru_lock);
+}
+
+/*
  * Invalidate all inodes for a device.
  */
 static int invalidate_sb_inodes(struct super_block *sb, struct list_head *dispose)
@@ -449,9 +475,10 @@
 		if (!inode->i_count) {
 			spin_lock(&wb_inode_list_lock);
 			list_del_init(&inode->i_io);
-			list_del(&inode->i_lru);
-			inodes_stat.nr_unused--;
 			spin_unlock(&wb_inode_list_lock);
+
+			__inode_lru_list_del(inode);
+
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
@@ -513,7 +540,7 @@
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_unused list.  So look for it there and if the
+ * the front of the inode_lru list.  So look for it there and if the
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
@@ -527,17 +554,17 @@
 
 	down_read(&iprune_sem);
 again:
-	spin_lock(&wb_inode_list_lock);
+	spin_lock(&inode_lru_lock);
 	for (; nr_to_scan; nr_to_scan--) {
 		struct inode *inode;
 
-		if (list_empty(&inode_unused))
+		if (list_empty(&inode_lru))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_lru);
+		inode = list_entry(inode_lru.prev, struct inode, i_lru);
 
 		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&wb_inode_list_lock);
+			spin_unlock(&inode_lru_lock);
 			goto again;
 		}
 		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
@@ -547,7 +574,7 @@
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_lru, &inode_unused);
+			list_move(&inode->i_lru, &inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -560,8 +587,8 @@
 			 *
 			 * We'll try to get it back if it becomes freeable.
 			 */
-			list_move(&inode->i_lru, &inode_unused);
-			spin_unlock(&wb_inode_list_lock);
+			list_move(&inode->i_lru, &inode_lru);
+			spin_unlock(&inode_lru_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
 
@@ -569,8 +596,8 @@
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&wb_inode_list_lock);
-			if (inode == list_entry(inode_unused.next,
+			spin_lock(&inode_lru_lock);
+			if (inode == list_entry(inode_lru.next,
 						struct inode, i_lru)) {
 				if (spin_trylock(&inode->i_lock)) {
 					if (can_unuse(inode))
@@ -591,7 +618,7 @@
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&wb_inode_list_lock);
+	spin_unlock(&inode_lru_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -1512,12 +1539,8 @@
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
 			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
-					list_empty(&inode->i_lru)) {
-				spin_lock(&wb_inode_list_lock);
-				list_add(&inode->i_lru, &inode_unused);
-				inodes_stat.nr_unused++;
-				spin_unlock(&wb_inode_list_lock);
-			}
+					list_empty(&inode->i_lru))
+				__inode_lru_list_add(inode);
 			spin_unlock(&inode->i_lock);
 			return;
 		}
@@ -1530,12 +1553,8 @@
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
-	if (!list_empty(&inode->i_lru)) {
-		spin_lock(&wb_inode_list_lock);
-		list_del_init(&inode->i_lru);
-		inodes_stat.nr_unused--;
-		spin_unlock(&wb_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_lru))
+		__inode_lru_list_del(inode);
 	if (!list_empty(&inode->i_io)) {
 		spin_lock(&wb_inode_list_lock);
 		list_del_init(&inode->i_io);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:18.000000000 +1100
@@ -2088,6 +2088,8 @@
 extern int __invalidate_device(struct block_device *);
 extern int invalidate_partition(struct gendisk *, int);
 #endif
+extern void __inode_lru_list_add(struct inode *inode);
+extern void __inode_lru_list_del(struct inode *inode);
 extern int invalidate_inodes(struct super_block *);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:20.000000000 +1100
@@ -10,7 +10,6 @@
 struct backing_dev_info;
 
 extern spinlock_t wb_inode_list_lock;
-extern struct list_head inode_unused;
 
 /*
  * fs/fs-writeback.c



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 29/35] fs: icache per-bdi writeback list locking
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (27 preceding siblings ...)
  2010-10-19  3:42 ` [patch 28/35] fs: icache split writeback and lru locks npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 30/35] fs: icache lazy LRU avoid LRU locking after IO operation npiggin
                   ` (7 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-scale-wb-2.patch --]
[-- Type: text/plain, Size: 15072 bytes --]

Scale inode writeback lists by breaking the global writeback list lock
into per-bdi locks.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/fs-writeback.c           |  110 ++++++++++++++++++++------------------------
 fs/inode.c                  |   17 ++++--
 fs/internal.h               |   12 ++++
 include/linux/backing-dev.h |    2 
 include/linux/writeback.h   |    2 
 mm/backing-dev.c            |   28 +++++++++--
 6 files changed, 100 insertions(+), 71 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:20.000000000 +1100
@@ -69,16 +69,6 @@
 	return test_bit(BDI_writeback_running, &bdi->state);
 }
 
-static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	if (strcmp(sb->s_type->name, "bdev") == 0)
-		return inode->i_mapping->backing_dev_info;
-
-	return sb->s_bdi;
-}
-
 static void bdi_queue_work(struct backing_dev_info *bdi,
 		struct wb_writeback_work *work)
 {
@@ -165,11 +155,9 @@
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct bdi_writeback *wb, struct inode *inode)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&wb_inode_list_lock);
+	assert_spin_locked(&wb->b_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -183,11 +171,9 @@
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct bdi_writeback *wb, struct inode *inode)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&wb_inode_list_lock);
+	assert_spin_locked(&wb->b_lock);
 	list_move(&inode->i_io, &wb->b_more_io);
 }
 
@@ -228,7 +214,6 @@
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	assert_spin_locked(&wb_inode_list_lock);
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_io);
 		if (older_than_this &&
@@ -285,18 +270,19 @@
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct bdi_writeback *wb,
+					struct inode *inode)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
-		spin_unlock(&wb_inode_list_lock);
+		spin_unlock(&wb->b_lock);
 		spin_unlock(&inode->i_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode->i_lock);
-		spin_lock(&wb_inode_list_lock);
+		spin_lock(&wb->b_lock);
 	}
 }
 
@@ -315,7 +301,8 @@
  * with them locked.
  */
 static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
+			struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
@@ -336,14 +323,14 @@
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+			requeue_io(wb, inode);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode);
+		inode_wait_for_writeback(wb, inode);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -351,7 +338,7 @@
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
-	spin_unlock(&wb_inode_list_lock);
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode->i_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -386,7 +373,7 @@
 		spin_lock(&inode->i_lock);
 	}
 
-	spin_lock(&wb_inode_list_lock);
+	spin_lock(&wb->b_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -399,7 +386,7 @@
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_io(wb, inode);
 			} else {
 				/*
 				 * Writeback blocked by something other than
@@ -408,7 +395,7 @@
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode);
+				redirty_tail(wb, inode);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -417,7 +404,7 @@
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(inode);
+			redirty_tail(wb, inode);
 		} else {
 			/*
 			 * The inode is clean
@@ -477,8 +464,9 @@
 						 struct inode, i_io);
 
 		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&wb_inode_list_lock);
-			spin_lock(&wb_inode_list_lock);
+			spin_unlock(&wb->b_lock);
+			cpu_relax();
+			spin_lock(&wb->b_lock);
 			goto again;
 		}
 
@@ -489,7 +477,7 @@
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode);
+				redirty_tail(wb, inode);
 				spin_unlock(&inode->i_lock);
 				continue;
 			}
@@ -505,7 +493,7 @@
 		}
 
 		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
-			requeue_io(inode);
+			requeue_io(wb, inode);
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
@@ -521,19 +509,19 @@
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		writeback_single_inode(wb, inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			redirty_tail(wb, inode);
 		}
-		spin_unlock(&wb_inode_list_lock);
+		spin_unlock(&wb->b_lock);
 		spin_unlock(&inode->i_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&wb_inode_list_lock);
+		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -553,7 +541,7 @@
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 again:
-	spin_lock(&wb_inode_list_lock);
+	spin_lock(&wb->b_lock);
 
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
@@ -565,10 +553,11 @@
 
 		if (!pin_sb_for_writeback(sb)) {
 			if (!spin_trylock(&inode->i_lock)) {
-				spin_unlock(&wb_inode_list_lock);
+				spin_unlock(&wb->b_lock);
+				cpu_relax();
 				goto again;
 			}
-			requeue_io(inode);
+			requeue_io(wb, inode);
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
@@ -578,7 +567,7 @@
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb_inode_list_lock);
+	spin_unlock(&wb->b_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -587,11 +576,11 @@
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&wb_inode_list_lock);
+	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&wb_inode_list_lock);
+	spin_unlock(&wb->b_lock);
 }
 
 /*
@@ -702,19 +691,19 @@
 		 * we'll just busyloop.
 		 */
 retry:
-		spin_lock(&wb_inode_list_lock);
+		spin_lock(&wb->b_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_io);
 			if (!spin_trylock(&inode->i_lock)) {
-				spin_unlock(&wb_inode_list_lock);
+				spin_unlock(&wb->b_lock);
 				goto retry;
 			}
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
-			inode_wait_for_writeback(inode);
+			inode_wait_for_writeback(wb, inode);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&wb_inode_list_lock);
+		spin_unlock(&wb->b_lock);
 	}
 
 	return wrote;
@@ -1013,7 +1002,9 @@
 		 * reposition it (that would break b_dirty time-ordering).
 		 */
 		if (!was_dirty) {
-			bdi = inode_to_bdi(inode);
+			struct bdi_writeback *wb;
+ 			bdi = inode_to_bdi(inode);
+			wb = inode_to_wb(inode);
 
 			if (bdi_cap_writeback_dirty(bdi)) {
 				WARN(!test_bit(BDI_registered, &bdi->state),
@@ -1030,9 +1021,10 @@
 			}
 
 			inode->dirtied_when = jiffies;
-			spin_lock(&wb_inode_list_lock);
-			list_move(&inode->i_io, &bdi->wb.b_dirty);
-			spin_unlock(&wb_inode_list_lock);
+			spin_lock(&wb->b_lock);
+			BUG_ON(!list_empty(&inode->i_io));
+			list_add(&inode->i_io, &wb->b_dirty);
+			spin_unlock(&wb->b_lock);
 		}
 	}
 out:
@@ -1209,6 +1201,7 @@
  */
 int write_inode_now(struct inode *inode, int sync)
 {
+	struct bdi_writeback *wb = inode_to_wb(inode);
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1222,9 +1215,9 @@
 
 	might_sleep();
 	spin_lock(&inode->i_lock);
-	spin_lock(&wb_inode_list_lock);
-	ret = writeback_single_inode(inode, &wbc);
-	spin_unlock(&wb_inode_list_lock);
+	spin_lock(&wb->b_lock);
+	ret = writeback_single_inode(wb, inode, &wbc);
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode->i_lock);
 	if (sync)
 		inode_sync_wait(inode);
@@ -1245,12 +1238,13 @@
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb = inode_to_wb(inode);
 	int ret;
 
 	spin_lock(&inode->i_lock);
-	spin_lock(&wb_inode_list_lock);
-	ret = writeback_single_inode(inode, wbc);
-	spin_unlock(&wb_inode_list_lock);
+	spin_lock(&wb->b_lock);
+	ret = writeback_single_inode(wb, inode, wbc);
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode->i_lock);
 	return ret;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:19.000000000 +1100
@@ -26,6 +26,7 @@
 #include <linux/posix_acl.h>
 #include <linux/bit_spinlock.h>
 #include <linux/lglock.h>
+#include "internal.h"
 
 /*
  * Usage:
@@ -35,7 +36,7 @@
  *   inode hash table, i_hash
  * inode_lru_lock protects:
  *   inode_lru, i_lru
- * wb_inode_list_lock protects:
+ * wb->b_lock protects:
  *   b_io, b_more_io, b_dirty, i_io, i_lru
  * inode->i_lock protects:
  *   i_state
@@ -49,7 +50,7 @@
  * inode->i_lock
  *   inode_list_lglock
  *   inode_lru_lock
- *   wb_inode_list_lock
+ *   wb->b_lock
  *   inode_hash_bucket lock
  */
 /*
@@ -126,7 +127,6 @@
 DECLARE_LGLOCK(inode_list_lglock);
 DEFINE_LGLOCK(inode_list_lglock);
 
-DEFINE_SPINLOCK(wb_inode_list_lock);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 /*
@@ -473,9 +473,11 @@
 		}
 		invalidate_inode_buffers(inode);
 		if (!inode->i_count) {
-			spin_lock(&wb_inode_list_lock);
+			struct bdi_writeback *wb = inode_to_wb(inode);
+
+			spin_lock(&wb->b_lock);
 			list_del_init(&inode->i_io);
-			spin_unlock(&wb_inode_list_lock);
+			spin_unlock(&wb->b_lock);
 
 			__inode_lru_list_del(inode);
 
@@ -1556,9 +1558,10 @@
 	if (!list_empty(&inode->i_lru))
 		__inode_lru_list_del(inode);
 	if (!list_empty(&inode->i_io)) {
-		spin_lock(&wb_inode_list_lock);
+		struct bdi_writeback *wb = inode_to_wb(inode);
+		spin_lock(&wb->b_lock);
 		list_del_init(&inode->i_io);
-		spin_unlock(&wb_inode_list_lock);
+		spin_unlock(&wb->b_lock);
 	}
 	inode_sb_list_del(inode);
 	WARN_ON(inode->i_state & I_NEW);
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h	2010-10-19 14:17:28.000000000 +1100
+++ linux-2.6/fs/internal.h	2010-10-19 14:19:00.000000000 +1100
@@ -15,6 +15,18 @@
 struct linux_binprm;
 struct path;
 
+static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	if (strcmp(sb->s_type->name, "bdev") == 0)
+		return inode->i_mapping->backing_dev_info;
+
+	return sb->s_bdi;
+}
+
+#define inode_to_wb(inode)   (&inode_to_bdi(inode)->wb)
+
 /*
  * block_dev.c
  */
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2010-10-19 14:17:15.000000000 +1100
+++ linux-2.6/include/linux/backing-dev.h	2010-10-19 14:19:00.000000000 +1100
@@ -16,6 +16,7 @@
 #include <linux/sched.h>
 #include <linux/timer.h>
 #include <linux/writeback.h>
+#include <linux/spinlock.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -54,6 +55,7 @@
 
 	struct task_struct *task;	/* writeback thread */
 	struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */
+	spinlock_t b_lock;		/* lock for inode lists */
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:00.000000000 +1100
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t wb_inode_list_lock;
-
 /*
  * fs/fs-writeback.c
  */
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/mm/backing-dev.c	2010-10-19 14:19:00.000000000 +1100
@@ -73,14 +73,14 @@
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&wb_inode_list_lock);
+	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_io)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
-	spin_unlock(&wb_inode_list_lock);
+	spin_unlock(&wb->b_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -631,6 +631,7 @@
 
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
+	spin_lock_init(&wb->b_lock);
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
@@ -671,6 +672,17 @@
 }
 EXPORT_SYMBOL(bdi_init);
 
+static void bdi_lock_two(struct backing_dev_info *bdi1, struct backing_dev_info *bdi2)
+{
+	if (bdi1 < bdi2) {
+		spin_lock(&bdi1->wb.b_lock);
+		spin_lock_nested(&bdi2->wb.b_lock, 1);
+	} else {
+		spin_lock(&bdi2->wb.b_lock);
+		spin_lock_nested(&bdi1->wb.b_lock, 1);
+	}
+}
+
 void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
@@ -682,11 +694,19 @@
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&wb_inode_list_lock);
+		bdi_lock_two(bdi, &default_backing_dev_info);
+		/*
+		 * It's OK to move inodes between different wb lists without
+		 * locking the individual inodes. i_lock will still protect
+		 * whether or not it is on a writeback list or not. However it
+		 * is a little quirk, maybe better to lock all inodes in this
+		 * uncommon case just to keep locking very regular.
+		 */
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
-		spin_unlock(&wb_inode_list_lock);
+		spin_unlock(&bdi->wb.b_lock);
+		spin_unlock(&dst->b_lock);
 	}
 
 	bdi_unregister(bdi);



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 30/35] fs: icache lazy LRU avoid LRU locking after IO operation
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (28 preceding siblings ...)
  2010-10-19  3:42 ` [patch 29/35] fs: icache per-bdi writeback list locking npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 31/35] fs: icache per-zone inode LRU npiggin
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-lazy-lru-wb-avoid.patch --]
[-- Type: text/plain, Size: 875 bytes --]

Now that inode LRU and writeback lists and locking are separated,
it makes sense to avoid LRU manipulation after completing a writeback,
and instead rely on lazy-LRU algorithm to do the work for us.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:18.000000000 +1100
@@ -410,7 +410,11 @@
 			 * The inode is clean
 			 */
 			list_del_init(&inode->i_io);
-			if (list_empty(&inode->i_lru))
+
+			/*
+			 * Put it on the LRU if it is unused, otherwise lazy.
+			 */
+			if (!inode->i_count && list_empty(&inode->i_lru))
 				__inode_lru_list_add(inode);
 		}
 	}



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 31/35] fs: icache per-zone inode LRU
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (29 preceding siblings ...)
  2010-10-19  3:42 ` [patch 30/35] fs: icache lazy LRU avoid LRU locking after IO operation npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 12:38   ` Dave Chinner
  2010-10-19  3:42 ` [patch 32/35] fs: icache minimise I_FREEING latency npiggin
                   ` (5 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-per-zone-lru.patch --]
[-- Type: text/plain, Size: 6854 bytes --]

Per-zone LRUs and shrinkers for inode cache.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c             |   84 ++++++++++++++++++++++++++++---------------------
 include/linux/mmzone.h |    7 ++++
 2 files changed, 56 insertions(+), 35 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:38:31.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:39:04.000000000 +1100
@@ -34,7 +34,7 @@
  *   s_inodes, i_sb_list
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
- * inode_lru_lock protects:
+ * zone->inode_lru_lock protects:
  *   inode_lru, i_lru
  * wb->b_lock protects:
  *   b_io, b_more_io, b_dirty, i_io, i_lru
@@ -49,7 +49,7 @@
  * Ordering:
  * inode->i_lock
  *   inode_list_lglock
- *   inode_lru_lock
+ *   zone->inode_lru_lock
  *   wb->b_lock
  *   inode_hash_bucket lock
  */
@@ -100,8 +100,6 @@
  * allowing for low-overhead inode sync() operations.
  */
 
-static LIST_HEAD(inode_lru);
-
 struct inode_hash_bucket {
 	struct hlist_bl_head head;
 };
@@ -127,8 +125,6 @@
 DECLARE_LGLOCK(inode_list_lglock);
 DEFINE_LGLOCK(inode_list_lglock);
 
-static DEFINE_SPINLOCK(inode_lru_lock);
-
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
@@ -166,7 +162,12 @@
 
 int get_nr_inodes_unused(void)
 {
-	return inodes_stat.nr_unused;
+	int nr = 0;
+	struct zone *z;
+
+	for_each_populated_zone(z)
+		nr += z->inode_nr_lru;
+	return nr;
 }
 
 /*
@@ -177,6 +178,7 @@
 {
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 	inodes_stat.nr_inodes = get_nr_inodes();
+	inodes_stat.nr_unused = get_nr_inodes_unused();
 	return proc_dointvec(table, write, buffer, lenp, ppos);
 #else
 	return -ENOSYS;
@@ -440,10 +442,12 @@
  */
 void __inode_lru_list_add(struct inode *inode)
 {
-	spin_lock(&inode_lru_lock);
-	list_add(&inode->i_lru, &inode_lru);
-	inodes_stat.nr_unused++;
-	spin_unlock(&inode_lru_lock);
+	struct zone *z = page_zone(virt_to_page(inode));
+
+	spin_lock(&z->inode_lru_lock);
+	list_add(&inode->i_lru, &z->inode_lru);
+	z->inode_nr_lru++;
+	spin_unlock(&z->inode_lru_lock);
 }
 
 /*
@@ -451,10 +455,12 @@
  */
 void __inode_lru_list_del(struct inode *inode)
 {
-	spin_lock(&inode_lru_lock);
+	struct zone *z = page_zone(virt_to_page(inode));
+
+	spin_lock(&z->inode_lru_lock);
 	list_del_init(&inode->i_lru);
-	inodes_stat.nr_unused--;
-	spin_unlock(&inode_lru_lock);
+	z->inode_nr_lru--;
+	spin_unlock(&z->inode_lru_lock);
 }
 
 /*
@@ -549,34 +555,35 @@
  * If the inode has metadata buffers attached to mapping->private_list then
  * try to remove them.
  */
-static void prune_icache(unsigned long nr_to_scan)
+static void prune_icache(struct zone *zone, unsigned long nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
 again:
-	spin_lock(&inode_lru_lock);
+	spin_lock(&zone->inode_lru_lock);
 	for (; nr_to_scan; nr_to_scan--) {
 		struct inode *inode;
 
-		if (list_empty(&inode_lru))
+		if (list_empty(&zone->inode_lru))
 			break;
 
-		inode = list_entry(inode_lru.prev, struct inode, i_lru);
+		inode = list_entry(zone->inode_lru.prev, struct inode, i_lru);
 
 		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&inode_lru_lock);
+			spin_unlock(&zone->inode_lru_lock);
+			cpu_relax();
 			goto again;
 		}
 		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
 			list_del_init(&inode->i_lru);
 			spin_unlock(&inode->i_lock);
-			inodes_stat.nr_unused--;
+			zone->inode_nr_lru--;
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_lru, &inode_lru);
+			list_move(&inode->i_lru, &zone->inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -589,8 +596,8 @@
 			 *
 			 * We'll try to get it back if it becomes freeable.
 			 */
-			list_move(&inode->i_lru, &inode_lru);
-			spin_unlock(&inode_lru_lock);
+			list_move(&inode->i_lru, &zone->inode_lru);
+			spin_unlock(&zone->inode_lru_lock);
 			__iget(inode);
 			spin_unlock(&inode->i_lock);
 
@@ -598,8 +605,8 @@
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lru_lock);
-			if (inode == list_entry(inode_lru.next,
+			spin_lock(&zone->inode_lru_lock);
+			if (inode == list_entry(zone->inode_lru.next,
 						struct inode, i_lru)) {
 				if (spin_trylock(&inode->i_lock)) {
 					if (can_unuse(inode))
@@ -614,13 +621,13 @@
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
-		inodes_stat.nr_unused--;
+		zone->inode_nr_lru--;
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&inode_lru_lock);
+	spin_unlock(&zone->inode_lru_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -639,11 +646,10 @@
 		unsigned long total, unsigned long global,
 		unsigned long flags, gfp_t gfp_mask)
 {
-	static unsigned long nr_to_scan;
 	unsigned long nr;
 
-	shrinker_add_scan(&nr_to_scan, scanned, global,
-			inodes_stat.nr_unused,
+	shrinker_add_scan(&zone->inode_nr_scan, scanned, total,
+			zone->inode_nr_lru,
 			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
 	/*
 	 * Nasty deadlock avoidance.  We may hold various FS locks,
@@ -653,11 +659,12 @@
 	if (!(gfp_mask & __GFP_FS))
 	       return;
 
-	while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
-		prune_icache(nr);
-		count_vm_events(SLABS_SCANNED, nr);
-		cond_resched();
-	}
+	nr = ACCESS_ONCE(zone->inode_nr_scan);
+	if (nr < SHRINK_BATCH)
+		return;
+	zone->inode_nr_scan = 0;
+	prune_icache(zone, nr);
+	count_vm_events(SLABS_SCANNED, nr);
 }
 
 static struct shrinker icache_shrinker = {
@@ -1830,6 +1837,7 @@
 void __init inode_init(void)
 {
 	int loop;
+	struct zone *zone;
 
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
@@ -1838,6 +1846,12 @@
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once);
+	for_each_zone(zone) {
+		spin_lock_init(&zone->inode_lru_lock);
+		INIT_LIST_HEAD(&zone->inode_lru);
+		zone->inode_nr_lru = 0;
+		zone->inode_nr_scan = 0;
+	}
 	register_shrinker(&icache_shrinker);
 
 	lg_lock_init(inode_list_lglock);
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2010-10-19 14:19:19.000000000 +1100
+++ linux-2.6/include/linux/mmzone.h	2010-10-19 14:38:33.000000000 +1100
@@ -362,6 +362,13 @@
 
 
 	ZONE_PADDING(_pad2_)
+
+	spinlock_t inode_lru_lock;
+	struct list_head inode_lru;
+	unsigned long inode_nr_lru;
+	unsigned long inode_nr_scan;
+
+	ZONE_PADDING(_pad3_)
 	/* Rarely used or read-mostly fields */
 
 	/*



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 32/35] fs: icache minimise I_FREEING latency
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (30 preceding siblings ...)
  2010-10-19  3:42 ` [patch 31/35] fs: icache per-zone inode LRU npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 33/35] fs: icache introduce inode_get/inode_get_ilock npiggin
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-less-i_freeing.patch --]
[-- Type: text/plain, Size: 2331 bytes --]

Problem with inode reclaim is that it puts inodes into I_FREEING state
and then continues to gather more, during which it may iput,
invalidate_mapping_pages, be preempted, etc. Holding these inodes in
I_FREEING can cause pauses.

After the inode scalability work, there is not a big reason to batch
up inodes to reclaim them, so dispose them as they are found from the
LRU.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c |   32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:18.000000000 +1100
@@ -410,6 +410,19 @@
 
 static void inode_sb_list_del(struct inode *inode);
 
+static void dispose_one_inode(struct inode *inode)
+{
+	evict(inode);
+
+	spin_lock(&inode->i_lock);
+	__remove_inode_hash(inode);
+	inode_sb_list_del(inode);
+	spin_unlock(&inode->i_lock);
+
+	wake_up_inode(inode);
+	destroy_inode(inode);
+}
+
 /*
  * dispose_list - dispose of the contents of a local list
  * @head: the head of the list to free
@@ -425,15 +438,8 @@
 		inode = list_first_entry(head, struct inode, i_lru);
 		list_del_init(&inode->i_lru);
 
-		evict(inode);
-
-		spin_lock(&inode->i_lock);
-		__remove_inode_hash(inode);
-		inode_sb_list_del(inode);
-		spin_unlock(&inode->i_lock);
-
-		wake_up_inode(inode);
-		destroy_inode(inode);
+		dispose_one_inode(inode);
+		cond_resched();
 	}
 }
 
@@ -557,7 +563,6 @@
  */
 static void prune_icache(struct zone *zone, unsigned long nr_to_scan)
 {
-	LIST_HEAD(freeable);
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
@@ -617,11 +622,15 @@
 			continue;
 		}
 freeable:
-		list_move(&inode->i_lru, &freeable);
+		list_del_init(&inode->i_lru);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
 		zone->inode_nr_lru--;
+		spin_unlock(&zone->inode_lru_lock);
+		dispose_one_inode(inode);
+		cond_resched();
+		spin_lock(&zone->inode_lru_lock);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -629,7 +638,6 @@
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&zone->inode_lru_lock);
 
-	dispose_list(&freeable);
 	up_read(&iprune_sem);
 }
 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 33/35] fs: icache introduce inode_get/inode_get_ilock
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (31 preceding siblings ...)
  2010-10-19  3:42 ` [patch 32/35] fs: icache minimise I_FREEING latency npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 10:17   ` Boaz Harrosh
  2010-10-19  3:42 ` [patch 34/35] fs: inode rename i_count to i_refs npiggin
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-rationalize-iget.patch --]
[-- Type: text/plain, Size: 29757 bytes --]

Factor open coded inode lock, increment, unlock into a function inode_get().
Rename __iget to inode_get_ilock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/9p/vfs_inode.c           |    4 +---
 fs/affs/inode.c             |    4 +---
 fs/afs/dir.c                |    4 +---
 fs/anon_inodes.c            |    4 +---
 fs/bfs/dir.c                |    4 +---
 fs/block_dev.c              |   14 +++-----------
 fs/btrfs/inode.c            |    4 +---
 fs/coda/dir.c               |    4 +---
 fs/drop_caches.c            |    2 +-
 fs/exofs/inode.c            |    4 +---
 fs/exofs/namei.c            |    4 +---
 fs/ext2/namei.c             |    4 +---
 fs/ext3/namei.c             |    4 +---
 fs/ext4/namei.c             |    4 +---
 fs/fs-writeback.c           |    6 +++---
 fs/gfs2/ops_inode.c         |    4 +---
 fs/hfsplus/dir.c            |    4 +---
 fs/inode.c                  |   18 +++++++++---------
 fs/jffs2/dir.c              |    8 ++------
 fs/jfs/jfs_txnmgr.c         |    4 +---
 fs/jfs/namei.c              |    4 +---
 fs/libfs.c                  |    4 +---
 fs/logfs/dir.c              |    4 +---
 fs/minix/namei.c            |    4 +---
 fs/namei.c                  |    7 ++-----
 fs/nfs/dir.c                |    4 +---
 fs/nfs/getroot.c            |    6 ++----
 fs/nfs/write.c              |    2 +-
 fs/nilfs2/namei.c           |    4 +---
 fs/notify/inode_mark.c      |    8 ++++----
 fs/ntfs/super.c             |    4 +---
 fs/ocfs2/namei.c            |    4 +---
 fs/quota/dquot.c            |    2 +-
 fs/reiserfs/namei.c         |    4 +---
 fs/sysv/namei.c             |    4 +---
 fs/ubifs/dir.c              |    4 +---
 fs/udf/namei.c              |    4 +---
 fs/ufs/namei.c              |    4 +---
 fs/xfs/linux-2.6/xfs_iops.c |    4 +---
 fs/xfs/xfs_inode.h          |    4 +---
 include/linux/fs.h          |   10 +++++++++-
 ipc/mqueue.c                |    7 ++-----
 kernel/futex.c              |    4 +---
 mm/shmem.c                  |    4 +---
 net/socket.c                |    4 +---
 45 files changed, 72 insertions(+), 150 deletions(-)

Index: linux-2.6/fs/drop_caches.c
===================================================================
--- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/drop_caches.c	2010-10-19 14:19:00.000000000 +1100
@@ -24,7 +24,7 @@
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
-		__iget(inode);
+		inode_get_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		rcu_read_unlock();
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:16.000000000 +1100
@@ -288,7 +288,7 @@
 
 /*
  * Write out an inode's dirty pages. Either the caller has ref on the inode
- * (either via __iget or via syscall against an fd) or the inode has
+ * (either via inode_get or via syscall against an fd) or the inode has
  * I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
@@ -511,7 +511,7 @@
 		}
 
 		BUG_ON(inode->i_state & I_FREEING);
-		__iget(inode);
+		inode_get_ilock(inode);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(wb, inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1089,7 +1089,7 @@
 			continue;
 		}
 
- 		__iget(inode);
+ 		inode_get_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		rcu_read_unlock();
 		/*
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -603,7 +603,7 @@
 			 */
 			list_move(&inode->i_lru, &zone->inode_lru);
 			spin_unlock(&zone->inode_lru_lock);
-			__iget(inode);
+			inode_get_ilock(inode);
 			spin_unlock(&inode->i_lock);
 
 			if (remove_inode_buffers(inode))
@@ -682,7 +682,7 @@
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must call __iget()
+ * NOTE: we are not increasing the inode-refcount, you must call inode_get_ilock()
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
@@ -1023,7 +1023,7 @@
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		inode_get_ilock(old);
 		spin_unlock(&old->i_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -1070,7 +1070,7 @@
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		inode_get_ilock(old);
 		spin_unlock(&old->i_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -1145,7 +1145,7 @@
 
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		__iget(inode);
+		inode_get_ilock(inode);
 	else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
@@ -1187,7 +1187,7 @@
 
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
-		__iget(inode);
+		inode_get_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -1219,7 +1219,7 @@
 
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
-		__iget(inode);
+		inode_get_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1408,7 +1408,7 @@
 
 found_old:
 	spin_unlock_bucket(b);
-	__iget(old);
+	inode_get_ilock(old);
 	spin_unlock(&old->i_lock);
 	wait_on_inode(old);
 	if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
@@ -1453,7 +1453,7 @@
 
 found_old:
 	spin_unlock_bucket(b);
-	__iget(old);
+	inode_get_ilock(old);
 	spin_unlock(&old->i_lock);
 	wait_on_inode(old);
 	if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:15.000000000 +1100
@@ -2462,12 +2462,20 @@
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
-static inline void __iget(struct inode *inode)
+static inline void inode_get_ilock(struct inode *inode)
 {
 	assert_spin_locked(&inode->i_lock);
+	BUG_ON(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE));
 	inode->i_count++;
 }
 
+static inline void inode_get(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	inode_get_ilock(inode);
+	spin_unlock(&inode->i_lock);
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/nfs/write.c	2010-10-19 14:19:00.000000000 +1100
@@ -390,7 +390,7 @@
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		__iget(inode);
+		inode_get_ilock(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:16.000000000 +1100
@@ -245,7 +245,7 @@
 
 		spin_lock(&inode->i_lock);
 		/*
-		 * We cannot __iget() an inode in state I_FREEING,
+		 * We cannot inode_get() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
@@ -256,7 +256,7 @@
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
+		 * doing an inode_get/iput with MS_ACTIVE clear would actually
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
@@ -270,7 +270,7 @@
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			__iget(inode);
+			inode_get_ilock(inode);
 		else
 			need_iput_tmp = NULL;
 		spin_unlock(&inode->i_lock);
@@ -280,7 +280,7 @@
 			spin_lock(&next_i->i_lock);
 			if (next_i->i_count &&
 			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-				__iget(next_i);
+				inode_get_ilock(next_i);
 				need_iput = next_i;
 			}
 			spin_unlock(&next_i->i_lock);
Index: linux-2.6/fs/quota/dquot.c
===================================================================
--- linux-2.6.orig/fs/quota/dquot.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/quota/dquot.c	2010-10-19 14:19:00.000000000 +1100
@@ -917,7 +917,7 @@
 			continue;
 		}
 
-		__iget(inode);
+		inode_get_ilock(inode);
 		spin_unlock(&inode->i_lock);
 		rcu_read_unlock();
 
Index: linux-2.6/fs/affs/inode.c
===================================================================
--- linux-2.6.orig/fs/affs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/affs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -388,9 +388,7 @@
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		inode_get(inode);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/anon_inodes.c	2010-10-19 14:19:00.000000000 +1100
@@ -114,9 +114,7 @@
 	 * so we can avoid doing an igrab() and we can use an open-coded
 	 * atomic_inc().
 	 */
-	spin_lock(&anon_inode_inode->i_lock);
-	anon_inode_inode->i_count++;
-	spin_unlock(&anon_inode_inode->i_lock);
+	inode_get(anon_inode_inode);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
Index: linux-2.6/fs/bfs/dir.c
===================================================================
--- linux-2.6.orig/fs/bfs/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/bfs/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -176,9 +176,7 @@
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/block_dev.c	2010-10-19 14:19:16.000000000 +1100
@@ -557,11 +557,7 @@
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	struct inode *inode = bdev->bd_inode;
-
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(bdev->bd_inode);
 
 	return bdev;
 }
@@ -592,9 +588,7 @@
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		spin_lock(&inode->i_lock);
-		bdev->bd_inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		bdgrab(bdev);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -610,9 +604,7 @@
 			 * So, we can access it via ->i_mapping always
 			 * without igrab().
 			 */
-			spin_lock(&inode->i_lock);
-			bdev->bd_inode->i_count++;
-			spin_unlock(&inode->i_lock);
+			inode_get(bdev->bd_inode);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/btrfs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -4763,9 +4763,7 @@
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
Index: linux-2.6/fs/coda/dir.c
===================================================================
--- linux-2.6.orig/fs/coda/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/coda/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -303,9 +303,7 @@
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
Index: linux-2.6/fs/exofs/inode.c
===================================================================
--- linux-2.6.orig/fs/exofs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/exofs/inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -1162,9 +1162,7 @@
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	ios->done = create_done;
 	ios->private = inode;
Index: linux-2.6/fs/exofs/namei.c
===================================================================
--- linux-2.6.orig/fs/exofs/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/exofs/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -153,9 +153,7 @@
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	return exofs_add_nondir(dentry, inode);
 }
Index: linux-2.6/fs/ext2/namei.c
===================================================================
--- linux-2.6.orig/fs/ext2/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ext2/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -206,9 +206,7 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/ext3/namei.c
===================================================================
--- linux-2.6.orig/fs/ext3/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ext3/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -2260,9 +2260,7 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/ext4/namei.c
===================================================================
--- linux-2.6.orig/fs/ext4/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ext4/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -2312,9 +2312,7 @@
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
Index: linux-2.6/fs/gfs2/ops_inode.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/gfs2/ops_inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -253,9 +253,7 @@
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		inode_get(inode);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
Index: linux-2.6/fs/hfsplus/dir.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/hfsplus/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -301,9 +301,7 @@
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
Index: linux-2.6/fs/jffs2/dir.c
===================================================================
--- linux-2.6.orig/fs/jffs2/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/jffs2/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -289,9 +289,7 @@
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		spin_lock(&old_dentry->d_inode->i_lock);
-		old_dentry->d_inode->i_count++;
-		spin_unlock(&old_dentry->d_inode->i_lock);
+		inode_get(old_dentry->d_inode);
 	}
 	return ret;
 }
@@ -866,9 +864,7 @@
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		spin_lock(&old_dentry->d_inode->i_lock);
-		old_dentry->d_inode->i_count++;
-		spin_unlock(&old_dentry->d_inode->i_lock);
+		inode_get(old_dentry->d_inode);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
Index: linux-2.6/fs/jfs/jfs_txnmgr.c
===================================================================
--- linux-2.6.orig/fs/jfs/jfs_txnmgr.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/jfs/jfs_txnmgr.c	2010-10-19 14:19:00.000000000 +1100
@@ -1279,9 +1279,7 @@
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		spin_lock(&tblk->u.ip->i_lock);
-		tblk->u.ip->i_count++;
-		spin_unlock(&tblk->u.ip->i_lock);
+		inode_get(tblk->u.ip);
 		/*
 		 * Avoid a rare deadlock
 		 *
Index: linux-2.6/fs/jfs/namei.c
===================================================================
--- linux-2.6.orig/fs/jfs/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/jfs/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -839,9 +839,7 @@
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	spin_lock(&ip->i_lock);
-	ip->i_count++;
-	spin_unlock(&ip->i_lock);
+	inode_get(ip);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-10-19 14:19:00.000000000 +1100
@@ -255,9 +255,7 @@
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
Index: linux-2.6/fs/logfs/dir.c
===================================================================
--- linux-2.6.orig/fs/logfs/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/logfs/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -569,9 +569,7 @@
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
Index: linux-2.6/fs/minix/namei.c
===================================================================
--- linux-2.6.orig/fs/minix/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/minix/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -101,9 +101,7 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	return add_nondir(dentry, inode);
 }
 
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -2290,11 +2290,8 @@
 		if (nd.last.name[nd.last.len])
 			goto slashes;
 		inode = dentry->d_inode;
-		if (inode) {
-			spin_lock(&inode->i_lock);
-			inode->i_count++;
-			spin_unlock(&inode->i_lock);
-		}
+		if (inode)
+			inode_get(inode);
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
Index: linux-2.6/fs/nfs/dir.c
===================================================================
--- linux-2.6.orig/fs/nfs/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/nfs/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -1580,9 +1580,7 @@
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		inode_get(inode);
 		d_add(dentry, inode);
 	}
 	return error;
Index: linux-2.6/fs/nfs/getroot.c
===================================================================
--- linux-2.6.orig/fs/nfs/getroot.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/nfs/getroot.c	2010-10-19 14:19:00.000000000 +1100
@@ -54,10 +54,8 @@
 			iput(inode);
 			return -ENOMEM;
 		}
-		/* Circumvent igrab(): we know the inode is not being freed */
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
+		/* We know the inode is not being freed */
+		inode_get(inode);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
Index: linux-2.6/fs/nilfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/nilfs2/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -219,9 +219,7 @@
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
Index: linux-2.6/fs/ntfs/super.c
===================================================================
--- linux-2.6.orig/fs/ntfs/super.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ntfs/super.c	2010-10-19 14:19:16.000000000 +1100
@@ -2930,9 +2930,7 @@
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
 		/* We increment i_count simulating an ntfs_iget(). */
-		spin_lock(&vol->root_ino->i_lock);
-		vol->root_ino->i_count++;
-		spin_unlock(&vol->root_ino->i_lock);
+		inode_get(vol->root_ino);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
Index: linux-2.6/fs/ocfs2/namei.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ocfs2/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -741,9 +741,7 @@
 		goto out_commit;
 	}
 
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
Index: linux-2.6/fs/reiserfs/namei.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/reiserfs/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -1156,9 +1156,7 @@
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
Index: linux-2.6/fs/sysv/namei.c
===================================================================
--- linux-2.6.orig/fs/sysv/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/sysv/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -126,9 +126,7 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	return add_nondir(dentry, inode);
 }
Index: linux-2.6/fs/ubifs/dir.c
===================================================================
--- linux-2.6.orig/fs/ubifs/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ubifs/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -550,9 +550,7 @@
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
Index: linux-2.6/fs/udf/namei.c
===================================================================
--- linux-2.6.orig/fs/udf/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/udf/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -1101,9 +1101,7 @@
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
Index: linux-2.6/fs/ufs/namei.c
===================================================================
--- linux-2.6.orig/fs/ufs/namei.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ufs/namei.c	2010-10-19 14:19:00.000000000 +1100
@@ -180,9 +180,7 @@
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
Index: linux-2.6/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_iops.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_iops.c	2010-10-19 14:19:00.000000000 +1100
@@ -352,9 +352,7 @@
 	if (unlikely(error))
 		return -error;
 
-	spin_lock(&inode->i_lock);
-	inode->i_count++;
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	d_instantiate(dentry, inode);
 	return 0;
 }
Index: linux-2.6/fs/xfs/xfs_inode.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_inode.h	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/xfs/xfs_inode.h	2010-10-19 14:19:16.000000000 +1100
@@ -481,10 +481,8 @@
 
 #define IHOLD(ip) \
 do { \
-	spin_lock(&VFS_I(ip)->i_lock); \
 	ASSERT(VFS_I(ip)->i_count > 0) ; \
-	VFS_I(ip)->i_count++; \
-	spin_unlock(&VFS_I(ip)->i_lock); \
+	inode_get(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
Index: linux-2.6/ipc/mqueue.c
===================================================================
--- linux-2.6.orig/ipc/mqueue.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/ipc/mqueue.c	2010-10-19 14:19:00.000000000 +1100
@@ -775,11 +775,8 @@
 	}
 
 	inode = dentry->d_inode;
-	if (inode) {
-		spin_lock(&inode->i_lock);
-		inode->i_count++;
-		spin_unlock(&inode->i_lock);
-	}
+	if (inode)
+		inode_get(inode);
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/kernel/futex.c	2010-10-19 14:19:00.000000000 +1100
@@ -168,9 +168,7 @@
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		spin_lock(&key->shared.inode->i_lock);
-		key->shared.inode->i_count++;
-		spin_unlock(&key->shared.inode->i_lock);
+		inode_get(key->shared.inode);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/mm/shmem.c	2010-10-19 14:19:00.000000000 +1100
@@ -1903,9 +1903,7 @@
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	spin_lock(&inode->i_lock);
-	inode->i_count++;	/* New dentry reference */
-	spin_unlock(&inode->i_lock);
+	inode_get(inode);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/net/socket.c	2010-10-19 14:19:00.000000000 +1100
@@ -378,9 +378,7 @@
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		spin_lock(&path.dentry->d_inode->i_lock);
-		path.dentry->d_inode->i_count++;
-		spin_unlock(&path.dentry->d_inode->i_lock);
+		inode_get(path.dentry->d_inode);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
Index: linux-2.6/fs/9p/vfs_inode.c
===================================================================
--- linux-2.6.orig/fs/9p/vfs_inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/9p/vfs_inode.c	2010-10-19 14:19:16.000000000 +1100
@@ -1798,9 +1798,7 @@
 		/* Caching disabled. No need to get upto date stat info.
 		 * This dentry will be released immediately. So, just i_count++
 		 */
-		spin_lock(&old_dentry->d_inode->i_lock);
-		old_dentry->d_inode->i_count++;
-		spin_unlock(&old_dentry->d_inode->i_lock);
+		inode_get(old_dentry->d_inode);
 	}
 
 	dentry->d_op = old_dentry->d_op;
Index: linux-2.6/fs/afs/dir.c
===================================================================
--- linux-2.6.orig/fs/afs/dir.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/afs/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -1045,9 +1045,7 @@
 	if (ret < 0)
 		goto link_error;
 
-	spin_lock(&vnode->vfs_inode.i_lock);
-	vnode->vfs_inode.i_count++;
-	spin_unlock(&vnode->vfs_inode.i_lock);
+	inode_get(&vnode->vfs_inode);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 34/35] fs: inode rename i_count to i_refs
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (32 preceding siblings ...)
  2010-10-19  3:42 ` [patch 33/35] fs: icache introduce inode_get/inode_get_ilock npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19  3:42 ` [patch 35/35] fs: icache document more lock orders npiggin
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-rename.patch --]
[-- Type: text/plain, Size: 24716 bytes --]

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 Documentation/filesystems/vfs.txt        |    4 ++--
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/staging/pohmelfs/inode.c         |   14 +++++++-------
 fs/9p/vfs_inode.c                        |    2 +-
 fs/block_dev.c                           |    2 +-
 fs/btrfs/inode.c                         |   10 +++++-----
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/exofs/inode.c                         |    4 ++--
 fs/ext3/ialloc.c                         |    4 ++--
 fs/ext4/ialloc.c                         |    4 ++--
 fs/fs-writeback.c                        |    4 ++--
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |   18 +++++++++---------
 fs/locks.c                               |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/nfs/inode.c                           |    4 ++--
 fs/nilfs2/mdt.c                          |    2 +-
 fs/notify/inode_mark.c                   |    8 ++++----
 fs/ntfs/inode.c                          |    6 +++---
 fs/ntfs/super.c                          |    4 ++--
 fs/reiserfs/stree.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/squashfs/dir.c                        |    8 ++++----
 fs/squashfs/inode.c                      |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/udf/inode.c                           |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    2 +-
 include/linux/fs.h                       |    6 +++---
 30 files changed, 65 insertions(+), 65 deletions(-)

Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/Documentation/filesystems/vfs.txt	2010-10-19 14:19:00.000000000 +1100
@@ -347,7 +347,7 @@
   lookup: called when the VFS needs to look up an inode in a parent
 	directory. The name to look for is found in the dentry. This
 	method must call d_add() to insert the found inode into the
-	dentry. The "i_count" field in the inode structure should be
+	dentry. The "i_refs" field in the inode structure should be
 	incremented. If the named inode does not exist a NULL inode
 	should be inserted into the dentry (this is called a negative
 	dentry). Returning an error code from this routine must only
@@ -926,7 +926,7 @@
 	d_instantiate()
 
   d_instantiate: add a dentry to the alias hash list for the inode and
-	updates the "d_inode" member. The "i_count" member in the
+	updates the "d_inode" member. The "i_refs" member in the
 	inode structure should be set/incremented. If the inode
 	pointer is NULL, the dentry is called a "negative
 	dentry". This function is commonly called when an inode is
Index: linux-2.6/fs/9p/vfs_inode.c
===================================================================
--- linux-2.6.orig/fs/9p/vfs_inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/9p/vfs_inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -1796,7 +1796,7 @@
 		kfree(st);
 	} else {
 		/* Caching disabled. No need to get upto date stat info.
-		 * This dentry will be released immediately. So, just i_count++
+		 * This dentry will be released immediately. So, just i_refs++
 		 */
 		inode_get(old_dentry->d_inode);
 	}
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/block_dev.c	2010-10-19 14:19:00.000000000 +1100
@@ -599,7 +599,7 @@
 		spin_lock(&bdev_lock);
 		if (!inode->i_bdev) {
 			/*
-			 * We take an additional bd_inode->i_count for inode,
+			 * We take an additional bd_inode->i_refs for inode,
 			 * and it's released in clear_inode() of inode.
 			 * So, we can access it via ->i_mapping always
 			 * without igrab().
Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/btrfs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -1965,8 +1965,8 @@
 	struct delayed_iput *delayed;
 
 	spin_lock(&inode->i_lock);
-	if (inode->i_count > 1) {
-		inode->i_count--;
+	if (inode->i_refs > 1) {
+		inode->i_refs--;
 		spin_unlock(&inode->i_lock);
 		return;
 	}
@@ -2723,10 +2723,10 @@
 		return ERR_PTR(-ENOSPC);
 
 	/* check if there is someone else holds reference */
-	if (S_ISDIR(inode->i_mode) && inode->i_count > 1)
+	if (S_ISDIR(inode->i_mode) && inode->i_refs > 1)
 		return ERR_PTR(-ENOSPC);
 
-	if (inode->i_count > 2)
+	if (inode->i_refs > 2)
 		return ERR_PTR(-ENOSPC);
 
 	if (xchg(&root->fs_info->enospc_unlink, 1))
@@ -3944,7 +3944,7 @@
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (inode->i_count > 1)
+			if (inode->i_refs > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
Index: linux-2.6/fs/ceph/mds_client.c
===================================================================
--- linux-2.6.orig/fs/ceph/mds_client.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ceph/mds_client.c	2010-10-19 14:19:00.000000000 +1100
@@ -1102,7 +1102,7 @@
 		spin_unlock(&inode->i_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, inode->i_count);
+		     inode, cap, inode->i_refs);
 		return 0;
 	}
 
Index: linux-2.6/fs/cifs/inode.c
===================================================================
--- linux-2.6.orig/fs/cifs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/cifs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -1641,7 +1641,7 @@
 	}
 
 	cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
-		 "jiffies %ld", full_path, inode, inode->i_count,
+		 "jiffies %ld", full_path, inode, inode->i_refs,
 		 dentry, dentry->d_time, jiffies);
 
 	if (CIFS_SB(sb)->tcon->unix_ext)
Index: linux-2.6/fs/exofs/inode.c
===================================================================
--- linux-2.6.orig/fs/exofs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/exofs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -1108,7 +1108,7 @@
 	set_obj_created(oi);
 
 	spin_lock(&inode->i_lock);
-	inode->i_count--;
+	inode->i_refs--;
 	spin_unlock(&inode->i_lock);
 	wake_up(&oi->i_wq);
 }
@@ -1170,7 +1170,7 @@
 	ret = exofs_sbi_create(ios);
 	if (ret) {
 		spin_lock(&inode->i_lock);
-		inode->i_count--;
+		inode->i_refs--;
 		spin_unlock(&inode->i_lock);
 		exofs_put_io_state(ios);
 		return ERR_PTR(ret);
Index: linux-2.6/fs/ext3/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/ialloc.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ext3/ialloc.c	2010-10-19 14:19:00.000000000 +1100
@@ -100,9 +100,9 @@
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (inode->i_count > 1) {
+	if (inode->i_refs > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					inode->i_count);
+					inode->i_refs);
 		return;
 	}
 	if (inode->i_nlink) {
Index: linux-2.6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/ialloc.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/ext4/ialloc.c	2010-10-19 14:19:00.000000000 +1100
@@ -189,9 +189,9 @@
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (inode->i_count > 1) {
+	if (inode->i_refs > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       inode->i_count);
+		       inode->i_refs);
 		return;
 	}
 	if (inode->i_nlink) {
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:00.000000000 +1100
@@ -308,7 +308,7 @@
 	unsigned dirty;
 	int ret;
 
-	if (!inode->i_count)
+	if (!inode->i_refs)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -414,7 +414,7 @@
 			/*
 			 * Put it on the LRU if it is unused, otherwise lazy.
 			 */
-			if (!inode->i_count && list_empty(&inode->i_lru))
+			if (!inode->i_refs && list_empty(&inode->i_lru))
 				__inode_lru_list_add(inode);
 		}
 	}
Index: linux-2.6/fs/hpfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hpfs/inode.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/hpfs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -183,7 +183,7 @@
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !i->i_count) {
+	if (hpfs_inode->i_rddir_off && !i->i_refs) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:14.000000000 +1100
@@ -40,7 +40,7 @@
  *   b_io, b_more_io, b_dirty, i_io, i_lru
  * inode->i_lock protects:
  *   i_state
- *   i_count
+ *   i_refs
  *   i_hash
  *   i_io
  *   i_lru
@@ -92,9 +92,9 @@
  * Each inode can be on two separate lists. One is
  * the hash list of the inode, used for lookups. The
  * other linked list is the "type" list:
- *  "in_use" - valid inode, i_count > 0, i_nlink > 0
+ *  "in_use" - valid inode, i_refs > 0, i_nlink > 0
  *  "dirty"  - as "in_use" but also dirty
- *  "unused" - valid inode, i_count = 0
+ *  "unused" - valid inode, i_refs = 0
  *
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
@@ -212,7 +212,7 @@
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	inode->i_count = 1;
+	inode->i_refs = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -484,7 +484,7 @@
 			continue;
 		}
 		invalidate_inode_buffers(inode);
-		if (!inode->i_count) {
+		if (!inode->i_refs) {
 			struct bdi_writeback *wb = inode_to_wb(inode);
 
 			spin_lock(&wb->b_lock);
@@ -541,7 +541,7 @@
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (inode->i_count)
+	if (inode->i_refs)
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -581,7 +581,7 @@
 			cpu_relax();
 			goto again;
 		}
-		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
+		if (inode->i_refs || (inode->i_state & ~I_REFERENCED)) {
 			list_del_init(&inode->i_lru);
 			spin_unlock(&inode->i_lock);
 			zone->inode_nr_lru--;
@@ -1608,8 +1608,8 @@
 		BUG_ON(inode->i_state & I_CLEAR);
 
 		spin_lock(&inode->i_lock);
-		inode->i_count--;
-		if (inode->i_count == 0)
+		inode->i_refs--;
+		if (inode->i_refs == 0)
 			iput_final(inode);
 		else
 			spin_unlock(&inode->i_lock);
Index: linux-2.6/fs/locks.c
===================================================================
--- linux-2.6.orig/fs/locks.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/locks.c	2010-10-19 14:19:00.000000000 +1100
@@ -1376,7 +1376,7 @@
 			goto out;
 		if ((arg == F_WRLCK)
 		    && (atomic_read(&dentry->d_count) > 1
-			|| inode->i_count > 1))
+			|| inode->i_refs > 1))
 			goto out;
 	}
 
Index: linux-2.6/fs/logfs/readwrite.c
===================================================================
--- linux-2.6.orig/fs/logfs/readwrite.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/logfs/readwrite.c	2010-10-19 14:19:00.000000000 +1100
@@ -1002,7 +1002,7 @@
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
-	if ((inode->i_nlink == 0) && inode->i_count == 1)
+	if ((inode->i_nlink == 0) && inode->i_refs == 1)
 		return 0;
 
 	if (bix < I0_BLOCKS)
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/nfs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -384,7 +384,7 @@
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		inode->i_count);
+		inode->i_refs);
 
 out:
 	return inode;
@@ -1190,7 +1190,7 @@
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			inode->i_count, fattr->valid);
+			inode->i_refs, fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
Index: linux-2.6/fs/notify/inode_mark.c
===================================================================
--- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:00.000000000 +1100
@@ -255,12 +255,12 @@
 		}
 
 		/*
-		 * If i_count is zero, the inode cannot have any watches and
+		 * If i_refs is zero, the inode cannot have any watches and
 		 * doing an inode_get/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
+		 * evict all inodes with zero i_refs from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!inode->i_count) {
+		if (!inode->i_refs) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
@@ -278,7 +278,7 @@
 		/* In case the dropping of a reference would nuke next_i. */
 		if (&next_i->i_sb_list != list) {
 			spin_lock(&next_i->i_lock);
-			if (next_i->i_count &&
+			if (next_i->i_refs &&
 			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
 				inode_get_ilock(next_i);
 				need_iput = next_i;
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/ntfs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -538,7 +538,7 @@
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_refs is set to 1, so it is not going to go away
  *    i_flags is set to 0 and we have no business touching it.  Only an ioctl()
  *    is allowed to write to them. We should of course be honouring them but
  *    we need to do that using the IS_* macros defined in include/linux/fs.h.
@@ -1215,7 +1215,7 @@
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_refs is set to 1, so it is not going to go away
  *
  * Return 0 on success and -errno on error.  In the error case, the inode will
  * have had make_bad_inode() executed on it.
@@ -1482,7 +1482,7 @@
  *
  * Q: What locks are held when the function is called?
  * A: i_state has I_NEW set, hence the inode is locked, also
- *    i_count is set to 1, so it is not going to go away
+ *    i_refs is set to 1, so it is not going to go away
  *
  * Return 0 on success and -errno on error.  In the error case, the inode will
  * have had make_bad_inode() executed on it.
Index: linux-2.6/fs/ntfs/super.c
===================================================================
--- linux-2.6.orig/fs/ntfs/super.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/ntfs/super.c	2010-10-19 14:19:00.000000000 +1100
@@ -2689,7 +2689,7 @@
 	//					   held. See fs/inode.c::
 	//					   generic_drop_inode(). */
 	//.delete_inode	= NULL,			/* VFS: Delete inode from disk.
-	//					   Called when i_count becomes
+	//					   Called when i_refs becomes
 	//					   0 and i_nlink is also 0. */
 	//.write_super	= NULL,			/* Flush dirty super block to
 	//					   disk. */
@@ -2929,7 +2929,7 @@
 		goto unl_upcase_iput_tmp_ino_err_out_now;
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
-		/* We increment i_count simulating an ntfs_iget(). */
+		/* We increment i_refs simulating an ntfs_iget(). */
 		inode_get(vol->root_ino);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
Index: linux-2.6/fs/reiserfs/stree.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/stree.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/reiserfs/stree.c	2010-10-19 14:19:00.000000000 +1100
@@ -1477,7 +1477,7 @@
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (inode->i_count > 1 ||
+	if (inode->i_refs > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
Index: linux-2.6/fs/smbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/smbfs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/smbfs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -327,7 +327,7 @@
 }
 
 /*
- * This routine is called when i_nlink == 0 and i_count goes to 0.
+ * This routine is called when i_nlink == 0 and i_refs goes to 0.
  * All blocking cleanup operations need to go here to avoid races.
  */
 static void
Index: linux-2.6/fs/squashfs/dir.c
===================================================================
--- linux-2.6.orig/fs/squashfs/dir.c	2010-10-19 14:17:11.000000000 +1100
+++ linux-2.6/fs/squashfs/dir.c	2010-10-19 14:19:00.000000000 +1100
@@ -50,14 +50,14 @@
  */
 static int get_dir_index_using_offset(struct super_block *sb,
 	u64 *next_block, int *next_offset, u64 index_start, int index_offset,
-	int i_count, u64 f_pos)
+	int i_refs, u64 f_pos)
 {
 	struct squashfs_sb_info *msblk = sb->s_fs_info;
 	int err, i, index, length = 0;
 	struct squashfs_dir_index dir_index;
 
-	TRACE("Entered get_dir_index_using_offset, i_count %d, f_pos %lld\n",
-					i_count, f_pos);
+	TRACE("Entered get_dir_index_using_offset, i_refs %d, f_pos %lld\n",
+					i_refs, f_pos);
 
 	/*
 	 * Translate from external f_pos to the internal f_pos.  This
@@ -68,7 +68,7 @@
 		return f_pos;
 	f_pos -= 3;
 
-	for (i = 0; i < i_count; i++) {
+	for (i = 0; i < i_refs; i++) {
 		err = squashfs_read_metadata(sb, &dir_index, &index_start,
 				&index_offset, sizeof(dir_index));
 		if (err < 0)
Index: linux-2.6/fs/squashfs/inode.c
===================================================================
--- linux-2.6.orig/fs/squashfs/inode.c	2010-10-19 14:17:11.000000000 +1100
+++ linux-2.6/fs/squashfs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -266,7 +266,7 @@
 		squashfs_i(inode)->offset = le16_to_cpu(sqsh_ino->offset);
 		squashfs_i(inode)->dir_idx_start = block;
 		squashfs_i(inode)->dir_idx_offset = offset;
-		squashfs_i(inode)->dir_idx_cnt = le16_to_cpu(sqsh_ino->i_count);
+		squashfs_i(inode)->dir_idx_cnt = le16_to_cpu(sqsh_ino->i_refs);
 		squashfs_i(inode)->parent = le32_to_cpu(sqsh_ino->parent_inode);
 
 		TRACE("Long directory inode %x:%x, start_block %llx, offset "
Index: linux-2.6/fs/ubifs/super.c
===================================================================
--- linux-2.6.orig/fs/ubifs/super.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/fs/ubifs/super.c	2010-10-19 14:19:00.000000000 +1100
@@ -350,7 +350,7 @@
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!inode->i_count);
+	ubifs_assert(!inode->i_refs);
 
 	truncate_inode_pages(&inode->i_data, 0);
 
Index: linux-2.6/fs/udf/inode.c
===================================================================
--- linux-2.6.orig/fs/udf/inode.c	2010-10-19 14:17:11.000000000 +1100
+++ linux-2.6/fs/udf/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -1071,7 +1071,7 @@
 	 *      i_flags = sb->s_flags
 	 *      i_state = 0
 	 * clean_inode(): zero fills and sets
-	 *      i_count = 1
+	 *      i_refs = 1
 	 *      i_nlink = 1
 	 *      i_op = NULL;
 	 */
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:00.000000000 +1100
@@ -738,7 +738,7 @@
 #ifdef CONFIG_SMP
 	int			i_sb_list_cpu;
 #endif
-	unsigned int		i_count;
+	unsigned int		i_refs;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
@@ -1621,7 +1621,7 @@
  *			also cause waiting on I_NEW, without I_NEW actually
  *			being set.  find_inode() uses this to prevent returning
  *			nearly-dead inodes.
- * I_WILL_FREE		Must be set when calling write_inode_now() if i_count
+ * I_WILL_FREE		Must be set when calling write_inode_now() if i_refs
  *			is zero.  I_FREEING must be set when I_WILL_FREE is
  *			cleared.
  * I_FREEING		Set when inode is about to be freed but still has dirty
@@ -2466,7 +2466,7 @@
 {
 	assert_spin_locked(&inode->i_lock);
 	BUG_ON(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE));
-	inode->i_count++;
+	inode->i_refs++;
 }
 
 static inline void inode_get(struct inode *inode)
Index: linux-2.6/fs/nilfs2/mdt.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/mdt.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/nilfs2/mdt.c	2010-10-19 14:19:00.000000000 +1100
@@ -480,7 +480,7 @@
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		inode->i_count = 1;
+		inode->i_refs = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
Index: linux-2.6/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_trace.h	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_trace.h	2010-10-19 14:19:00.000000000 +1100
@@ -599,7 +599,7 @@
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = VFS_I(ip)->i_count;
+		__entry->count = VFS_I(ip)->i_refs;
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
Index: linux-2.6/fs/xfs/xfs_inode.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_inode.h	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/xfs/xfs_inode.h	2010-10-19 14:19:00.000000000 +1100
@@ -481,7 +481,7 @@
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(VFS_I(ip)->i_count > 0) ; \
+	ASSERT(VFS_I(ip)->i_refs > 0) ; \
 	inode_get(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c	2010-10-19 14:18:58.000000000 +1100
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c	2010-10-19 14:19:00.000000000 +1100
@@ -1549,7 +1549,7 @@
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (inode->i_count != 1)
+	if (inode->i_refs != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-10-19 14:18:59.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -1296,11 +1296,11 @@
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, count: %u.\n",
 				__func__, pi->ino, pi, inode, count);
 
-		if (inode->i_count != count) {
-			printk("%s: ino: %llu, pi: %p, inode: %p, count: %u, i_count: %d.\n",
+		if (inode->i_refs != count) {
+			printk("%s: ino: %llu, pi: %p, inode: %p, count: %u, i_refs: %d.\n",
 					__func__, pi->ino, pi, inode, count,
-					inode->i_count);
-			count = inode->i_count;
+					inode->i_refs);
+			count = inode->i_refs;
 			in_drop_list++;
 		}
 
@@ -1311,8 +1311,8 @@
 	list_for_each_entry_safe(inode, tmp, &sb->s_inodes, i_sb_list) {
 		pi = POHMELFS_I(inode);
 
-		dprintk("%s: ino: %llu, pi: %p, inode: %p, i_count: %u.\n",
-				__func__, pi->ino, pi, inode, inode->i_count);
+		dprintk("%s: ino: %llu, pi: %p, inode: %p, i_refs: %u.\n",
+				__func__, pi->ino, pi, inode, inode->i_refs);
 
 		/*
 		 * These are special inodes, they were created during
@@ -1320,7 +1320,7 @@
 		 * so they live here with reference counter being 1 and prevent
 		 * umount from succeed since it believes that they are busy.
 		 */
-		count = inode->i_count;
+		count = inode->i_refs;
 		if (count) {
 			list_del_init(&inode->i_sb_list);
 			while (count--)



^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 35/35] fs: icache document more lock orders
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (33 preceding siblings ...)
  2010-10-19  3:42 ` [patch 34/35] fs: inode rename i_count to i_refs npiggin
@ 2010-10-19  3:42 ` npiggin
  2010-10-19 16:22 ` [patch 00/35] my inode scaling series for review Christoph Hellwig
  2010-10-20 13:14 ` Al Viro
  36 siblings, 0 replies; 70+ messages in thread
From: npiggin @ 2010-10-19  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: fs-inode-more-lock-doc.patch --]
[-- Type: text/plain, Size: 639 bytes --]

Add some more documentation of (existing) lock orders

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
@@ -51,7 +51,9 @@
  *   inode_list_lglock
  *   zone->inode_lru_lock
  *   wb->b_lock
+ *     sb_lock (pin_sb_for_writeback)
  *   inode_hash_bucket lock
+ *   dentry->d_lock (alias management)
  */
 /*
  * This is needed for the following functions:



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 03/35] mm: implement per-zone shrinker
  2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
@ 2010-10-19  4:49   ` KOSAKI Motohiro
  2010-10-19  5:33     ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  4:49 UTC (permalink / raw)
  To: npiggin; +Cc: kosaki.motohiro, linux-kernel, linux-fsdevel, linux-mm

Hi

> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
> +++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
> @@ -997,6 +997,10 @@
>  /*
>   * A callback you can register to apply pressure to ageable caches.
>   *
> + * 'shrink_zone' is the new shrinker API. It is to be used in preference
> + * to 'shrink'. One must point to a shrinker function, the other must
> + * be NULL. See 'shrink_slab' for details about the shrink_zone API.
> + *
>   * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'.  It should
>   * look through the least-recently-used 'nr_to_scan' entries and
>   * attempt to free them up.  It should return the number of objects
> @@ -1013,13 +1017,53 @@
>  	int (*shrink)(struct shrinker *, int nr_to_scan, gfp_t gfp_mask);
>  	int seeks;	/* seeks to recreate an obj */
>  
> +	/*
> +	 * shrink_zone - slab shrinker callback for reclaimable objects
> +	 * @shrink: this struct shrinker
> +	 * @zone: zone to scan
> +	 * @scanned: pagecache lru pages scanned in zone
> +	 * @total: total pagecache lru pages in zone
> +	 * @global: global pagecache lru pages (for zone-unaware shrinkers)
> +	 * @flags: shrinker flags
> +	 * @gfp_mask: gfp context we are operating within
> +	 *
> +	 * The shrinkers are responsible for calculating the appropriate
> +	 * pressure to apply, batching up scanning (and cond_resched,
> +	 * cond_resched_lock etc), and updating events counters including
> +	 * count_vm_event(SLABS_SCANNED, nr).
> +	 *
> +	 * This approach gives flexibility to the shrinkers. They know best how
> +	 * to do batching, how much time between cond_resched is appropriate,
> +	 * what statistics to increment, etc.
> +	 */
> +	void (*shrink_zone)(struct shrinker *shrink,
> +		struct zone *zone, unsigned long scanned,
> +		unsigned long total, unsigned long global,
> +		unsigned long flags, gfp_t gfp_mask);

Now we decided to don't remove old (*shrink)() interface and zone unaware
slab users continue to use it. so why do we need global argument?
If only zone aware shrinker user (*shrink_zone)(), we can remove it.

Personally I think we should remove it because a removing makes a clear
message that all shrinker need to implement zone awareness eventually.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 03/35] mm: implement per-zone shrinker
  2010-10-19  4:49   ` KOSAKI Motohiro
@ 2010-10-19  5:33     ` Nick Piggin
  2010-10-19  5:40       ` KOSAKI Motohiro
  0 siblings, 1 reply; 70+ messages in thread
From: Nick Piggin @ 2010-10-19  5:33 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: npiggin, linux-kernel, linux-fsdevel, linux-mm

Hi,

On Tue, Oct 19, 2010 at 01:49:12PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > Index: linux-2.6/include/linux/mm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
> > +++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
> > @@ -997,6 +997,10 @@
> >  /*
> >   * A callback you can register to apply pressure to ageable caches.
> >   *
> > + * 'shrink_zone' is the new shrinker API. It is to be used in preference
> > + * to 'shrink'. One must point to a shrinker function, the other must
> > + * be NULL. See 'shrink_slab' for details about the shrink_zone API.
> 
...

> Now we decided to don't remove old (*shrink)() interface and zone unaware
> slab users continue to use it. so why do we need global argument?
> If only zone aware shrinker user (*shrink_zone)(), we can remove it.
> 
> Personally I think we should remove it because a removing makes a clear
> message that all shrinker need to implement zone awareness eventually.

I agree, I do want to remove the old API, but it's easier to merge if
I just start by adding the new API. It is split out from my previous
patch which does convert all users of the API. When this gets merged, I
will break those out and send them via respective maintainers, then
remove the old API when they're all converted upstream.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 03/35] mm: implement per-zone shrinker
  2010-10-19  5:33     ` Nick Piggin
@ 2010-10-19  5:40       ` KOSAKI Motohiro
  0 siblings, 0 replies; 70+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  5:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kosaki.motohiro, linux-kernel, linux-fsdevel, linux-mm

> Hi,
> 
> On Tue, Oct 19, 2010 at 01:49:12PM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > Index: linux-2.6/include/linux/mm.h
> > > ===================================================================
> > > --- linux-2.6.orig/include/linux/mm.h	2010-10-19 14:19:40.000000000 +1100
> > > +++ linux-2.6/include/linux/mm.h	2010-10-19 14:36:48.000000000 +1100
> > > @@ -997,6 +997,10 @@
> > >  /*
> > >   * A callback you can register to apply pressure to ageable caches.
> > >   *
> > > + * 'shrink_zone' is the new shrinker API. It is to be used in preference
> > > + * to 'shrink'. One must point to a shrinker function, the other must
> > > + * be NULL. See 'shrink_slab' for details about the shrink_zone API.
> > 
> ...
> 
> > Now we decided to don't remove old (*shrink)() interface and zone unaware
> > slab users continue to use it. so why do we need global argument?
> > If only zone aware shrinker user (*shrink_zone)(), we can remove it.
> > 
> > Personally I think we should remove it because a removing makes a clear
> > message that all shrinker need to implement zone awareness eventually.
> 
> I agree, I do want to remove the old API, but it's easier to merge if
> I just start by adding the new API. It is split out from my previous
> patch which does convert all users of the API. When this gets merged, I
> will break those out and send them via respective maintainers, then
> remove the old API when they're all converted upstream.

Ok, I've got. I have no objection this step-by-step development. thanks
quick responce!





^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 08/35] fs: icache lock i_count
  2010-10-19  3:42 ` [patch 08/35] fs: icache lock i_count npiggin
@ 2010-10-19 10:16   ` Boaz Harrosh
  2010-10-20  2:14     ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2010-10-19 10:16 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On 10/19/2010 05:42 AM, npiggin@kernel.dk wrote:
> Protect inode->i_count with i_lock, rather than having it atomic.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
<>
>  fs/exofs/inode.c                         |   12 +++++++---
>  fs/exofs/namei.c                         |    4 ++-
<>
> Index: linux-2.6/fs/exofs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/exofs/inode.c	2010-10-19 14:17:26.000000000 +1100
> +++ linux-2.6/fs/exofs/inode.c	2010-10-19 14:19:18.000000000 +1100
> @@ -1107,7 +1107,9 @@
>  

Hi Nick, Please use -p option in your diff(s) it is a bit hard to follow
and review without the proper function context. These patches are on a git
tree. Why don't you use git to produce and send these patches?

>  	set_obj_created(oi);
>  
> -	atomic_dec(&inode->i_count);
> +	spin_lock(&inode->i_lock);
> +	inode->i_count--;
> +	spin_unlock(&inode->i_lock);

I've queued up a patch in Linux-next that will conflict with this.
The patch uses iput() instead.

>  	wake_up(&oi->i_wq);
>  }
>  
> @@ -1160,14 +1162,18 @@
>  	/* increment the refcount so that the inode will still be around when we
>  	 * reach the callback
>  	 */
> -	atomic_inc(&inode->i_count);
> +	spin_lock(&inode->i_lock);
> +	inode->i_count++;
> +	spin_unlock(&inode->i_lock);
>  
>  	ios->done = create_done;
>  	ios->private = inode;
>  	ios->cred = oi->i_cred;
>  	ret = exofs_sbi_create(ios);
>  	if (ret) {
> -		atomic_dec(&inode->i_count);
> +		spin_lock(&inode->i_lock);
> +		inode->i_count--;
> +		spin_unlock(&inode->i_lock);

Here too. (iput)

>  		exofs_put_io_state(ios);
>  		return ERR_PTR(ret);
>  	}
> Index: linux-2.6/fs/exofs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/exofs/namei.c	2010-10-19 14:17:26.000000000 +1100
> +++ linux-2.6/fs/exofs/namei.c	2010-10-19 14:19:18.000000000 +1100
> @@ -153,7 +153,9 @@
>  
>  	inode->i_ctime = CURRENT_TIME;
>  	inode_inc_link_count(inode);
> -	atomic_inc(&inode->i_count);
> +	spin_lock(&inode->i_lock);
> +	inode->i_count++;

All these will change to inode_get(), right?

> +	spin_unlock(&inode->i_lock);
>  
>  	return exofs_add_nondir(dentry, inode);
>  }

Thanks
Boaz

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 33/35] fs: icache introduce inode_get/inode_get_ilock
  2010-10-19  3:42 ` [patch 33/35] fs: icache introduce inode_get/inode_get_ilock npiggin
@ 2010-10-19 10:17   ` Boaz Harrosh
  2010-10-20  2:17     ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: Boaz Harrosh @ 2010-10-19 10:17 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On 10/19/2010 05:42 AM, npiggin@kernel.dk wrote:
> Factor open coded inode lock, increment, unlock into a function inode_get().
> Rename __iget to inode_get_ilock.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---

<>

> Index: linux-2.6/fs/exofs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/exofs/inode.c	2010-10-19 14:18:58.000000000 +1100
> +++ linux-2.6/fs/exofs/inode.c	2010-10-19 14:19:16.000000000 +1100
> @@ -1162,9 +1162,7 @@
>  	/* increment the refcount so that the inode will still be around when we
>  	 * reach the callback
>  	 */
> -	spin_lock(&inode->i_lock);
> -	inode->i_count++;
> -	spin_unlock(&inode->i_lock);
> +	inode_get(inode);
>  
>  	ios->done = create_done;
>  	ios->private = inode;
> Index: linux-2.6/fs/exofs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/exofs/namei.c	2010-10-19 14:18:58.000000000 +1100
> +++ linux-2.6/fs/exofs/namei.c	2010-10-19 14:19:00.000000000 +1100
> @@ -153,9 +153,7 @@
>  
>  	inode->i_ctime = CURRENT_TIME;
>  	inode_inc_link_count(inode);
> -	spin_lock(&inode->i_lock);
> -	inode->i_count++;
> -	spin_unlock(&inode->i_lock);
> +	inode_get(inode);
>  
>  	return exofs_add_nondir(dentry, inode);
>  }

Why won't you define an intermediate inode_get() in patch 08/35 and
change both puts and gets of all file_systems in one patch? Instead
of two tree sweeping patches. (At least for all the trivial places
like here)

Thanks
Boaz



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 07/35] fs: icache lock i_state
  2010-10-19  3:42 ` [patch 07/35] fs: icache lock i_state npiggin
@ 2010-10-19 10:47   ` Miklos Szeredi
  2010-10-19 17:06     ` Peter Zijlstra
  0 siblings, 1 reply; 70+ messages in thread
From: Miklos Szeredi @ 2010-10-19 10:47 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:58.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:36.000000000 +1100
> @@ -288,10 +288,12 @@

The function name here helps review a bit.  If you are using quilt add
"export QUILT_DIFF_OPTS=-p" to .quiltrc.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-19  3:42 ` [patch 31/35] fs: icache per-zone inode LRU npiggin
@ 2010-10-19 12:38   ` Dave Chinner
  2010-10-20  2:35     ` Nick Piggin
  2010-10-20  3:14     ` KOSAKI Motohiro
  0 siblings, 2 replies; 70+ messages in thread
From: Dave Chinner @ 2010-10-19 12:38 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> Per-zone LRUs and shrinkers for inode cache.

Regardless of whether this is the right way to scale or not, I don't
like the fact that this moves the cache LRUs into the memory
management structures, and expands the use of MM specific structures
throughout the code. It ties the cache implementation to the current
VM implementation. That, IMO, goes against all the principle of
modularisation at the source code level, and it means we have to tie
all shrinker implemenations to the current internal implementation
of the VM. I don't think that is wise thing to do because of the
dependencies and impedance mismatches it introduces.

As an example: XFS inodes to be reclaimed are simply tagged in a
radix tree so the shrinker can reclaim inodes in optimal IO order
rather strict LRU order. It simply does not match a zone-based
shrinker implementation in any way, shape or form, nor does it's
inherent parallelism match that of the way shrinkers are called.

Any change in shrinker infrastructure needs to be able to handle
these sorts of impedance mismatches between the VM and the cache
subsystem. The current API doesn't handle this very well, either,
so it's something that we need to fix so that scalability is easy
for everyone.

Anyway, my main point is that tying the LRU and shrinker scaling to
the implementation of the VM is a one-off solution that doesn't work
for generic infrastructure. Other subsystems need the same
large-machine scaling treatment, and there's no way we should be
tying them all into the struct zone. It needs further abstraction.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists
  2010-10-19  3:42 ` [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists npiggin
@ 2010-10-19 15:33   ` Miklos Szeredi
  2010-10-20  2:37     ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: Miklos Szeredi @ 2010-10-19 15:33 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/drop_caches.c                 |    4 -
>  fs/fs-writeback.c                |   15 +++--
>  fs/inode.c                       |   99 ++++++++++++++++++++++++++++-----------
>  fs/notify/inode_mark.c           |    6 +-
>  fs/quota/dquot.c                 |    8 +--
>  fs/super.c                       |   16 +++++-
>  include/linux/fs.h               |   58 ++++++++++++++++++++++
>  include/linux/fsnotify_backend.h |    4 -
>  include/linux/writeback.h        |    1 
>  9 files changed, 164 insertions(+), 47 deletions(-)
> 
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/inode.c	2010-10-19 14:19:23.000000000 +1100

[snip]

> @@ -718,13 +714,63 @@
>  	return tmp & I_HASHMASK;
>  }
>  
> +static inline int inode_list_cpu(struct inode *inode)
> +{
> +#ifdef CONFIG_SMP
> +	return inode->i_sb_list_cpu;
> +#else
> +	return smp_processor_id();
> +#endif
> +}
> +
> +/* helper for file_sb_list_add to reduce ifdefs */
> +static inline void __inode_sb_list_add(struct inode *inode, struct super_block *sb)
> +{
> +	struct list_head *list;
> +#ifdef CONFIG_SMP
> +	int cpu;
> +	cpu = smp_processor_id();
> +	inode->i_sb_list_cpu = cpu;
> +	list = per_cpu_ptr(sb->s_inodes, cpu);
> +#else
> +	list = &sb->s_inodes;
> +#endif
> +	list_add_rcu(&inode->i_sb_list, list);
> +}
> +
> +/**
> + * inode_sb_list_add - add an inode to the sb's file list
> + * @inode: inode to add
> + * @sb: sb to add it to
> + *
> + * Use this function to associate an with the superblock it belongs to.

                                       ^^^inode

> + */
> +static void inode_sb_list_add(struct inode *inode, struct super_block *sb)
> +{
> +	lg_local_lock(inode_list_lglock);
> +	__inode_sb_list_add(inode, sb);
> +	lg_local_unlock(inode_list_lglock);
> +}
> +
> +/**
> + * inode_sb_list_del - remove an inode from the sb's inode list
> + * @inode: inode to remove
> + * @sb: sb to remove it from
> + *
> + * Use this function to remove an inode from its superblock.
> + */
> +static void inode_sb_list_del(struct inode *inode)
> +{
> +	lg_local_lock_cpu(inode_list_lglock, inode_list_cpu(inode));
> +	list_del_rcu(&inode->i_sb_list);
> +	lg_local_unlock_cpu(inode_list_lglock, inode_list_cpu(inode));
> +}
> +
>  static inline void
>  __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
>  			struct inode *inode)
>  {
> -	spin_lock(&sb_inode_list_lock);
> -	list_add_rcu(&inode->i_sb_list, &sb->s_inodes);
> -	spin_unlock(&sb_inode_list_lock);
> +	inode_sb_list_add(inode, sb);
>  	if (b) {
>  		spin_lock_bucket(b);
>  		hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
> @@ -1270,6 +1316,7 @@
>  			continue;
>  		if (!spin_trylock(&old->i_lock)) {
>  			spin_unlock_bucket(b);
> +			cpu_relax();

Doesn't this logically belong to a previous patch?

>  			goto repeat;
>  		}
>  		goto found_old;
> @@ -1453,9 +1500,7 @@
>  			inodes_stat.nr_unused--;
>  		spin_unlock(&wb_inode_list_lock);
>  	}
> -	spin_lock(&sb_inode_list_lock);
> -	list_del_rcu(&inode->i_sb_list);
> -	spin_unlock(&sb_inode_list_lock);
> +	inode_sb_list_del(inode);
>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state |= I_FREEING;
>  	spin_unlock(&inode->i_lock);
> @@ -1732,6 +1777,8 @@
>  					 init_once);
>  	register_shrinker(&icache_shrinker);
>  
> +	lg_lock_init(inode_list_lglock);
> +
>  	/* Hash may have been set up in inode_init_early */
>  	if (!hashdist)
>  		return;
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:22.000000000 +1100
> @@ -374,6 +374,7 @@
>  #include <linux/cache.h>
>  #include <linux/kobject.h>
>  #include <linux/list.h>
> +#include <linux/rculist.h>
>  #include <linux/rculist_bl.h>
>  #include <linux/radix-tree.h>
>  #include <linux/prio_tree.h>
> @@ -733,6 +734,9 @@
>  		struct rcu_head		i_rcu;
>  	};
>  	unsigned long		i_ino;
> +#ifdef CONFIG_SMP
> +	int			i_sb_list_cpu;
> +#endif
>  	unsigned int		i_count;
>  	unsigned int		i_nlink;
>  	uid_t			i_uid;
> @@ -1344,11 +1348,12 @@
>  #endif
>  	const struct xattr_handler **s_xattr;
>  
> -	struct list_head	s_inodes;	/* all inodes */
>  	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
>  #ifdef CONFIG_SMP
> +	struct list_head __percpu *s_inodes;
>  	struct list_head __percpu *s_files;
>  #else
> +	struct list_head	s_inodes;	/* all inodes */
>  	struct list_head	s_files;
>  #endif
>  	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
> @@ -2202,6 +2207,57 @@
>  	__insert_inode_hash(inode, inode->i_ino);
>  }
>  
> +#ifdef CONFIG_SMP
> +/*
> + * These macros iterate all inodes on all CPUs for a given superblock.
> + * rcu_read_lock must be held.
> + */
> +#define do_inode_list_for_each_entry_rcu(__sb, __inode)		\
> +{								\
> +	int i;							\
> +	for_each_possible_cpu(i) {				\
> +		struct list_head *list;				\
> +		list = per_cpu_ptr((__sb)->s_inodes, i);	\
> +		list_for_each_entry_rcu((__inode), list, i_sb_list)
> +
> +#define while_inode_list_for_each_entry_rcu			\
> +	}							\
> +}
> +
> +#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp)	\
> +{								\
> +	int i;							\
> +	for_each_possible_cpu(i) {				\
> +		struct list_head *list;				\
> +		list = per_cpu_ptr((__sb)->s_inodes, i);	\
> +		list_for_each_entry_safe((__inode), (__tmp), list, i_sb_list)
> +
> +#define while_inode_list_for_each_entry_safe			\
> +	}							\
> +}
> +
> +#else
> +
> +#define do_inode_list_for_each_entry_rcu(__sb, __inode)		\
> +{								\
> +	struct list_head *list;					\
> +	list = &(sb)->s_inodes;					\
> +	list_for_each_entry_rcu((__inode), list, i_sb_list)
> +
> +#define while_inode_list_for_each_entry_rcu			\
> +}
> +
> +#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp)	\
> +{								\
> +	struct list_head *list;					\
> +	list = &(sb)->s_inodes;					\
> +	list_for_each_entry_safe((__inode), (__tmp), list, i_sb_list)
> +
> +#define while_inode_list_for_each_entry_safe			\
> +}
> +
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  extern void submit_bio(int, struct bio *);
>  extern int bdev_read_only(struct block_device *);
> Index: linux-2.6/fs/super.c
> ===================================================================
> --- linux-2.6.orig/fs/super.c	2010-10-19 14:17:17.000000000 +1100
> +++ linux-2.6/fs/super.c	2010-10-19 14:18:59.000000000 +1100
> @@ -67,12 +67,25 @@
>  			for_each_possible_cpu(i)
>  				INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
>  		}
> +		s->s_inodes = alloc_percpu(struct list_head);
> +		if (!s->s_inodes) {
> +			free_percpu(s->s_files);
> +			security_sb_free(s);
> +			kfree(s);
> +			s = NULL;
> +			goto out;

Factor out error cleanups to separate out labels?

> +		} else {
> +			int i;
> +
> +			for_each_possible_cpu(i)
> +				INIT_LIST_HEAD(per_cpu_ptr(s->s_inodes, i));
> +		}
>  #else
>  		INIT_LIST_HEAD(&s->s_files);
> +		INIT_LIST_HEAD(&s->s_inodes);
>  #endif
>  		INIT_LIST_HEAD(&s->s_instances);
>  		INIT_HLIST_HEAD(&s->s_anon);
> -		INIT_LIST_HEAD(&s->s_inodes);
>  		INIT_LIST_HEAD(&s->s_dentry_lru);
>  		init_rwsem(&s->s_umount);
>  		mutex_init(&s->s_lock);
> @@ -124,6 +137,7 @@
>  static inline void destroy_super(struct super_block *s)
>  {
>  #ifdef CONFIG_SMP
> +	free_percpu(s->s_inodes);
>  	free_percpu(s->s_files);
>  #endif
>  	security_sb_free(s);
> Index: linux-2.6/fs/drop_caches.c
> ===================================================================
> --- linux-2.6.orig/fs/drop_caches.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/drop_caches.c	2010-10-19 14:19:18.000000000 +1100
> @@ -17,7 +17,7 @@
>  	struct inode *inode, *toput_inode = NULL;
>  
>  	rcu_read_lock();
> -	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
> +	do_inode_list_for_each_entry_rcu(sb, inode) {
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
>  				|| inode->i_mapping->nrpages == 0) {
> @@ -31,7 +31,7 @@
>  		iput(toput_inode);
>  		toput_inode = inode;
>  		rcu_read_lock();
> -	}
> +	} while_inode_list_for_each_entry_rcu
>  	rcu_read_unlock();
>  	iput(toput_inode);
>  }
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:22.000000000 +1100
> @@ -1074,7 +1074,7 @@
>  	 * we still have to wait for that writeout.
>  	 */
>  	rcu_read_lock();
> -	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
> +	do_inode_list_for_each_entry_rcu(sb, inode) {
>  		struct address_space *mapping;
>  
>  		spin_lock(&inode->i_lock);
> @@ -1093,11 +1093,12 @@
>  		spin_unlock(&inode->i_lock);
>  		rcu_read_unlock();
>  		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the i_lock.  We
> -		 * cannot iput the inode now as we can be holding the last
> -		 * reference and we cannot iput it under spinlock. So we keep
> -		 * the reference and iput it later.
> +		 * We hold a reference to 'inode' so it couldn't have
> +		 * been removed from s_inodes list while we dropped the
> +		 * i_lock.  We cannot iput the inode now as we can be
> +		 * holding the last reference and we cannot iput it
> +		 * under spinlock. So we keep the reference and iput it
> +		 * later.
>  		 */
>  		iput(old_inode);
>  		old_inode = inode;
> @@ -1107,7 +1108,7 @@
>  		cond_resched();
>  
>  		rcu_read_lock();
> -	}
> +	} while_inode_list_for_each_entry_rcu
>  	rcu_read_unlock();
>  	iput(old_inode);
>  }
> Index: linux-2.6/fs/notify/inode_mark.c
> ===================================================================
> --- linux-2.6.orig/fs/notify/inode_mark.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/notify/inode_mark.c	2010-10-19 14:19:18.000000000 +1100
> @@ -236,11 +236,11 @@
>   * and with the sb going away, no new inodes will appear or be referenced
>   * from other paths.
>   */
> -void fsnotify_unmount_inodes(struct list_head *list)
> +void fsnotify_unmount_inodes(struct super_block *sb)
>  {
>  	struct inode *inode, *next_i, *need_iput = NULL;
>  
> -	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
> +	do_inode_list_for_each_entry_safe(sb, inode, next_i) {
>  		struct inode *need_iput_tmp;
>  
>  		spin_lock(&inode->i_lock);
> @@ -295,5 +295,5 @@
>  		fsnotify_inode_delete(inode);
>  
>  		iput(inode);
> -	}
> +	} while_inode_list_for_each_entry_safe
>  }
> Index: linux-2.6/fs/quota/dquot.c
> ===================================================================
> --- linux-2.6.orig/fs/quota/dquot.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/quota/dquot.c	2010-10-19 14:19:18.000000000 +1100
> @@ -898,7 +898,7 @@
>  #endif
>  
>  	rcu_read_lock();
> -	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
> +	do_inode_list_for_each_entry_rcu(sb, inode) {
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>  			spin_unlock(&inode->i_lock);
> @@ -930,7 +930,7 @@
>  		 * lock. So we keep the reference and iput it later. */
>  		old_inode = inode;
>  		rcu_read_lock();
> -	}
> +	} while_inode_list_for_each_entry_rcu
>  	rcu_read_unlock();
>  	iput(old_inode);
>  
> @@ -1013,7 +1013,7 @@
>  	int reserved = 0;
>  
>  	rcu_read_lock();
> -	list_for_each_entry_rcu(inode, &sb->s_inodes, i_sb_list) {
> +	do_inode_list_for_each_entry_rcu(sb, inode) {
>  		/*
>  		 *  We have to scan also I_NEW inodes because they can already
>  		 *  have quota pointer initialized. Luckily, we need to touch
> @@ -1025,7 +1025,7 @@
>  				reserved = 1;
>  			remove_inode_dquot_ref(inode, type, tofree_head);
>  		}
> -	}
> +	} while_inode_list_for_each_entry_rcu
>  	rcu_read_unlock();
>  #ifdef CONFIG_QUOTA_DEBUG
>  	if (reserved) {
> Index: linux-2.6/include/linux/fsnotify_backend.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fsnotify_backend.h	2010-10-19 14:17:17.000000000 +1100
> +++ linux-2.6/include/linux/fsnotify_backend.h	2010-10-19 14:18:59.000000000 +1100
> @@ -402,7 +402,7 @@
>  extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
>  extern void fsnotify_get_mark(struct fsnotify_mark *mark);
>  extern void fsnotify_put_mark(struct fsnotify_mark *mark);
> -extern void fsnotify_unmount_inodes(struct list_head *list);
> +extern void fsnotify_unmount_inodes(struct super_block *sb);
>  
>  /* put here because inotify does some weird stuff when destroying watches */
>  extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u32 mask,
> @@ -443,7 +443,7 @@
>  	return 0;
>  }
>  
> -static inline void fsnotify_unmount_inodes(struct list_head *list)
> +static inline void fsnotify_unmount_inodes(struct super_block *sb)
>  {}
>  
>  #endif	/* CONFIG_FSNOTIFY */
> Index: linux-2.6/include/linux/writeback.h
> ===================================================================
> --- linux-2.6.orig/include/linux/writeback.h	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/include/linux/writeback.h	2010-10-19 14:19:21.000000000 +1100
> @@ -9,7 +9,6 @@
>  
>  struct backing_dev_info;
>  
> -extern spinlock_t sb_inode_list_lock;
>  extern spinlock_t wb_inode_list_lock;
>  extern struct list_head inode_unused;
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 26/35] fs: icache alloc anonymous inode allocation
  2010-10-19  3:42 ` [patch 26/35] fs: icache alloc anonymous inode allocation npiggin
@ 2010-10-19 15:50   ` Miklos Szeredi
  2010-10-20  2:38     ` Nick Piggin
  2010-10-19 16:33   ` Christoph Hellwig
  1 sibling, 1 reply; 70+ messages in thread
From: Miklos Szeredi @ 2010-10-19 15:50 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> Provide new_anon_inode function for inodes without a default inode number, and
> not on sb list. This can enable filesystems to reduce locking. "Real"
> filesystems can also reduce locking by allocating anonymous inode first, then
> adding it to lists after finding the inode number.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/anon_inodes.c   |    2 +-
>  fs/inode.c         |   32 +++++++++++++++++++++++++++++++-
>  fs/pipe.c          |    3 ++-
>  include/linux/fs.h |    2 ++
>  net/socket.c       |    3 ++-
>  5 files changed, 38 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/inode.c	2010-10-19 14:19:22.000000000 +1100
> @@ -219,6 +219,7 @@
>  #ifdef CONFIG_QUOTA
>  	memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
>  #endif
> +	INIT_LIST_HEAD(&inode->i_sb_list);
>  	inode->i_pipe = NULL;
>  	inode->i_bdev = NULL;
>  	inode->i_cdev = NULL;
> @@ -761,6 +762,8 @@
>   */
>  static void inode_sb_list_del(struct inode *inode)
>  {
> +	if (list_empty(&inode->i_sb_list))
> +		return;
>  	lg_local_lock_cpu(inode_list_lglock, inode_list_cpu(inode));
>  	list_del_rcu(&inode->i_sb_list);
>  	lg_local_unlock_cpu(inode_list_lglock, inode_list_cpu(inode));
> @@ -819,7 +822,7 @@
>   */
>  static DEFINE_PER_CPU(unsigned int, last_ino);
>  
> -static unsigned int get_next_ino(void)
> +unsigned int get_next_ino(void)
>  {
>  	unsigned int res;
>  
> @@ -838,6 +841,7 @@
>  	put_cpu();
>  	return res;
>  }
> +EXPORT_SYMBOL(get_next_ino);
>  
>  /**
>   *	new_inode 	- obtain an inode
> @@ -870,6 +874,32 @@
>  }
>  EXPORT_SYMBOL(new_inode);
>  
> +/**
> + *	new_anon_inode 	- obtain an anonymous inode
> + *	@sb: superblock
> + *
> + *	Similar to new_inode, however the inode is not given an inode
> + *	number, and is not added to the sb's list of inodes, to reduce
> + *	overheads.
> + *
> + *	A filesystem which needs an inode number must subsequently
> + *	assign one to i_ino. A filesystem which needs inodes to be on the
> + *	per-sb list (currently only used by the vfs for umount or remount)
> + *	must add the inode to that list.
> + */
> +struct inode *new_anon_inode(struct super_block *sb)
> +{
> +	struct inode *inode;
> +
> +	inode = alloc_inode(sb);
> +	if (inode) {
> +		inode->i_ino = ULONG_MAX;
> +		inode->i_state = 0;
> +	}
> +	return inode;
> +}
> +EXPORT_SYMBOL(new_anon_inode);
> +
>  void unlock_new_inode(struct inode *inode)
>  {
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> Index: linux-2.6/fs/pipe.c
> ===================================================================
> --- linux-2.6.orig/fs/pipe.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/pipe.c	2010-10-19 14:19:00.000000000 +1100
> @@ -948,7 +948,7 @@
>  
>  static struct inode * get_pipe_inode(void)
>  {
> -	struct inode *inode = new_inode(pipe_mnt->mnt_sb);
> +	struct inode *inode = new_anon_inode(pipe_mnt->mnt_sb);
>  	struct pipe_inode_info *pipe;
>  
>  	if (!inode)
> @@ -962,6 +962,7 @@
>  	pipe->readers = pipe->writers = 1;
>  	inode->i_fop = &rdwr_pipefifo_fops;
>  
> +	inode->i_ino = get_next_ino();
>  	/*
>  	 * Mark the inode dirty from the very beginning,
>  	 * that way it will never be moved to the dirty
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:21.000000000 +1100
> @@ -2192,11 +2192,13 @@
>  extern int insert_inode_locked(struct inode *);
>  extern void unlock_new_inode(struct inode *);
>  
> +extern unsigned int get_next_ino(void);
>  extern void iget_failed(struct inode *);
>  extern void end_writeback(struct inode *);
>  extern void destroy_inode(struct inode *);
>  extern void __destroy_inode(struct inode *);
>  extern struct inode *new_inode(struct super_block *);
> +extern struct inode *new_anon_inode(struct super_block *);
>  extern void free_inode_nonrcu(struct inode *inode);
>  extern int should_remove_suid(struct dentry *);
>  extern int file_remove_suid(struct file *);
> Index: linux-2.6/net/socket.c
> ===================================================================
> --- linux-2.6.orig/net/socket.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/net/socket.c	2010-10-19 14:19:19.000000000 +1100
> @@ -476,13 +476,14 @@
>  	struct inode *inode;
>  	struct socket *sock;
>  
> -	inode = new_inode(sock_mnt->mnt_sb);
> +	inode = new_anon_inode(sock_mnt->mnt_sb);
>  	if (!inode)
>  		return NULL;
>  
>  	sock = SOCKET_I(inode);
>  
>  	kmemcheck_annotate_bitfield(sock, type);
> +	inode->i_ino = get_next_ino();
>  	inode->i_mode = S_IFSOCK | S_IRWXUGO;
>  	inode->i_uid = current_fsuid();
>  	inode->i_gid = current_fsgid();
> Index: linux-2.6/fs/anon_inodes.c
> ===================================================================
> --- linux-2.6.orig/fs/anon_inodes.c	2010-10-19 14:18:58.000000000 +1100
> +++ linux-2.6/fs/anon_inodes.c	2010-10-19 14:19:19.000000000 +1100
> @@ -191,7 +191,7 @@
>   */
>  static struct inode *anon_inode_mkinode(void)
>  {
> -	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
> +	struct inode *inode = new_anon_inode(anon_inode_mnt->mnt_sb);
>  
>  	if (!inode)
>  		return ERR_PTR(-ENOMEM);

This too needs an inode->i_ino initialization (the default ULONG_MAX
will cause EOVERFLOW on 32bit fstat, AFAIK), though it could just be a
constant, say 2.

Miklos

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 27/35] fs: icache split IO and LRU lists
  2010-10-19  3:42 ` [patch 27/35] fs: icache split IO and LRU lists npiggin
@ 2010-10-19 16:12   ` Miklos Szeredi
  2010-10-20  2:41     ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: Miklos Szeredi @ 2010-10-19 16:12 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> Split inode reclaim and writeback lists in preparation to scale them up
> (per-bdi locking for i_io and per-zone locking for i_lru)
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/fs-writeback.c  |   30 +++++++++++++++++-------------
>  fs/inode.c         |   46 +++++++++++++++++++++++++++-------------------
>  fs/nilfs2/mdt.c    |    3 ++-
>  include/linux/fs.h |    3 ++-
>  mm/backing-dev.c   |    6 +++---
>  5 files changed, 51 insertions(+), 37 deletions(-)
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:21.000000000 +1100
> @@ -173,11 +173,11 @@
>  	if (!list_empty(&wb->b_dirty)) {
>  		struct inode *tail;
>  
> -		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
> +		tail = list_entry(wb->b_dirty.next, struct inode, i_io);
>  		if (time_before(inode->dirtied_when, tail->dirtied_when))
>  			inode->dirtied_when = jiffies;
>  	}
> -	list_move(&inode->i_list, &wb->b_dirty);
> +	list_move(&inode->i_io, &wb->b_dirty);
>  }
>  
>  /*
> @@ -188,7 +188,7 @@
>  	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
>  
>  	assert_spin_locked(&wb_inode_list_lock);
> -	list_move(&inode->i_list, &wb->b_more_io);
> +	list_move(&inode->i_io, &wb->b_more_io);
>  }
>  
>  static void inode_sync_complete(struct inode *inode)
> @@ -230,14 +230,14 @@
>  
>  	assert_spin_locked(&wb_inode_list_lock);
>  	while (!list_empty(delaying_queue)) {
> -		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> +		inode = list_entry(delaying_queue->prev, struct inode, i_io);
>  		if (older_than_this &&
>  		    inode_dirtied_after(inode, *older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> -		list_move(&inode->i_list, &tmp);
> +		list_move(&inode->i_io, &tmp);
>  	}
>  
>  	/* just one sb in list, splice to dispatch_queue and we're done */
> @@ -248,12 +248,12 @@
>  
>  	/* Move inodes from one superblock together */
>  	while (!list_empty(&tmp)) {
> -		inode = list_entry(tmp.prev, struct inode, i_list);
> +		inode = list_entry(tmp.prev, struct inode, i_io);
>  		sb = inode->i_sb;
>  		list_for_each_prev_safe(pos, node, &tmp) {
> -			inode = list_entry(pos, struct inode, i_list);
> +			inode = list_entry(pos, struct inode, i_io);
>  			if (inode->i_sb == sb)
> -				list_move(&inode->i_list, dispatch_queue);
> +				list_move(&inode->i_io, dispatch_queue);
>  		}
>  	}
>  }
> @@ -422,7 +422,11 @@
>  			/*
>  			 * The inode is clean
>  			 */
> -			list_move(&inode->i_list, &inode_unused);
> +			list_del_init(&inode->i_io);
> +			if (list_empty(&inode->i_lru)) {
> +				list_add(&inode->i_lru, &inode_unused);
> +				inodes_stat.nr_unused++;

It's not obvious where this came from.  How come nr_unused was
correctly accounted with the previous, list_move() version?

Miklos


> +			}
>  		}
>  	}
>  	inode_sync_complete(inode);
> @@ -472,7 +476,7 @@
>  	while (!list_empty(&wb->b_io)) {
>  		long pages_skipped;
>  		struct inode *inode = list_entry(wb->b_io.prev,
> -						 struct inode, i_list);
> +						 struct inode, i_io);
>  
>  		if (!spin_trylock(&inode->i_lock)) {
>  			spin_unlock(&wb_inode_list_lock);
> @@ -558,7 +562,7 @@
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = list_entry(wb->b_io.prev,
> -						 struct inode, i_list);
> +						 struct inode, i_io);
>  		struct super_block *sb = inode->i_sb;
>  
>  		if (!pin_sb_for_writeback(sb)) {
> @@ -703,7 +707,7 @@
>  		spin_lock(&wb_inode_list_lock);
>  		if (!list_empty(&wb->b_more_io))  {
>  			inode = list_entry(wb->b_more_io.prev,
> -						struct inode, i_list);
> +						struct inode, i_io);
>  			if (!spin_trylock(&inode->i_lock)) {
>  				spin_unlock(&wb_inode_list_lock);
>  				goto retry;
> @@ -1029,7 +1033,7 @@
>  
>  			inode->dirtied_when = jiffies;
>  			spin_lock(&wb_inode_list_lock);
> -			list_move(&inode->i_list, &bdi->wb.b_dirty);
> +			list_move(&inode->i_io, &bdi->wb.b_dirty);
>  			spin_unlock(&wb_inode_list_lock);
>  		}
>  	}
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2010-10-19 14:19:00.000000000 +1100
> +++ linux-2.6/include/linux/fs.h	2010-10-19 14:19:21.000000000 +1100
> @@ -727,7 +727,8 @@
>  
>  struct inode {
>  	struct hlist_bl_node	i_hash;
> -	struct list_head	i_list;		/* backing dev IO list */
> +	struct list_head	i_io;		/* backing dev IO list */
> +	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
>  	union {
>  		struct list_head	i_dentry;
> Index: linux-2.6/mm/backing-dev.c
> ===================================================================
> --- linux-2.6.orig/mm/backing-dev.c	2010-10-19 14:18:59.000000000 +1100
> +++ linux-2.6/mm/backing-dev.c	2010-10-19 14:19:20.000000000 +1100
> @@ -74,11 +74,11 @@
>  
>  	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
>  	spin_lock(&wb_inode_list_lock);
> -	list_for_each_entry(inode, &wb->b_dirty, i_list)
> +	list_for_each_entry(inode, &wb->b_dirty, i_io)
>  		nr_dirty++;
> -	list_for_each_entry(inode, &wb->b_io, i_list)
> +	list_for_each_entry(inode, &wb->b_io, i_io)
>  		nr_io++;
> -	list_for_each_entry(inode, &wb->b_more_io, i_list)
> +	list_for_each_entry(inode, &wb->b_more_io, i_io)
>  		nr_more_io++;
>  	spin_unlock(&wb_inode_list_lock);
>  
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c	2010-10-19 14:19:00.000000000 +1100
> +++ linux-2.6/fs/inode.c	2010-10-19 14:19:21.000000000 +1100
> @@ -34,12 +34,13 @@
>   * inode_hash_bucket lock protects:
>   *   inode hash table, i_hash
>   * wb_inode_list_lock protects:
> - *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_list
> + *   inode_in_use, inode_unused, b_io, b_more_io, b_dirty, i_io, i_lru
>   * inode->i_lock protects:
>   *   i_state
>   *   i_count
>   *   i_hash
> - *   i_list
> + *   i_io
> + *   i_lru
>   *   i_sb_list
>   *
>   * Ordering:
> @@ -327,6 +328,7 @@
>  
>  void destroy_inode(struct inode *inode)
>  {
> +	BUG_ON(!list_empty(&inode->i_io));
>  	__destroy_inode(inode);
>  	if (inode->i_sb->s_op->destroy_inode)
>  		inode->i_sb->s_op->destroy_inode(inode);
> @@ -345,7 +347,8 @@
>  	INIT_HLIST_BL_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_dentry);
>  	INIT_LIST_HEAD(&inode->i_devices);
> -	INIT_LIST_HEAD(&inode->i_list);
> +	INIT_LIST_HEAD(&inode->i_io);
> +	INIT_LIST_HEAD(&inode->i_lru);
>  	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
>  	spin_lock_init(&inode->i_data.tree_lock);
>  	spin_lock_init(&inode->i_data.i_mmap_lock);
> @@ -413,8 +416,8 @@
>  	while (!list_empty(head)) {
>  		struct inode *inode;
>  
> -		inode = list_first_entry(head, struct inode, i_list);
> -		list_del_init(&inode->i_list);
> +		inode = list_first_entry(head, struct inode, i_lru);
> +		list_del_init(&inode->i_lru);
>  
>  		evict(inode);
>  
> @@ -445,13 +448,14 @@
>  		invalidate_inode_buffers(inode);
>  		if (!inode->i_count) {
>  			spin_lock(&wb_inode_list_lock);
> -			list_del(&inode->i_list);
> +			list_del_init(&inode->i_io);
> +			list_del(&inode->i_lru);
>  			inodes_stat.nr_unused--;
>  			spin_unlock(&wb_inode_list_lock);
>  			WARN_ON(inode->i_state & I_NEW);
>  			inode->i_state |= I_FREEING;
>  			spin_unlock(&inode->i_lock);
> -			list_add(&inode->i_list, dispose);
> +			list_add(&inode->i_lru, dispose);
>  			continue;
>  		}
>  		spin_unlock(&inode->i_lock);
> @@ -530,20 +534,20 @@
>  		if (list_empty(&inode_unused))
>  			break;
>  
> -		inode = list_entry(inode_unused.prev, struct inode, i_list);
> +		inode = list_entry(inode_unused.prev, struct inode, i_lru);
>  
>  		if (!spin_trylock(&inode->i_lock)) {
>  			spin_unlock(&wb_inode_list_lock);
>  			goto again;
>  		}
>  		if (inode->i_count || (inode->i_state & ~I_REFERENCED)) {
> -			list_del_init(&inode->i_list);
> +			list_del_init(&inode->i_lru);
>  			spin_unlock(&inode->i_lock);
>  			inodes_stat.nr_unused--;
>  			continue;
>  		}
>  		if (inode->i_state & I_REFERENCED) {
> -			list_move(&inode->i_list, &inode_unused);
> +			list_move(&inode->i_lru, &inode_unused);
>  			inode->i_state &= ~I_REFERENCED;
>  			spin_unlock(&inode->i_lock);
>  			continue;
> @@ -556,7 +560,7 @@
>  			 *
>  			 * We'll try to get it back if it becomes freeable.
>  			 */
> -			list_move(&inode->i_list, &inode_unused);
> +			list_move(&inode->i_lru, &inode_unused);
>  			spin_unlock(&wb_inode_list_lock);
>  			__iget(inode);
>  			spin_unlock(&inode->i_lock);
> @@ -567,7 +571,7 @@
>  			iput(inode);
>  			spin_lock(&wb_inode_list_lock);
>  			if (inode == list_entry(inode_unused.next,
> -						struct inode, i_list)) {
> +						struct inode, i_lru)) {
>  				if (spin_trylock(&inode->i_lock)) {
>  					if (can_unuse(inode))
>  						goto freeable;
> @@ -577,7 +581,7 @@
>  			continue;
>  		}
>  freeable:
> -		list_move(&inode->i_list, &freeable);
> +		list_move(&inode->i_lru, &freeable);
>  		WARN_ON(inode->i_state & I_NEW);
>  		inode->i_state |= I_FREEING;
>  		spin_unlock(&inode->i_lock);
> @@ -1508,9 +1512,9 @@
>  		if (sb->s_flags & MS_ACTIVE) {
>  			inode->i_state |= I_REFERENCED;
>  			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
> -					list_empty(&inode->i_list)) {
> +					list_empty(&inode->i_lru)) {
>  				spin_lock(&wb_inode_list_lock);
> -				list_add(&inode->i_list, &inode_unused);
> +				list_add(&inode->i_lru, &inode_unused);
>  				inodes_stat.nr_unused++;
>  				spin_unlock(&wb_inode_list_lock);
>  			}
> @@ -1526,11 +1530,15 @@
>  		inode->i_state &= ~I_WILL_FREE;
>  		__remove_inode_hash(inode);
>  	}
> -	if (!list_empty(&inode->i_list)) {
> +	if (!list_empty(&inode->i_lru)) {
>  		spin_lock(&wb_inode_list_lock);
> -		list_del_init(&inode->i_list);
> -		if (!inode->i_state)
> -			inodes_stat.nr_unused--;
> +		list_del_init(&inode->i_lru);
> +		inodes_stat.nr_unused--;
> +		spin_unlock(&wb_inode_list_lock);
> +	}
> +	if (!list_empty(&inode->i_io)) {
> +		spin_lock(&wb_inode_list_lock);
> +		list_del_init(&inode->i_io);
>  		spin_unlock(&wb_inode_list_lock);
>  	}
>  	inode_sb_list_del(inode);
> Index: linux-2.6/fs/nilfs2/mdt.c
> ===================================================================
> --- linux-2.6.orig/fs/nilfs2/mdt.c	2010-10-19 14:18:58.000000000 +1100
> +++ linux-2.6/fs/nilfs2/mdt.c	2010-10-19 14:19:16.000000000 +1100
> @@ -504,7 +504,8 @@
>  #endif
>  		inode->dirtied_when = 0;
>  
> -		INIT_LIST_HEAD(&inode->i_list);
> +		INIT_LIST_HEAD(&inode->i_io);
> +		INIT_LIST_HEAD(&inode->i_lru);
>  		INIT_LIST_HEAD(&inode->i_sb_list);
>  		inode->i_state = 0;
>  #endif
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 00/35] my inode scaling series for review
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (34 preceding siblings ...)
  2010-10-19  3:42 ` [patch 35/35] fs: icache document more lock orders npiggin
@ 2010-10-19 16:22 ` Christoph Hellwig
  2010-10-20  3:05   ` Nick Piggin
  2010-10-20 13:14 ` Al Viro
  36 siblings, 1 reply; 70+ messages in thread
From: Christoph Hellwig @ 2010-10-19 16:22 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 02:42:16PM +1100, npiggin@kernel.dk wrote:
> * My locking design allows i_lock to lock the entire state of the icache
>   for a particular inode. Not so with Dave's, and he had to add code not
>   required with inode_lock synchronisation or my i_lock synchronisation.
>   I prefer being very conservative about making changes, especially before
>   inode_lock is lifted (which will be the end-point of bisection for any
>   locking breakage before it).

Which code exaxtly?  I've done a diff between his inode.c and yours - 
and Dave's is a lot simpler.  Mostly due to the more regular and simpler
locking, but also because he did various cleanups before tackling the
actual locking.  See the diff at the end of this mail for a direct
comparism.

> * As far as I can tell, I have addressed all Dave and Christoph's real
>   concerns.  The disagreement about the i_lock locking model can easily be
>   solved if they post a couple of small incremental patches to the end of the
>   series, making i_lock locking less regular and no longer protecting icache
>   state of that given inode (like inode_lock was able to pre-patchset). I've
>   repeatedly disagreed with this approach, however.

The diff below and looking over the other patches doesn't make it look
like you have actually picked up much at all, neither of the feedback
from me, nor Dave nor Andrew or Al. 

Even worse than that none of the sometimes quite major bug fixes were
picked up either.  The get_new_inode re-lookup locking is still wrong,
the exofs fix is not there. And the fix for mapping move of the
block devices which we unfortunately still have seems to be paper
over by passing the bdi_writeback to the requing helpers instead
of fixing it.  While this makes the assert_spin_lock panic go away
it still leaves a really nasty race as your version locks a different
bdi than the one that it actually modifies.

There's also another bug which was there in your very first version
with an XXX but that Dave AFAIK never picked up: invalidate_inodes is
called from a lot of other places than umount, and unlocked list
access is everything but safe there.

Anyway, below is the diff between the two trees.  I've cut down the
curn in filesystem a bit - every related to the gratious i_ref vs i_ref
and iref vs inode_get difference, as well as the call_rcu boilerplat
additions and get_next_ino calls are removed to make it somewhat
readable.

To me the inode.c and especially fs-writeback.c code in Dave's version
looks a lot more polished.

 b/Documentation/filesystems/porting |   12 
 b/Documentation/filesystems/vfs.txt |   16 
 b/fs/block_dev.c                    |   50 -
 b/fs/btrfs/inode.c                  |    2 
 b/fs/dcache.c                       |   31 -
 b/fs/drop_caches.c                  |   24 
 b/fs/fs-writeback.c                 |  308 ++++------
 b/fs/inode.c                        | 1095  +++++++++++++++---------------------
 b/fs/internal.h                     |   23 
 b/fs/nilfs2/gcdat.c                 |    1 
 b/fs/nilfs2/gcinode.c               |    7 
 b/fs/notify/inode_mark.c            |   41 -
 b/fs/notify/mark.c                  |    1 
 b/fs/notify/vfsmount_mark.c         |    1 
 b/fs/quota/dquot.c                  |   56 -
 b/fs/super.c                        |   17 
 b/include/linux/backing-dev.h       |    5 
 b/include/linux/bit_spinlock.h      |    4 
 b/include/linux/fs.h                |  107 ---
 b/include/linux/fsnotify_backend.h  |    4 
 b/include/linux/list_bl.h           |   41 -
 b/include/linux/poison.h            |    2 
 b/mm/backing-dev.c                  |   19 
 b/fs/xfs/linux-2.6/xfs_buf.c        |    4 
 b/include/linux/rculist_bl.h        |  128 ----
 25 files changed, 804 insertions(+), 1195 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 45160c4..f182795 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -319,14 +319,8 @@ may happen while the inode is in the middle of ->write_inode(); e.g. if you blin
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
 
---
 [mandatory]
-	inode_lock is gone, replaced by fine grained locks. See fs/inode.c
-for details of what locks to replace inode_lock with in order to protect
-particular things. Most of the time, a filesystem only needs ->i_lock, which
-protects *all* the inode state and its membership on lists that was
-previously protected with inode_lock.
+        The i_count field in the inode has been replaced with i_ref, which is
+a regular integer instead of an atomic_t.  Filesystems should not manipulate
+it directly but use helpers like igrab(), iref() and iput().
 
---
-[mandatory]
-	Filessystems must RCU-free their inodes. Lots of examples.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f63b131..7ab923c 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -246,7 +246,7 @@ or bottom half).
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the i_lock spinlock held.
+	with the i_lock and sb_inode_list_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
@@ -347,8 +347,8 @@ otherwise noted.
   lookup: called when the VFS needs to look up an inode in a parent
 	directory. The name to look for is found in the dentry. This
 	method must call d_add() to insert the found inode into the
-	dentry. The "i_refs" field in the inode structure should be
-	incremented. If the named inode does not exist a NULL inode
+	dentry. A reference to the inode should be taken via the
+	iref() function.  If the named inode does not exist a NULL inode
 	should be inserted into the dentry (this is called a negative
 	dentry). Returning an error code from this routine must only
 	be done on a real error, otherwise creating inodes with system
@@ -926,11 +926,11 @@ manipulate dentries:
 	d_instantiate()
 
   d_instantiate: add a dentry to the alias hash list for the inode and
-	updates the "d_inode" member. The "i_refs" member in the
-	inode structure should be set/incremented. If the inode
-	pointer is NULL, the dentry is called a "negative
-	dentry". This function is commonly called when an inode is
-	created for an existing negative dentry
+	updates the "d_inode" member. A reference to the inode
+	should be taken via the iref() function.  If the inode
+	pointer is NULL, the dentry is called a "negative dentry".
+	This function is commonly called when an inode is created
+	for an existing negative dentry
 
   d_lookup: look up a dentry given its parent and path name component
 	It looks up the child of that given name from the dcache
diff --git a/fs/block_dev.c b/fs/block_dev.c
index a2de19e..dae9871 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -48,6 +48,24 @@ inline struct block_device *I_BDEV(struct inode *inode)
 
 EXPORT_SYMBOL(I_BDEV);
 
+/*
+ * move the inode from it's current bdi to the a new bdi. if the inode is dirty
+ * we need to move it onto the dirty list of @dst so that the inode is always
+ * on the right list.
+ */
+static void bdev_inode_switch_bdi(struct inode *inode,
+			struct backing_dev_info *dst)
+{
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
+	bdi_lock_two(old, dst);
+	inode->i_data.backing_dev_info = dst;
+	if (!list_empty(&inode->i_wb_list))
+		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
+	spin_unlock(&old->wb.b_lock);
+	spin_unlock(&dst->wb.b_lock);
+}
+
 static sector_t max_block(struct block_device *bdev)
 {
 	sector_t retval = ~((sector_t)0);
@@ -395,20 +413,13 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void bdev_i_callback(struct rcu_head *head)
+static void bdev_destroy_inode(struct inode *inode)
 {
-	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct bdev_inode *bdi = BDEV_I(inode);
 
-	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(bdev_cachep, bdi);
 }
 
-static void bdev_destroy_inode(struct inode *inode)
-{
-	call_rcu(&inode->i_rcu, bdev_i_callback);
-}
-
 static void init_once(void *foo)
 {
 	struct bdev_inode *ei = (struct bdev_inode *) foo;
@@ -557,8 +568,7 @@ EXPORT_SYMBOL(bdget);
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	inode_get(bdev->bd_inode);
-
+	iref(bdev->bd_inode);
 	return bdev;
 }
 
@@ -599,12 +609,11 @@ static struct block_device *bd_acquire(struct inode *inode)
 		spin_lock(&bdev_lock);
 		if (!inode->i_bdev) {
 			/*
-			 * We take an additional bd_inode->i_refs for inode,
-			 * and it's released in clear_inode() of inode.
-			 * So, we can access it via ->i_mapping always
-			 * without igrab().
+			 * We take an additional bdev reference here so
+			 * we can access it via ->i_mapping always
+			 * without first needing to grab a reference.
 			 */
-			inode_get(bdev->bd_inode);
+			bdgrab(bdev);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
@@ -1398,7 +1407,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				bdi = blk_get_backing_dev_info(bdev);
 				if (bdi == NULL)
 					bdi = &default_backing_dev_info;
-				bdev->bd_inode->i_data.backing_dev_info = bdi;
+				bdev_inode_switch_bdi(bdev->bd_inode, bdi);
 			}
 			if (bdev->bd_invalidated)
 				rescan_partitions(disk, bdev);
@@ -1413,8 +1422,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			if (ret)
 				goto out_clear;
 			bdev->bd_contains = whole;
-			bdev->bd_inode->i_data.backing_dev_info =
-			   whole->bd_inode->i_data.backing_dev_info;
+			bdev_inode_switch_bdi(bdev->bd_inode,
+				whole->bd_inode->i_data.backing_dev_info);
 			bdev->bd_part = disk_get_part(disk, partno);
 			if (!(disk->flags & GENHD_FL_UP) ||
 			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
@@ -1447,7 +1456,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk_put_part(bdev->bd_part);
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
-	bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+	bdev_inode_switch_bdi(bdev->bd_inode, &default_backing_dev_info);
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
@@ -1541,7 +1550,8 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		disk_put_part(bdev->bd_part);
 		bdev->bd_part = NULL;
 		bdev->bd_disk = NULL;
-		bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+		bdev_inode_switch_bdi(bdev->bd_inode,
+					&default_backing_dev_info);
 		if (bdev != bdev->bd_contains)
 			victim = bdev->bd_contains;
 		bdev->bd_contains = NULL;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4da677e..c7a2bef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3854,7 +3855,7 @@ again:
 	p = &root->inode_tree.rb_node;
 	parent = NULL;
 
-	if (hlist_bl_unhashed(&inode->i_hash))
+	if (inode_unhashed(inode))
 		return;
 
 	spin_lock(&root->inode_lock);
diff --git a/fs/dcache.c b/fs/dcache.c
index e309f9b..83293be 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -534,7 +534,7 @@ restart:
  *
  * This function may fail to free any resources if all the dentries are in use.
  */
-static void prune_dcache(unsigned long count)
+static void prune_dcache(int count)
 {
 	struct super_block *sb, *p = NULL;
 	int w_count;
@@ -887,8 +887,7 @@ void shrink_dcache_parent(struct dentry * parent)
 EXPORT_SYMBOL(shrink_dcache_parent);
 
 /*
- * shrink_dcache_memory scans and reclaims unused dentries. This function
- * is defined according to the shrinker API described in linux/mm.h.
+ * Scan `nr' dentries and return the number which remain.
  *
  * We need to avoid reentering the filesystem if the caller is performing a
  * GFP_NOFS allocation attempt.  One example deadlock is:
@@ -896,30 +895,22 @@ EXPORT_SYMBOL(shrink_dcache_parent);
  * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
  * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
  * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
+ *
+ * In this case we return -1 to tell the caller that we baled.
  */
-static void shrink_dcache_memory(struct shrinker *shrink,
-		struct zone *zone, unsigned long scanned,
-		unsigned long total, unsigned long global,
-		unsigned long flags, gfp_t gfp_mask)
+static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
-	static unsigned long nr_to_scan;
-	unsigned long nr;
-
-	shrinker_add_scan(&nr_to_scan, scanned, global,
-			dentry_stat.nr_unused,
-			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
-	if (!(gfp_mask & __GFP_FS))
-	       return;
-
-	while ((nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH))) {
+	if (nr) {
+		if (!(gfp_mask & __GFP_FS))
+			return -1;
 		prune_dcache(nr);
-		count_vm_events(SLABS_SCANNED, nr);
-		cond_resched();
 	}
+	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker dcache_shrinker = {
-	.shrink_zone = shrink_dcache_memory,
+	.shrink = shrink_dcache_memory,
+	.seeks = DEFAULT_SEEKS,
 };
 
 /**
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 2c8b7df..bd39f65 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,29 +16,33 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	rcu_read_lock();
-	do_inode_list_for_each_entry_rcu(sb, inode) {
+	spin_lock(&sb->s_inodes_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)
-				|| inode->i_mapping->nrpages == 0) {
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (inode->i_mapping->nrpages == 0)) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
-		inode_get_ilock(inode);
+		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
-		rcu_read_unlock();
+		spin_unlock(&sb->s_inodes_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		rcu_read_lock();
-	} while_inode_list_for_each_entry_rcu
-	rcu_read_unlock();
+		spin_lock(&sb->s_inodes_lock);
+	}
+	spin_unlock(&sb->s_inodes_lock);
 	iput(toput_inode);
 }
 
 static void drop_slab(void)
 {
-	shrink_all_slab();
+	int nr_objects;
+
+	do {
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+	} while (nr_objects > 10);
 }
 
 int drop_caches_sysctl_handler(ctl_table *table, int write,
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9b2e2c3..04e8dd5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -69,6 +69,16 @@ int writeback_in_progress(struct backing_dev_info *bdi)
 	return test_bit(BDI_writeback_running, &bdi->state);
 }
 
+static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	if (strcmp(sb->s_type->name, "bdev") == 0)
+		return inode->i_mapping->backing_dev_info;
+
+	return sb->s_bdi;
+}
+
 static void bdi_queue_work(struct backing_dev_info *bdi,
 		struct wb_writeback_work *work)
 {
@@ -147,6 +157,18 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 }
 
 /*
+ * Remove the inode from the writeback list it is on.
+ */
+void inode_wb_list_del(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.b_lock);
+	list_del_init(&inode->i_wb_list);
+	spin_unlock(&bdi->wb.b_lock);
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -155,26 +177,30 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct bdi_writeback *wb, struct inode *inode)
+static void redirty_tail(struct inode *inode)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+
 	assert_spin_locked(&wb->b_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(wb->b_dirty.next, struct inode, i_io);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_wb_list);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_io, &wb->b_dirty);
+	list_move(&inode->i_wb_list, &wb->b_dirty);
 }
 
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct bdi_writeback *wb, struct inode *inode)
+static void requeue_io(struct inode *inode)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+
 	assert_spin_locked(&wb->b_lock);
-	list_move(&inode->i_io, &wb->b_more_io);
+	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -215,14 +241,15 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	int do_sb_sort = 0;
 
 	while (!list_empty(delaying_queue)) {
-		inode = list_entry(delaying_queue->prev, struct inode, i_io);
+		inode = list_entry(delaying_queue->prev,
+						struct inode, i_wb_list);
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
-		list_move(&inode->i_io, &tmp);
+		list_move(&inode->i_wb_list, &tmp);
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
@@ -233,12 +260,12 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 
 	/* Move inodes from one superblock together */
 	while (!list_empty(&tmp)) {
-		inode = list_entry(tmp.prev, struct inode, i_io);
+		inode = list_entry(tmp.prev, struct inode, i_wb_list);
 		sb = inode->i_sb;
 		list_for_each_prev_safe(pos, node, &tmp) {
-			inode = list_entry(pos, struct inode, i_io);
+			inode = list_entry(pos, struct inode, i_wb_list);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_io, dispatch_queue);
+				list_move(&inode->i_wb_list, dispatch_queue);
 		}
 	}
 }
@@ -256,6 +283,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	assert_spin_locked(&wb->b_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -270,45 +298,46 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct bdi_writeback *wb,
-					struct inode *inode)
+static void inode_wait_for_writeback(struct inode *inode)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
-		spin_unlock(&wb->b_lock);
 		spin_unlock(&inode->i_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode->i_lock);
-		spin_lock(&wb->b_lock);
 	}
 }
 
-/*
- * Write out an inode's dirty pages. Either the caller has ref on the inode
- * (either via inode_get or via syscall against an fd) or the inode has
- * I_WILL_FREE set (via generic_forget_inode)
+/**
+ * sync_inode - write an inode and its pages to disk.
+ * @inode: the inode to sync
+ * @wbc: controls the writeback mode
  *
- * If `wait' is set, wait on the writeout.
+ * sync_inode() will write an inode and its pages to disk.  It will also
+ * correctly update the inode on its superblock's dirty inode lists and will
+ * update inode->i_state.
+ *
+ * The caller must have a ref on the inode or the inode has I_WILL_FREE set.
+ *
+ * If @wbc->sync_mode == WB_SYNC_ALL the we are doing a data integrity
+ * operation so we need to wait on the writeout.
  *
  * The whole writeout design is quite complex and fragile.  We want to avoid
  * starvation of particular inodes when others are being redirtied, prevent
  * livelocks, etc.
- *
- * Called under wb_inode_list_lock and i_lock. May drop the locks but returns
- * with them locked.
  */
-static int
-writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
-			struct writeback_control *wbc)
+int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
 
-	if (!inode->i_refs)
+	spin_lock(&inode->i_lock);
+	if (!inode->i_ref)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -323,14 +352,17 @@ writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(wb, inode);
+			spin_unlock(&inode->i_lock);
+			spin_lock(&bdi->wb.b_lock);
+			requeue_io(inode);
+			spin_unlock(&bdi->wb.b_lock);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(wb, inode);
+		inode_wait_for_writeback(inode);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -338,7 +370,6 @@ writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
-	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode->i_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -362,18 +393,15 @@ writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	spin_unlock(&inode->i_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
-		int err;
-
-		spin_unlock(&inode->i_lock);
-		err = write_inode(inode, wbc);
+		int err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
-		spin_lock(&inode->i_lock);
 	}
 
-	spin_lock(&wb->b_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -382,11 +410,13 @@ writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_unlock(&inode->i_lock);
+			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(wb, inode);
+				requeue_io(inode);
 			} else {
 				/*
 				 * Writeback blocked by something other than
@@ -395,8 +425,9 @@ writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(wb, inode);
+				redirty_tail(inode);
 			}
+			spin_unlock(&bdi->wb.b_lock);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
@@ -404,23 +435,31 @@ writeback_single_inode(struct bdi_writeback *wb, struct inode *inode,
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(wb, inode);
+			spin_unlock(&inode->i_lock);
+			spin_lock(&bdi->wb.b_lock);
+			redirty_tail(inode);
+			spin_unlock(&bdi->wb.b_lock);
 		} else {
 			/*
-			 * The inode is clean
+			 * The inode is clean. If it is unused, then make sure
+			 * that it is put on the LRU correctly as iput_final()
+			 * does not move dirty inodes to the LRU and dirty
+			 * inodes are removed from the LRU during scanning.
 			 */
-			list_del_init(&inode->i_io);
-
-			/*
-			 * Put it on the LRU if it is unused, otherwise lazy.
-			 */
-			if (!inode->i_refs && list_empty(&inode->i_lru))
-				__inode_lru_list_add(inode);
+			int unused = inode->i_ref == 0;
+			spin_unlock(&inode->i_lock);
+			inode_wb_list_del(inode);
+			if (unused)
+				inode_lru_list_add(inode);
 		}
+	} else {
+		/* freer will clean up */
+		spin_unlock(&inode->i_lock);
 	}
 	inode_sync_complete(inode);
 	return ret;
 }
+EXPORT_SYMBOL(sync_inode);
 
 /*
  * For background writeback the caller does not have the sb pinned
@@ -461,18 +500,11 @@ static bool pin_sb_for_writeback(struct super_block *sb)
 static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		struct writeback_control *wbc, bool only_this_sb)
 {
-again:
+	assert_spin_locked(&wb->b_lock);
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_io);
-
-		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&wb->b_lock);
-			cpu_relax();
-			spin_lock(&wb->b_lock);
-			goto again;
-		}
+						 struct inode, i_wb_list);
 
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
@@ -481,13 +513,9 @@ again:
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(wb, inode);
-				spin_unlock(&inode->i_lock);
+				redirty_tail(inode);
 				continue;
 			}
-
-			spin_unlock(&inode->i_lock);
-
 			/*
 			 * The inode belongs to a different superblock.
 			 * Bounce back to the caller to unpin this and
@@ -496,9 +524,18 @@ again:
 			return 0;
 		}
 
-		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
-			requeue_io(wb, inode);
+		/*
+		 * We can see I_FREEING here when the inod isin the process of
+		 * being reclaimed. In that case the freer is waiting on the
+		 * wb->b_lock that we currently hold to remove the inode from
+		 * the writeback list. So we don't spin on it here, requeue it
+		 * and move on to the next inode, which will allow the other
+		 * thread to free the inode when we drop the lock.
+		 */
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
 			spin_unlock(&inode->i_lock);
+			requeue_io(inode);
 			continue;
 		}
 		/*
@@ -510,19 +547,21 @@ again:
 			return 1;
 		}
 
-		BUG_ON(inode->i_state & I_FREEING);
-		inode_get_ilock(inode);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&wb->b_lock);
+
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(wb, inode, wbc);
+		sync_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(wb, inode);
+			spin_lock(&wb->b_lock);
+			redirty_tail(inode);
+			spin_unlock(&wb->b_lock);
 		}
-		spin_unlock(&wb->b_lock);
-		spin_unlock(&inode->i_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->b_lock);
@@ -544,25 +583,17 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-again:
 	spin_lock(&wb->b_lock);
-
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_io);
+						 struct inode, i_wb_list);
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
-			if (!spin_trylock(&inode->i_lock)) {
-				spin_unlock(&wb->b_lock);
-				cpu_relax();
-				goto again;
-			}
-			requeue_io(wb, inode);
-			spin_unlock(&inode->i_lock);
+			requeue_io(inode);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -694,20 +725,16 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-retry:
-		spin_lock(&wb->b_lock);
 		if (!list_empty(&wb->b_more_io))  {
+			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
-						struct inode, i_io);
-			if (!spin_trylock(&inode->i_lock)) {
-				spin_unlock(&wb->b_lock);
-				goto retry;
-			}
+						struct inode, i_wb_list);
+			spin_lock(&inode->i_lock);
+			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
-			inode_wait_for_writeback(wb, inode);
+			inode_wait_for_writeback(inode);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&wb->b_lock);
 	}
 
 	return wrote;
@@ -735,7 +762,6 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 {
 	unsigned long expired;
 	long nr_pages;
-	int nr_dirty_inodes;
 
 	/*
 	 * When set to zero, disable periodic writeback
@@ -748,15 +774,10 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	if (time_before(jiffies, expired))
 		return 0;
 
-	/* approximate dirty inodes */
-	nr_dirty_inodes = get_nr_inodes() - get_nr_inodes_unused();
-	if (nr_dirty_inodes < 0)
-		nr_dirty_inodes = 0;
-
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			nr_dirty_inodes;
+			get_nr_dirty_inodes();
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -988,27 +1009,25 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * superblock list, based upon its state.
 		 */
 		if (inode->i_state & I_SYNC)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * Only add valid (hashed) inodes to the superblock's
 		 * dirty list.  Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_bl_unhashed(&inode->i_hash))
-				goto out;
+			if (inode_unhashed(inode))
+				goto out_unlock;
 		}
 		if (inode->i_state & I_FREEING)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * If the inode was already on b_dirty/b_io/b_more_io, don't
 		 * reposition it (that would break b_dirty time-ordering).
 		 */
 		if (!was_dirty) {
-			struct bdi_writeback *wb;
- 			bdi = inode_to_bdi(inode);
-			wb = inode_to_wb(inode);
+			bdi = inode_to_bdi(inode);
 
 			if (bdi_cap_writeback_dirty(bdi)) {
 				WARN(!test_bit(BDI_registered, &bdi->state),
@@ -1024,16 +1043,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_unlock(&inode->i_lock);
+			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
-			spin_lock(&wb->b_lock);
-			BUG_ON(!list_empty(&inode->i_io));
-			list_add(&inode->i_io, &wb->b_dirty);
-			spin_unlock(&wb->b_lock);
+			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+			spin_unlock(&bdi->wb.b_lock);
+			goto out;
 		}
 	}
-out:
+out_unlock:
 	spin_unlock(&inode->i_lock);
-
+out:
 	if (wakeup_bdi)
 		bdi_wakeup_thread_delayed(bdi);
 }
@@ -1066,6 +1086,8 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+	spin_lock(&sb->s_inodes_lock);
+
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
 	 * because there may have been pages dirtied before our sync
@@ -1073,32 +1095,25 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	rcu_read_lock();
-	do_inode_list_for_each_entry_rcu(sb, inode) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
 		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
 		mapping = inode->i_mapping;
-		if (mapping->nrpages == 0) {
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    mapping->nrpages == 0) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
-
- 		inode_get_ilock(inode);
+		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
-		rcu_read_unlock();
+		spin_unlock(&sb->s_inodes_lock);
 		/*
-		 * We hold a reference to 'inode' so it couldn't have
-		 * been removed from s_inodes list while we dropped the
-		 * i_lock.  We cannot iput the inode now as we can be
-		 * holding the last reference and we cannot iput it
-		 * under spinlock. So we keep the reference and iput it
-		 * later.
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * s_inodes_lock.  We cannot iput the inode now as we can be
+		 * holding the last reference and we cannot iput it under
+		 * s_inodes_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1107,9 +1122,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		rcu_read_lock();
-	} while_inode_list_for_each_entry_rcu
-	rcu_read_unlock();
+		spin_lock(&sb->s_inodes_lock);
+	}
+	spin_unlock(&sb->s_inodes_lock);
 	iput(old_inode);
 }
 
@@ -1126,7 +1141,6 @@ void writeback_inodes_sb(struct super_block *sb)
 {
 	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
 	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
-	int nr_dirty_inodes;
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
 		.sb		= sb,
@@ -1136,11 +1150,7 @@ void writeback_inodes_sb(struct super_block *sb)
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	nr_dirty_inodes = get_nr_inodes() - get_nr_inodes_unused();
-	if (nr_dirty_inodes < 0)
-		nr_dirty_inodes = 0;
-
-	work.nr_pages = nr_dirty + nr_unstable + nr_dirty_inodes;
+	work.nr_pages = nr_dirty + nr_unstable + get_nr_dirty_inodes();
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
@@ -1205,7 +1215,6 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
-	struct bdi_writeback *wb = inode_to_wb(inode);
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1218,38 +1227,9 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode->i_lock);
-	spin_lock(&wb->b_lock);
-	ret = writeback_single_inode(wb, inode, &wbc);
-	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode->i_lock);
+	ret = sync_inode(inode, &wbc);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
 }
 EXPORT_SYMBOL(write_inode_now);
-
-/**
- * sync_inode - write an inode and its pages to disk.
- * @inode: the inode to sync
- * @wbc: controls the writeback mode
- *
- * sync_inode() will write an inode and its pages to disk.  It will also
- * correctly update the inode on its superblock's dirty inode lists and will
- * update inode->i_state.
- *
- * The caller must have a ref on the inode.
- */
-int sync_inode(struct inode *inode, struct writeback_control *wbc)
-{
-	struct bdi_writeback *wb = inode_to_wb(inode);
-	int ret;
-
-	spin_lock(&inode->i_lock);
-	spin_lock(&wb->b_lock);
-	ret = writeback_single_inode(wb, inode, wbc);
-	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode->i_lock);
-	return ret;
-}
-EXPORT_SYMBOL(sync_inode);
diff --git a/fs/inode.c b/fs/inode.c
index c682715..6a9b1ea 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -24,36 +24,40 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
-#include <linux/bit_spinlock.h>
-#include <linux/lglock.h>
+
 #include "internal.h"
 
 /*
- * Usage:
- * inode_list_lglock protects:
- *   s_inodes, i_sb_list
- * inode_hash_bucket lock protects:
+ * Locking rules.
+ *
+ * inode->i_lock is *always* the innermost lock.
+ *
+ * inode->i_lock protects:
+ *   i_ref i_state
+ * inode hash lock protects:
  *   inode hash table, i_hash
- * zone->inode_lru_lock protects:
+ * sb inode lock protects:
+ *   s_inodes, i_sb_list
+ * bdi writeback lock protects:
+ *   b_io, b_more_io, b_dirty, i_wb_list
+ * inode_lru_lock protects:
  *   inode_lru, i_lru
- * wb->b_lock protects:
- *   b_io, b_more_io, b_dirty, i_io, i_lru
- * inode->i_lock protects:
- *   i_state
- *   i_refs
- *   i_hash
- *   i_io
- *   i_lru
- *   i_sb_list
  *
- * Ordering:
- * inode->i_lock
- *   inode_list_lglock
- *   zone->inode_lru_lock
+ * Lock orders
+ * inode hash bucket lock
+ *   inode->i_lock
+ *
+ * sb inode lock
+ *   inode_lru_lock
  *   wb->b_lock
- *     sb_lock (pin_sb_for_writeback)
- *   inode_hash_bucket lock
- *   dentry->d_lock (alias management)
+ *     inode->i_lock
+ *
+ * wb->b_lock
+ *   sb_lock (pin sb for writeback)
+ *   inode->i_lock
+ *
+ * inode_lru
+ *   inode->i_lock
  */
 /*
  * This is needed for the following functions:
@@ -89,43 +93,21 @@
 
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
+static struct hlist_bl_head *inode_hashtable __read_mostly;
 
 /*
  * Each inode can be on two separate lists. One is
  * the hash list of the inode, used for lookups. The
  * other linked list is the "type" list:
- *  "in_use" - valid inode, i_refs > 0, i_nlink > 0
+ *  "in_use" - valid inode, i_ref > 0, i_nlink > 0
  *  "dirty"  - as "in_use" but also dirty
- *  "unused" - valid inode, i_refs = 0
+ *  "unused" - valid inode, i_ref = 0
  *
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
  */
-
-struct inode_hash_bucket {
-	struct hlist_bl_head head;
-};
-
-static inline void spin_lock_bucket(struct inode_hash_bucket *b)
-{
-	bit_spin_lock(0, (unsigned long *)b);
-}
-
-static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
-{
-	__bit_spin_unlock(0, (unsigned long *)b);
-}
-
-static struct inode_hash_bucket *inode_hashtable __read_mostly;
-
-/*
- * A simple spinlock to protect the list manipulations.
- *
- * NOTE! You also have to own the lock if you change
- * the i_state of an inode while it is in use..
- */
-DECLARE_LGLOCK(inode_list_lglock);
-DEFINE_LGLOCK(inode_list_lglock);
+static LIST_HEAD(inode_lru);
+static DEFINE_SPINLOCK(inode_lru_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
@@ -144,48 +126,42 @@ static DECLARE_RWSEM(iprune_sem);
 /*
  * Statistics gathering..
  */
-struct inodes_stat_t inodes_stat = {
-	.nr_inodes = 0,
-	.nr_unused = 0,
-};
+struct inodes_stat_t inodes_stat;
 
-static DEFINE_PER_CPU(unsigned int, nr_inodes);
+static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+static struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
 
 static struct kmem_cache *inode_cachep __read_mostly;
 
-int get_nr_inodes(void)
+static inline int get_nr_inodes(void)
+{
+	return percpu_counter_sum_positive(&nr_inodes);
+}
+
+static inline int get_nr_inodes_unused(void)
 {
-	int i;
-	int sum = 0;
-	for_each_possible_cpu(i)
-		sum += per_cpu(nr_inodes, i);
-	return sum < 0 ? 0 : sum;
+	return percpu_counter_sum_positive(&nr_inodes_unused);
 }
 
-int get_nr_inodes_unused(void)
+int get_nr_dirty_inodes(void)
 {
-	int nr = 0;
-	struct zone *z;
+	int nr_dirty = get_nr_inodes() - get_nr_inodes_unused();
+	return nr_dirty > 0 ? nr_dirty : 0;
 
-	for_each_populated_zone(z)
-		nr += z->inode_nr_lru;
-	return nr;
 }
 
 /*
- * Handle nr_dentry sysctl
+ * Handle nr_inode sysctl
  */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 int proc_nr_inodes(ctl_table *table, int write,
 		   void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 	inodes_stat.nr_inodes = get_nr_inodes();
 	inodes_stat.nr_unused = get_nr_inodes_unused();
 	return proc_dointvec(table, write, buffer, lenp, ppos);
-#else
-	return -ENOSYS;
-#endif
 }
+#endif
 
 static void wake_up_inode(struct inode *inode)
 {
@@ -214,7 +190,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	inode->i_refs = 1;
+	inode->i_ref = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -228,7 +204,6 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 #ifdef CONFIG_QUOTA
 	memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
 #endif
-	INIT_LIST_HEAD(&inode->i_sb_list);
 	inode->i_pipe = NULL;
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
@@ -275,7 +250,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 
-	this_cpu_inc(nr_inodes);
+	percpu_counter_inc(&nr_inodes);
 
 	return 0;
 out:
@@ -306,12 +281,6 @@ static struct inode *alloc_inode(struct super_block *sb)
 	return inode;
 }
 
-void free_inode_nonrcu(struct inode *inode)
-{
-	kmem_cache_free(inode_cachep, inode);
-}
-EXPORT_SYMBOL(free_inode_nonrcu);
-
 void __destroy_inode(struct inode *inode)
 {
 	BUG_ON(inode_has_buffers(inode));
@@ -323,25 +292,18 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
-	this_cpu_dec(nr_inodes);
+	percpu_counter_dec(&nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
 
-static void i_callback(struct rcu_head *head)
-{
-	struct inode *inode = container_of(head, struct inode, i_rcu);
-	INIT_LIST_HEAD(&inode->i_dentry);
-	kmem_cache_free(inode_cachep, inode);
-}
-
 void destroy_inode(struct inode *inode)
 {
-	BUG_ON(!list_empty(&inode->i_io));
+	BUG_ON(!list_empty(&inode->i_lru));
 	__destroy_inode(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
 	else
-		call_rcu(&inode->i_rcu, i_callback);
+		kmem_cache_free(inode_cachep, (inode));
 }
 
 /*
@@ -352,10 +314,10 @@ void destroy_inode(struct inode *inode)
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_BL_NODE(&inode->i_hash);
+	init_hlist_bl_node(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_io);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
@@ -378,6 +340,117 @@ static void init_once(void *foo)
 	inode_init_once(inode);
 }
 
+/**
+ * iref - increment the reference count on an inode
+ * @inode:	inode to take a reference on
+ *
+ * iref() should be called to take an extra reference to an inode. The inode
+ * must already have a reference count obtained via igrab() as iref() does not
+ * do checks for the inode being freed and hence cannot be used to initially
+ * obtain a reference to the inode.
+ */
+void iref(struct inode *inode)
+{
+	WARN_ON(inode->i_ref < 1);
+	spin_lock(&inode->i_lock);
+	inode->i_ref++;
+	spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL_GPL(iref);
+
+/*
+ * check against I_FREEING as inode writeback completion could race with
+ * setting the I_FREEING and removing the inode from the LRU.
+ */
+void inode_lru_list_add(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	if (list_empty(&inode->i_lru) && !(inode->i_state & I_FREEING)) {
+		list_add(&inode->i_lru, &inode_lru);
+		percpu_counter_inc(&nr_inodes_unused);
+	}
+	spin_unlock(&inode_lru_lock);
+}
+
+void inode_lru_list_del(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	if (!list_empty(&inode->i_lru)) {
+		list_del_init(&inode->i_lru);
+		percpu_counter_dec(&nr_inodes_unused);
+	}
+	spin_unlock(&inode_lru_lock);
+}
+
+/**
+ * inode_sb_list_add - add inode to the superblock list of inodes
+ * @inode: inode to add
+ */
+void inode_sb_list_add(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	spin_lock(&sb->s_inodes_lock);
+	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
+}
+EXPORT_SYMBOL_GPL(inode_sb_list_add);
+
+static void inode_sb_list_del(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	spin_lock(&sb->s_inodes_lock);
+	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb->s_inodes_lock);
+}
+
+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+	unsigned long tmp;
+
+	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+			L1_CACHE_BYTES;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
+	return tmp & I_HASHMASK;
+}
+
+/**
+ *	__insert_inode_hash - hash an inode
+ *	@inode: unhashed inode
+ *	@hashval: unsigned long value used to locate this object in the
+ *		inode_hashtable.
+ *
+ *	Add an inode to the inode hash for this superblock.
+ */
+void __insert_inode_hash(struct inode *inode, unsigned long hashval)
+{
+	struct hlist_bl_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+
+	hlist_bl_lock(b);
+	hlist_bl_add_head(&inode->i_hash, b);
+	hlist_bl_unlock(b);
+}
+EXPORT_SYMBOL(__insert_inode_hash);
+
+/**
+ *	remove_inode_hash - remove an inode from the hash
+ *	@inode: inode to unhash
+ *
+ *	Remove an inode from the superblock. inode->i_lock must be
+ *	held.
+ */
+void remove_inode_hash(struct inode *inode)
+{
+	struct hlist_bl_head *b;
+
+	b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
+	hlist_bl_lock(b);
+	hlist_bl_del_init(&inode->i_hash);
+	hlist_bl_unlock(b);
+}
+EXPORT_SYMBOL(remove_inode_hash);
+
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
@@ -386,8 +459,9 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
-	/* don't need i_lock here, no concurrent mods to i_state */
+	spin_lock(&inode->i_lock);
 	inode->i_state = I_FREEING | I_CLEAR;
+	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(end_writeback);
 
@@ -408,20 +482,36 @@ static void evict(struct inode *inode)
 		cd_forget(inode);
 }
 
-static void __remove_inode_hash(struct inode *inode);
-
-static void inode_sb_list_del(struct inode *inode);
-
+/*
+ * Free the inode passed in, removing it from the lists it is still connected
+ * to but avoiding unnecessary lock round-trips for the lists it is no longer
+ * on.
+ *
+ * An inode must already be marked I_FREEING so that we avoid the inode being
+ * moved back onto lists if we race with other code that manipulates the lists
+ * (e.g. writeback_single_inode). The caller
+ */
 static void dispose_one_inode(struct inode *inode)
 {
-	evict(inode);
+	BUG_ON(!(inode->i_state & I_FREEING));
 
-	spin_lock(&inode->i_lock);
-	__remove_inode_hash(inode);
-	inode_sb_list_del(inode);
-	spin_unlock(&inode->i_lock);
+	/*
+	 * move the inode off the IO lists and LRU once
+	 * I_FREEING is set so that it won't get moved back on
+	 * there if it is dirty.
+	 */
+	if (!list_empty(&inode->i_wb_list))
+		inode_wb_list_del(inode);
+	if (!list_empty(&inode->i_lru))
+		inode_lru_list_del(inode);
+	if (!list_empty(&inode->i_sb_list))
+		inode_sb_list_del(inode);
+
+	evict(inode);
 
+	remove_inode_hash(inode);
 	wake_up_inode(inode);
+	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	destroy_inode(inode);
 }
 
@@ -437,74 +527,57 @@ static void dispose_list(struct list_head *head)
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_lru);
-		list_del_init(&inode->i_lru);
+		inode = list_first_entry(head, struct inode, i_sb_list);
+		list_del_init(&inode->i_sb_list);
 
 		dispose_one_inode(inode);
-		cond_resched();
 	}
 }
 
 /*
- * Add an inode to the LRU list. i_lock must be held.
- */
-void __inode_lru_list_add(struct inode *inode)
-{
-	struct zone *z = page_zone(virt_to_page(inode));
-
-	spin_lock(&z->inode_lru_lock);
-	list_add(&inode->i_lru, &z->inode_lru);
-	z->inode_nr_lru++;
-	spin_unlock(&z->inode_lru_lock);
-}
-
-/*
- * Remove an inode from the LRU list. i_lock must be held.
- */
-void __inode_lru_list_del(struct inode *inode)
-{
-	struct zone *z = page_zone(virt_to_page(inode));
-
-	spin_lock(&z->inode_lru_lock);
-	list_del_init(&inode->i_lru);
-	z->inode_nr_lru--;
-	spin_unlock(&z->inode_lru_lock);
-}
-
-/*
  * Invalidate all inodes for a device.
  */
-static int invalidate_sb_inodes(struct super_block *sb, struct list_head *dispose)
+static int invalidate_list(struct super_block *sb, struct list_head *head,
+			struct list_head *dispose)
 {
-	struct inode *inode;
+	struct list_head *next;
 	int busy = 0;
 
-	do_inode_list_for_each_entry_rcu(sb, inode) {
+	next = head->next;
+	for (;;) {
+		struct list_head *tmp = next;
+		struct inode *inode;
+
+		/*
+		 * We can reschedule here without worrying about the list's
+		 * consistency because the per-sb list of inodes must not
+		 * change during umount anymore, and because iprune_sem keeps
+		 * shrink_icache_memory() away.
+		 */
+		cond_resched_lock(&sb->s_inodes_lock);
+
+		next = next->next;
+		if (tmp == head)
+			break;
+		inode = list_entry(tmp, struct inode, i_sb_list);
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & I_NEW) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		invalidate_inode_buffers(inode);
-		if (!inode->i_refs) {
-			struct bdi_writeback *wb = inode_to_wb(inode);
-
-			spin_lock(&wb->b_lock);
-			list_del_init(&inode->i_io);
-			spin_unlock(&wb->b_lock);
-
-			__inode_lru_list_del(inode);
-
+		if (!inode->i_ref) {
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			spin_unlock(&inode->i_lock);
-			list_add(&inode->i_lru, dispose);
+
+			/* save a lock round trip by removing the inode here. */
+			list_move(&inode->i_sb_list, dispose);
 			continue;
 		}
 		spin_unlock(&inode->i_lock);
 		busy = 1;
-	} while_inode_list_for_each_entry_rcu
-
+	}
 	return busy;
 }
 
@@ -518,127 +591,113 @@ static int invalidate_sb_inodes(struct super_block *sb, struct list_head *dispos
  */
 int invalidate_inodes(struct super_block *sb)
 {
-	int busy;
 	LIST_HEAD(throw_away);
+	int busy;
 
 	down_write(&iprune_sem);
-	/*
-	 * We can walk the per-sb list of inodes here without worrying about
-	 * its consistency, because the list must not change during umount
-	 * anymore, and because iprune_sem keeps shrink_icache_memory() away.
-	 */
-	fsnotify_unmount_inodes(sb);
-	busy = invalidate_sb_inodes(sb, &throw_away);
+	spin_lock(&sb->s_inodes_lock);
+	fsnotify_unmount_inodes(&sb->s_inodes);
+	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
+	spin_unlock(&sb->s_inodes_lock);
+	up_write(&iprune_sem);
 
 	dispose_list(&throw_away);
-	up_write(&iprune_sem);
 
 	return busy;
 }
 EXPORT_SYMBOL(invalidate_inodes);
 
-static int can_unuse(struct inode *inode)
-{
-	if (inode->i_state & ~I_REFERENCED)
-		return 0;
-	if (inode_has_buffers(inode))
-		return 0;
-	if (inode->i_refs)
-		return 0;
-	if (inode->i_data.nrpages)
-		return 0;
-	return 1;
-}
-
 /*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside LRU lock by dispose_list().
+ * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
+ * temporary list and then are freed outside locks by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
- * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_lru list.  So look for it there and if the
- * inode is still freeable, proceed.  The right inode is found 99.9% of the
- * time in testing on a 4-way.
+ * pagecache removed.  If the inode has metadata buffers attached to
+ * mapping->private_list then try to remove them.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
- * try to remove them.
+ * If the inode has the I_REFERENCED flag set, it means that it has been used
+ * recently - the flag is set in iput_final(). When we encounter such an inode,
+ * clear the flag and move it to the back of the LRU so it gets another pass
+ * through the LRU before it gets reclaimed. This is necessary because of the
+ * fact we are doing lazy LRU updates to minimise lock contention so the LRU
+ * does not have strict ordering. Hence we don't want to reclaim inodes with
+ * this flag set because they are the inodes that are out of order...
  */
-static void prune_icache(struct zone *zone, unsigned long nr_to_scan)
+static void prune_icache(int nr_to_scan)
 {
+	int nr_scanned;
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
-again:
-	spin_lock(&zone->inode_lru_lock);
-	for (; nr_to_scan; nr_to_scan--) {
+	spin_lock(&inode_lru_lock);
+	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
 
-		if (list_empty(&zone->inode_lru))
+		if (list_empty(&inode_lru))
 			break;
 
-		inode = list_entry(zone->inode_lru.prev, struct inode, i_lru);
+		inode = list_entry(inode_lru.prev, struct inode, i_lru);
 
-		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&zone->inode_lru_lock);
-			cpu_relax();
-			goto again;
-		}
-		if (inode->i_refs || (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_lru);
+		/*
+		 * Referenced or dirty inodes are still in use. Give them
+		 * another pass through the LRU as we canot reclaim them now.
+		 */
+		spin_lock(&inode->i_lock);
+		if (inode->i_ref || (inode->i_state & ~I_REFERENCED)) {
 			spin_unlock(&inode->i_lock);
-			zone->inode_nr_lru--;
+			list_del_init(&inode->i_lru);
+			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
+
+		/* recently referenced inodes get one more pass */
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_lru, &zone->inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
+			list_move(&inode->i_lru, &inode_lru);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			/*
-			 * Move back to the head of the unused list in case the
-			 * invalidations failed. Could improve this by going to
-			 * the head of the list only if invalidation fails.
-			 *
-			 * We'll try to get it back if it becomes freeable.
-			 */
-			list_move(&inode->i_lru, &zone->inode_lru);
-			spin_unlock(&zone->inode_lru_lock);
-			inode_get_ilock(inode);
+			inode->i_ref++;
 			spin_unlock(&inode->i_lock);
-
+			spin_unlock(&inode_lru_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&zone->inode_lru_lock);
-			if (inode == list_entry(zone->inode_lru.next,
-						struct inode, i_lru)) {
-				if (spin_trylock(&inode->i_lock)) {
-					if (can_unuse(inode))
-						goto freeable;
-					spin_unlock(&inode->i_lock);
-				}
-			}
+
+			/*
+			 * Rather than try to determine if we can still use the
+			 * inode after calling iput(), leave the inode where it
+			 * is on the LRU. If we race with another recalimer,
+			 * that reclaimer will either see the a reference count
+			 * or the I_REFERENCED flag, and move the inode to the
+			 * back of the LRU. It we don't race, then we'll see
+			 * the I_REFERENCED flag on the next pass and do the
+			 * same. Either way, we won't spin on it in this loop.
+			 */
+			spin_lock(&inode_lru_lock);
 			continue;
 		}
-freeable:
-		list_del_init(&inode->i_lru);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		spin_unlock(&inode->i_lock);
-		zone->inode_nr_lru--;
-		spin_unlock(&zone->inode_lru_lock);
+
+		/* save a lock round trip by removing the inode here. */
+		list_del_init(&inode->i_lru);
+		percpu_counter_dec(&nr_inodes_unused);
+		spin_unlock(&inode_lru_lock);
+
 		dispose_one_inode(inode);
 		cond_resched();
-		spin_lock(&zone->inode_lru_lock);
+
+		spin_lock(&inode_lru_lock);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
-	spin_unlock(&zone->inode_lru_lock);
+	spin_unlock(&inode_lru_lock);
 
 	up_read(&iprune_sem);
 }
@@ -649,47 +708,33 @@ freeable:
  * not open and the dcache references to those inodes have already been
  * reclaimed.
  *
- * This function is defined according to shrinker API described in linux/mm.h.
+ * This function is passed the number of inodes to scan, and it returns the
+ * total number of remaining possibly-reclaimable inodes.
  */
-static void shrink_icache_memory(struct shrinker *shrink,
-		struct zone *zone, unsigned long scanned,
-		unsigned long total, unsigned long global,
-		unsigned long flags, gfp_t gfp_mask)
+static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
-	unsigned long nr;
-
-	shrinker_add_scan(&zone->inode_nr_scan, scanned, total,
-			zone->inode_nr_lru,
-			SHRINK_DEFAULT_SEEKS * 100 / sysctl_vfs_cache_pressure);
-	/*
-	 * Nasty deadlock avoidance.  We may hold various FS locks,
-	 * and we don't want to recurse into the FS that called us
-	 * in clear_inode() and friends..
-	 */
-	if (!(gfp_mask & __GFP_FS))
-	       return;
-
-	nr = ACCESS_ONCE(zone->inode_nr_scan);
-	if (nr < SHRINK_BATCH)
-		return;
-	zone->inode_nr_scan = 0;
-	prune_icache(zone, nr);
-	count_vm_events(SLABS_SCANNED, nr);
+	if (nr) {
+		/*
+		 * Nasty deadlock avoidance.  We may hold various FS locks,
+		 * and we don't want to recurse into the FS that called us
+		 * in clear_inode() and friends..
+		 */
+		if (!(gfp_mask & __GFP_FS))
+			return -1;
+		prune_icache(nr);
+	}
+	return (get_nr_inodes_unused() / 100) * sysctl_vfs_cache_pressure;
 }
 
 static struct shrinker icache_shrinker = {
-	.shrink_zone = shrink_icache_memory,
+	.shrink = shrink_icache_memory,
+	.seeks = DEFAULT_SEEKS,
 };
 
 static void __wait_on_freeing_inode(struct inode *inode);
-/*
- * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must call inode_get_ilock()
- * by hand after calling find_inode now! This simplifies iunique and won't
- * add any additional branch in the common code.
- */
+
 static struct inode *find_inode(struct super_block *sb,
-				struct inode_hash_bucket *b,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
@@ -697,28 +742,25 @@ static struct inode *find_inode(struct super_block *sb,
 	struct inode *inode = NULL;
 
 repeat:
-	rcu_read_lock();
-	hlist_bl_for_each_entry_rcu(inode, node, &b->head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		spin_lock(&inode->i_lock);
-		if (hlist_bl_unhashed(&inode->i_hash)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
 		if (!test(inode, data)) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			rcu_read_unlock();
+			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
+			hlist_bl_lock(b);
 			goto repeat;
 		}
-		break;
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+		return inode;
 	}
-	rcu_read_unlock();
-	return node ? inode : NULL;
+	return NULL;
 }
 
 /*
@@ -726,135 +768,31 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct inode_hash_bucket *b,
-				unsigned long ino)
+		struct hlist_bl_head *b, unsigned long ino)
 {
 	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	rcu_read_lock();
-	hlist_bl_for_each_entry_rcu(inode, node, &b->head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		spin_lock(&inode->i_lock);
-		if (hlist_bl_unhashed(&inode->i_hash)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			rcu_read_unlock();
+			hlist_bl_unlock(b);
 			__wait_on_freeing_inode(inode);
+			hlist_bl_lock(b);
 			goto repeat;
 		}
-		break;
-	}
-	rcu_read_unlock();
-	return node ? inode : NULL;
-}
-
-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
-	unsigned long tmp;
-
-	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
-			L1_CACHE_BYTES;
-	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> I_HASHBITS);
-	return tmp & I_HASHMASK;
-}
-
-static inline int inode_list_cpu(struct inode *inode)
-{
-#ifdef CONFIG_SMP
-	return inode->i_sb_list_cpu;
-#else
-	return smp_processor_id();
-#endif
-}
-
-/* helper for file_sb_list_add to reduce ifdefs */
-static inline void __inode_sb_list_add(struct inode *inode, struct super_block *sb)
-{
-	struct list_head *list;
-#ifdef CONFIG_SMP
-	int cpu;
-	cpu = smp_processor_id();
-	inode->i_sb_list_cpu = cpu;
-	list = per_cpu_ptr(sb->s_inodes, cpu);
-#else
-	list = &sb->s_inodes;
-#endif
-	list_add_rcu(&inode->i_sb_list, list);
-}
-
-/**
- * inode_sb_list_add - add an inode to the sb's file list
- * @inode: inode to add
- * @sb: sb to add it to
- *
- * Use this function to associate an with the superblock it belongs to.
- */
-static void inode_sb_list_add(struct inode *inode, struct super_block *sb)
-{
-	lg_local_lock(inode_list_lglock);
-	__inode_sb_list_add(inode, sb);
-	lg_local_unlock(inode_list_lglock);
-}
-
-/**
- * inode_sb_list_del - remove an inode from the sb's inode list
- * @inode: inode to remove
- * @sb: sb to remove it from
- *
- * Use this function to remove an inode from its superblock.
- */
-static void inode_sb_list_del(struct inode *inode)
-{
-	if (list_empty(&inode->i_sb_list))
-		return;
-	lg_local_lock_cpu(inode_list_lglock, inode_list_cpu(inode));
-	list_del_rcu(&inode->i_sb_list);
-	lg_local_unlock_cpu(inode_list_lglock, inode_list_cpu(inode));
-}
-
-static inline void
-__inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
-			struct inode *inode)
-{
-	inode_sb_list_add(inode, sb);
-	if (b) {
-		spin_lock_bucket(b);
-		hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
-		spin_unlock_bucket(b);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+		return inode;
 	}
+	return NULL;
 }
 
-/**
- * inode_add_to_lists - add a new inode to relevant lists
- * @sb: superblock inode belongs to
- * @inode: inode to mark in use
- *
- * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash.
- *
- * We calculate the hash list to add to here so it is all internal
- * which requires the caller to have already set up the inode number in the
- * inode to add.
- */
-void inode_add_to_lists(struct super_block *sb, struct inode *inode)
-{
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
-
-	spin_lock(&inode->i_lock);
-	__inode_add_to_lists(sb, b, inode);
-	spin_unlock(&inode->i_lock);
-}
-EXPORT_SYMBOL_GPL(inode_add_to_lists);
-
-#define LAST_INO_BATCH 1024
-
 /*
  * Each cpu owns a range of LAST_INO_BATCH numbers.
  * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
@@ -870,25 +808,25 @@ EXPORT_SYMBOL_GPL(inode_add_to_lists);
  * error if st_ino won't fit in target struct field. Use 32bit counter
  * here to attempt to avoid that.
  */
+#define LAST_INO_BATCH 1024
 static DEFINE_PER_CPU(unsigned int, last_ino);
 
 unsigned int get_next_ino(void)
 {
-	unsigned int res;
+	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int res = *p;
 
-	get_cpu();
-	res = __this_cpu_read(last_ino);
 #ifdef CONFIG_SMP
-	if (unlikely((res & (LAST_INO_BATCH - 1)) == 0)) {
+	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
 		static atomic_t shared_last_ino;
 		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
 
 		res = next - LAST_INO_BATCH;
 	}
 #endif
-	res++;
-	__this_cpu_write(last_ino, res);
-	put_cpu();
+
+	*p = ++res;
+	put_cpu_var(last_ino);
 	return res;
 }
 EXPORT_SYMBOL(get_next_ino);
@@ -911,44 +849,16 @@ struct inode *new_inode(struct super_block *sb)
 
 	inode = alloc_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
-		inode->i_state = 0;
 		/*
-		 * We could init inode locked here, to improve performance.
+		 * set the inode state before we make the inode accessible to
+		 * the outside world.
 		 */
-		spin_lock(&inode->i_lock);
-		__inode_add_to_lists(sb, NULL, inode);
-		spin_unlock(&inode->i_lock);
-	}
-	return inode;
-}
-EXPORT_SYMBOL(new_inode);
-
-/**
- *	new_anon_inode 	- obtain an anonymous inode
- *	@sb: superblock
- *
- *	Similar to new_inode, however the inode is not given an inode
- *	number, and is not added to the sb's list of inodes, to reduce
- *	overheads.
- *
- *	A filesystem which needs an inode number must subsequently
- *	assign one to i_ino. A filesystem which needs inodes to be on the
- *	per-sb list (currently only used by the vfs for umount or remount)
- *	must add the inode to that list.
- */
-struct inode *new_anon_inode(struct super_block *sb)
-{
-	struct inode *inode;
-
-	inode = alloc_inode(sb);
-	if (inode) {
-		inode->i_ino = ULONG_MAX;
 		inode->i_state = 0;
+		inode_sb_list_add(inode);
 	}
 	return inode;
 }
-EXPORT_SYMBOL(new_anon_inode);
+EXPORT_SYMBOL(new_inode);
 
 void unlock_new_inode(struct inode *inode)
 {
@@ -992,7 +902,7 @@ EXPORT_SYMBOL(unlock_new_inode);
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode *get_new_inode(struct super_block *sb,
-				struct inode_hash_bucket *b,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				int (*set)(struct inode *, void *),
 				void *data)
@@ -1003,16 +913,21 @@ static struct inode *get_new_inode(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
+		hlist_bl_lock(b);
 		/* We released the lock, so.. */
 		old = find_inode(sb, b, test, data);
 		if (!old) {
-			spin_lock(&inode->i_lock);
 			if (set(inode, data))
 				goto set_failed;
 
+			/*
+			 * Set the inode state before we make the inode
+			 * visible to the outside world.
+			 */
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, b, inode);
-			spin_unlock(&inode->i_lock);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
+			inode_sb_list_add(inode);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1025,8 +940,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		inode_get_ilock(old);
-		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1034,7 +948,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 	return inode;
 
 set_failed:
-	spin_unlock(&inode->i_lock);
+	hlist_bl_unlock(b);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -1044,7 +958,7 @@ set_failed:
  * comment at iget_locked for details.
  */
 static struct inode *get_new_inode_fast(struct super_block *sb,
-				struct inode_hash_bucket *b, unsigned long ino)
+				struct hlist_bl_head *b, unsigned long ino)
 {
 	struct inode *inode;
 
@@ -1052,14 +966,19 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
+		hlist_bl_lock(b);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
-			spin_lock(&inode->i_lock);
+			/*
+			 * Set the inode state before we make the inode
+			 * visible to the outside world.
+			 */
 			inode->i_ino = ino;
 			inode->i_state = I_NEW;
-			__inode_add_to_lists(sb, b, inode);
-			spin_unlock(&inode->i_lock);
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
+			inode_sb_list_add(inode);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1072,8 +991,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		inode_get_ilock(old);
-		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1081,26 +999,28 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	return inode;
 }
 
-/* Is the ino for this sb hashed right now? */
-static int is_ino_hashed(struct super_block *sb, unsigned long ino)
+/*
+ * search the inode cache for a matching inode number.
+ * If we find one, then the inode number we are trying to
+ * allocate is not unique and so we should not use it.
+ *
+ * Returns 1 if the inode number is unique, 0 if it is not.
+ */
+static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 	struct hlist_bl_node *node;
-	struct inode *inode = NULL;
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
+	struct inode *inode;
 
-	spin_lock_bucket(b);
-	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
+	hlist_bl_lock(b);
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_ino == ino && inode->i_sb == sb) {
-			spin_unlock_bucket(b);
+			hlist_bl_unlock(b);
 			return 0;
 		}
-		/*
-		 * Don't bother checking for I_FREEING etc., because
-		 * we don't want iunique to wait on freeing inodes. Just
-		 * skip it and get the next one.
-		 */
 	}
-	spin_unlock_bucket(b);
+
+	hlist_bl_unlock(b);
 	return 1;
 }
 
@@ -1125,17 +1045,17 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
-	static DEFINE_SPINLOCK(unique_lock);
+	static DEFINE_SPINLOCK(iunique_lock);
 	static unsigned int counter;
 	ino_t res;
 
-	spin_lock(&unique_lock);
+	spin_lock(&iunique_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-	} while (!is_ino_hashed(sb, res));
-	spin_unlock(&unique_lock);
+	} while (!test_inode_iunique(sb, res));
+	spin_unlock(&iunique_lock);
 
 	return res;
 }
@@ -1143,21 +1063,20 @@ EXPORT_SYMBOL(iunique);
 
 struct inode *igrab(struct inode *inode)
 {
-	struct inode *ret = inode;
-
 	spin_lock(&inode->i_lock);
-	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		inode_get_ilock(inode);
-	else
+	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+	} else {
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
-		ret = NULL;
-	spin_unlock(&inode->i_lock);
-
-	return ret;
+		inode = NULL;
+	}
+	return inode;
 }
 EXPORT_SYMBOL(igrab);
 
@@ -1181,21 +1100,19 @@ EXPORT_SYMBOL(igrab);
  * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
-		struct inode_hash_bucket *b,
+		struct hlist_bl_head *b,
 		int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 
+	hlist_bl_lock(b);
 	inode = find_inode(sb, b, test, data);
-	if (inode) {
-		inode_get_ilock(inode);
-		spin_unlock(&inode->i_lock);
-		if (likely(wait))
-			wait_on_inode(inode);
-		return inode;
-	}
-	return NULL;
+	hlist_bl_unlock(b);
+
+	if (inode && likely(wait))
+		wait_on_inode(inode);
+	return inode;
 }
 
 /**
@@ -1214,19 +1131,18 @@ static struct inode *ifind(struct super_block *sb,
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
-		struct inode_hash_bucket *b,
+		struct hlist_bl_head *b,
 		unsigned long ino)
 {
 	struct inode *inode;
 
+	hlist_bl_lock(b);
 	inode = find_inode_fast(sb, b, ino);
-	if (inode) {
-		inode_get_ilock(inode);
-		spin_unlock(&inode->i_lock);
+	hlist_bl_unlock(b);
+
+	if (inode)
 		wait_on_inode(inode);
-		return inode;
-	}
-	return NULL;
+	return inode;
 }
 
 /**
@@ -1253,7 +1169,7 @@ static struct inode *ifind_fast(struct super_block *sb,
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
 	return ifind(sb, b, test, data, 0);
 }
@@ -1281,7 +1197,7 @@ EXPORT_SYMBOL(ilookup5_nowait);
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
 	return ifind(sb, b, test, data, 1);
 }
@@ -1303,7 +1219,7 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 
 	return ifind_fast(sb, b, ino);
 }
@@ -1333,7 +1249,7 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
 	inode = ifind(sb, b, test, data, 1);
@@ -1364,7 +1280,7 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
 	inode = ifind_fast(sb, b, ino);
@@ -1382,43 +1298,40 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
-	struct hlist_bl_node *node;
-	struct inode *old;
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, ino);
 
 	inode->i_state |= I_NEW;
-
-repeat:
-	spin_lock_bucket(b);
-	hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
-		if (old->i_ino != ino)
-			continue;
-		if (old->i_sb != sb)
-			continue;
-		if (old->i_state & (I_FREEING|I_WILL_FREE))
-			continue;
-		if (!spin_trylock(&old->i_lock)) {
-			spin_unlock_bucket(b);
-			cpu_relax();
-			goto repeat;
+	while (1) {
+		struct hlist_bl_node *node;
+		struct inode *old = NULL;
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(old, node, b, i_hash) {
+			if (old->i_ino != ino)
+				continue;
+			if (old->i_sb != sb)
+				continue;
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
+				continue;
+			}
+			break;
+		}
+		if (likely(!node)) {
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
+			return 0;
+		}
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
+		wait_on_inode(old);
+		if (unlikely(!inode_unhashed(old))) {
+			iput(old);
+			return -EBUSY;
 		}
-		goto found_old;
-	}
-	hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
-	spin_unlock_bucket(b);
-	return 0;
-
-found_old:
-	spin_unlock_bucket(b);
-	inode_get_ilock(old);
-	spin_unlock(&old->i_lock);
-	wait_on_inode(old);
-	if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 		iput(old);
-		return -EBUSY;
 	}
-	iput(old);
-	goto repeat;
 }
 EXPORT_SYMBOL(insert_inode_locked);
 
@@ -1426,95 +1339,49 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct super_block *sb = inode->i_sb;
-	struct inode_hash_bucket *b = inode_hashtable + hash(sb, hashval);
-	struct hlist_bl_node *node;
-	struct inode *old;
+	struct hlist_bl_head *b = inode_hashtable + hash(sb, hashval);
 
+	/*
+	 * Nobody else can see the new inode yet, so it is safe to set flags
+	 * without locking here.
+	 */
 	inode->i_state |= I_NEW;
 
-repeat:
-	spin_lock_bucket(b);
-	hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
-		if (old->i_sb != sb)
-			continue;
-		/* XXX: audit put test outside i_lock? */
-		if (!test(old, data))
-			continue;
-		if (old->i_state & (I_FREEING|I_WILL_FREE))
-			continue;
-		if (!spin_trylock(&old->i_lock)) {
-			spin_unlock_bucket(b);
-			cpu_relax();
-			goto repeat;
+	while (1) {
+		struct hlist_bl_node *node;
+		struct inode *old = NULL;
+
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(old, node, b, i_hash) {
+			if (old->i_sb != sb)
+				continue;
+			if (!test(old, data))
+				continue;
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
+				continue;
+			}
+			break;
+		}
+		if (likely(!node)) {
+			hlist_bl_add_head(&inode->i_hash, b);
+			hlist_bl_unlock(b);
+			return 0;
+		}
+		old->i_ref++;
+		spin_unlock(&old->i_lock);
+		hlist_bl_unlock(b);
+		wait_on_inode(old);
+		if (unlikely(!inode_unhashed(old))) {
+			iput(old);
+			return -EBUSY;
 		}
-		goto found_old;
-	}
-	hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
-	spin_unlock_bucket(b);
-	return 0;
-
-found_old:
-	spin_unlock_bucket(b);
-	inode_get_ilock(old);
-	spin_unlock(&old->i_lock);
-	wait_on_inode(old);
-	if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 		iput(old);
-		return -EBUSY;
 	}
-	iput(old);
-	goto repeat;
 }
 EXPORT_SYMBOL(insert_inode_locked4);
 
-/**
- *	__insert_inode_hash - hash an inode
- *	@inode: unhashed inode
- *	@hashval: unsigned long value used to locate this object in the
- *		inode_hashtable.
- *
- *	Add an inode to the inode hash for this superblock.
- */
-void __insert_inode_hash(struct inode *inode, unsigned long hashval)
-{
-	struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, hashval);
-
-	spin_lock(&inode->i_lock);
-	spin_lock_bucket(b);
-	hlist_bl_add_head_rcu(&inode->i_hash, &b->head);
-	spin_unlock_bucket(b);
-	spin_unlock(&inode->i_lock);
-}
-EXPORT_SYMBOL(__insert_inode_hash);
-
-/**
- *	__remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock. inode->i_lock must be
- *	held.
- */
-static void __remove_inode_hash(struct inode *inode)
-{
-	struct inode_hash_bucket *b = inode_hashtable + hash(inode->i_sb, inode->i_ino);
-	spin_lock_bucket(b);
-	hlist_bl_del_init_rcu(&inode->i_hash);
-	spin_unlock_bucket(b);
-}
-
-/**
- *	remove_inode_hash - remove an inode from the hash
- *	@inode: inode to unhash
- *
- *	Remove an inode from the superblock.
- */
-void remove_inode_hash(struct inode *inode)
-{
-	spin_lock(&inode->i_lock);
-	__remove_inode_hash(inode);
-	spin_unlock(&inode->i_lock);
-}
-EXPORT_SYMBOL(remove_inode_hash);
 
 int generic_delete_inode(struct inode *inode)
 {
@@ -1529,7 +1396,7 @@ EXPORT_SYMBOL(generic_delete_inode);
  */
 int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || hlist_bl_unhashed(&inode->i_hash);
+	return !inode->i_nlink || inode_unhashed(inode);
 }
 EXPORT_SYMBOL_GPL(generic_drop_inode);
 
@@ -1549,6 +1416,8 @@ static void iput_final(struct inode *inode)
 	const struct super_operations *op = inode->i_sb->s_op;
 	int drop;
 
+	assert_spin_locked(&inode->i_lock);
+
 	if (op && op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1558,8 +1427,11 @@ static void iput_final(struct inode *inode)
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
 			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
-					list_empty(&inode->i_lru))
-				__inode_lru_list_add(inode);
+			    list_empty(&inode->i_lru)) {
+				spin_unlock(&inode->i_lock);
+				inode_lru_list_add(inode);
+				return;
+			}
 			spin_unlock(&inode->i_lock);
 			return;
 		}
@@ -1567,32 +1439,16 @@ static void iput_final(struct inode *inode)
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
 		write_inode_now(inode, 1);
+		remove_inode_hash(inode);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		__remove_inode_hash(inode);
 	}
-	if (!list_empty(&inode->i_lru))
-		__inode_lru_list_del(inode);
-	if (!list_empty(&inode->i_io)) {
-		struct bdi_writeback *wb = inode_to_wb(inode);
-		spin_lock(&wb->b_lock);
-		list_del_init(&inode->i_io);
-		spin_unlock(&wb->b_lock);
-	}
-	inode_sb_list_del(inode);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 	spin_unlock(&inode->i_lock);
-	evict(inode);
-	/*
-	 * i_lock is required to delete from hash because find_inode_fast
-	 * might find us but go to sleep before we run wake_up_inode.
-	 */
-	remove_inode_hash(inode);
-	wake_up_inode(inode);
-	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
-	destroy_inode(inode);
+
+	dispose_one_inode(inode);
 }
 
 /**
@@ -1607,14 +1463,14 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
+		spin_lock(&inode->i_lock);
 		BUG_ON(inode->i_state & I_CLEAR);
 
-		spin_lock(&inode->i_lock);
-		inode->i_refs--;
-		if (inode->i_refs == 0)
+		if (--inode->i_ref == 0) {
 			iput_final(inode);
-		else
-			spin_unlock(&inode->i_lock);
+			return;
+		}
+		spin_unlock(&inode->i_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
@@ -1832,7 +1688,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct inode_hash_bucket),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					HASH_EARLY,
@@ -1841,13 +1697,12 @@ void __init inode_init_early(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
+		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
 void __init inode_init(void)
 {
 	int loop;
-	struct zone *zone;
 
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
@@ -1856,15 +1711,9 @@ void __init inode_init(void)
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once);
-	for_each_zone(zone) {
-		spin_lock_init(&zone->inode_lru_lock);
-		INIT_LIST_HEAD(&zone->inode_lru);
-		zone->inode_nr_lru = 0;
-		zone->inode_nr_scan = 0;
-	}
 	register_shrinker(&icache_shrinker);
-
-	lg_lock_init(inode_list_lglock);
+	percpu_counter_init(&nr_inodes, 0);
+	percpu_counter_init(&nr_inodes_unused, 0);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
@@ -1872,7 +1721,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct inode_hash_bucket),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					0,
@@ -1881,7 +1730,7 @@ void __init inode_init(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_BL_HEAD(&inode_hashtable[loop].head);
+		INIT_HLIST_BL_HEAD(&inode_hashtable[loop]);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/internal.h b/fs/internal.h
index ada4564..f8825ae 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -15,18 +15,6 @@ struct super_block;
 struct linux_binprm;
 struct path;
 
-static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	if (strcmp(sb->s_type->name, "bdev") == 0)
-		return inode->i_mapping->backing_dev_info;
-
-	return sb->s_bdi;
-}
-
-#define inode_to_wb(inode)   (&inode_to_bdi(inode)->wb)
-
 /*
  * block_dev.c
  */
@@ -113,3 +101,14 @@ extern void put_super(struct super_block *sb);
 struct nameidata;
 extern struct file *nameidata_to_filp(struct nameidata *);
 extern void release_open_intent(struct nameidata *);
+
+/*
+ * inode.c
+ */
+extern void inode_lru_list_add(struct inode *inode);
+extern void inode_lru_list_del(struct inode *inode);
+
+/*
+ * fs-writeback.c
+ */
+extern void inode_wb_list_del(struct inode *inode);
diff --git a/fs/nilfs2/gcdat.c b/fs/nilfs2/gcdat.c
index 84a45d1..c51f0e8 100644
--- a/fs/nilfs2/gcdat.c
+++ b/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
 #include "page.h"
 #include "mdt.h"
 
+/* XXX: what protects i_state? */
 int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
 {
 	struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index 9b2b81c..ce7344e 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -45,7 +45,6 @@
 #include <linux/buffer_head.h>
 #include <linux/mpage.h>
 #include <linux/hash.h>
-#include <linux/list_bl.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include "nilfs.h"
@@ -286,15 +285,17 @@ void nilfs_clear_gcinode(struct inode *inode)
 void nilfs_remove_all_gcinode(struct the_nilfs *nilfs)
 {
 	struct hlist_bl_head *head = nilfs->ns_gc_inodes_h;
-	struct hlist_bl_node *node, *n;
+	struct hlist_bl_node *node;
 	struct inode *inode;
 	int loop;
 
 	for (loop = 0; loop < NILFS_GCINODE_HASH_SIZE; loop++, head++) {
-		hlist_bl_for_each_entry_safe(inode, node, n, head, i_hash) {
+restart:
+		hlist_bl_for_each_entry(inode, node, head, i_hash) {
 			hlist_bl_del_init(&inode->i_hash);
 			list_del_init(&NILFS_I(inode)->i_dirty);
 			nilfs_clear_gcinode(inode); /* might sleep */
+			goto restart;
 		}
 	}
 }
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 08b9888..265ecba 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -22,7 +22,6 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h>
 
 #include <asm/atomic.h>
 
@@ -232,35 +231,35 @@ out:
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with iprune_mutex held, keeping shrink_icache_memory() at bay,
- * and with the sb going away, no new inodes will appear or be referenced
- * from other paths.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * sb->s_inodes_lock protects the super block's list of inodes.
  */
-void fsnotify_unmount_inodes(struct super_block *sb)
+void fsnotify_unmount_inodes(struct list_head *list)
 {
 	struct inode *inode, *next_i, *need_iput = NULL;
 
-	do_inode_list_for_each_entry_safe(sb, inode, next_i) {
+	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
 		struct inode *need_iput_tmp;
+		struct super_block *sb = inode->i_sb;
 
-		spin_lock(&inode->i_lock);
 		/*
-		 * We cannot inode_get() an inode in state I_FREEING,
+		 * We cannot iref() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 
 		/*
-		 * If i_refs is zero, the inode cannot have any watches and
-		 * doing an inode_get/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_refs from icache which is
+		 * If i_ref is zero, the inode cannot have any watches and
+		 * doing an iref/iput with MS_ACTIVE clear would actually
+		 * evict all inodes with zero i_ref from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!inode->i_refs) {
+		if (!inode->i_ref) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
@@ -270,7 +269,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			inode_get_ilock(inode);
+			inode->i_ref++;
 		else
 			need_iput_tmp = NULL;
 		spin_unlock(&inode->i_lock);
@@ -278,14 +277,22 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 		/* In case the dropping of a reference would nuke next_i. */
 		if (&next_i->i_sb_list != list) {
 			spin_lock(&next_i->i_lock);
-			if (next_i->i_refs &&
+			if (inode->i_ref &&
 			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-				inode_get_ilock(next_i);
+				next_i->i_ref++;
 				need_iput = next_i;
 			}
 			spin_unlock(&next_i->i_lock);
 		}
 
+		/*
+		 * We can safely drop sb->s_inodes_lock here because we hold
+		 * references on both inode and next_i.  Also no new inodes
+		 * will be added since the umount has begun.  Finally,
+		 * iprune_mutex keeps shrink_icache_memory() away.
+		 */
+		spin_unlock(&sb->s_inodes_lock);
+
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
 
@@ -295,5 +302,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
 		fsnotify_inode_delete(inode);
 
 		iput(inode);
-	} while_inode_list_for_each_entry_safe
+
+		spin_lock(&sb->s_inodes_lock);
+	}
 }
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 325185e..50c0085 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -91,7 +91,6 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/srcu.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/notify/vfsmount_mark.c b/fs/notify/vfsmount_mark.c
index 56772b5..6f8eefe 100644
--- a/fs/notify/vfsmount_mark.c
+++ b/fs/notify/vfsmount_mark.c
@@ -23,7 +23,6 @@
 #include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index aed3559..178bed4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -246,7 +246,6 @@ struct dqstats dqstats;
 EXPORT_SYMBOL(dqstats);
 
 static qsize_t inode_get_rsv_space(struct inode *inode);
-static qsize_t __inode_get_rsv_space(struct inode *inode);
 static void __dquot_initialize(struct inode *inode, int type);
 
 static inline unsigned int
@@ -897,41 +896,35 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	rcu_read_lock();
-	do_inode_list_for_each_entry_rcu(sb, inode) {
+	spin_lock(&sb->s_inodes_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    !atomic_read(&inode->i_writecount) ||
+		    !dqinit_needed(inode, type)) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 #ifdef CONFIG_QUOTA_DEBUG
-		if (unlikely(__inode_get_rsv_space(inode) > 0))
+		if (unlikely(inode_get_rsv_space(inode) > 0))
 			reserved = 1;
 #endif
-		if (!atomic_read(&inode->i_writecount)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (!dqinit_needed(inode, type)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
 
-		inode_get_ilock(inode);
+		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
-		rcu_read_unlock();
+		spin_unlock(&sb->s_inodes_lock);
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * i_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * lock. So we keep the reference and iput it later. */
+		 * removed from s_inodes list while we dropped the lock.
+		 * We cannot iput the inode now as we can be holding the last
+		 * reference and we cannot iput it under the lock. So we
+		 * keep the reference and iput it later. */
 		old_inode = inode;
-		rcu_read_lock();
-	} while_inode_list_for_each_entry_rcu
-	rcu_read_unlock();
+		spin_lock(&sb->s_inodes_lock);
+	}
+	spin_unlock(&sb->s_inodes_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1012,8 +1005,8 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	rcu_read_lock();
-	do_inode_list_for_each_entry_rcu(sb, inode) {
+	spin_lock(&sb->s_inodes_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1025,8 +1018,8 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 				reserved = 1;
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
-	} while_inode_list_for_each_entry_rcu
-	rcu_read_unlock();
+	}
+	spin_unlock(&sb->s_inodes_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
@@ -1497,17 +1490,6 @@ void inode_sub_rsv_space(struct inode *inode, qsize_t number)
 }
 EXPORT_SYMBOL(inode_sub_rsv_space);
 
-/* no i_lock variant of inode_get_rsv_space */
-static qsize_t __inode_get_rsv_space(struct inode *inode)
-{
-	qsize_t ret;
-
-	if (!inode->i_sb->dq_op->get_reserved_space)
-		return 0;
-	ret = *inode_reserved_space(inode);
-	return ret;
-}
-
 static qsize_t inode_get_rsv_space(struct inode *inode)
 {
 	qsize_t ret;
diff --git a/fs/super.c b/fs/super.c
index 573c040..c5332e5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -67,28 +67,16 @@ static struct super_block *alloc_super(struct file_system_type *type)
 			for_each_possible_cpu(i)
 				INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
 		}
-		s->s_inodes = alloc_percpu(struct list_head);
-		if (!s->s_inodes) {
-			free_percpu(s->s_files);
-			security_sb_free(s);
-			kfree(s);
-			s = NULL;
-			goto out;
-		} else {
-			int i;
-
-			for_each_possible_cpu(i)
-				INIT_LIST_HEAD(per_cpu_ptr(s->s_inodes, i));
-		}
 #else
 		INIT_LIST_HEAD(&s->s_files);
-		INIT_LIST_HEAD(&s->s_inodes);
 #endif
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
+		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
+		spin_lock_init(&s->s_inodes_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
@@ -137,7 +125,6 @@ out:
 static inline void destroy_super(struct super_block *s)
 {
 #ifdef CONFIG_SMP
-	free_percpu(s->s_inodes);
 	free_percpu(s->s_files);
 #endif
 	security_sb_free(s);
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -795,7 +795,9 @@ xfs_setup_inode(
 
 	inode->i_ino = ip->i_ino;
 	inode->i_state = I_NEW;
-	inode_add_to_lists(ip->i_mount->m_super, inode);
+
+	inode_sb_list_add(inode);
+	insert_inode_hash(inode);
 
 	inode->i_mode	= ip->i_d.di_mode;
 	inode->i_nlink	= ip->i_d.di_nlink;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index a87f6e7..995a3ad 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -16,7 +16,6 @@
 #include <linux/sched.h>
 #include <linux/timer.h>
 #include <linux/writeback.h>
-#include <linux/spinlock.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -55,10 +54,10 @@ struct bdi_writeback {
 
 	struct task_struct *task;	/* writeback thread */
 	struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */
-	spinlock_t b_lock;		/* lock for inode lists */
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t b_lock;		/* writeback lists lock */
 };
 
 struct backing_dev_info {
@@ -110,6 +109,8 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index e612575..7113a32 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -1,10 +1,6 @@
 #ifndef __LINUX_BIT_SPINLOCK_H
 #define __LINUX_BIT_SPINLOCK_H
 
-#include <linux/kernel.h>
-#include <linux/preempt.h>
-#include <asm/atomic.h>
-
 /*
  *  bit-based spin_lock()
  *
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9063486..213272b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -45,6 +45,7 @@ struct inodes_stat_t {
 	int dummy[5];		/* padding for sysctl ABI compatibility */
 };
 
+
 #define NR_FILE  8192	/* this can well be larger on a larger system */
 
 #define MAY_EXEC 1
@@ -374,8 +375,6 @@ struct inodes_stat_t {
 #include <linux/cache.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
-#include <linux/rculist.h>
-#include <linux/rculist_bl.h>
 #include <linux/radix-tree.h>
 #include <linux/prio_tree.h>
 #include <linux/init.h>
@@ -384,6 +383,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/list_bl.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -408,8 +408,7 @@ extern struct files_stat_struct files_stat;
 extern int get_max_files(void);
 extern int sysctl_nr_open;
 extern struct inodes_stat_t inodes_stat;
-extern int get_nr_inodes(void);
-extern int get_nr_inodes_unused(void);
+extern int get_nr_dirty_inodes(void);
 extern int leases_enable, lease_break_time;
 
 struct buffer_head;
@@ -727,18 +726,12 @@ struct posix_acl;
 
 struct inode {
 	struct hlist_bl_node	i_hash;
-	struct list_head	i_io;		/* backing dev IO list */
+	struct list_head	i_wb_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
-	union {
-		struct list_head	i_dentry;
-		struct rcu_head		i_rcu;
-	};
+	struct list_head	i_dentry;
 	unsigned long		i_ino;
-#ifdef CONFIG_SMP
-	int			i_sb_list_cpu;
-#endif
-	unsigned int		i_refs;
+	unsigned int		i_ref;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
@@ -797,6 +790,11 @@ struct inode {
 	void			*i_private; /* fs or device private pointer */
 };
 
+static inline int inode_unhashed(struct inode *inode)
+{
+	return hlist_bl_unhashed(&inode->i_hash);
+}
+
 /*
  * inode->i_mutex nesting subclasses for the lock validator:
  *
@@ -1349,12 +1347,12 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
+	spinlock_t		s_inodes_lock;	/* lock for s_inodes */
+	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
-	struct list_head __percpu *s_inodes;
 	struct list_head __percpu *s_files;
 #else
-	struct list_head	s_inodes;	/* all inodes */
 	struct list_head	s_files;
 #endif
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
@@ -1621,7 +1619,7 @@ struct super_operations {
  *			also cause waiting on I_NEW, without I_NEW actually
  *			being set.  find_inode() uses this to prevent returning
  *			nearly-dead inodes.
- * I_WILL_FREE		Must be set when calling write_inode_now() if i_refs
+ * I_WILL_FREE		Must be set when calling write_inode_now() if i_ref
  *			is zero.  I_FREEING must be set when I_WILL_FREE is
  *			cleared.
  * I_FREEING		Set when inode is about to be freed but still has dirty
@@ -2088,8 +2086,6 @@ extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *);
 extern int invalidate_partition(struct gendisk *, int);
 #endif
-extern void __inode_lru_list_add(struct inode *inode);
-extern void __inode_lru_list_del(struct inode *inode);
 extern int invalidate_inodes(struct super_block *);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
@@ -2174,7 +2170,6 @@ extern loff_t vfs_llseek(struct file *file, loff_t offset, int origin);
 
 extern int inode_init_always(struct super_block *, struct inode *);
 extern void inode_init_once(struct inode *);
-extern void inode_add_to_lists(struct super_block *, struct inode *);
 extern void iput(struct inode *);
 extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
@@ -2194,74 +2189,24 @@ extern struct inode * iget_locked(struct super_block *, unsigned long);
 extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
-
 extern unsigned int get_next_ino(void);
+
+extern void iref(struct inode *inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
 extern void __destroy_inode(struct inode *);
 extern struct inode *new_inode(struct super_block *);
-extern struct inode *new_anon_inode(struct super_block *);
-extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 extern void remove_inode_hash(struct inode *);
-static inline void insert_inode_hash(struct inode *inode) {
+static inline void insert_inode_hash(struct inode *inode)
+{
 	__insert_inode_hash(inode, inode->i_ino);
 }
-
-#ifdef CONFIG_SMP
-/*
- * These macros iterate all inodes on all CPUs for a given superblock.
- * rcu_read_lock must be held.
- */
-#define do_inode_list_for_each_entry_rcu(__sb, __inode)		\
-{								\
-	int i;							\
-	for_each_possible_cpu(i) {				\
-		struct list_head *list;				\
-		list = per_cpu_ptr((__sb)->s_inodes, i);	\
-		list_for_each_entry_rcu((__inode), list, i_sb_list)
-
-#define while_inode_list_for_each_entry_rcu			\
-	}							\
-}
-
-#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp)	\
-{								\
-	int i;							\
-	for_each_possible_cpu(i) {				\
-		struct list_head *list;				\
-		list = per_cpu_ptr((__sb)->s_inodes, i);	\
-		list_for_each_entry_safe((__inode), (__tmp), list, i_sb_list)
-
-#define while_inode_list_for_each_entry_safe			\
-	}							\
-}
-
-#else
-
-#define do_inode_list_for_each_entry_rcu(__sb, __inode)		\
-{								\
-	struct list_head *list;					\
-	list = &(sb)->s_inodes;					\
-	list_for_each_entry_rcu((__inode), list, i_sb_list)
-
-#define while_inode_list_for_each_entry_rcu			\
-}
-
-#define do_inode_list_for_each_entry_safe(__sb, __inode, __tmp)	\
-{								\
-	struct list_head *list;					\
-	list = &(sb)->s_inodes;					\
-	list_for_each_entry_safe((__inode), (__tmp), list, i_sb_list)
-
-#define while_inode_list_for_each_entry_safe			\
-}
-
-#endif
+extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
 extern void submit_bio(int, struct bio *);
@@ -2462,20 +2407,6 @@ extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
-static inline void inode_get_ilock(struct inode *inode)
-{
-	assert_spin_locked(&inode->i_lock);
-	BUG_ON(inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE));
-	inode->i_refs++;
-}
-
-static inline void inode_get(struct inode *inode)
-{
-	spin_lock(&inode->i_lock);
-	inode_get_ilock(inode);
-	spin_unlock(&inode->i_lock);
-}
-
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index d1849f9..e40190d 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -402,7 +402,7 @@ extern void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, un
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_get_mark(struct fsnotify_mark *mark);
 extern void fsnotify_put_mark(struct fsnotify_mark *mark);
-extern void fsnotify_unmount_inodes(struct super_block *sb);
+extern void fsnotify_unmount_inodes(struct list_head *list);
 
 /* put here because inotify does some weird stuff when destroying watches */
 extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u32 mask,
@@ -443,7 +443,7 @@ static inline u32 fsnotify_get_cookie(void)
 	return 0;
 }
 
-static inline void fsnotify_unmount_inodes(struct super_block *sb)
+static inline void fsnotify_unmount_inodes(struct list_head *list)
 {}
 
 #endif	/* CONFIG_FSNOTIFY */
diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index c2034b9..5bb2370 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -36,13 +36,13 @@ struct hlist_bl_node {
 #define INIT_HLIST_BL_HEAD(ptr) \
 	((ptr)->first = NULL)
 
-static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+static inline void init_hlist_bl_node(struct hlist_bl_node *h)
 {
 	h->next = NULL;
 	h->pprev = NULL;
 }
 
-#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+#define hlist_bl_entry(ptr, type, member) container_of(ptr, type, member)
 
 static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
 {
@@ -98,15 +98,15 @@ static inline void __hlist_bl_del(struct hlist_bl_node *n)
 static inline void hlist_bl_del(struct hlist_bl_node *n)
 {
 	__hlist_bl_del(n);
-	n->next = LIST_POISON1;
-	n->pprev = LIST_POISON2;
+	n->next = BL_LIST_POISON1;
+	n->pprev = BL_LIST_POISON2;
 }
 
 static inline void hlist_bl_del_init(struct hlist_bl_node *n)
 {
 	if (!hlist_bl_unhashed(n)) {
 		__hlist_bl_del(n);
-		INIT_HLIST_BL_NODE(n);
+		init_hlist_bl_node(n);
 	}
 }
 
@@ -121,21 +121,26 @@ static inline void hlist_bl_del_init(struct hlist_bl_node *n)
 #define hlist_bl_for_each_entry(tpos, pos, head, member)		\
 	for (pos = hlist_bl_first(head);				\
 	     pos &&							\
-		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
 	     pos = pos->next)
 
+#endif
+
+
 /**
- * hlist_bl_for_each_entry_safe - iterate over list of given type safe against removal of list entry
- * @tpos:	the type * to use as a loop cursor.
- * @pos:	the &struct hlist_node to use as a loop cursor.
- * @n:		another &struct hlist_node to use as temporary storage
- * @head:	the head for your list.
- * @member:	the name of the hlist_node within the struct.
+ * hlist_bl_lock	- lock a hash list
+ * @h:	hash list head to lock
  */
-#define hlist_bl_for_each_entry_safe(tpos, pos, n, head, member)	 \
-	for (pos = hlist_bl_first(head);				 \
-	     pos && ({ n = pos->next; 1; }) && 				 \
-		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
-	     pos = n)
+static inline void hlist_bl_lock(struct hlist_bl_head *h)
+{
+	bit_spin_lock(0, (unsigned long *)h);
+}
 
-#endif
+/**
+ * hlist_bl_unlock	- unlock a hash list
+ * @h:	hash list head to unlock
+ */
+static inline void hlist_bl_unlock(struct hlist_bl_head *h)
+{
+	__bit_spin_unlock(0, (unsigned long *)h);
+}
diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..d367d39 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -22,6 +22,8 @@
 #define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
 #define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
 
+#define BL_LIST_POISON1  ((void *) 0x00300300 + POISON_POINTER_DELTA)
+#define BL_LIST_POISON2  ((void *) 0x00400400 + POISON_POINTER_DELTA)
 /********** include/linux/timer.h **********/
 /*
  * Magic number "tsta" to indicate a static timer initializer
diff --git a/include/linux/rculist_bl.h b/include/linux/rculist_bl.h
deleted file mode 100644
index cdfb54e..0000000
--- a/include/linux/rculist_bl.h
+++ /dev/null
@@ -1,128 +0,0 @@
-#ifndef _LINUX_RCULIST_BL_H
-#define _LINUX_RCULIST_BL_H
-
-/*
- * RCU-protected bl list version. See include/linux/list_bl.h.
- */
-#include <linux/list_bl.h>
-#include <linux/rcupdate.h>
-#include <linux/bit_spinlock.h>
-
-static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
-					struct hlist_bl_node *n)
-{
-	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
-	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
-	rcu_assign_pointer(h->first,
-		(struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK));
-}
-
-static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
-{
-	return (struct hlist_bl_node *)
-		((unsigned long)rcu_dereference(h->first) & ~LIST_BL_LOCKMASK);
-}
-
-/**
- * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
- * @n: the element to delete from the hash list.
- *
- * Note: hlist_bl_unhashed() on the node returns true after this. It is
- * useful for RCU based read lockfree traversal if the writer side
- * must know if the list entry is still hashed or already unhashed.
- *
- * In particular, it means that we can not poison the forward pointers
- * that may still be used for walking the hash list and we can only
- * zero the pprev pointer so list_unhashed() will return true after
- * this.
- *
- * The caller must take whatever precautions are necessary (such as
- * holding appropriate locks) to avoid racing with another
- * list-mutation primitive, such as hlist_bl_add_head_rcu() or
- * hlist_bl_del_rcu(), running on this same list.  However, it is
- * perfectly legal to run concurrently with the _rcu list-traversal
- * primitives, such as hlist_bl_for_each_entry_rcu().
- */
-static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
-{
-	if (!hlist_bl_unhashed(n)) {
-		__hlist_bl_del(n);
-		n->pprev = NULL;
-	}
-}
-
-/**
- * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
- * @n: the element to delete from the hash list.
- *
- * Note: hlist_bl_unhashed() on entry does not return true after this,
- * the entry is in an undefined state. It is useful for RCU based
- * lockfree traversal.
- *
- * In particular, it means that we can not poison the forward
- * pointers that may still be used for walking the hash list.
- *
- * The caller must take whatever precautions are necessary
- * (such as holding appropriate locks) to avoid racing
- * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
- * or hlist_bl_del_rcu(), running on this same list.
- * However, it is perfectly legal to run concurrently with
- * the _rcu list-traversal primitives, such as
- * hlist_bl_for_each_entry().
- */
-static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
-{
-	__hlist_bl_del(n);
-	n->pprev = LIST_POISON2;
-}
-
-/**
- * hlist_bl_add_head_rcu
- * @n: the element to add to the hash list.
- * @h: the list to add to.
- *
- * Description:
- * Adds the specified element to the specified hlist_bl,
- * while permitting racing traversals.
- *
- * The caller must take whatever precautions are necessary
- * (such as holding appropriate locks) to avoid racing
- * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
- * or hlist_bl_del_rcu(), running on this same list.
- * However, it is perfectly legal to run concurrently with
- * the _rcu list-traversal primitives, such as
- * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
- * problems on Alpha CPUs.  Regardless of the type of CPU, the
- * list-traversal primitive must be guarded by rcu_read_lock().
- */
-static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
-					struct hlist_bl_head *h)
-{
-	struct hlist_bl_node *first;
-
-	/* don't need hlist_bl_first_rcu because we're under lock */
-	first = hlist_bl_first(h);
-
-	n->next = first;
-	if (first)
-		first->pprev = &n->next;
-	n->pprev = &h->first;
-
-	/* need _rcu because we can have concurrent lock free readers */
-	hlist_bl_set_first_rcu(h, n);
-}
-/**
- * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
- * @tpos:	the type * to use as a loop cursor.
- * @pos:	the &struct hlist_bl_node to use as a loop cursor.
- * @head:	the head for your list.
- * @member:	the name of the hlist_bl_node within the struct.
- *
- */
-#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member)		\
-	for (pos = hlist_bl_first_rcu(head);				\
-		pos &&							\
-		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
-		pos = rcu_dereference_raw(pos->next))
-
-#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d52ae7c..af060d4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,11 +74,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&wb->b_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_io)
+	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_io)
+	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_io)
+	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
 	spin_unlock(&wb->b_lock);
 
@@ -631,10 +631,10 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
-	spin_lock_init(&wb->b_lock);
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->b_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -672,7 +672,8 @@ err:
 }
 EXPORT_SYMBOL(bdi_init);
 
-static void bdi_lock_two(struct backing_dev_info *bdi1, struct backing_dev_info *bdi2)
+void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2)
 {
 	if (bdi1 < bdi2) {
 		spin_lock(&bdi1->wb.b_lock);
@@ -682,6 +683,7 @@ static void bdi_lock_two(struct backing_dev_info *bdi1, struct backing_dev_info
 		spin_lock_nested(&bdi1->wb.b_lock, 1);
 	}
 }
+EXPORT_SYMBOL(bdi_lock_two);
 
 void bdi_destroy(struct backing_dev_info *bdi)
 {
@@ -695,13 +697,6 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
 		bdi_lock_two(bdi, &default_backing_dev_info);
-		/*
-		 * It's OK to move inodes between different wb lists without
-		 * locking the individual inodes. i_lock will still protect
-		 * whether or not it is on a writeback list or not. However it
-		 * is a little quirk, maybe better to lock all inodes in this
-		 * uncommon case just to keep locking very regular.
-		 */
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [patch 26/35] fs: icache alloc anonymous inode allocation
  2010-10-19  3:42 ` [patch 26/35] fs: icache alloc anonymous inode allocation npiggin
  2010-10-19 15:50   ` Miklos Szeredi
@ 2010-10-19 16:33   ` Christoph Hellwig
  2010-10-20  3:07     ` Nick Piggin
  1 sibling, 1 reply; 70+ messages in thread
From: Christoph Hellwig @ 2010-10-19 16:33 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 02:42:42PM +1100, npiggin@kernel.dk wrote:
> Provide new_anon_inode function for inodes without a default inode number, and
> not on sb list. This can enable filesystems to reduce locking. "Real"
> filesystems can also reduce locking by allocating anonymous inode first, then
> adding it to lists after finding the inode number.

Having an _anon inode allocation for fileystsem that do manage the inode
lifetime is fine, but please don't mix that up with i_ino assignment,
as they are two totally different things.

Disk and network filesystem do not need a default i_ino, but they
absolutely do no need their inodes to be on the per-sb list.
anonfs/pipe/socket (and nothing else) can do away with the per-sb list,
but they do need a pseudo inode number.

I have a version of this port to Dave's tree which gets this right.
i_ino assignment is already moved out by my patch (which should apply
to your tree with minimal differences), so the new _anon only does not
put the inode on the list.  The other difference is that we don't bother
initializing i_sb_list in the main inode allocation path, but only in
new_anon_inode, and that the function is not exported - it really should
only be used for built-in filesystems that never get unmounted to be
safe.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 07/35] fs: icache lock i_state
  2010-10-19 10:47   ` Miklos Szeredi
@ 2010-10-19 17:06     ` Peter Zijlstra
  0 siblings, 0 replies; 70+ messages in thread
From: Peter Zijlstra @ 2010-10-19 17:06 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, 2010-10-19 at 12:47 +0200, Miklos Szeredi wrote:
> On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> > Index: linux-2.6/fs/fs-writeback.c
> > ===================================================================
> > --- linux-2.6.orig/fs/fs-writeback.c	2010-10-19 14:18:58.000000000 +1100
> > +++ linux-2.6/fs/fs-writeback.c	2010-10-19 14:19:36.000000000 +1100
> > @@ -288,10 +288,12 @@
> 
> The function name here helps review a bit.  If you are using quilt add
> "export QUILT_DIFF_OPTS=-p" to .quiltrc.

I have:

QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"

in my /etc/quilt.quiltrc, it is similar to --show-c-function but doesn't
consider labels.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 08/35] fs: icache lock i_count
  2010-10-19 10:16   ` Boaz Harrosh
@ 2010-10-20  2:14     ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  2:14 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 12:16:51PM +0200, Boaz Harrosh wrote:
> On 10/19/2010 05:42 AM, npiggin@kernel.dk wrote:
> > Protect inode->i_count with i_lock, rather than having it atomic.
> > 
> > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> > 
> > ---
> <>
> >  fs/exofs/inode.c                         |   12 +++++++---
> >  fs/exofs/namei.c                         |    4 ++-
> <>
> > Index: linux-2.6/fs/exofs/inode.c
> > ===================================================================
> > --- linux-2.6.orig/fs/exofs/inode.c	2010-10-19 14:17:26.000000000 +1100
> > +++ linux-2.6/fs/exofs/inode.c	2010-10-19 14:19:18.000000000 +1100
> > @@ -1107,7 +1107,9 @@
> >  
> 
> Hi Nick, Please use -p option in your diff(s) it is a bit hard to follow
> and review without the proper function context. These patches are on a git
> tree. Why don't you use git to produce and send these patches?

Yes, sorry that was a big mistake. New computer, I don't know why it
doesn't do that as default option.

Thanks,
Nick



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 33/35] fs: icache introduce inode_get/inode_get_ilock
  2010-10-19 10:17   ` Boaz Harrosh
@ 2010-10-20  2:17     ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  2:17 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 12:17:31PM +0200, Boaz Harrosh wrote:
> On 10/19/2010 05:42 AM, npiggin@kernel.dk wrote:
> > Index: linux-2.6/fs/exofs/namei.c
> > ===================================================================
> > --- linux-2.6.orig/fs/exofs/namei.c	2010-10-19 14:18:58.000000000 +1100
> > +++ linux-2.6/fs/exofs/namei.c	2010-10-19 14:19:00.000000000 +1100
> > @@ -153,9 +153,7 @@
> >  
> >  	inode->i_ctime = CURRENT_TIME;
> >  	inode_inc_link_count(inode);
> > -	spin_lock(&inode->i_lock);
> > -	inode->i_count++;
> > -	spin_unlock(&inode->i_lock);
> > +	inode_get(inode);
> >  
> >  	return exofs_add_nondir(dentry, inode);
> >  }
> 
> Why won't you define an intermediate inode_get() in patch 08/35 and
> change both puts and gets of all file_systems in one patch? Instead
> of two tree sweeping patches. (At least for all the trivial places
> like here)

I hadn't wanted to make non locking related changes before inode
lock was removed. But yes it may make sense just to do the name
change. I'll see how it looks.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-19 12:38   ` Dave Chinner
@ 2010-10-20  2:35     ` Nick Piggin
  2010-10-20  3:12       ` Nick Piggin
  2010-10-20  3:14     ` KOSAKI Motohiro
  1 sibling, 1 reply; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  2:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: npiggin, linux-kernel, linux-fsdevel

[I should have cc'ed this one to linux-mm as well, so I quote your
reply in full here]

On Tue, Oct 19, 2010 at 11:38:52PM +1100, Dave Chinner wrote:
> On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > Per-zone LRUs and shrinkers for inode cache.
> 
> Regardless of whether this is the right way to scale or not, I don't
> like the fact that this moves the cache LRUs into the memory
> management structures, and expands the use of MM specific structures
> throughout the code.

The zone structure really is the basic unit of memory abstraction
in the whole zoned VM concept (which covers different properties
of both physical address and NUMA cost).

The zone contains structures for memory management that aren't
otherwise directly related to one another. Generic page waitqueues,
page allocator structures, pagecache reclaim structures, memory model
data, and various statistics.

Structures to reclaim inodes from a particular zone belong in the
zone struct as much as those to reclaim pagecache or anonymous
memory from that zone too. It actually fits far better in here than
globally, because all our allocation/reclaiming/watermarks etc is
driven per-zone.

The structure is not frequent -- a couple per NUMA node.


> It ties the cache implementation to the current
> VM implementation. That, IMO, goes against all the principle of
> modularisation at the source code level, and it means we have to tie
> all shrinker implemenations to the current internal implementation
> of the VM. I don't think that is wise thing to do because of the
> dependencies and impedance mismatches it introduces.

It's very fundamental. We allocate memory from, and have to reclaim
memory from -- zones. Memory reclaim is driven based on how the VM
wants to reclaim memory: nothing you can do to avoid some linkage
between the two.

Look at it this way. The dumb global shrinker is also tied to an
MM implementation detail, but that detail in fact does *not* match
the reality of the MM, and so it has all these problems interacting
with real reclaim.

What problems? OK, on an N zone system (assuming equal zones and
even distribution of objects around memory), then if there is a shortage
on a particular zone, slabs from _all_ zones are reclaimed. We reclaim
a factor of N too many objects. In a NUMA situation, we also touch
remote memory with a chance (N-1)/N.

As number of nodes grow beyond 2, this quickly goes down hill.

In summary, there needs to be some knowledge of how MM reclaims memory
in memory reclaim shrinkers -- simply can't do a good implementation
without that. If the zone concept changes, the MM gets turned upside
down and all those assumptions would need to be revisited anyway.


> As an example: XFS inodes to be reclaimed are simply tagged in a
> radix tree so the shrinker can reclaim inodes in optimal IO order
> rather strict LRU order. It simply does not match a zone-based

This is another problem, similar to what we have in pagecache. In
the pagecache, we need to clean pages in optimal IO order, but we
still reclaim them according to some LRU order.

If you reclaim them in optimal IO order, cache efficiency will go
down because you sacrifice recency/frequency information. If you
IO in reclaim order, IO efficiency goes down. The solution is to
decouple them with like writeout versus reclaim.

But anyway, that's kind of an "aside": inode caches are reclaimed
in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
change that in the slightest.

> shrinker implementation in any way, shape or form, nor does it's
> inherent parallelism match that of the way shrinkers are called.
> 
> Any change in shrinker infrastructure needs to be able to handle
> these sorts of impedance mismatches between the VM and the cache
> subsystem. The current API doesn't handle this very well, either,
> so it's something that we need to fix so that scalability is easy
> for everyone.
> 
> Anyway, my main point is that tying the LRU and shrinker scaling to
> the implementation of the VM is a one-off solution that doesn't work
> for generic infrastructure.

No it isn't. It worked for the pagecache, and it works for dcache.


> Other subsystems need the same
> large-machine scaling treatment, and there's no way we should be
> tying them all into the struct zone. It needs further abstraction.

An abstraction? Other than the zone? What do you suggest? Invent
something that the VM has no concept of and try to use that?

No. The zone is the right thing to base it on.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists
  2010-10-19 15:33   ` Miklos Szeredi
@ 2010-10-20  2:37     ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  2:37 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 05:33:54PM +0200, Miklos Szeredi wrote:
> On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> > +/**
> > + * inode_sb_list_add - add an inode to the sb's file list
> > + * @inode: inode to add
> > + * @sb: sb to add it to
> > + *
> > + * Use this function to associate an with the superblock it belongs to.
> 
>                                        ^^^inode

Yes, thanks.


> > @@ -1270,6 +1316,7 @@
> >  			continue;
> >  		if (!spin_trylock(&old->i_lock)) {
> >  			spin_unlock_bucket(b);
> > +			cpu_relax();
> 
> Doesn't this logically belong to a previous patch?

Yes, I had a couple of them leak out. I'll try to fix them
up... most trylocks get resolved with RCU later.

 
> > Index: linux-2.6/fs/super.c
> > ===================================================================
> > --- linux-2.6.orig/fs/super.c	2010-10-19 14:17:17.000000000 +1100
> > +++ linux-2.6/fs/super.c	2010-10-19 14:18:59.000000000 +1100
> > @@ -67,12 +67,25 @@
> >  			for_each_possible_cpu(i)
> >  				INIT_LIST_HEAD(per_cpu_ptr(s->s_files, i));
> >  		}
> > +		s->s_inodes = alloc_percpu(struct list_head);
> > +		if (!s->s_inodes) {
> > +			free_percpu(s->s_files);
> > +			security_sb_free(s);
> > +			kfree(s);
> > +			s = NULL;
> > +			goto out;
> 
> Factor out error cleanups to separate out labels?

OK... probably makes sense at this point.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 26/35] fs: icache alloc anonymous inode allocation
  2010-10-19 15:50   ` Miklos Szeredi
@ 2010-10-20  2:38     ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  2:38 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 05:50:00PM +0200, Miklos Szeredi wrote:
> On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> > Index: linux-2.6/fs/anon_inodes.c
> > ===================================================================
> > --- linux-2.6.orig/fs/anon_inodes.c	2010-10-19 14:18:58.000000000 +1100
> > +++ linux-2.6/fs/anon_inodes.c	2010-10-19 14:19:19.000000000 +1100
> > @@ -191,7 +191,7 @@
> >   */
> >  static struct inode *anon_inode_mkinode(void)
> >  {
> > -	struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
> > +	struct inode *inode = new_anon_inode(anon_inode_mnt->mnt_sb);
> >  
> >  	if (!inode)
> >  		return ERR_PTR(-ENOMEM);
> 
> This too needs an inode->i_ino initialization (the default ULONG_MAX
> will cause EOVERFLOW on 32bit fstat, AFAIK), though it could just be a
> constant, say 2.

OK. I'll add a #define for it. Thanks


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 27/35] fs: icache split IO and LRU lists
  2010-10-19 16:12   ` Miklos Szeredi
@ 2010-10-20  2:41     ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  2:41 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 06:12:07PM +0200, Miklos Szeredi wrote:
> On Tue, 19 Oct 2010, npiggin@kernel.d wrote:
> > @@ -422,7 +422,11 @@
> >  			/*
> >  			 * The inode is clean
> >  			 */
> > -			list_move(&inode->i_list, &inode_unused);
> > +			list_del_init(&inode->i_io);
> > +			if (list_empty(&inode->i_lru)) {
> > +				list_add(&inode->i_lru, &inode_unused);
> > +				inodes_stat.nr_unused++;
> 
> It's not obvious where this came from.  How come nr_unused was
> correctly accounted with the previous, list_move() version?

Yes... inode is considered unused if it has 0 refcount, even if
it is dirty. This hunk must have crept in from somewhere else,
good catch.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 00/35] my inode scaling series for review
  2010-10-19 16:22 ` [patch 00/35] my inode scaling series for review Christoph Hellwig
@ 2010-10-20  3:05   ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  3:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 12:22:07PM -0400, Christoph Hellwig wrote:
> On Tue, Oct 19, 2010 at 02:42:16PM +1100, npiggin@kernel.dk wrote:
> > * My locking design allows i_lock to lock the entire state of the icache
> >   for a particular inode. Not so with Dave's, and he had to add code not
> >   required with inode_lock synchronisation or my i_lock synchronisation.
> >   I prefer being very conservative about making changes, especially before
> >   inode_lock is lifted (which will be the end-point of bisection for any
> >   locking breakage before it).
> 
> Which code exaxtly?  I've done a diff between his inode.c and yours - 

I pointed it out earlier (not that you replied).

> and Dave's is a lot simpler.  Mostly due to the more regular and simpler
> locking, but also because he did various cleanups before tackling the
> actual locking.  See the diff at the end of this mail for a direct
> comparism.

How can you say it is more regular and simpler locking? _Acquiring_
locks is not the difficult part of a locking rewrite. Verifying that
no unhandled concurrency is introduced is.

At first glance there are parts of my code that is more complex, that
is because it is actually more regular locking model that is easier to
verify when you lift inode_lock away.

Like I said to Dave when he first looked at my series: he's welcome to
submit a small incremental change to reduce i_lock coverage or put
the lock order another way and we can see what the impact looks like.

Locking changes should start simpler and more conservative.

 
> > * As far as I can tell, I have addressed all Dave and Christoph's real
> >   concerns.  The disagreement about the i_lock locking model can easily be
> >   solved if they post a couple of small incremental patches to the end of the
> >   series, making i_lock locking less regular and no longer protecting icache
> >   state of that given inode (like inode_lock was able to pre-patchset). I've
> >   repeatedly disagreed with this approach, however.
> 
> The diff below and looking over the other patches doesn't make it look
> like you have actually picked up much at all, neither of the feedback
> from me, nor Dave nor Andrew or Al. 

I picked up most of it. I missed the per-cpu one from Andrew because
I was waiting for Eric's new API.

I did pick up feedback from you or Dave, and where I didn't, I replied
with my reasons.


> Even worse than that none of the sometimes quite major bug fixes were
> picked up either.  The get_new_inode re-lookup locking is still wrong,

I see, yes, I missed that point.


> the exofs fix is not there.

It doesn't belong there.


> And the fix for mapping move of the
> block devices which we unfortunately still have seems to be paper
> over by passing the bdi_writeback to the requing helpers instead
> of fixing it.  While this makes the assert_spin_lock panic go away
> it still leaves a really nasty race as your version locks a different
> bdi than the one that it actually modifies.

Yeah Dave's fix got omitted from my series for some reason. I
agreed it is needed and I just had it commented out for some
reason. Thanks.

 
> There's also another bug which was there in your very first version
> with an XXX but that Dave AFAIK never picked up: invalidate_inodes is
> called from a lot of other places than umount, and unlocked list
> access is everything but safe there.

No, I didn't ignore that. It is quite definitely already buggy upstream
if there is concurrent list modification there. I noticed that Al was
thinking about it and I was going to wait until he replied with
something definitive.


> Anyway, below is the diff between the two trees.  I've cut down the
> curn in filesystem a bit - every related to the gratious i_ref vs i_ref
> and iref vs inode_get difference, as well as the call_rcu boilerplat
> additions and get_next_ino calls are removed to make it somewhat
> readable.
> 
> To me the inode.c and especially fs-writeback.c code in Dave's version
> looks a lot more polished.

Well, like I said, I disagree. I don't like how the icache state
of the inode is not locked (as it is with inode_lock) when the
inode is being manipulated.

I see any such step in that direction as an incremental patch,
and shouldn't be lumped in with lock breaking work. That's just
not how to structure a series that adds fine grained locking.
As I said, I am willing to review and comment on such an incremental
patch with an open mind. But I would have to see a good justification
and a reason why RCU or another technique would not work.

You also missed my feedback that he didn't add required RCU, didn't
structure the series properly for a serious scalability rework,
didn't break out the rest of the global locks, and hadn't had it
tested with the rest of the vfs stack.

But your review so far has been very helpful, thanks.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 26/35] fs: icache alloc anonymous inode allocation
  2010-10-19 16:33   ` Christoph Hellwig
@ 2010-10-20  3:07     ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  3:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: npiggin, linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 12:33:27PM -0400, Christoph Hellwig wrote:
> On Tue, Oct 19, 2010 at 02:42:42PM +1100, npiggin@kernel.dk wrote:
> > Provide new_anon_inode function for inodes without a default inode number, and
> > not on sb list. This can enable filesystems to reduce locking. "Real"
> > filesystems can also reduce locking by allocating anonymous inode first, then
> > adding it to lists after finding the inode number.
> 
> Having an _anon inode allocation for fileystsem that do manage the inode
> lifetime is fine, but please don't mix that up with i_ino assignment,
> as they are two totally different things.
> 
> Disk and network filesystem do not need a default i_ino, but they
> absolutely do no need their inodes to be on the per-sb list.
> anonfs/pipe/socket (and nothing else) can do away with the per-sb list,
> but they do need a pseudo inode number.

Probably bad wording of "anon". It should be "raw", maybe. The
filesystem is then of course responsible for adding i_ino and/or
to lists.


> I have a version of this port to Dave's tree which gets this right.
> i_ino assignment is already moved out by my patch (which should apply
> to your tree with minimal differences), so the new _anon only does not
> put the inode on the list.  The other difference is that we don't bother
> initializing i_sb_list in the main inode allocation path, but only in
> new_anon_inode, and that the function is not exported - it really should
> only be used for built-in filesystems that never get unmounted to be
> safe.

I'll check it.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  2:35     ` Nick Piggin
@ 2010-10-20  3:12       ` Nick Piggin
  2010-10-20  9:43         ` Dave Chinner
  0 siblings, 1 reply; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  3:12 UTC (permalink / raw)
  To: Nick Piggin, linux-mm; +Cc: Dave Chinner, linux-kernel, linux-fsdevel

Gah. Try again.

On Wed, Oct 20, 2010 at 01:35:56PM +1100, Nick Piggin wrote:
> [I should have cc'ed this one to linux-mm as well, so I quote your
> reply in full here]
> 
> On Tue, Oct 19, 2010 at 11:38:52PM +1100, Dave Chinner wrote:
> > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > > Per-zone LRUs and shrinkers for inode cache.
> > 
> > Regardless of whether this is the right way to scale or not, I don't
> > like the fact that this moves the cache LRUs into the memory
> > management structures, and expands the use of MM specific structures
> > throughout the code.
> 
> The zone structure really is the basic unit of memory abstraction
> in the whole zoned VM concept (which covers different properties
> of both physical address and NUMA cost).
> 
> The zone contains structures for memory management that aren't
> otherwise directly related to one another. Generic page waitqueues,
> page allocator structures, pagecache reclaim structures, memory model
> data, and various statistics.
> 
> Structures to reclaim inodes from a particular zone belong in the
> zone struct as much as those to reclaim pagecache or anonymous
> memory from that zone too. It actually fits far better in here than
> globally, because all our allocation/reclaiming/watermarks etc is
> driven per-zone.
> 
> The structure is not frequent -- a couple per NUMA node.
> 
> 
> > It ties the cache implementation to the current
> > VM implementation. That, IMO, goes against all the principle of
> > modularisation at the source code level, and it means we have to tie
> > all shrinker implemenations to the current internal implementation
> > of the VM. I don't think that is wise thing to do because of the
> > dependencies and impedance mismatches it introduces.
> 
> It's very fundamental. We allocate memory from, and have to reclaim
> memory from -- zones. Memory reclaim is driven based on how the VM
> wants to reclaim memory: nothing you can do to avoid some linkage
> between the two.
> 
> Look at it this way. The dumb global shrinker is also tied to an
> MM implementation detail, but that detail in fact does *not* match
> the reality of the MM, and so it has all these problems interacting
> with real reclaim.
> 
> What problems? OK, on an N zone system (assuming equal zones and
> even distribution of objects around memory), then if there is a shortage
> on a particular zone, slabs from _all_ zones are reclaimed. We reclaim
> a factor of N too many objects. In a NUMA situation, we also touch
> remote memory with a chance (N-1)/N.
> 
> As number of nodes grow beyond 2, this quickly goes down hill.
> 
> In summary, there needs to be some knowledge of how MM reclaims memory
> in memory reclaim shrinkers -- simply can't do a good implementation
> without that. If the zone concept changes, the MM gets turned upside
> down and all those assumptions would need to be revisited anyway.
> 
> 
> > As an example: XFS inodes to be reclaimed are simply tagged in a
> > radix tree so the shrinker can reclaim inodes in optimal IO order
> > rather strict LRU order. It simply does not match a zone-based
> 
> This is another problem, similar to what we have in pagecache. In
> the pagecache, we need to clean pages in optimal IO order, but we
> still reclaim them according to some LRU order.
> 
> If you reclaim them in optimal IO order, cache efficiency will go
> down because you sacrifice recency/frequency information. If you
> IO in reclaim order, IO efficiency goes down. The solution is to
> decouple them with like writeout versus reclaim.
> 
> But anyway, that's kind of an "aside": inode caches are reclaimed
> in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
> change that in the slightest.
> 
> > shrinker implementation in any way, shape or form, nor does it's
> > inherent parallelism match that of the way shrinkers are called.
> > 
> > Any change in shrinker infrastructure needs to be able to handle
> > these sorts of impedance mismatches between the VM and the cache
> > subsystem. The current API doesn't handle this very well, either,
> > so it's something that we need to fix so that scalability is easy
> > for everyone.
> > 
> > Anyway, my main point is that tying the LRU and shrinker scaling to
> > the implementation of the VM is a one-off solution that doesn't work
> > for generic infrastructure.
> 
> No it isn't. It worked for the pagecache, and it works for dcache.
> 
> 
> > Other subsystems need the same
> > large-machine scaling treatment, and there's no way we should be
> > tying them all into the struct zone. It needs further abstraction.
> 
> An abstraction? Other than the zone? What do you suggest? Invent
> something that the VM has no concept of and try to use that?
> 
> No. The zone is the right thing to base it on.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-19 12:38   ` Dave Chinner
  2010-10-20  2:35     ` Nick Piggin
@ 2010-10-20  3:14     ` KOSAKI Motohiro
  2010-10-20  3:20       ` Nick Piggin
  1 sibling, 1 reply; 70+ messages in thread
From: KOSAKI Motohiro @ 2010-10-20  3:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kosaki.motohiro, npiggin, linux-kernel, linux-fsdevel

Hello,

> On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > Per-zone LRUs and shrinkers for inode cache.
> 
> Regardless of whether this is the right way to scale or not, I don't
> like the fact that this moves the cache LRUs into the memory
> management structures, and expands the use of MM specific structures
> throughout the code. It ties the cache implementation to the current
> VM implementation. That, IMO, goes against all the principle of
> modularisation at the source code level, and it means we have to tie
> all shrinker implemenations to the current internal implementation
> of the VM. I don't think that is wise thing to do because of the
> dependencies and impedance mismatches it introduces.
> 
> As an example: XFS inodes to be reclaimed are simply tagged in a
> radix tree so the shrinker can reclaim inodes in optimal IO order
> rather strict LRU order. It simply does not match a zone-based
> shrinker implementation in any way, shape or form, nor does it's
> inherent parallelism match that of the way shrinkers are called.
> 
> Any change in shrinker infrastructure needs to be able to handle
> these sorts of impedance mismatches between the VM and the cache
> subsystem. The current API doesn't handle this very well, either,
> so it's something that we need to fix so that scalability is easy
> for everyone.
> 
> Anyway, my main point is that tying the LRU and shrinker scaling to
> the implementation of the VM is a one-off solution that doesn't work
> for generic infrastructure. Other subsystems need the same
> large-machine scaling treatment, and there's no way we should be
> tying them all into the struct zone. It needs further abstraction.

I'm not sure what data structure is best. I can only say current
zone unawareness slab shrinker might makes following sad scenario.

 o DMA zone shortage invoke and plenty icache in NORMAL zone dropping
 o NUMA aware system enable zone_reclaim_mode, but shrink_slab() still
   drop unrelated zone's icache

both makes performance degression. In other words, Linux does not have
flat memory model. so, I don't think Nick's basic concept is wrong. 
It's straight forward enhancement. but if it don't fit current shrinkers,
I'd like to discuss how to make better data structure.



and I have dump question (sorry, I don't know xfs at all). current
xfs_mount is below.

typedef struct xfs_mount {
 ...
        struct shrinker         m_inode_shrink; /* inode reclaim shrinker */
} xfs_mount_t;


Do you mean xfs can't convert shrinker to shrinker[ZONES]? If so, why?


Thanks.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  3:14     ` KOSAKI Motohiro
@ 2010-10-20  3:20       ` Nick Piggin
  2010-10-20  3:29         ` KOSAKI Motohiro
  2010-10-20 10:19         ` Dave Chinner
  0 siblings, 2 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20  3:20 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Dave Chinner, npiggin, linux-kernel, linux-fsdevel

On Wed, Oct 20, 2010 at 12:14:32PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > Anyway, my main point is that tying the LRU and shrinker scaling to
> > the implementation of the VM is a one-off solution that doesn't work
> > for generic infrastructure. Other subsystems need the same
> > large-machine scaling treatment, and there's no way we should be
> > tying them all into the struct zone. It needs further abstraction.
> 
> I'm not sure what data structure is best. I can only say current
> zone unawareness slab shrinker might makes following sad scenario.
> 
>  o DMA zone shortage invoke and plenty icache in NORMAL zone dropping
>  o NUMA aware system enable zone_reclaim_mode, but shrink_slab() still
>    drop unrelated zone's icache
> 
> both makes performance degression. In other words, Linux does not have
> flat memory model. so, I don't think Nick's basic concept is wrong. 
> It's straight forward enhancement. but if it don't fit current shrinkers,
> I'd like to discuss how to make better data structure.
> 
> 
> 
> and I have dump question (sorry, I don't know xfs at all). current
> xfs_mount is below.
> 
> typedef struct xfs_mount {
>  ...
>         struct shrinker         m_inode_shrink; /* inode reclaim shrinker */
> } xfs_mount_t;
> 
> 
> Do you mean xfs can't convert shrinker to shrinker[ZONES]? If so, why?

Well if XFS were to use per-ZONE shrinkers, it would remain with a
single shrinker context per-sb like it has now, but it would divide
its object management into per-zone structures.

For subsystems that aren't important, don't take much memory or have
much reclaim throughput, they are free to ignore the zone argument
and keep using the global input to the shrinker.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  3:20       ` Nick Piggin
@ 2010-10-20  3:29         ` KOSAKI Motohiro
  2010-10-20 10:19         ` Dave Chinner
  1 sibling, 0 replies; 70+ messages in thread
From: KOSAKI Motohiro @ 2010-10-20  3:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kosaki.motohiro, Dave Chinner, linux-kernel, linux-fsdevel

> On Wed, Oct 20, 2010 at 12:14:32PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > > Anyway, my main point is that tying the LRU and shrinker scaling to
> > > the implementation of the VM is a one-off solution that doesn't work
> > > for generic infrastructure. Other subsystems need the same
> > > large-machine scaling treatment, and there's no way we should be
> > > tying them all into the struct zone. It needs further abstraction.
> > 
> > I'm not sure what data structure is best. I can only say current
> > zone unawareness slab shrinker might makes following sad scenario.
> > 
> >  o DMA zone shortage invoke and plenty icache in NORMAL zone dropping
> >  o NUMA aware system enable zone_reclaim_mode, but shrink_slab() still
> >    drop unrelated zone's icache
> > 
> > both makes performance degression. In other words, Linux does not have
> > flat memory model. so, I don't think Nick's basic concept is wrong. 
> > It's straight forward enhancement. but if it don't fit current shrinkers,
> > I'd like to discuss how to make better data structure.
> > 
> > 
> > 
> > and I have dump question (sorry, I don't know xfs at all). current
> > xfs_mount is below.
> > 
> > typedef struct xfs_mount {
> >  ...
> >         struct shrinker         m_inode_shrink; /* inode reclaim shrinker */
> > } xfs_mount_t;
> > 
> > 
> > Do you mean xfs can't convert shrinker to shrinker[ZONES]? If so, why?
> 
> Well if XFS were to use per-ZONE shrinkers, it would remain with a
> single shrinker context per-sb like it has now, but it would divide
> its object management into per-zone structures.

Oops, my fault ;)
Yes, my intention is converting mp->m_perag_tree to per-zone.

Thanks fix me.


> 
> For subsystems that aren't important, don't take much memory or have
> much reclaim throughput, they are free to ignore the zone argument
> and keep using the global input to the shrinker.
> 




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  3:12       ` Nick Piggin
@ 2010-10-20  9:43         ` Dave Chinner
  2010-10-20 10:02           ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2010-10-20  9:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, linux-kernel, linux-fsdevel

> On Wed, Oct 20, 2010 at 01:35:56PM +1100, Nick Piggin wrote:
> > On Tue, Oct 19, 2010 at 11:38:52PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > > > Per-zone LRUs and shrinkers for inode cache.
> > > 
> > > Regardless of whether this is the right way to scale or not, I don't
> > > like the fact that this moves the cache LRUs into the memory
> > > management structures, and expands the use of MM specific structures
> > > throughout the code.
> > 
> > The zone structure really is the basic unit of memory abstraction
> > in the whole zoned VM concept (which covers different properties
> > of both physical address and NUMA cost).

[ snip lecture on NUMA VM 101 - I got that at SGI w.r.t. Irix more than
8 years ago, and Linux isn't any different. ]

> > > It ties the cache implementation to the current
> > > VM implementation. That, IMO, goes against all the principle of
> > > modularisation at the source code level, and it means we have to tie
> > > all shrinker implemenations to the current internal implementation
> > > of the VM. I don't think that is wise thing to do because of the
> > > dependencies and impedance mismatches it introduces.
> > 
> > It's very fundamental. We allocate memory from, and have to reclaim
> > memory from -- zones. Memory reclaim is driven based on how the VM
> > wants to reclaim memory: nothing you can do to avoid some linkage
> > between the two.

The allocation API exposes per-node allocation, not zones. The zones
are the internal implementation of the API, not what people use
directly for allocation...

> > > As an example: XFS inodes to be reclaimed are simply tagged in a
> > > radix tree so the shrinker can reclaim inodes in optimal IO order
> > > rather strict LRU order. It simply does not match a zone-based
....
> > But anyway, that's kind of an "aside": inode caches are reclaimed
> > in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
> > change that in the slightest.

I suspect you didn't read what I wrote, so I'll repeat it. XFS has
reclaimed inodes in optimal IO order for several releases and so
per-zone LRU would change that drastically.

> > > Other subsystems need the same
> > > large-machine scaling treatment, and there's no way we should be
> > > tying them all into the struct zone. It needs further abstraction.
> > 
> > An abstraction? Other than the zone? What do you suggest? Invent
> > something that the VM has no concept of and try to use that?

I think you answered that question yourself a moment ago:

> > The structure is not frequent -- a couple per NUMA node.

Sounds to me like a per-node LRU/shrinker arrangement is an
abstraction that the VM could work with. Indeed, make it run only
from the *per-node kswapd* instead of from direct reclaim, and we'd
also solve the unbound reclaim parallelism problem at the same
time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  9:43         ` Dave Chinner
@ 2010-10-20 10:02           ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20 10:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-mm, linux-kernel, linux-fsdevel

On Wed, Oct 20, 2010 at 08:43:02PM +1100, Dave Chinner wrote:
> > On Wed, Oct 20, 2010 at 01:35:56PM +1100, Nick Piggin wrote:
> > > 
> > > It's very fundamental. We allocate memory from, and have to reclaim
> > > memory from -- zones. Memory reclaim is driven based on how the VM
> > > wants to reclaim memory: nothing you can do to avoid some linkage
> > > between the two.
> 
> The allocation API exposes per-node allocation, not zones. The zones
> are the internal implementation of the API, not what people use
> directly for allocation...

Of course it exposes zones (with GFP flags). In fact they were exposed
before the zone concept was extended to NUMA.


> > > > As an example: XFS inodes to be reclaimed are simply tagged in a
> > > > radix tree so the shrinker can reclaim inodes in optimal IO order
> > > > rather strict LRU order. It simply does not match a zone-based
> ....
> > > But anyway, that's kind of an "aside": inode caches are reclaimed
> > > in LRU, IO-suboptimal order today anyway. Per-zone LRU doesn't
> > > change that in the slightest.
> 
> I suspect you didn't read what I wrote, so I'll repeat it. XFS has
> reclaimed inodes in optimal IO order for several releases and so
> per-zone LRU would change that drastically.

You were talking about XFS's own inode reclaim code? My patches
of course don't change that. I would like to see them usable by
XFS as well of course, but I'm not forcing anything to be
shoehorned in where it doesn't fit properly yet.

The Linux inode reclaimer is pretty well "random" from POV of
disk order, as you know.

I don't have the complete answer about how to write back required
inode information in IO optimal order, and at the same time make
reclaim optimal reclaiming choices.

It could be that a 2 stage reclaim process is enough (have the
Linux inode reclaim make the thing and make it eligible for IO
and real reclaiming, then have an inode writeout pass that does
IO optimal reclaiming from those).

That is really quite speculative and out of scope of this patch set.
But the point is that this patch set doesn't prohibit anything like
that happening, does not change XFS's reclaim currently.


> > > > Other subsystems need the same
> > > > large-machine scaling treatment, and there's no way we should be
> > > > tying them all into the struct zone. It needs further abstraction.
> > > 
> > > An abstraction? Other than the zone? What do you suggest? Invent
> > > something that the VM has no concept of and try to use that?
> 
> I think you answered that question yourself a moment ago:
> 
> > > The structure is not frequent -- a couple per NUMA node.
> 
> Sounds to me like a per-node LRU/shrinker arrangement is an
> abstraction that the VM could work with.

The zone really is the right place. If you do it per node, then
you can still have shortages in one node in a zone but not
another, causing the same excessive reclaim problem.


> Indeed, make it run only
> from the *per-node kswapd* instead of from direct reclaim, and we'd
> also solve the unbound reclaim parallelism problem at the same
> time...

That's also out of scope, but it is among things being
considered, as far as I know (along with capping number of
threads in reclaim etc). But doing zone LRUs doesn't change
this either -- kswapd pagecache reclaim also works per node,
by simply processing all the zones that belong to the node.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20  3:20       ` Nick Piggin
  2010-10-20  3:29         ` KOSAKI Motohiro
@ 2010-10-20 10:19         ` Dave Chinner
  2010-10-20 10:41           ` Nick Piggin
  1 sibling, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2010-10-20 10:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: KOSAKI Motohiro, linux-kernel, linux-fsdevel

On Wed, Oct 20, 2010 at 02:20:24PM +1100, Nick Piggin wrote:
> On Wed, Oct 20, 2010 at 12:14:32PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@kernel.dk wrote:
> > > Anyway, my main point is that tying the LRU and shrinker scaling to
> > > the implementation of the VM is a one-off solution that doesn't work
> > > for generic infrastructure. Other subsystems need the same
> > > large-machine scaling treatment, and there's no way we should be
> > > tying them all into the struct zone. It needs further abstraction.
> > 
> > I'm not sure what data structure is best. I can only say current
> > zone unawareness slab shrinker might makes following sad scenario.
> > 
> >  o DMA zone shortage invoke and plenty icache in NORMAL zone dropping
> >  o NUMA aware system enable zone_reclaim_mode, but shrink_slab() still
> >    drop unrelated zone's icache
> > 
> > both makes performance degression. In other words, Linux does not have
> > flat memory model. so, I don't think Nick's basic concept is wrong. 
> > It's straight forward enhancement. but if it don't fit current shrinkers,
> > I'd like to discuss how to make better data structure.
> > 
> > 
> > 
> > and I have dump question (sorry, I don't know xfs at all). current
> > xfs_mount is below.
> > 
> > typedef struct xfs_mount {
> >  ...
> >         struct shrinker         m_inode_shrink; /* inode reclaim shrinker */
> > } xfs_mount_t;
> > 
> > 
> > Do you mean xfs can't convert shrinker to shrinker[ZONES]? If so, why?
> 
> Well if XFS were to use per-ZONE shrinkers, it would remain with a
> single shrinker context per-sb like it has now, but it would divide
> its object management into per-zone structures.

<sigh>

I don't think anyone wants per-ag X per-zone reclaim lists on a 1024
node machine with a 1,000 AG (1PB) filesystem.

As I have already said, the XFS inode caches are optimised in
structure to minimise IO and maximise internal filesystem
parallelism. They are not optimised for per-cpu or NUMA scalability
because if you don't have filesystem level parallelism, you can't
scale to large numbers of concurrent operations across large numbers
of CPUs in the first place.

In the case of XFS, per-allocation group is the way we scale
internal parallelism and as long as you have more AGs than you have
CPUs, there is very good per-CPU scalability through the filesystem
because most operations are isolated to a single AG.  That is how we
scale parallelism in XFS, and it has proven to scale pretty well for
even the largest of NUMA machines. 

This is what I mean about there being an impedence mismatch between
the way the VM and the VFS/filesystem caches scale. Fundamentally,
the way filesystems want their caches to operate for optimal
performance can be vastly different to the way you want shrinkers to
operate for VM scalability. Forcing the MM way of doing stuff down
into the LRUs and shrinkers is not a good way of solving this
problem.

> For subsystems that aren't important, don't take much memory or have
> much reclaim throughput, they are free to ignore the zone argument
> and keep using the global input to the shrinker.

Having a global lock in a shrinker is already a major point of
contention because shrinkers have unbound parallelism.  Hence all
shrinkers need to be converted to use scalable structures. What we
need _first_ is the infrastructure to do this in a sane manner, not
tie a couple of shrinkers tightly into the mm structures and then
walk away.

And FWIW, most subsystems that use shrinkers can be compiled in as
modules or not compiled in at all. That'll probably leave #ifdef
CONFIG_ crap all through the struct zone definition as they are
converted to use your current method....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 31/35] fs: icache per-zone inode LRU
  2010-10-20 10:19         ` Dave Chinner
@ 2010-10-20 10:41           ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20 10:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, KOSAKI Motohiro, linux-kernel, linux-fsdevel

On Wed, Oct 20, 2010 at 09:19:06PM +1100, Dave Chinner wrote:
> On Wed, Oct 20, 2010 at 02:20:24PM +1100, Nick Piggin wrote:
> > 
> > Well if XFS were to use per-ZONE shrinkers, it would remain with a
> > single shrinker context per-sb like it has now, but it would divide
> > its object management into per-zone structures.
> 
> <sigh>
> 
> I don't think anyone wants per-ag X per-zone reclaim lists on a 1024
> node machine with a 1,000 AG (1PB) filesystem.

Maybe not, but a 1024 node machine will *definitely* need to minimise
interconnect traffic and remote memory access. So if each node can't
spare enough memory for a couple of thousand LRU list heads, then
XFS's per-ag LRUs may need rethinking (they may provide reasonable
scalability on well partitioned workloads, but they can not help
the reclaimers to do the right thing -- remote memory accesses will
still dominate the inode LRU scanning there).


> As I have already said, the XFS inode caches are optimised in
> structure to minimise IO and maximise internal filesystem
> parallelism. They are not optimised for per-cpu or NUMA scalability
> because if you don't have filesystem level parallelism, you can't
> scale to large numbers of concurrent operations across large numbers
> of CPUs in the first place.

And as I have already said, nothing in my patches changes that.
What it provides is *opportunity* for shrinkers to take advantage
of per-zone scalability and improved reclaim patterns. Nothing
forces it, though.

 
> In the case of XFS, per-allocation group is the way we scale
> internal parallelism and as long as you have more AGs than you have
> CPUs, there is very good per-CPU scalability through the filesystem
> because most operations are isolated to a single AG.  That is how we
> scale parallelism in XFS, and it has proven to scale pretty well for
> even the largest of NUMA machines. 
> 
> This is what I mean about there being an impedence mismatch between
> the way the VM and the VFS/filesystem caches scale. Fundamentally,
> the way filesystems want their caches to operate for optimal
> performance can be vastly different to the way you want shrinkers to
> operate for VM scalability. Forcing the MM way of doing stuff down
> into the LRUs and shrinkers is not a good way of solving this
> problem.

It isn't forcing anything. Maybe you didn't understand the patch
because you keep repeating this.

 
> > For subsystems that aren't important, don't take much memory or have
> > much reclaim throughput, they are free to ignore the zone argument
> > and keep using the global input to the shrinker.
> 
> Having a global lock in a shrinker is already a major point of
> contention because shrinkers have unbound parallelism.  Hence all
> shrinkers need to be converted to use scalable structures. What we
> need _first_ is the infrastructure to do this in a sane manner, not
> tie a couple of shrinkers tightly into the mm structures and then
> walk away.

Per zone is the way to do it. Shrinkers and reclaim concept is
already tightly coupled with the mm. Memory pressure and the need
to reclaim occurs solely and only as a function of a zone (or zones).
Adding the zone argument to the shrinker does nothing more than adding
that previously missing input to the shrinker.

"I have a memory shortage in this zone, so I need to free reclaimable
objects from this zone"

This is a pretty core memory managementy idea. If you "decouple"
shrinkers from mm any further, then you end up with something that
doesn't give shrinkers the required information.
 

> And FWIW, most subsystems that use shrinkers can be compiled in as
> modules or not compiled in at all. That'll probably leave #ifdef
> CONFIG_ crap all through the struct zone definition as they are
> converted to use your current method....

I haven't thought about how random drivers will do per-zone things.
Obviously not an all out dumping ground in struct zone, but it does
fit for critical central caches like page, inode, and dentry.

Even if they aren't compiled out, we don't want their size bloating
things too much if they aren't loaded or in use. Probably dynamic
allocation would be the best way to go for them. Pretty simple really.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 19/35] fs: icache remove redundant i_sb_list umount locking
  2010-10-19  3:42 ` [patch 19/35] fs: icache remove redundant i_sb_list umount locking npiggin
@ 2010-10-20 12:46   ` Al Viro
  2010-10-20 13:03     ` Nick Piggin
  0 siblings, 1 reply; 70+ messages in thread
From: Al Viro @ 2010-10-20 12:46 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 02:42:35PM +1100, npiggin@kernel.dk wrote:
> +	/*
> +	 * We can walk the per-sb list of inodes here without worrying about
> +	 * its consistency, because the list must not change during umount
> +	 * anymore, and because iprune_sem keeps shrink_icache_memory() away.
> +	 */
>  	fsnotify_unmount_inodes(&sb->s_inodes);

OK, explain to me why is that safe.  Note that fsnotify_destroy_mark()
_can_ race with umount, dropping the last reference to inode before
fsnotify_unmount_inodes() would get to it and kill it (along with the mark).
With the current code it's just fine - we walk the list under lock and
iput() won't mess with that list until it acquires the damn lock.  And
no matter who gets there first, the mark will be destroyed and reference
to inode will be dropped.

With your change, AFAICS, removal from the list can happen while we walk
it.  With obvious results.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 19/35] fs: icache remove redundant i_sb_list umount locking
  2010-10-20 12:46   ` Al Viro
@ 2010-10-20 13:03     ` Nick Piggin
  2010-10-20 13:27       ` Al Viro
  0 siblings, 1 reply; 70+ messages in thread
From: Nick Piggin @ 2010-10-20 13:03 UTC (permalink / raw)
  To: Al Viro; +Cc: npiggin, linux-kernel, linux-fsdevel

On Wed, Oct 20, 2010 at 01:46:31PM +0100, Al Viro wrote:
> On Tue, Oct 19, 2010 at 02:42:35PM +1100, npiggin@kernel.dk wrote:
> > +	/*
> > +	 * We can walk the per-sb list of inodes here without worrying about
> > +	 * its consistency, because the list must not change during umount
> > +	 * anymore, and because iprune_sem keeps shrink_icache_memory() away.
> > +	 */
> >  	fsnotify_unmount_inodes(&sb->s_inodes);
> 
> OK, explain to me why is that safe.  Note that fsnotify_destroy_mark()
> _can_ race with umount, dropping the last reference to inode before
> fsnotify_unmount_inodes() would get to it and kill it (along with the mark).
> With the current code it's just fine - we walk the list under lock and
> iput() won't mess with that list until it acquires the damn lock.  And
> no matter who gets there first, the mark will be destroyed and reference
> to inode will be dropped.
> 
> With your change, AFAICS, removal from the list can happen while we walk
> it.  With obvious results.

Ah, tricky. So after fsnotify_unmount_inodes runs, the invalidate_list
is now safe because no more marks can put the inode at that point? OK I
can concede that point and drop the patch, thanks.

Where can fsnotify_destroy_mark run concurrently at umount time, can you
explain? I haven't spotted it yet.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 00/35] my inode scaling series for review
  2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
                   ` (35 preceding siblings ...)
  2010-10-19 16:22 ` [patch 00/35] my inode scaling series for review Christoph Hellwig
@ 2010-10-20 13:14 ` Al Viro
  2010-10-20 13:59   ` Nick Piggin
  36 siblings, 1 reply; 70+ messages in thread
From: Al Viro @ 2010-10-20 13:14 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, linux-fsdevel

On Tue, Oct 19, 2010 at 02:42:16PM +1100, npiggin@kernel.dk wrote:

> I don't think Dave Chinner's approach is the way to go for a number of
> reasons.
> 
> * My locking design allows i_lock to lock the entire state of the icache
>   for a particular inode. Not so with Dave's, and he had to add code not
>   required with inode_lock synchronisation or my i_lock synchronisation.
>   I prefer being very conservative about making changes, especially before
>   inode_lock is lifted (which will be the end-point of bisection for any
>   locking breakage before it).

I don't think so; in particular, hash chains protection would be more
natural *outside* of i_lock.  I'm not particulary concerned about
trylock in there, or about the width of area where we are holding ->i_lock,
but I really don't think this locking hierarchy makes _sense_.

> * As far as I can tell, I have addressed all Dave and Christoph's real
>   concerns.  The disagreement about the i_lock locking model can easily be
>   solved if they post a couple of small incremental patches to the end of the
>   series, making i_lock locking less regular and no longer protecting icache
>   state of that given inode (like inode_lock was able to pre-patchset). I've
>   repeatedly disagreed with this approach, however.

IMO you are worrying about the wrong things.  Frankly, I'm a lot more
concerned about the locking being natural for the data structures we
have and easily understood.  We *are* fscking close to the complexity
cliff, hopefully still on the right side of it.  And "if you compare that
to earlier code, you can show that it doesn't break unless the old one had
been broken too" doesn't work well for analysis of the result.  Even now,
nevermind a couple of months later.

> * I have used RCU for inodes, and structured a lot of the locking around that.
>   RCU is required for store-free path walking, so it makes more sense IMO to
>   implement now rather than in a subsequent release (and reworking inode locking  to take advantage of it). I have a design sketched for using slab RCU freeing,  which is a little more complex, but it should be able to take care of any
>   real-workload regressions if we do discover them.

Separate (sub)series.

> * I implement per-zone LRU lists and locking, which are desperately required
>   for reasonable NUMA performance, and are a first step towards proper mem
>   controller control of vfs caches (Google have a similar per-zone LRU patch
>   they need for their fakenuma based memory control, I believe).

Ditto.

> * I implemented per-cpu locking for inode sb lists. The scalability and
>   single threaded performance of the full vfs-scale stack has been tested
>   quite well. Most of the vfs scales pretty linearly up to several hundreds
>   of sockets at least. I have counted cycles on various x86 and POWER
>   architectures to improve single threaded performance. It's an ongoing
>   process but there has been a lot of work done already there.
> 
>   We want all these things ASAP, so it doesn't make sense to me to stage out
>   significant locking changes in the icache code over several releases. Just
>   get them out of the way now -- the series is bisectable and reviewable, so
>   I think it will reduce churn and headache for everyone to get it out of the
>   way now.

We want all these things, but I'd prefer to have them unbundled, TYVM.  Even
if we end up merging all in one cycle.  With locking going first.

One trivial note on splitup: e.g. the patch closer to the end, introducing
inode_get() can and should be pulled all the way in front, with only the
helper to be modified when you play with i_count later.  General principle:
if you end up switching to helper anyway, better do that immediately and
do subsequent changes to that helper rather than touching every place where
it's been open-coded...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 19/35] fs: icache remove redundant i_sb_list umount locking
  2010-10-20 13:03     ` Nick Piggin
@ 2010-10-20 13:27       ` Al Viro
  0 siblings, 0 replies; 70+ messages in thread
From: Al Viro @ 2010-10-20 13:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel

On Thu, Oct 21, 2010 at 12:03:45AM +1100, Nick Piggin wrote:

> Where can fsnotify_destroy_mark run concurrently at umount time, can you
> explain? I haven't spotted it yet.

When you destroy the group mark belongs to.  Which is not tied to any
specific fs...

Basically, these marks are associated with pairs (inode, group) and
can be kicked out by either side.  They pin inodes once set up, without
pinning the superblock down.  Main selling point of idiotify, inherited
by fsnotify.  That's why we need to go through that list on umount in
the first place - we need to evict that crap, or it'll keep inodes busy.

OTOH, removal of group also needs to kill the subset of marks - ones
belonging to that group...  grep for fsnotify_put_group() and look
for the chain through fsnotify_destroy_group() and on to
fsnotify_destroy_mark() and fsnotify_destroy_inode_mark().

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 00/35] my inode scaling series for review
  2010-10-20 13:14 ` Al Viro
@ 2010-10-20 13:59   ` Nick Piggin
  0 siblings, 0 replies; 70+ messages in thread
From: Nick Piggin @ 2010-10-20 13:59 UTC (permalink / raw)
  To: Al Viro; +Cc: npiggin, linux-kernel, linux-fsdevel

On Wed, Oct 20, 2010 at 02:14:46PM +0100, Al Viro wrote:
> On Tue, Oct 19, 2010 at 02:42:16PM +1100, npiggin@kernel.dk wrote:
> 
> > I don't think Dave Chinner's approach is the way to go for a number of
> > reasons.
> > 
> > * My locking design allows i_lock to lock the entire state of the icache
> >   for a particular inode. Not so with Dave's, and he had to add code not
> >   required with inode_lock synchronisation or my i_lock synchronisation.
> >   I prefer being very conservative about making changes, especially before
> >   inode_lock is lifted (which will be the end-point of bisection for any
> >   locking breakage before it).
> 
> I don't think so; in particular, hash chains protection would be more
> natural *outside* of i_lock.  I'm not particulary concerned about
> trylock in there, or about the width of area where we are holding ->i_lock,
> but I really don't think this locking hierarchy makes _sense_.

That's a fair point as I said I would like to see a patch and argue
it, but I want a lock that locks the whole icache state of the inode, at
least in the first pass of breaking the locks and removing inode_lock.

In a sense, going from that point (patch 13) to a locking hierarchy
that makes _sense_ becomes easier. Individual changes that are
bisectable (to breakage, rather than lifting of inode_lock) and somewhat
reviewable on their own.

 
> > * As far as I can tell, I have addressed all Dave and Christoph's real
> >   concerns.  The disagreement about the i_lock locking model can easily be
> >   solved if they post a couple of small incremental patches to the end of the
> >   series, making i_lock locking less regular and no longer protecting icache
> >   state of that given inode (like inode_lock was able to pre-patchset). I've
> >   repeatedly disagreed with this approach, however.
> 
> IMO you are worrying about the wrong things.  Frankly, I'm a lot more
> concerned about the locking being natural for the data structures we
> have and easily understood.

I still think the task of acquiring locks is smaller than of verifying
what they protect. So it makes more sense to me to have state of the
working object being consistent and protected. The locking hierarchy
itself as you can see is not complex, it's flat and just 2 level.


>  We *are* fscking close to the complexity
> cliff, hopefully still on the right side of it.  And "if you compare that
> to earlier code, you can show that it doesn't break unless the old one had
> been broken too" doesn't work well for analysis of the result.  Even now,
> nevermind a couple of months later.

(you're going to love the dcache series)

 
> > * I have used RCU for inodes, and structured a lot of the locking around that.
> >   RCU is required for store-free path walking, so it makes more sense IMO to
> >   implement now rather than in a subsequent release (and reworking inode locking  to take advantage of it). I have a design sketched for using slab RCU freeing,  which is a little more complex, but it should be able to take care of any
> >   real-workload regressions if we do discover them.
> 
> Separate (sub)series.
> 
> > * I implement per-zone LRU lists and locking, which are desperately required
> >   for reasonable NUMA performance, and are a first step towards proper mem
> >   controller control of vfs caches (Google have a similar per-zone LRU patch
> >   they need for their fakenuma based memory control, I believe).
> 
> Ditto.

This one largely is separate (about 100 line patch at the end
of the series).

The RCU inodes has impact on some of the locking though, so it
wouldn't be entirely independent.

 
> > * I implemented per-cpu locking for inode sb lists. The scalability and
> >   single threaded performance of the full vfs-scale stack has been tested
> >   quite well. Most of the vfs scales pretty linearly up to several hundreds
> >   of sockets at least. I have counted cycles on various x86 and POWER
> >   architectures to improve single threaded performance. It's an ongoing
> >   process but there has been a lot of work done already there.
> > 
> >   We want all these things ASAP, so it doesn't make sense to me to stage out
> >   significant locking changes in the icache code over several releases. Just
> >   get them out of the way now -- the series is bisectable and reviewable, so
> >   I think it will reduce churn and headache for everyone to get it out of the
> >   way now.
> 
> We want all these things, but I'd prefer to have them unbundled, TYVM.  Even
> if we end up merging all in one cycle.  With locking going first.
>
> One trivial note on splitup: e.g. the patch closer to the end, introducing
> inode_get() can and should be pulled all the way in front, with only the
> helper to be modified when you play with i_count later.  General principle:
> if you end up switching to helper anyway, better do that immediately and
> do subsequent changes to that helper rather than touching every place where
> it's been open-coded...

Yeah I'll do that. Thanks

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2010-10-20 14:00 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
2010-10-19  3:42 ` [patch 01/35] bit_spinlock: add required includes npiggin
2010-10-19  3:42 ` [patch 02/35] kernel: add bl_list npiggin
2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
2010-10-19  4:49   ` KOSAKI Motohiro
2010-10-19  5:33     ` Nick Piggin
2010-10-19  5:40       ` KOSAKI Motohiro
2010-10-19  3:42 ` [patch 04/35] vfs: convert inode and dentry caches to " npiggin
2010-10-19  3:42 ` [patch 05/35] fs: icache lock s_inodes list npiggin
2010-10-19  3:42 ` [patch 06/35] fs: icache lock inode hash npiggin
2010-10-19  3:42 ` [patch 07/35] fs: icache lock i_state npiggin
2010-10-19 10:47   ` Miklos Szeredi
2010-10-19 17:06     ` Peter Zijlstra
2010-10-19  3:42 ` [patch 08/35] fs: icache lock i_count npiggin
2010-10-19 10:16   ` Boaz Harrosh
2010-10-20  2:14     ` Nick Piggin
2010-10-19  3:42 ` [patch 09/35] fs: icache lock lru/writeback lists npiggin
2010-10-19  3:42 ` [patch 10/35] fs: icache atomic inodes_stat npiggin
2010-10-19  3:42 ` [patch 11/35] fs: icache lock inode state npiggin
2010-10-19  3:42 ` [patch 12/35] fs: inode atomic last_ino, iunique lock npiggin
2010-10-19  3:42 ` [patch 13/35] fs: icache remove inode_lock npiggin
2010-10-19  3:42 ` [patch 14/35] fs: icache factor hash lock into functions npiggin
2010-10-19  3:42 ` [patch 15/35] fs: icache per-bucket inode hash locks npiggin
2010-10-19  3:42 ` [patch 16/35] fs: icache lazy inode lru npiggin
2010-10-19  3:42 ` [patch 17/35] fs: icache RCU free inodes npiggin
2010-10-19  3:42 ` [patch 18/35] fs: avoid inode RCU freeing for pseudo fs npiggin
2010-10-19  3:42 ` [patch 19/35] fs: icache remove redundant i_sb_list umount locking npiggin
2010-10-20 12:46   ` Al Viro
2010-10-20 13:03     ` Nick Piggin
2010-10-20 13:27       ` Al Viro
2010-10-19  3:42 ` [patch 20/35] fs: icache rcu walk for i_sb_list npiggin
2010-10-19  3:42 ` [patch 21/35] fs: icache per-cpu nr_inodes, non-atomic nr_unused counters npiggin
2010-10-19  3:42 ` [patch 22/35] fs: icache per-cpu last_ino allocator npiggin
2010-10-19  3:42 ` [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists npiggin
2010-10-19 15:33   ` Miklos Szeredi
2010-10-20  2:37     ` Nick Piggin
2010-10-19  3:42 ` [patch 24/35] fs: icache use RCU to avoid locking in hash lookups npiggin
2010-10-19  3:42 ` [patch 25/35] fs: icache reduce some locking overheads npiggin
2010-10-19  3:42 ` [patch 26/35] fs: icache alloc anonymous inode allocation npiggin
2010-10-19 15:50   ` Miklos Szeredi
2010-10-20  2:38     ` Nick Piggin
2010-10-19 16:33   ` Christoph Hellwig
2010-10-20  3:07     ` Nick Piggin
2010-10-19  3:42 ` [patch 27/35] fs: icache split IO and LRU lists npiggin
2010-10-19 16:12   ` Miklos Szeredi
2010-10-20  2:41     ` Nick Piggin
2010-10-19  3:42 ` [patch 28/35] fs: icache split writeback and lru locks npiggin
2010-10-19  3:42 ` [patch 29/35] fs: icache per-bdi writeback list locking npiggin
2010-10-19  3:42 ` [patch 30/35] fs: icache lazy LRU avoid LRU locking after IO operation npiggin
2010-10-19  3:42 ` [patch 31/35] fs: icache per-zone inode LRU npiggin
2010-10-19 12:38   ` Dave Chinner
2010-10-20  2:35     ` Nick Piggin
2010-10-20  3:12       ` Nick Piggin
2010-10-20  9:43         ` Dave Chinner
2010-10-20 10:02           ` Nick Piggin
2010-10-20  3:14     ` KOSAKI Motohiro
2010-10-20  3:20       ` Nick Piggin
2010-10-20  3:29         ` KOSAKI Motohiro
2010-10-20 10:19         ` Dave Chinner
2010-10-20 10:41           ` Nick Piggin
2010-10-19  3:42 ` [patch 32/35] fs: icache minimise I_FREEING latency npiggin
2010-10-19  3:42 ` [patch 33/35] fs: icache introduce inode_get/inode_get_ilock npiggin
2010-10-19 10:17   ` Boaz Harrosh
2010-10-20  2:17     ` Nick Piggin
2010-10-19  3:42 ` [patch 34/35] fs: inode rename i_count to i_refs npiggin
2010-10-19  3:42 ` [patch 35/35] fs: icache document more lock orders npiggin
2010-10-19 16:22 ` [patch 00/35] my inode scaling series for review Christoph Hellwig
2010-10-20  3:05   ` Nick Piggin
2010-10-20 13:14 ` Al Viro
2010-10-20 13:59   ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).