All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] ext4: increase mbcache scalability
@ 2013-07-18  0:55 T Makphaibulchoke
  2013-08-22 15:54 ` [PATCH v2 " T Makphaibulchoke
                   ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2013-07-18  0:55 UTC (permalink / raw)
  To: viro, tytso, adilger.kernel, linux-fsdevel, linux-kernel,
	linux-ext4, tmakph
  Cc: aswin

This patch intends to improve the scalability of an ext4 filesystem by
introducing higher degree of parallelism to the usages of its mb_cache and
mb_cache_entries.

Here are some of the benchmark results with the changes. 

On a 90 core machine:

Here are the performance improvements in some of the aim7 workloads,

---------------------------
|             | % increase |
---------------------------
| custom      |     13.30  |
---------------------------
| disk        |      5.00  |  
---------------------------
| new_fserver |      6.73  |  
---------------------------
| shared      |      7.12  |  
---------------------------

For Swingbench dss workload,

-------------------------------------------------------------------------------
| Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
-------------------------------------------------------------------------------
| % imprvoment | 9.14 | 9.26 | 19.9 | 5.31 | 7.79 | 0.23 |-4.39 |-4.48 |-4.36 |
-------------------------------------------------------------------------------

For SPECjbb2013, composite run,

--------------------------------------------
|               | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement |   6.73   |     16.37     |
--------------------------------------------



On an 80 core machine:

Here are the improvments in some of the aim7 workloads,
 
---------------------------
|             | % increase |
---------------------------
| fserver     |      6.47  |
---------------------------
| new_fserver |      4.62  |  
---------------------------

For Swingbench dss workload,

-------------------------------------------------------------------------------
| Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
-------------------------------------------------------------------------------
| % imprvoment | 4.29 | 2.28 | 3.25 | 1.45 |-0.84 | 3.11 |-0.61 | 3.80 | 2.58 |
-------------------------------------------------------------------------------

For SPECjbb2013, composite run,

--------------------------------------------
|               | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement |   0.49   |      3.21     |
--------------------------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.  


 fs/ext4/ext4.h          |   1 +
 fs/ext4/super.c         |   8 ++
 fs/ext4/xattr.c         |  41 +++++---
 fs/ext4/xattr.h         |   3 +
 fs/mbcache.c            | 432 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------
 include/linux/mbcache.h |   5 +
 6 files changed, 385 insertions(+), 105 deletions(-)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/2] mbcache: decoupling the locking of local from global data
  2013-08-22 16:53     ` Linus Torvalds
@ 2013-08-22 15:33       ` Thavatchai Makphaibulchoke
  0 siblings, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-08-22 15:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: T Makphaibulchoke, Theodore Ts'o, adilger.kernel, Al Viro,
	linux-ext4, Linux Kernel Mailing List, linux-fsdevel,
	Chandramouleeswaran, Aswin

Thanks for the comments.

On 08/22/2013 04:53 PM, Linus Torvalds wrote:
> 
> Please don't do these ugly and pointless preprocessor macro expanders
> that hide what the actual operation is.
> 
> The DEBUG case seems to be just for your own testing anyway, so even
> that shouldn't exist in the merged version.
> 

Sorry, will clean them all up.

> 
> And this is really ugly. Again it's also then hidden behind the ugly macro.
> 
> First off, the thousand-time retry seems completely excessive. Does it
> actually need any retry AT ALL? If the hash entry changes, either you
> should retry forever, or if you feel that can result in livelocks
> (fair enough) and you need a fallback case to a bigger lock, then why
> not just do the fallback immediately?
> 
> More importantly, regardless of that retry issue, this seems to be
> abstracted at the wrong level, resulting in every single user of this
> repeating the same complex and hard-to-understand incantation:
> 

Looks like this is a misjudgement on my part.  There is really no need to guard against mb_cache_entry from moving to a different hash chain, as the shrinking and allocation function already protecting against each other thorugh mb_cache_spinlock.  The retry is not needed.

> 
> where the only difference is that the last one doesn't unlock
> afterwards because it runs in a loop with that LRU list lock held.
> Ugh.

Followed the above logic, all these pieces of code are also not necessary and could be just a simple unhash, as the original.

> 
> The locking logic also isn't explained anywhere, making the
> hard-to-read code even harder to read.


Will add comment, explaining the locking logic.

> 
>              Linus
> 

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 0/2] ext4: increase mbcache scalability
  2013-07-18  0:55 [PATCH 0/2] ext4: increase mbcache scalability T Makphaibulchoke
@ 2013-08-22 15:54 ` T Makphaibulchoke
  2013-08-22 15:54   ` [PATCH v2 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
                     ` (2 more replies)
  2013-09-04 16:39 ` [PATCH v3 0/2] " T Makphaibulchoke
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2013-08-22 15:54 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, torvalds, T Makphaibulchoke

The patch consists of two parts.  The first part introduces higher degree of
parallelism to the usages of the mb_cache and mb_cache_entries and impacts
all ext filesystems.

The second part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems

Here are some of the benchmark results with the changes. 

On a 90 core machine:

Here are the performance improvements in some of the aim7 workloads,

---------------------------
|             | % increase |
---------------------------
| alltests    |     11.85  |
---------------------------
| custom      |     14.42  |
---------------------------
| fserver     |     21.36  |  
---------------------------
| new_dbase   |      5.59 |  
---------------------------
| new_fserver |     21.45  |  
---------------------------
| shared      |     12.84  |  
---------------------------
For Swingbench dss workload, with 16 GB database,

-------------------------------------------------------------------------------
| Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
-------------------------------------------------------------------------------
| % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 |
-------------------------------------------------------------------------------
| % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 |
| without using|      |      |      |      |      |      |      |      |      |
| shared memory|      |      |      |      |      |      |      |      |      |
-------------------------------------------------------------------------------
For SPECjbb2013, composite run,

--------------------------------------------
|               | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement |   5.99   |     N/A       |
--------------------------------------------


On an 80 core machine:

The aim7's results for most of the workloads turn out to the same.

Here are the results of Swingbench dss workload,

-------------------------------------------------------------------------------
| Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
-------------------------------------------------------------------------------
| % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 |
-------------------------------------------------------------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced. 

Changed in v2:
	- New performance data
	- New diff summary

T Makphaibulchoke (2):
  mbcache: decoupling the locking of local from global data
  ext4: each filesystem creates and uses its own mc_cache

 fs/ext4/ext4.h          |   1 +
 fs/ext4/super.c         |  24 ++--
 fs/ext4/xattr.c         |  51 +++++----
 fs/ext4/xattr.h         |   6 +-
 fs/mbcache.c            | 293 +++++++++++++++++++++++++++++++++++-------------
 include/linux/mbcache.h |  10 +-
 6 files changed, 269 insertions(+), 116 deletions(-)

-- 
1.7.11.3


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 1/2] mbcache: decoupling the locking of local from global data
  2013-08-22 15:54 ` [PATCH v2 " T Makphaibulchoke
@ 2013-08-22 15:54   ` T Makphaibulchoke
  2013-08-22 16:53     ` Linus Torvalds
  2013-08-22 15:54     ` T Makphaibulchoke
  2014-01-24 18:31   ` [PATCH v4 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  2 siblings, 1 reply; 63+ messages in thread
From: T Makphaibulchoke @ 2013-08-22 15:54 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, torvalds, T Makphaibulchoke

The patch increases the parallelism of mb_cache_entry utilization by
introducing new spinlocks to the mb_cache structure to protect the mb_cache
local block and index hash chains, while the global mb_cache_lru_list and
mb_cache_list continue to be protected by the global mb_cache_spinlock.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
Changed in v2:
	- As per Linus Torvalds' suggestion, instead of allocating spinlock
	  arrays to protect the hash chains, use hlist_bl_head, which already
	  contains builtin lock.

 fs/mbcache.c            | 293 +++++++++++++++++++++++++++++++++++-------------
 include/linux/mbcache.h |  10 +-
 2 files changed, 221 insertions(+), 82 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..cd4446e 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,16 @@
  * back on the lru list.
  */
 
+/* Locking protocol:
+ *
+ * The nth hash chain of both the c_block_hash and c_index_hash are
+ * protected by the mth entry of the c_bdev_locks and c_key_locks respectively,
+ * where m is equal to n & c_lock_mask.
+ *
+ * While holding a c_bdev_locks, a thread can acquire either a c_key_locks
+ * or mb_cache_spinlock.
+ */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 
@@ -35,9 +45,9 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/init.h>
+#include <linux/list_bl.h>
 #include <linux/mbcache.h>
 
-
 #ifdef MB_CACHE_DEBUG
 # define mb_debug(f...) do { \
 		printk(KERN_DEBUG f); \
@@ -57,6 +67,40 @@
 
 #define MB_CACHE_WRITER ((unsigned short)~0U >> 1)
 
+#define	MB_LOCK_HASH_CHAIN(hash_head)	hlist_bl_lock(hash_head)
+#define	MB_UNLOCK_HASH_CHAIN(hash_head)	hlist_bl_unlock(hash_head)
+#ifdef MB_CACHE_DEBUG
+#define	MB_LOCK_BLOCK_HASH(ce) do {					\
+		struct hlist_bl_head *block_hash_p = ce->e_block_hash_p;\
+		MB_LOCK_HASH_CHAIN(block_hash_p);			\
+		mb_assert(ce->e_block_hash_p == block_hash_p);		\
+	} while (0)
+#else
+#define	MB_LOCK_BLOCK_HASH(ce)	MB_LOCK_HASH_CHAIN(ce->e_block_hash_p)
+#endif
+#define	MB_UNLOCK_BLOCK_HASH(ce)	MB_UNLOCK_HASH_CHAIN(ce->e_block_hash_p)
+#define	MB_LOCK_INDEX_HASH(ce)		MB_LOCK_HASH_CHAIN(ce->e_index_hash_p)
+#define	MB_UNLOCK_INDEX_HASH(ce)	MB_UNLOCK_HASH_CHAIN(ce->e_index_hash_p)
+
+#define	MAX_LOCK_RETRY	1024
+
+static inline int __mb_cache_lock_entry_block_hash(struct mb_cache_entry *ce)
+{
+	int nretry = 0;
+	struct hlist_bl_head *block_hash_p = ce->e_block_hash_p;
+
+	do {
+		MB_LOCK_HASH_CHAIN(block_hash_p);
+		if (ce->e_block_hash_p == block_hash_p)
+			return 0;
+		MB_UNLOCK_HASH_CHAIN(block_hash_p);
+		block_hash_p = ce->e_block_hash_p;
+	} while (++nretry < MAX_LOCK_RETRY);
+	return -1;
+}
+
+#define	MB_LOCK_ENTRY_BLOCK_HASH(ce)	__mb_cache_lock_entry_block_hash(ce)
+
 static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
 		
 MODULE_AUTHOR("Andreas Gruenbacher <a.gruenbacher@computer.org>");
@@ -101,7 +145,7 @@ static struct shrinker mb_cache_shrinker = {
 static inline int
 __mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
 {
-	return !list_empty(&ce->e_block_list);
+	return !hlist_bl_unhashed(&ce->e_block_list);
 }
 
 
@@ -109,11 +153,26 @@ static void
 __mb_cache_entry_unhash(struct mb_cache_entry *ce)
 {
 	if (__mb_cache_entry_is_hashed(ce)) {
-		list_del_init(&ce->e_block_list);
-		list_del(&ce->e_index.o_list);
+		hlist_bl_del_init(&ce->e_block_list);
+		MB_LOCK_INDEX_HASH(ce);
+		hlist_bl_del(&ce->e_index.o_list);
+		MB_UNLOCK_INDEX_HASH(ce);
 	}
 }
 
+#ifdef	MAK_SRC
+static void
+__mb_cache_entry_unhash_nolock(struct mb_cache_entry *ce)
+{
+	mb_assert(!__mb_cache_entry_is_hashed(ce));
+	mb_assert(!ce->e_block_hash_p);
+	mb_assert(!ce->e_index_hash_p);
+
+	hlist_bl_del_init(&ce->e_block_list);
+	hlist_bl_del(&ce->e_index.o_list);
+}
+#endif	/* MAK_SRC */
+
 
 static void
 __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
@@ -127,9 +186,10 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
 
 
 static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
-	__releases(mb_cache_spinlock)
+__mb_cache_entry_release_unlock(struct mb_cache_entry *ce,
+	struct hlist_bl_head *hash_p)
 {
+	mb_assert(hash_p);
 	/* Wake up all processes queuing for this cache entry. */
 	if (ce->e_queued)
 		wake_up_all(&mb_cache_queue);
@@ -139,13 +199,17 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
 	if (!(ce->e_used || ce->e_queued)) {
 		if (!__mb_cache_entry_is_hashed(ce))
 			goto forget;
-		mb_assert(list_empty(&ce->e_lru_list));
-		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
-	}
-	spin_unlock(&mb_cache_spinlock);
+		MB_UNLOCK_HASH_CHAIN(hash_p);
+		spin_lock(&mb_cache_spinlock);
+		if (list_empty(&ce->e_lru_list))
+			list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+		spin_unlock(&mb_cache_spinlock);
+	} else
+		MB_UNLOCK_HASH_CHAIN(hash_p);
 	return;
 forget:
-	spin_unlock(&mb_cache_spinlock);
+	MB_UNLOCK_HASH_CHAIN(hash_p);
+	mb_assert(list_empty(&ce->e_lru_list));
 	__mb_cache_entry_forget(ce, GFP_KERNEL);
 }
 
@@ -164,31 +228,45 @@ forget:
 static int
 mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
 {
-	LIST_HEAD(free_list);
 	struct mb_cache *cache;
-	struct mb_cache_entry *entry, *tmp;
 	int count = 0;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
+	int max_loop = nr_to_scan << 1;
 
 	mb_debug("trying to free %d entries", nr_to_scan);
-	spin_lock(&mb_cache_spinlock);
-	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
-		struct mb_cache_entry *ce =
-			list_entry(mb_cache_lru_list.next,
-				   struct mb_cache_entry, e_lru_list);
-		list_move_tail(&ce->e_lru_list, &free_list);
+	while ((nr_to_scan-- > 0) && (max_loop-- > 0)) {
+		struct mb_cache_entry *ce;
+
+		spin_lock(&mb_cache_spinlock);
+		if (list_empty(&mb_cache_lru_list)) {
+			spin_unlock(&mb_cache_spinlock);
+			break;
+		}
+		ce = list_entry(mb_cache_lru_list.next,
+			struct mb_cache_entry, e_lru_list);
+		list_del_init(&ce->e_lru_list);
+		spin_unlock(&mb_cache_spinlock);
+
+		if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
+			spin_lock(&mb_cache_spinlock);
+			list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+			spin_unlock(&mb_cache_spinlock);
+			continue;
+		}
+		mb_assert(!(ce->e_used || ce->e_queued));
 		__mb_cache_entry_unhash(ce);
+		MB_UNLOCK_BLOCK_HASH(ce);
+		__mb_cache_entry_forget(ce, gfp_mask);
+		nr_to_scan--;
 	}
+	spin_lock(&mb_cache_spinlock);
 	list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
 		mb_debug("cache %s (%d)", cache->c_name,
 			  atomic_read(&cache->c_entry_count));
 		count += atomic_read(&cache->c_entry_count);
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
-		__mb_cache_entry_forget(entry, gfp_mask);
-	}
 	return (count / 100) * sysctl_vfs_cache_pressure;
 }
 
@@ -216,18 +294,18 @@ mb_cache_create(const char *name, int bucket_bits)
 	cache->c_name = name;
 	atomic_set(&cache->c_entry_count, 0);
 	cache->c_bucket_bits = bucket_bits;
-	cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
-	                              GFP_KERNEL);
+	cache->c_block_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_block_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_block_hash[n]);
-	cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
-				      GFP_KERNEL);
+		INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+	cache->c_index_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_index_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_index_hash[n]);
+		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
 	cache->c_entry_cache = kmem_cache_create(name,
 		sizeof(struct mb_cache_entry), 0,
 		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -276,13 +354,22 @@ mb_cache_shrink(struct block_device *bdev)
 			list_entry(l, struct mb_cache_entry, e_lru_list);
 		if (ce->e_bdev == bdev) {
 			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
 		}
 	}
 	spin_unlock(&mb_cache_spinlock);
 	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+		struct mb_cache_entry *ce =
+			list_entry(l, struct mb_cache_entry, e_lru_list);
+		if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
+			spin_lock(&mb_cache_spinlock);
+			list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+			spin_unlock(&mb_cache_spinlock);
+			continue;
+		}
+		mb_assert(!ce->e_used && !ce->e_queued);
+		__mb_cache_entry_unhash(ce);
+		MB_UNLOCK_BLOCK_HASH(ce);
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 }
 
@@ -306,15 +393,18 @@ mb_cache_destroy(struct mb_cache *cache)
 			list_entry(l, struct mb_cache_entry, e_lru_list);
 		if (ce->e_cache == cache) {
 			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
 		}
 	}
 	list_del(&cache->c_cache_list);
 	spin_unlock(&mb_cache_spinlock);
 
 	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+		struct mb_cache_entry *ce =
+			list_entry(l, struct mb_cache_entry, e_lru_list);
+		MB_LOCK_BLOCK_HASH(ce);
+		__mb_cache_entry_unhash(ce);
+		MB_UNLOCK_BLOCK_HASH(ce);
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 
 	if (atomic_read(&cache->c_entry_count) > 0) {
@@ -344,12 +434,26 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 	struct mb_cache_entry *ce = NULL;
 
 	if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+		struct list_head *l;
+		struct list_head *ltmp;
+
 		spin_lock(&mb_cache_spinlock);
-		if (!list_empty(&mb_cache_lru_list)) {
-			ce = list_entry(mb_cache_lru_list.next,
-					struct mb_cache_entry, e_lru_list);
-			list_del_init(&ce->e_lru_list);
-			__mb_cache_entry_unhash(ce);
+		list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
+			ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+			if (ce->e_cache == cache) {
+				list_del_init(&ce->e_lru_list);
+				spin_unlock(&mb_cache_spinlock);
+				if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
+					spin_lock(&mb_cache_spinlock);
+					list_add_tail(&ce->e_lru_list,
+						&mb_cache_lru_list);
+					continue;
+				}
+				mb_assert(!ce->e_used && !ce->e_queued);
+				__mb_cache_entry_unhash(ce);
+				MB_UNLOCK_BLOCK_HASH(ce);
+				break;
+			}
 		}
 		spin_unlock(&mb_cache_spinlock);
 	}
@@ -359,9 +463,12 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 			return NULL;
 		atomic_inc(&cache->c_entry_count);
 		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_LIST_HEAD(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
 		ce->e_cache = cache;
 		ce->e_queued = 0;
+		ce->e_block_hash_p = &cache->c_block_hash[0];
+		ce->e_index_hash_p = &cache->c_index_hash[0];
 	}
 	ce->e_used = 1 + MB_CACHE_WRITER;
 	return ce;
@@ -388,28 +495,36 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 {
 	struct mb_cache *cache = ce->e_cache;
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	int error = -EBUSY;
+	struct hlist_bl_head *block_hash_p;
+	struct hlist_bl_head *index_hash_p;
+	struct mb_cache_entry *lce;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
-	spin_lock(&mb_cache_spinlock);
-	list_for_each_prev(l, &cache->c_block_hash[bucket]) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_block_list);
-		if (ce->e_bdev == bdev && ce->e_block == block)
+	block_hash_p = &cache->c_block_hash[bucket];
+	MB_LOCK_HASH_CHAIN(block_hash_p);
+	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+		if (lce->e_bdev == bdev && lce->e_block == block)
 			goto out;
 	}
+	mb_assert(!__mb_cache_entry_is_hashed(ce));
 	__mb_cache_entry_unhash(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
-	list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
 	bucket = hash_long(key, cache->c_bucket_bits);
-	list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+	index_hash_p = &cache->c_index_hash[bucket];
+	MB_LOCK_HASH_CHAIN(index_hash_p);
+	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+	ce->e_index_hash_p = index_hash_p;
+	MB_UNLOCK_HASH_CHAIN(index_hash_p);
 	error = 0;
 out:
-	spin_unlock(&mb_cache_spinlock);
+	MB_UNLOCK_HASH_CHAIN(block_hash_p);
 	return error;
 }
 
@@ -424,8 +539,8 @@ out:
 void
 mb_cache_entry_release(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
-	__mb_cache_entry_release_unlock(ce);
+	MB_LOCK_BLOCK_HASH(ce);
+	__mb_cache_entry_release_unlock(ce, ce->e_block_hash_p);
 }
 
 
@@ -438,10 +553,12 @@ mb_cache_entry_release(struct mb_cache_entry *ce)
 void
 mb_cache_entry_free(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
+	mb_assert(ce);
+	mb_assert(ce->e_block_hash_p);
+	MB_LOCK_BLOCK_HASH(ce);
 	mb_assert(list_empty(&ce->e_lru_list));
 	__mb_cache_entry_unhash(ce);
-	__mb_cache_entry_release_unlock(ce);
+	__mb_cache_entry_release_unlock(ce, ce->e_block_hash_p);
 }
 
 
@@ -458,34 +575,39 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 		   sector_t block)
 {
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *block_hash_p;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
-	spin_lock(&mb_cache_spinlock);
-	list_for_each(l, &cache->c_block_hash[bucket]) {
-		ce = list_entry(l, struct mb_cache_entry, e_block_list);
+	block_hash_p = &cache->c_block_hash[bucket];
+	MB_LOCK_HASH_CHAIN(block_hash_p);
+	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
 			DEFINE_WAIT(wait);
 
+			spin_lock(&mb_cache_spinlock);
 			if (!list_empty(&ce->e_lru_list))
 				list_del_init(&ce->e_lru_list);
+			spin_unlock(&mb_cache_spinlock);
 
 			while (ce->e_used > 0) {
 				ce->e_queued++;
 				prepare_to_wait(&mb_cache_queue, &wait,
 						TASK_UNINTERRUPTIBLE);
-				spin_unlock(&mb_cache_spinlock);
+				MB_UNLOCK_BLOCK_HASH(ce);
 				schedule();
-				spin_lock(&mb_cache_spinlock);
+				MB_LOCK_BLOCK_HASH(ce);
 				ce->e_queued--;
 			}
 			finish_wait(&mb_cache_queue, &wait);
 			ce->e_used += 1 + MB_CACHE_WRITER;
 
 			if (!__mb_cache_entry_is_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
+				__mb_cache_entry_release_unlock(ce,
+					block_hash_p);
 				return NULL;
 			}
 			goto cleanup;
@@ -494,25 +616,30 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 	ce = NULL;
 
 cleanup:
-	spin_unlock(&mb_cache_spinlock);
+	MB_UNLOCK_HASH_CHAIN(block_hash_p);
 	return ce;
 }
 
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
 
 static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
-	while (l != head) {
+	while (l != NULL) {
 		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_index.o_list);
+			hlist_bl_entry(l, struct mb_cache_entry,
+				e_index.o_list);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
 			DEFINE_WAIT(wait);
 
+			MB_UNLOCK_HASH_CHAIN(head);
+			spin_lock(&mb_cache_spinlock);
 			if (!list_empty(&ce->e_lru_list))
 				list_del_init(&ce->e_lru_list);
+			spin_unlock(&mb_cache_spinlock);
 
+			MB_LOCK_HASH_CHAIN(head);
 			/* Incrementing before holding the lock gives readers
 			   priority over writers. */
 			ce->e_used++;
@@ -520,18 +647,20 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
 				ce->e_queued++;
 				prepare_to_wait(&mb_cache_queue, &wait,
 						TASK_UNINTERRUPTIBLE);
-				spin_unlock(&mb_cache_spinlock);
+				MB_UNLOCK_HASH_CHAIN(head);
 				schedule();
-				spin_lock(&mb_cache_spinlock);
+				MB_LOCK_HASH_CHAIN(head);
+				mb_assert(ce->e_index_hash_p == head);
 				ce->e_queued--;
 			}
 			finish_wait(&mb_cache_queue, &wait);
 
 			if (!__mb_cache_entry_is_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
-				spin_lock(&mb_cache_spinlock);
+				__mb_cache_entry_release_unlock(ce, head);
+				MB_LOCK_HASH_CHAIN(head);
 				return ERR_PTR(-EAGAIN);
 			}
+			mb_assert(ce->e_index_hash_p == head);
 			return ce;
 		}
 		l = l->next;
@@ -557,13 +686,17 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 			  unsigned int key)
 {
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
-	struct mb_cache_entry *ce;
+	struct hlist_bl_node *l;
+	struct mb_cache_entry *ce = NULL;
+	struct hlist_bl_head *index_hash_p;
 
-	spin_lock(&mb_cache_spinlock);
-	l = cache->c_index_hash[bucket].next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
-	spin_unlock(&mb_cache_spinlock);
+	index_hash_p = &cache->c_index_hash[bucket];
+	MB_LOCK_HASH_CHAIN(index_hash_p);
+	if (!hlist_bl_empty(index_hash_p)) {
+		l = hlist_bl_first(index_hash_p);
+		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+	}
+	MB_UNLOCK_HASH_CHAIN(index_hash_p);
 	return ce;
 }
 
@@ -592,13 +725,17 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 {
 	struct mb_cache *cache = prev->e_cache;
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *index_hash_p;
 
-	spin_lock(&mb_cache_spinlock);
+	index_hash_p = &cache->c_index_hash[bucket];
+	mb_assert(prev->e_index_hash_p == index_hash_p);
+	MB_LOCK_HASH_CHAIN(index_hash_p);
+	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
-	__mb_cache_entry_release_unlock(prev);
+	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+	MB_UNLOCK_HASH_CHAIN(index_hash_p);
 	return ce;
 }
 
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..89826c0 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -11,11 +11,13 @@ struct mb_cache_entry {
 	unsigned short			e_queued;
 	struct block_device		*e_bdev;
 	sector_t			e_block;
-	struct list_head		e_block_list;
+	struct hlist_bl_node		e_block_list;
 	struct {
-		struct list_head	o_list;
+		struct hlist_bl_node	o_list;
 		unsigned int		o_key;
 	} e_index;
+	struct hlist_bl_head		*e_block_hash_p;
+	struct hlist_bl_head		*e_index_hash_p;
 };
 
 struct mb_cache {
@@ -25,8 +27,8 @@ struct mb_cache {
 	int				c_max_entries;
 	int				c_bucket_bits;
 	struct kmem_cache		*c_entry_cache;
-	struct list_head		*c_block_hash;
-	struct list_head		*c_index_hash;
+	struct hlist_bl_head		*c_block_hash;
+	struct hlist_bl_head		*c_index_hash;
 };
 
 /* Functions on caches */
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/2] ext4: each filesystem creates and uses its own mc_cache
  2013-08-22 15:54 ` [PATCH v2 " T Makphaibulchoke
@ 2013-08-22 15:54     ` T Makphaibulchoke
  2013-08-22 15:54     ` T Makphaibulchoke
  2014-01-24 18:31   ` [PATCH v4 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2013-08-22 15:54 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, torvalds, T Makphaibulchoke

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c.
fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/super.c | 24 ++++++++++++++++--------
 fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
 fs/ext4/xattr.h |  6 +++---
 4 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..70f2709 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1306,6 +1306,7 @@ struct ext4_sb_info {
 	struct list_head s_es_lru;
 	unsigned long s_es_last_sorted;
 	struct percpu_counter s_extent_cache_cnt;
+	struct mb_cache *s_mb_cache;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 };
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 36b141e..3e7d338 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
 static struct ext4_lazy_init *ext4_li_info;
 static struct mutex ext4_li_mtx;
 static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -828,6 +829,10 @@ static void ext4_put_super(struct super_block *sb)
 		invalidate_bdev(sbi->journal_bdev);
 		ext4_blkdev_remove(sbi);
 	}
+	if (sbi->s_mb_cache) {
+		ext4_xattr_destroy_cache(sbi->s_mb_cache);
+		sbi->s_mb_cache = NULL;
+	}
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 	sb->s_fs_info = NULL;
@@ -3916,6 +3921,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	if (ext4_mballoc_ready) {
+		sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+		if (!sbi->s_mb_cache) {
+			ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+			goto failed_mount_wq;
+		}
+	}
 
 	/*
 	 * The journal may have updated the bg summary counts, so we
@@ -5428,11 +5440,9 @@ static int __init ext4_init_fs(void)
 
 	err = ext4_init_mballoc();
 	if (err)
-		goto out3;
-
-	err = ext4_init_xattr();
-	if (err)
 		goto out2;
+	else
+		ext4_mballoc_ready = 1;
 	err = init_inodecache();
 	if (err)
 		goto out1;
@@ -5448,10 +5458,9 @@ out:
 	unregister_as_ext3();
 	destroy_inodecache();
 out1:
-	ext4_exit_xattr();
-out2:
+	ext4_mballoc_ready = 0;
 	ext4_exit_mballoc();
-out3:
+out2:
 	ext4_exit_feat_adverts();
 out4:
 	if (ext4_proc_root)
@@ -5474,7 +5483,6 @@ static void __exit ext4_exit_fs(void)
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
 	destroy_inodecache();
-	ext4_exit_xattr();
 	ext4_exit_mballoc();
 	ext4_exit_feat_adverts();
 	remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c081e34..3e0ff31 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
 static struct buffer_head *ext4_xattr_cache_find(struct inode *,
 						 struct ext4_xattr_header *,
 						 struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
 static int ext4_xattr_list(struct dentry *dentry, char *buffer,
 			   size_t buffer_size);
 
-static struct mb_cache *ext4_xattr_cache;
-
 static const struct xattr_handler *ext4_xattr_handler_map[] = {
 	[EXT4_XATTR_INDEX_USER]		     = &ext4_xattr_user_handler,
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
 	NULL
 };
 
+#define EXT4_GET_MB_CACHE(inode)	(((struct ext4_sb_info *) \
+				inode->i_sb->s_fs_info)->s_mb_cache)
+
 static __le32 ext4_xattr_block_csum(struct inode *inode,
 				    sector_t block_nr,
 				    struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
 	if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
 cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
-	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+	ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
 		goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
-		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+		ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
 		if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
-				ext4_xattr_cache_insert(bs->bh);
+				ext4_xattr_cache_insert(ext4_mb_cache,
+					bs->bh);
 			}
 			unlock_buffer(bs->bh);
 			if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
-			ext4_xattr_cache_insert(new_bh);
+			ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
 			if (error)
@@ -1492,13 +1498,13 @@ ext4_xattr_put_super(struct super_block *sb)
  * Returns 0, or a negative error number on failure.
  */
 static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 {
 	__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
 	struct mb_cache_entry *ce;
 	int error;
 
-	ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+	ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
 	if (!ce) {
 		ea_bdebug(bh, "out of memory");
 		return;
@@ -1570,12 +1576,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
 again:
-	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+	ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
 				       hash);
 	while (ce) {
 		struct buffer_head *bh;
@@ -1673,19 +1680,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
 
 #undef BLOCK_HASH_SHIFT
 
-int __init
-ext4_init_xattr(void)
+#define	HASH_BUCKET_BITS	6
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
 {
-	ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
-	if (!ext4_xattr_cache)
-		return -ENOMEM;
-	return 0;
+	return mb_cache_create(name, HASH_BUCKET_BITS);
 }
 
-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
 {
-	if (ext4_xattr_cache)
-		mb_cache_destroy(ext4_xattr_cache);
-	ext4_xattr_cache = NULL;
+	if (cache)
+		mb_cache_destroy(cache);
 }
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
 
-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
 extern const struct xattr_handler *ext4_xattr_handlers[];
 
 extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
 				       struct ext4_xattr_info *i,
 				       struct ext4_xattr_ibody_find *is);
 
+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
 #ifdef CONFIG_EXT4_FS_SECURITY
 extern int ext4_init_security(handle_t *handle, struct inode *inode,
 			      struct inode *dir, const struct qstr *qstr);
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2 2/2] ext4: each filesystem creates and uses its own mc_cache
@ 2013-08-22 15:54     ` T Makphaibulchoke
  0 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2013-08-22 15:54 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, torvalds, T Makphaibulchoke

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c.
fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/super.c | 24 ++++++++++++++++--------
 fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
 fs/ext4/xattr.h |  6 +++---
 4 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..70f2709 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1306,6 +1306,7 @@ struct ext4_sb_info {
 	struct list_head s_es_lru;
 	unsigned long s_es_last_sorted;
 	struct percpu_counter s_extent_cache_cnt;
+	struct mb_cache *s_mb_cache;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 };
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 36b141e..3e7d338 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
 static struct ext4_lazy_init *ext4_li_info;
 static struct mutex ext4_li_mtx;
 static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -828,6 +829,10 @@ static void ext4_put_super(struct super_block *sb)
 		invalidate_bdev(sbi->journal_bdev);
 		ext4_blkdev_remove(sbi);
 	}
+	if (sbi->s_mb_cache) {
+		ext4_xattr_destroy_cache(sbi->s_mb_cache);
+		sbi->s_mb_cache = NULL;
+	}
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 	sb->s_fs_info = NULL;
@@ -3916,6 +3921,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	if (ext4_mballoc_ready) {
+		sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+		if (!sbi->s_mb_cache) {
+			ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+			goto failed_mount_wq;
+		}
+	}
 
 	/*
 	 * The journal may have updated the bg summary counts, so we
@@ -5428,11 +5440,9 @@ static int __init ext4_init_fs(void)
 
 	err = ext4_init_mballoc();
 	if (err)
-		goto out3;
-
-	err = ext4_init_xattr();
-	if (err)
 		goto out2;
+	else
+		ext4_mballoc_ready = 1;
 	err = init_inodecache();
 	if (err)
 		goto out1;
@@ -5448,10 +5458,9 @@ out:
 	unregister_as_ext3();
 	destroy_inodecache();
 out1:
-	ext4_exit_xattr();
-out2:
+	ext4_mballoc_ready = 0;
 	ext4_exit_mballoc();
-out3:
+out2:
 	ext4_exit_feat_adverts();
 out4:
 	if (ext4_proc_root)
@@ -5474,7 +5483,6 @@ static void __exit ext4_exit_fs(void)
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
 	destroy_inodecache();
-	ext4_exit_xattr();
 	ext4_exit_mballoc();
 	ext4_exit_feat_adverts();
 	remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c081e34..3e0ff31 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
 static struct buffer_head *ext4_xattr_cache_find(struct inode *,
 						 struct ext4_xattr_header *,
 						 struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
 static int ext4_xattr_list(struct dentry *dentry, char *buffer,
 			   size_t buffer_size);
 
-static struct mb_cache *ext4_xattr_cache;
-
 static const struct xattr_handler *ext4_xattr_handler_map[] = {
 	[EXT4_XATTR_INDEX_USER]		     = &ext4_xattr_user_handler,
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
 	NULL
 };
 
+#define EXT4_GET_MB_CACHE(inode)	(((struct ext4_sb_info *) \
+				inode->i_sb->s_fs_info)->s_mb_cache)
+
 static __le32 ext4_xattr_block_csum(struct inode *inode,
 				    sector_t block_nr,
 				    struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
 	if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
 cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
-	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+	ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
 		goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
-		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+		ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
 		if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
-				ext4_xattr_cache_insert(bs->bh);
+				ext4_xattr_cache_insert(ext4_mb_cache,
+					bs->bh);
 			}
 			unlock_buffer(bs->bh);
 			if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
-			ext4_xattr_cache_insert(new_bh);
+			ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
 			if (error)
@@ -1492,13 +1498,13 @@ ext4_xattr_put_super(struct super_block *sb)
  * Returns 0, or a negative error number on failure.
  */
 static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 {
 	__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
 	struct mb_cache_entry *ce;
 	int error;
 
-	ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+	ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
 	if (!ce) {
 		ea_bdebug(bh, "out of memory");
 		return;
@@ -1570,12 +1576,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
 again:
-	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+	ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
 				       hash);
 	while (ce) {
 		struct buffer_head *bh;
@@ -1673,19 +1680,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
 
 #undef BLOCK_HASH_SHIFT
 
-int __init
-ext4_init_xattr(void)
+#define	HASH_BUCKET_BITS	6
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
 {
-	ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
-	if (!ext4_xattr_cache)
-		return -ENOMEM;
-	return 0;
+	return mb_cache_create(name, HASH_BUCKET_BITS);
 }
 
-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
 {
-	if (ext4_xattr_cache)
-		mb_cache_destroy(ext4_xattr_cache);
-	ext4_xattr_cache = NULL;
+	if (cache)
+		mb_cache_destroy(cache);
 }
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
 
-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/2] mbcache: decoupling the locking of local from global data
  2013-08-22 15:54   ` [PATCH v2 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
@ 2013-08-22 16:53     ` Linus Torvalds
  2013-08-22 15:33       ` Thavatchai Makphaibulchoke
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2013-08-22 16:53 UTC (permalink / raw)
  To: T Makphaibulchoke
  Cc: Theodore Ts'o, adilger.kernel, Al Viro, linux-ext4,
	Linux Kernel Mailing List, linux-fsdevel, Chandramouleeswaran,
	Aswin

On Thu, Aug 22, 2013 at 8:54 AM, T Makphaibulchoke <tmac@hp.com> wrote:
>
> +#define        MB_LOCK_HASH_CHAIN(hash_head)   hlist_bl_lock(hash_head)
> +#define        MB_UNLOCK_HASH_CHAIN(hash_head) hlist_bl_unlock(hash_head)
> +#ifdef MB_CACHE_DEBUG
> +#define        MB_LOCK_BLOCK_HASH(ce) do {

Please don't do these ugly and pointless preprocessor macro expanders
that hide what the actual operation is.

The DEBUG case seems to be just for your own testing anyway, so even
that shouldn't exist in the merged version.

> +#define        MAX_LOCK_RETRY  1024
> +
> +static inline int __mb_cache_lock_entry_block_hash(struct mb_cache_entry *ce)
> +{
> +       int nretry = 0;
> +       struct hlist_bl_head *block_hash_p = ce->e_block_hash_p;
> +
> +       do {
> +               MB_LOCK_HASH_CHAIN(block_hash_p);
> +               if (ce->e_block_hash_p == block_hash_p)
> +                       return 0;
> +               MB_UNLOCK_HASH_CHAIN(block_hash_p);
> +               block_hash_p = ce->e_block_hash_p;
> +       } while (++nretry < MAX_LOCK_RETRY);
> +       return -1;
> +}

And this is really ugly. Again it's also then hidden behind the ugly macro.

First off, the thousand-time retry seems completely excessive. Does it
actually need any retry AT ALL? If the hash entry changes, either you
should retry forever, or if you feel that can result in livelocks
(fair enough) and you need a fallback case to a bigger lock, then why
not just do the fallback immediately?

More importantly, regardless of that retry issue, this seems to be
abstracted at the wrong level, resulting in every single user of this
repeating the same complex and hard-to-understand incantation:

> +               if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
> +                       spin_lock(&mb_cache_spinlock);
> +                       list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
> +                       spin_unlock(&mb_cache_spinlock);
> +                       continue;
> +               }
..
> +               if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
> +                       spin_lock(&mb_cache_spinlock);
> +                       list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
> +                       spin_unlock(&mb_cache_spinlock);
> +                       continue;
> +               }
..
> +                               if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
> +                                       spin_lock(&mb_cache_spinlock);
> +                                       list_add_tail(&ce->e_lru_list,
> +                                               &mb_cache_lru_list);
> +                                       continue;
> +                               }

where the only difference is that the last one doesn't unlock
afterwards because it runs in a loop with that LRU list lock held.
Ugh.

The locking logic also isn't explained anywhere, making the
hard-to-read code even harder to read.

             Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-04 20:00   ` [PATCH v3 0/2] ext4: increase mbcache scalability Andreas Dilger
@ 2013-09-04 15:33     ` Thavatchai Makphaibulchoke
  2013-09-05 15:22     ` Thavatchai Makphaibulchoke
  1 sibling, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-04 15:33 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: T Makphaibulchoke, Theodore Ts'o, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On 09/04/2013 08:00 PM, Andreas Dilger wrote:

> 
> In the past, I've raised the question of whether mbcache is even
> useful on real-world systems.  Essentially, this is providing a
> "deduplication" service for ext2/3/4 xattr blocks that are identical.
> The question is how often this is actually the case in modern use?
> The original design was for allowing external ACL blocks to be
> shared between inodes, at a time when ACLs where pretty much the
> only xattrs stored on inodes.
> 
> The question now is whether there are common uses where all of the
> xattrs stored on multiple inodes are identical?  If that is not the
> case, mbcache is just adding overhead and should just be disabled
> entirely instead of just adding less overhead.
> 
> There aren't good statistics on the hit rate for mbcache, but it
> might be possible to generate some with systemtap or similar to
> see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.
> 
> Cheers, Andreas
> 

Thanks Andreas for the comments.  Since I'm not familiar with systemtap, I'm thinking probably the quickest and simplest way is to re-run aim7 and swing bench with mbcache disabled for comparison. Please let me know if you have any other benchmark suggestion or if you think systemtap on ext4_xattr_cache_find() would give a more accurate measurement.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-07-18  0:55 [PATCH 0/2] ext4: increase mbcache scalability T Makphaibulchoke
  2013-08-22 15:54 ` [PATCH v2 " T Makphaibulchoke
@ 2013-09-04 16:39 ` T Makphaibulchoke
  2013-09-04 16:39   ` [PATCH v3 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
                     ` (3 more replies)
  2014-02-20 18:37 ` [PATCH V5 0/3] " T Makphaibulchoke
  2014-03-12 16:19 ` [PATCH v5 RESEND 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  3 siblings, 4 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2013-09-04 16:39 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, torvalds, aswin_proj, T Makphaibulchoke

This patch intends to improve the scalability of an ext filesystem,
particularly ext4.

The patch consists of two parts.  The first part introduces higher degree of
parallelism to the usages of the mb_cache and mb_cache_entries and impacts
all ext filesystems.

The second part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems

Here are some of the benchmark results with the changes. 

On a 90 core machine:

Here are the performance improvements in some of the aim7 workloads,

---------------------------
|             | % increase |
---------------------------
| alltests    |     11.85  |
---------------------------
| custom      |     14.42  |
---------------------------
| fserver     |     21.36  |  
---------------------------
| new_dbase   |      5.59 |  
---------------------------
| new_fserver |     21.45  |  
---------------------------
| shared      |     12.84  |  
---------------------------
For Swingbench dss workload, with 16 GB database,

-------------------------------------------------------------------------------
| Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
-------------------------------------------------------------------------------
| % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 |
-------------------------------------------------------------------------------
| % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 |
| without using|      |      |      |      |      |      |      |      |      |
| shared memory|      |      |      |      |      |      |      |      |      |
-------------------------------------------------------------------------------
For SPECjbb2013, composite run,

--------------------------------------------
|               | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement |   5.99   |     N/A       |
--------------------------------------------


On an 80 core machine:

The aim7's results for most of the workloads turn out to the same.

Here are the results of Swingbench dss workload,

-------------------------------------------------------------------------------
| Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
-------------------------------------------------------------------------------
| % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 |
-------------------------------------------------------------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced. 

Changed in v3:
	- New diff summary

Changed in v2:
	- New performance data
	- New diff summary

T Makphaibulchoke (2):
  mbcache: decoupling the locking of local from global data
  ext4: each filesystem creates and uses its own mb_cache

 fs/ext4/ext4.h          |   1 +
 fs/ext4/super.c         |  24 ++--
 fs/ext4/xattr.c         |  51 ++++----
 fs/ext4/xattr.h         |   6 +-
 fs/mbcache.c            | 306 +++++++++++++++++++++++++++++++++++-------------
 include/linux/mbcache.h |  10 +-
 6 files changed, 277 insertions(+), 121 deletions(-)

-- 
1.7.11.3


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v3 1/2] mbcache: decoupling the locking of local from global data
  2013-09-04 16:39 ` [PATCH v3 0/2] " T Makphaibulchoke
@ 2013-09-04 16:39   ` T Makphaibulchoke
  2013-10-30 13:27     ` Theodore Ts'o
  2013-10-30 14:42     ` Theodore Ts'o
  2013-09-04 16:39   ` [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2013-09-04 16:39 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, torvalds, aswin_proj, T Makphaibulchoke

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables.  Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
Changed in v3:
	- Removed all hash lock macros.
	- Fixed a possible race condition updating the e_used and
	  e_queued members of an mb_cache_entry between mb_cache_entry_get
	  function, traversing an block hash chain, and mb_cache_entry_find
	  function, traversing an index hash chain.

Changed in v2:
	- As per Linus Torvalds' suggestion, instead of allocating spinlock
	  arrays to protect the hash chains, use hlist_bl_head, which already
	  contains builtin lock.

 fs/mbcache.c            | 306 +++++++++++++++++++++++++++++++++++-------------
 include/linux/mbcache.h |  10 +-
 2 files changed, 229 insertions(+), 87 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..dd45fe9 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,38 @@
  * back on the lru list.
  */
 
+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Lock ordering:
+ *
+ * Each block hash chain's lock has the highest order, followed by each
+ * index hash chain's lock, with mb_cache_spinlock the lowest.
+ * While holding a block hash chain lock a thread can acquire either
+ * an index hash chain lock or mb_cache_spinlock.
+ *
+ * Synchronization:
+ *
+ * Since both the e_used and e_queued members of each mb_cache_entry can
+ * be updated while traversing either a block hash chain or an index hash
+ * chain and the index hash chain lock has lower oder, each index hash
+ * chain's lock, in addition to being used to serialize accesses to the
+ * index hash chain, is also used to serialize accesses to both e_used
+ * and e_queued.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced.  Both e_used
+ * and e_queued are 0's.  When an mb_cache_entry is explicitly
+ * freed it is first removed from a block hash chain.
+ */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 
@@ -35,9 +67,9 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/init.h>
+#include <linux/list_bl.h>
 #include <linux/mbcache.h>
 
-
 #ifdef MB_CACHE_DEBUG
 # define mb_debug(f...) do { \
 		printk(KERN_DEBUG f); \
@@ -99,23 +131,34 @@ static struct shrinker mb_cache_shrinker = {
 };
 
 static inline int
-__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
+__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
 {
-	return !list_empty(&ce->e_block_list);
+	return !hlist_bl_unhashed(&ce->e_block_list);
 }
 
 
 static void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_block(struct mb_cache_entry *ce)
 {
-	if (__mb_cache_entry_is_hashed(ce)) {
-		list_del_init(&ce->e_block_list);
-		list_del(&ce->e_index.o_list);
-	}
+	if (__mb_cache_entry_is_block_hashed(ce))
+		hlist_bl_del_init(&ce->e_block_list);
+}
+
+static inline int
+__mb_cache_entry_is_index_hashed(struct mb_cache_entry *ce)
+{
+	return !hlist_bl_unhashed(&ce->e_index.o_list);
 }
 
 
 static void
+__mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
+{
+	if (__mb_cache_entry_is_index_hashed(ce))
+		hlist_bl_del(&ce->e_index.o_list);
+}
+
+static void
 __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
 {
 	struct mb_cache *cache = ce->e_cache;
@@ -128,8 +171,8 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
 
 static void
 __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
-	__releases(mb_cache_spinlock)
 {
+	hlist_bl_lock(ce->e_index_hash_p);
 	/* Wake up all processes queuing for this cache entry. */
 	if (ce->e_queued)
 		wake_up_all(&mb_cache_queue);
@@ -137,15 +180,20 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
 		ce->e_used -= MB_CACHE_WRITER;
 	ce->e_used--;
 	if (!(ce->e_used || ce->e_queued)) {
-		if (!__mb_cache_entry_is_hashed(ce))
+		hlist_bl_unlock(ce->e_index_hash_p);
+		if (!__mb_cache_entry_is_block_hashed(ce))
 			goto forget;
+		spin_lock(&mb_cache_spinlock);
 		mb_assert(list_empty(&ce->e_lru_list));
 		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
-	}
-	spin_unlock(&mb_cache_spinlock);
+		spin_unlock(&mb_cache_spinlock);
+	} else
+		hlist_bl_unlock(ce->e_index_hash_p);
+	hlist_bl_unlock(ce->e_block_hash_p);
 	return;
 forget:
-	spin_unlock(&mb_cache_spinlock);
+	hlist_bl_unlock(ce->e_block_hash_p);
+	mb_assert(list_empty(&ce->e_lru_list));
 	__mb_cache_entry_forget(ce, GFP_KERNEL);
 }
 
@@ -164,31 +212,46 @@ forget:
 static int
 mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
 {
-	LIST_HEAD(free_list);
 	struct mb_cache *cache;
-	struct mb_cache_entry *entry, *tmp;
 	int count = 0;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
 
 	mb_debug("trying to free %d entries", nr_to_scan);
-	spin_lock(&mb_cache_spinlock);
-	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
-		struct mb_cache_entry *ce =
-			list_entry(mb_cache_lru_list.next,
-				   struct mb_cache_entry, e_lru_list);
-		list_move_tail(&ce->e_lru_list, &free_list);
-		__mb_cache_entry_unhash(ce);
+	while (nr_to_scan > 0) {
+		struct mb_cache_entry *ce;
+
+		spin_lock(&mb_cache_spinlock);
+		if (list_empty(&mb_cache_lru_list)) {
+			spin_unlock(&mb_cache_spinlock);
+			break;
+		}
+		ce = list_entry(mb_cache_lru_list.next,
+			struct mb_cache_entry, e_lru_list);
+		list_del_init(&ce->e_lru_list);
+		spin_unlock(&mb_cache_spinlock);
+
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		if (!(ce->e_used || ce->e_queued)) {
+			__mb_cache_entry_unhash_index(ce);
+			hlist_bl_unlock(ce->e_index_hash_p);
+			__mb_cache_entry_unhash_block(ce);
+			hlist_bl_unlock(ce->e_block_hash_p);
+			__mb_cache_entry_forget(ce, gfp_mask);
+			--nr_to_scan;
+		} else {
+			hlist_bl_unlock(ce->e_index_hash_p);
+			hlist_bl_unlock(ce->e_block_hash_p);
+		}
 	}
+	spin_lock(&mb_cache_spinlock);
 	list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
 		mb_debug("cache %s (%d)", cache->c_name,
 			  atomic_read(&cache->c_entry_count));
 		count += atomic_read(&cache->c_entry_count);
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
-		__mb_cache_entry_forget(entry, gfp_mask);
-	}
 	return (count / 100) * sysctl_vfs_cache_pressure;
 }
 
@@ -216,18 +279,18 @@ mb_cache_create(const char *name, int bucket_bits)
 	cache->c_name = name;
 	atomic_set(&cache->c_entry_count, 0);
 	cache->c_bucket_bits = bucket_bits;
-	cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
-	                              GFP_KERNEL);
+	cache->c_block_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_block_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_block_hash[n]);
-	cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
-				      GFP_KERNEL);
+		INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+	cache->c_index_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_index_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_index_hash[n]);
+		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
 	cache->c_entry_cache = kmem_cache_create(name,
 		sizeof(struct mb_cache_entry), 0,
 		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -276,13 +339,24 @@ mb_cache_shrink(struct block_device *bdev)
 			list_entry(l, struct mb_cache_entry, e_lru_list);
 		if (ce->e_bdev == bdev) {
 			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
 		}
 	}
 	spin_unlock(&mb_cache_spinlock);
 	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+		struct mb_cache_entry *ce =
+			list_entry(l, struct mb_cache_entry, e_lru_list);
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		if (!(ce->e_used || ce->e_queued)) {
+			__mb_cache_entry_unhash_index(ce);
+			hlist_bl_unlock(ce->e_index_hash_p);
+			__mb_cache_entry_unhash_block(ce);
+			hlist_bl_unlock(ce->e_block_hash_p);
+			__mb_cache_entry_forget(ce, GFP_KERNEL);
+		} else {
+			hlist_bl_unlock(ce->e_index_hash_p);
+			hlist_bl_unlock(ce->e_block_hash_p);
+		}
 	}
 }
 
@@ -306,15 +380,21 @@ mb_cache_destroy(struct mb_cache *cache)
 			list_entry(l, struct mb_cache_entry, e_lru_list);
 		if (ce->e_cache == cache) {
 			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
 		}
 	}
 	list_del(&cache->c_cache_list);
 	spin_unlock(&mb_cache_spinlock);
 
 	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+		struct mb_cache_entry *ce =
+			list_entry(l, struct mb_cache_entry, e_lru_list);
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		__mb_cache_entry_unhash_index(ce);
+		hlist_bl_unlock(ce->e_index_hash_p);
+		__mb_cache_entry_unhash_block(ce);
+		hlist_bl_unlock(ce->e_block_hash_p);
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 
 	if (atomic_read(&cache->c_entry_count) > 0) {
@@ -344,12 +424,27 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 	struct mb_cache_entry *ce = NULL;
 
 	if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+		struct list_head *l;
+		struct list_head *ltmp;
+
 		spin_lock(&mb_cache_spinlock);
-		if (!list_empty(&mb_cache_lru_list)) {
-			ce = list_entry(mb_cache_lru_list.next,
-					struct mb_cache_entry, e_lru_list);
-			list_del_init(&ce->e_lru_list);
-			__mb_cache_entry_unhash(ce);
+		list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
+			ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+			if (ce->e_cache == cache) {
+				list_del_init(&ce->e_lru_list);
+				spin_unlock(&mb_cache_spinlock);
+				hlist_bl_lock(ce->e_block_hash_p);
+				hlist_bl_lock(ce->e_index_hash_p);
+				if (!(ce->e_used || ce->e_queued)) {
+					__mb_cache_entry_unhash_index(ce);
+					hlist_bl_unlock(ce->e_index_hash_p);
+					__mb_cache_entry_unhash_block(ce);
+					hlist_bl_unlock(ce->e_block_hash_p);
+					break;
+				}
+				hlist_bl_unlock(ce->e_index_hash_p);
+				hlist_bl_unlock(ce->e_block_hash_p);
+			}
 		}
 		spin_unlock(&mb_cache_spinlock);
 	}
@@ -359,9 +454,12 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 			return NULL;
 		atomic_inc(&cache->c_entry_count);
 		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_LIST_HEAD(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
 		ce->e_cache = cache;
 		ce->e_queued = 0;
+		ce->e_block_hash_p = &cache->c_block_hash[0];
+		ce->e_index_hash_p = &cache->c_index_hash[0];
 	}
 	ce->e_used = 1 + MB_CACHE_WRITER;
 	return ce;
@@ -388,28 +486,37 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 {
 	struct mb_cache *cache = ce->e_cache;
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	int error = -EBUSY;
+	struct hlist_bl_head *block_hash_p;
+	struct hlist_bl_head *index_hash_p;
+	struct mb_cache_entry *lce;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
-	spin_lock(&mb_cache_spinlock);
-	list_for_each_prev(l, &cache->c_block_hash[bucket]) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_block_list);
-		if (ce->e_bdev == bdev && ce->e_block == block)
+	block_hash_p = &cache->c_block_hash[bucket];
+	hlist_bl_lock(block_hash_p);
+	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+		if (lce->e_bdev == bdev && lce->e_block == block)
 			goto out;
 	}
-	__mb_cache_entry_unhash(ce);
+	mb_assert(!__mb_cache_entry_is_hashed(ce));
+	__mb_cache_entry_unhash_index(ce);
+	__mb_cache_entry_unhash_block(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
-	list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
 	bucket = hash_long(key, cache->c_bucket_bits);
-	list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+	index_hash_p = &cache->c_index_hash[bucket];
+	hlist_bl_lock(index_hash_p);
+	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+	ce->e_index_hash_p = index_hash_p;
+	hlist_bl_unlock(index_hash_p);
 	error = 0;
 out:
-	spin_unlock(&mb_cache_spinlock);
+	hlist_bl_unlock(block_hash_p);
 	return error;
 }
 
@@ -424,7 +531,7 @@ out:
 void
 mb_cache_entry_release(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(ce->e_block_hash_p);
 	__mb_cache_entry_release_unlock(ce);
 }
 
@@ -438,9 +545,13 @@ mb_cache_entry_release(struct mb_cache_entry *ce)
 void
 mb_cache_entry_free(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
+	mb_assert(ce);
+	hlist_bl_lock(ce->e_block_hash_p);
+	hlist_bl_lock(ce->e_index_hash_p);
 	mb_assert(list_empty(&ce->e_lru_list));
-	__mb_cache_entry_unhash(ce);
+	__mb_cache_entry_unhash_index(ce);
+	hlist_bl_unlock(ce->e_index_hash_p);
+	__mb_cache_entry_unhash_block(ce);
 	__mb_cache_entry_release_unlock(ce);
 }
 
@@ -458,33 +569,46 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 		   sector_t block)
 {
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *block_hash_p;
+	int held_block_lock = 1;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
-	spin_lock(&mb_cache_spinlock);
-	list_for_each(l, &cache->c_block_hash[bucket]) {
-		ce = list_entry(l, struct mb_cache_entry, e_block_list);
+	block_hash_p = &cache->c_block_hash[bucket];
+	hlist_bl_lock(block_hash_p);
+	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
 			DEFINE_WAIT(wait);
 
+			spin_lock(&mb_cache_spinlock);
 			if (!list_empty(&ce->e_lru_list))
 				list_del_init(&ce->e_lru_list);
+			spin_unlock(&mb_cache_spinlock);
 
+			hlist_bl_lock(ce->e_index_hash_p);
 			while (ce->e_used > 0) {
 				ce->e_queued++;
+				hlist_bl_unlock(ce->e_index_hash_p);
 				prepare_to_wait(&mb_cache_queue, &wait,
 						TASK_UNINTERRUPTIBLE);
-				spin_unlock(&mb_cache_spinlock);
+				if (held_block_lock) {
+					hlist_bl_unlock(block_hash_p);
+					held_block_lock = 0;
+				}
 				schedule();
-				spin_lock(&mb_cache_spinlock);
+				hlist_bl_lock(ce->e_index_hash_p);
 				ce->e_queued--;
 			}
-			finish_wait(&mb_cache_queue, &wait);
 			ce->e_used += 1 + MB_CACHE_WRITER;
+			hlist_bl_unlock(ce->e_index_hash_p);
+			finish_wait(&mb_cache_queue, &wait);
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			if (!held_block_lock)
+				hlist_bl_lock(block_hash_p);
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
 				return NULL;
 			}
@@ -494,24 +618,27 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 	ce = NULL;
 
 cleanup:
-	spin_unlock(&mb_cache_spinlock);
+	hlist_bl_unlock(block_hash_p);
 	return ce;
 }
 
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
 
 static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
-	while (l != head) {
+	while (l != NULL) {
 		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_index.o_list);
+			hlist_bl_entry(l, struct mb_cache_entry,
+				e_index.o_list);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
 			DEFINE_WAIT(wait);
 
+			spin_lock(&mb_cache_spinlock);
 			if (!list_empty(&ce->e_lru_list))
 				list_del_init(&ce->e_lru_list);
+			spin_unlock(&mb_cache_spinlock);
 
 			/* Incrementing before holding the lock gives readers
 			   priority over writers. */
@@ -520,18 +647,23 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
 				ce->e_queued++;
 				prepare_to_wait(&mb_cache_queue, &wait,
 						TASK_UNINTERRUPTIBLE);
-				spin_unlock(&mb_cache_spinlock);
+				hlist_bl_unlock(head);
 				schedule();
-				spin_lock(&mb_cache_spinlock);
+				hlist_bl_lock(head);
+				mb_assert(ce->e_index_hash_p == head);
 				ce->e_queued--;
 			}
+			hlist_bl_unlock(head);
 			finish_wait(&mb_cache_queue, &wait);
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			hlist_bl_lock(ce->e_block_hash_p);
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
-				spin_lock(&mb_cache_spinlock);
+				hlist_bl_lock(head);
 				return ERR_PTR(-EAGAIN);
-			}
+			} else
+				hlist_bl_unlock(ce->e_block_hash_p);
+			mb_assert(ce->e_index_hash_p == head);
 			return ce;
 		}
 		l = l->next;
@@ -557,13 +689,17 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 			  unsigned int key)
 {
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
-	struct mb_cache_entry *ce;
+	struct hlist_bl_node *l;
+	struct mb_cache_entry *ce = NULL;
+	struct hlist_bl_head *index_hash_p;
 
-	spin_lock(&mb_cache_spinlock);
-	l = cache->c_index_hash[bucket].next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
-	spin_unlock(&mb_cache_spinlock);
+	index_hash_p = &cache->c_index_hash[bucket];
+	hlist_bl_lock(index_hash_p);
+	if (!hlist_bl_empty(index_hash_p)) {
+		l = hlist_bl_first(index_hash_p);
+		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+	}
+	hlist_bl_unlock(index_hash_p);
 	return ce;
 }
 
@@ -592,13 +728,17 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 {
 	struct mb_cache *cache = prev->e_cache;
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *index_hash_p;
 
-	spin_lock(&mb_cache_spinlock);
+	index_hash_p = &cache->c_index_hash[bucket];
+	mb_assert(prev->e_index_hash_p == index_hash_p);
+	hlist_bl_lock(index_hash_p);
+	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
-	__mb_cache_entry_release_unlock(prev);
+	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+	hlist_bl_unlock(index_hash_p);
 	return ce;
 }
 
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..89826c0 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -11,11 +11,13 @@ struct mb_cache_entry {
 	unsigned short			e_queued;
 	struct block_device		*e_bdev;
 	sector_t			e_block;
-	struct list_head		e_block_list;
+	struct hlist_bl_node		e_block_list;
 	struct {
-		struct list_head	o_list;
+		struct hlist_bl_node	o_list;
 		unsigned int		o_key;
 	} e_index;
+	struct hlist_bl_head		*e_block_hash_p;
+	struct hlist_bl_head		*e_index_hash_p;
 };
 
 struct mb_cache {
@@ -25,8 +27,8 @@ struct mb_cache {
 	int				c_max_entries;
 	int				c_bucket_bits;
 	struct kmem_cache		*c_entry_cache;
-	struct list_head		*c_block_hash;
-	struct list_head		*c_index_hash;
+	struct hlist_bl_head		*c_block_hash;
+	struct hlist_bl_head		*c_index_hash;
 };
 
 /* Functions on caches */
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache
  2013-09-04 16:39 ` [PATCH v3 0/2] " T Makphaibulchoke
  2013-09-04 16:39   ` [PATCH v3 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
@ 2013-09-04 16:39   ` T Makphaibulchoke
  2013-10-30 14:30     ` Theodore Ts'o
  2013-09-04 20:00   ` [PATCH v3 0/2] ext4: increase mbcache scalability Andreas Dilger
  2013-09-05  2:35   ` Theodore Ts'o
  3 siblings, 1 reply; 63+ messages in thread
From: T Makphaibulchoke @ 2013-09-04 16:39 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, torvalds, aswin_proj, T Makphaibulchoke

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the
cache creation and destory calls from ex4_init_xattr() and ext4_exitxattr()
in fs/ext4/xattr.c.
fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
Changed in v3:
	- No change

Changed in v2:
	- No change

 fs/ext4/ext4.h  |  1 +
 fs/ext4/super.c | 24 ++++++++++++++++--------
 fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
 fs/ext4/xattr.h |  6 +++---
 4 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..70f2709 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1306,6 +1306,7 @@ struct ext4_sb_info {
 	struct list_head s_es_lru;
 	unsigned long s_es_last_sorted;
 	struct percpu_counter s_extent_cache_cnt;
+	struct mb_cache *s_mb_cache;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 };
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 36b141e..3e7d338 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
 static struct ext4_lazy_init *ext4_li_info;
 static struct mutex ext4_li_mtx;
 static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -828,6 +829,10 @@ static void ext4_put_super(struct super_block *sb)
 		invalidate_bdev(sbi->journal_bdev);
 		ext4_blkdev_remove(sbi);
 	}
+	if (sbi->s_mb_cache) {
+		ext4_xattr_destroy_cache(sbi->s_mb_cache);
+		sbi->s_mb_cache = NULL;
+	}
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 	sb->s_fs_info = NULL;
@@ -3916,6 +3921,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	if (ext4_mballoc_ready) {
+		sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+		if (!sbi->s_mb_cache) {
+			ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+			goto failed_mount_wq;
+		}
+	}
 
 	/*
 	 * The journal may have updated the bg summary counts, so we
@@ -5428,11 +5440,9 @@ static int __init ext4_init_fs(void)
 
 	err = ext4_init_mballoc();
 	if (err)
-		goto out3;
-
-	err = ext4_init_xattr();
-	if (err)
 		goto out2;
+	else
+		ext4_mballoc_ready = 1;
 	err = init_inodecache();
 	if (err)
 		goto out1;
@@ -5448,10 +5458,9 @@ out:
 	unregister_as_ext3();
 	destroy_inodecache();
 out1:
-	ext4_exit_xattr();
-out2:
+	ext4_mballoc_ready = 0;
 	ext4_exit_mballoc();
-out3:
+out2:
 	ext4_exit_feat_adverts();
 out4:
 	if (ext4_proc_root)
@@ -5474,7 +5483,6 @@ static void __exit ext4_exit_fs(void)
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
 	destroy_inodecache();
-	ext4_exit_xattr();
 	ext4_exit_mballoc();
 	ext4_exit_feat_adverts();
 	remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c081e34..3e0ff31 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
 static struct buffer_head *ext4_xattr_cache_find(struct inode *,
 						 struct ext4_xattr_header *,
 						 struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
 static int ext4_xattr_list(struct dentry *dentry, char *buffer,
 			   size_t buffer_size);
 
-static struct mb_cache *ext4_xattr_cache;
-
 static const struct xattr_handler *ext4_xattr_handler_map[] = {
 	[EXT4_XATTR_INDEX_USER]		     = &ext4_xattr_user_handler,
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
 	NULL
 };
 
+#define EXT4_GET_MB_CACHE(inode)	(((struct ext4_sb_info *) \
+				inode->i_sb->s_fs_info)->s_mb_cache)
+
 static __le32 ext4_xattr_block_csum(struct inode *inode,
 				    sector_t block_nr,
 				    struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
 	if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
 cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
-	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+	ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
 		goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
-		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+		ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
 		if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
-				ext4_xattr_cache_insert(bs->bh);
+				ext4_xattr_cache_insert(ext4_mb_cache,
+					bs->bh);
 			}
 			unlock_buffer(bs->bh);
 			if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
-			ext4_xattr_cache_insert(new_bh);
+			ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
 			if (error)
@@ -1492,13 +1498,13 @@ ext4_xattr_put_super(struct super_block *sb)
  * Returns 0, or a negative error number on failure.
  */
 static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 {
 	__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
 	struct mb_cache_entry *ce;
 	int error;
 
-	ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+	ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
 	if (!ce) {
 		ea_bdebug(bh, "out of memory");
 		return;
@@ -1570,12 +1576,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
 again:
-	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+	ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
 				       hash);
 	while (ce) {
 		struct buffer_head *bh;
@@ -1673,19 +1680,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
 
 #undef BLOCK_HASH_SHIFT
 
-int __init
-ext4_init_xattr(void)
+#define	HASH_BUCKET_BITS	6
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
 {
-	ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
-	if (!ext4_xattr_cache)
-		return -ENOMEM;
-	return 0;
+	return mb_cache_create(name, HASH_BUCKET_BITS);
 }
 
-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
 {
-	if (ext4_xattr_cache)
-		mb_cache_destroy(ext4_xattr_cache);
-	ext4_xattr_cache = NULL;
+	if (cache)
+		mb_cache_destroy(cache);
 }
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
 
-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
 extern const struct xattr_handler *ext4_xattr_handlers[];
 
 extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
 				       struct ext4_xattr_info *i,
 				       struct ext4_xattr_ibody_find *is);
 
+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
 #ifdef CONFIG_EXT4_FS_SECURITY
 extern int ext4_init_security(handle_t *handle, struct inode *inode,
 			      struct inode *dir, const struct qstr *qstr);
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-04 16:39 ` [PATCH v3 0/2] " T Makphaibulchoke
  2013-09-04 16:39   ` [PATCH v3 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
  2013-09-04 16:39   ` [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
@ 2013-09-04 20:00   ` Andreas Dilger
  2013-09-04 15:33     ` Thavatchai Makphaibulchoke
  2013-09-05 15:22     ` Thavatchai Makphaibulchoke
  2013-09-05  2:35   ` Theodore Ts'o
  3 siblings, 2 replies; 63+ messages in thread
From: Andreas Dilger @ 2013-09-04 20:00 UTC (permalink / raw)
  To: T Makphaibulchoke
  Cc: Theodore Ts'o, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On 2013-09-04, at 10:39 AM, T Makphaibulchoke wrote:
> This patch intends to improve the scalability of an ext filesystem,
> particularly ext4.

In the past, I've raised the question of whether mbcache is even
useful on real-world systems.  Essentially, this is providing a
"deduplication" service for ext2/3/4 xattr blocks that are identical.
The question is how often this is actually the case in modern use?
The original design was for allowing external ACL blocks to be
shared between inodes, at a time when ACLs where pretty much the
only xattrs stored on inodes.

The question now is whether there are common uses where all of the
xattrs stored on multiple inodes are identical?  If that is not the
case, mbcache is just adding overhead and should just be disabled
entirely instead of just adding less overhead.

There aren't good statistics on the hit rate for mbcache, but it
might be possible to generate some with systemtap or similar to
see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.

Cheers, Andreas

> The patch consists of two parts.  The first part introduces higher
> degree of parallelism to the usages of the mb_cache and
> mb_cache_entries and impacts all ext filesystems.
> 
> The second part of the patch further increases the scalablity of
> an ext4 filesystem by having each ext4 fielsystem allocate and use
> its own private mbcache structure, instead of sharing a single
> mcache structures across all ext4 filesystems
> 
> Here are some of the benchmark results with the changes. 
> 
> On a 90 core machine:
> 
> Here are the performance improvements in some of the aim7 workloads,
> 
> ---------------------------
> |             | % increase |
> ---------------------------
> | alltests    |     11.85  |
> ---------------------------
> | custom      |     14.42  |
> ---------------------------
> | fserver     |     21.36  |  
> ---------------------------
> | new_dbase   |      5.59 |  
> ---------------------------
> | new_fserver |     21.45  |  
> ---------------------------
> | shared      |     12.84  |  
> ---------------------------
> For Swingbench dss workload, with 16 GB database,
> 
> -------------------------------------------------------------------------------
> | Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
> -------------------------------------------------------------------------------
> | % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 |
> -------------------------------------------------------------------------------
> | % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 |
> | without using|      |      |      |      |      |      |      |      |      |
> | shared memory|      |      |      |      |      |      |      |      |      |
> -------------------------------------------------------------------------------
> For SPECjbb2013, composite run,
> 
> --------------------------------------------
> |               | max-jOPS | critical-jOPS |
> --------------------------------------------
> | % improvement |   5.99   |     N/A       |
> --------------------------------------------
> 
> 
> On an 80 core machine:
> 
> The aim7's results for most of the workloads turn out to the same.
> 
> Here are the results of Swingbench dss workload,
> 
> -------------------------------------------------------------------------------
> | Users        | 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  |
> -------------------------------------------------------------------------------
> | % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 |
> -------------------------------------------------------------------------------
> 
> The changes have been tested with ext4 xfstests to verify that no regression
> has been introduced. 
> 
> Changed in v3:
> 	- New diff summary
> 
> Changed in v2:
> 	- New performance data
> 	- New diff summary
> 
> T Makphaibulchoke (2):
>  mbcache: decoupling the locking of local from global data
>  ext4: each filesystem creates and uses its own mb_cache
> 
> fs/ext4/ext4.h          |   1 +
> fs/ext4/super.c         |  24 ++--
> fs/ext4/xattr.c         |  51 ++++----
> fs/ext4/xattr.h         |   6 +-
> fs/mbcache.c            | 306 +++++++++++++++++++++++++++++++++++-------------
> include/linux/mbcache.h |  10 +-
> 6 files changed, 277 insertions(+), 121 deletions(-)
> 
> -- 
> 1.7.11.3
> 


Cheers, Andreas






^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-04 16:39 ` [PATCH v3 0/2] " T Makphaibulchoke
                     ` (2 preceding siblings ...)
  2013-09-04 20:00   ` [PATCH v3 0/2] ext4: increase mbcache scalability Andreas Dilger
@ 2013-09-05  2:35   ` Theodore Ts'o
  2013-09-05  9:49     ` Thavatchai Makphaibulchoke
  3 siblings, 1 reply; 63+ messages in thread
From: Theodore Ts'o @ 2013-09-05  2:35 UTC (permalink / raw)
  To: T Makphaibulchoke
  Cc: adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel,
	aswin, torvalds, aswin_proj

On Wed, Sep 04, 2013 at 10:39:14AM -0600, T Makphaibulchoke wrote:
> 
> Here are the performance improvements in some of the aim7 workloads,

How did you gather these results?  The mbcache is only used if you are
using extended attributes, and only if the extended attributes don't
fit in the inode's extra space.

I checked aim7, and it doesn't do any extended attribute operations.
So why are you seeing differences?  Are you doing something like
deliberately using 128 byte inodes (which is not the default inode
size), and then enabling SELinux, or some such?

						- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-05  2:35   ` Theodore Ts'o
@ 2013-09-05  9:49     ` Thavatchai Makphaibulchoke
  2013-09-06  5:10       ` Andreas Dilger
  0 siblings, 1 reply; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-05  9:49 UTC (permalink / raw)
  To: Theodore Ts'o, T Makphaibulchoke, adilger.kernel, viro,
	linux-ext4, linux-kernel, linux-fsdevel, aswin, torvalds,
	aswin_proj

On 09/05/2013 02:35 AM, Theodore Ts'o wrote:
> How did you gather these results?  The mbcache is only used if you are
> using extended attributes, and only if the extended attributes don't
> fit in the inode's extra space.
> 
> I checked aim7, and it doesn't do any extended attribute operations.
> So why are you seeing differences?  Are you doing something like
> deliberately using 128 byte inodes (which is not the default inode
> size), and then enabling SELinux, or some such?
> 
> 						- Ted
> 

No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck.  Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.


                        |--3.51%-- __mb_cache_entry_find
                        |          mb_cache_entry_find_first
                        |          ext4_xattr_cache_find
                        |          ext4_xattr_block_set
                        |          ext4_xattr_set_handle
                        |          ext4_initxattrs
                        |          security_inode_init_security
                        |          ext4_init_security
                        |          __ext4_new_inode
                        |          ext4_create
                        |          vfs_create
                        |          lookup_open
                        |          do_last
                        |          path_openat
                        |          do_filp_open
                        |          do_sys_open
                        |          SyS_open
                        |          sys_creat
                        |          system_call
                        |          __creat_nocancel
                        |          |
                        |          |--16.67%-- 0x11fe2c0
                        |          |

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-04 20:00   ` [PATCH v3 0/2] ext4: increase mbcache scalability Andreas Dilger
  2013-09-04 15:33     ` Thavatchai Makphaibulchoke
@ 2013-09-05 15:22     ` Thavatchai Makphaibulchoke
  1 sibling, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-05 15:22 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: T Makphaibulchoke, Theodore Ts'o, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On 09/04/2013 08:00 PM, Andreas Dilger wrote:
> 
> In the past, I've raised the question of whether mbcache is even
> useful on real-world systems.  Essentially, this is providing a
> "deduplication" service for ext2/3/4 xattr blocks that are identical.
> The question is how often this is actually the case in modern use?
> The original design was for allowing external ACL blocks to be
> shared between inodes, at a time when ACLs where pretty much the
> only xattrs stored on inodes.
> 
> The question now is whether there are common uses where all of the
> xattrs stored on multiple inodes are identical?  If that is not the
> case, mbcache is just adding overhead and should just be disabled
> entirely instead of just adding less overhead.
> 
> There aren't good statistics on the hit rate for mbcache, but it
> might be possible to generate some with systemtap or similar to
> see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.
> 
> Cheers, Andreas
> 

Looks like it's a bit harder to disable mbcache than I thought. I ended up adding code to collect the statics.

With selinux enabled, for new_fserver workload of aim7, there are a total of 0x7e05420100000000 ext4_xattr_cache_find() calls that result in a hit and 0xc100000000000000 calls that are not.  The number does not seem to favor the complete disabling of mbcache in this case.

Thanks,
Mak.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-05  9:49     ` Thavatchai Makphaibulchoke
@ 2013-09-06  5:10       ` Andreas Dilger
  2013-09-06 12:23         ` Thavatchai Makphaibulchoke
  0 siblings, 1 reply; 63+ messages in thread
From: Andreas Dilger @ 2013-09-06  5:10 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Theodore Ts'o, T Makphaibulchoke, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
> On 09/05/2013 02:35 AM, Theodore Ts'o wrote:
>> How did you gather these results?  The mbcache is only used if you
>> are using extended attributes, and only if the extended attributes don't fit in the inode's extra space.
>> 
>> I checked aim7, and it doesn't do any extended attribute operations.
>> So why are you seeing differences?  Are you doing something like
>> deliberately using 128 byte inodes (which is not the default inode
>> size), and then enabling SELinux, or some such?
> 
> No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck.  Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.
> 
> 
>                        |--3.51%-- __mb_cache_entry_find
>                        |          mb_cache_entry_find_first
>                        |          ext4_xattr_cache_find
>                        |          ext4_xattr_block_set
>                        |          ext4_xattr_set_handle
>                        |          ext4_initxattrs
>                        |          security_inode_init_security
>                        |          ext4_init_security

Looks like this is some large security xattr, or enough smaller
xattrs to exceed the ~120 bytes of in-inode xattr storage.  How
big is the SELinux xattr (assuming that is what it is)?

> Looks like it's a bit harder to disable mbcache than I thought.
> I ended up adding code to collect the statics.
> 
> With selinux enabled, for new_fserver workload of aim7, there
> are a total of 0x7e05420100000000 ext4_xattr_cache_find() calls
> that result in a hit and 0xc100000000000000 calls that are not.
> The number does not seem to favor the complete disabling of
> mbcache in this case.

This is about a 65% hit rate, which seems reasonable.

You could try a few different things here:
- disable selinux completely (boot with "selinux=0" on the kernel
  command line) and see how much faster it is
- format your ext4 filesystem with larger inodes (-I 512) and see
  if this is an improvement or not.  That depends on the size of
  the selinux xattrs and if they will fit into the extra 256 bytes
  of xattr space these larger inodes will give you.  The performance
  might also be worse, since there will be more data to read/write
  for each inode, but it would avoid seeking to the xattr blocks.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-06  5:10       ` Andreas Dilger
@ 2013-09-06 12:23         ` Thavatchai Makphaibulchoke
  2013-09-10 20:47           ` Andreas Dilger
  0 siblings, 1 reply; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-06 12:23 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, T Makphaibulchoke, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On 09/06/2013 05:10 AM, Andreas Dilger wrote:
> On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
>> No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck.  Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.
>>
>>
>>                        |--3.51%-- __mb_cache_entry_find
>>                        |          mb_cache_entry_find_first
>>                        |          ext4_xattr_cache_find
>>                        |          ext4_xattr_block_set
>>                        |          ext4_xattr_set_handle
>>                        |          ext4_initxattrs
>>                        |          security_inode_init_security
>>                        |          ext4_init_security
> 
> Looks like this is some large security xattr, or enough smaller
> xattrs to exceed the ~120 bytes of in-inode xattr storage.  How
> big is the SELinux xattr (assuming that is what it is)?
> 

Sorry I'm familiar with SELinux enough to say how big its xattr is. Anyway, I'm positive that SELinux is what is generating these xattrs.  With SELinux disabled, there seems to be no call ext4_xattr_cache_find().

>> Looks like it's a bit harder to disable mbcache than I thought.
>> I ended up adding code to collect the statics.
>>
>> With selinux enabled, for new_fserver workload of aim7, there
>> are a total of 0x7e05420100000000 ext4_xattr_cache_find() calls
>> that result in a hit and 0xc100000000000000 calls that are not.
>> The number does not seem to favor the complete disabling of
>> mbcache in this case.
> 
> This is about a 65% hit rate, which seems reasonable.
> 
> You could try a few different things here:
> - disable selinux completely (boot with "selinux=0" on the kernel
>   command line) and see how much faster it is
> - format your ext4 filesystem with larger inodes (-I 512) and see
>   if this is an improvement or not.  That depends on the size of
>   the selinux xattrs and if they will fit into the extra 256 bytes
>   of xattr space these larger inodes will give you.  The performance
>   might also be worse, since there will be more data to read/write
>   for each inode, but it would avoid seeking to the xattr blocks.
> 

Thanks for the above suggestions. Could you please clarify if we are attempting to look for a workaround here? Since we agree the way mb_cache uses one global spinlock is incorrect and SELinux exposes the problem (which is not uncommon with Enterprise installations), I believe we should look at fixing it (patch 1/2). As you also mentioned, this will also impact both ext2 and ext3 filesystems.

Anyway, please let me know if you still think any of the above experiments is relevant.

Thanks,
Mak.


> Cheers, Andreas
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-10 21:02             ` Theodore Ts'o
@ 2013-09-10 17:10               ` Thavatchai Makphaibulchoke
  2013-09-11  3:13               ` Eric Sandeen
  1 sibling, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-10 17:10 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: T Makphaibulchoke, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On 09/10/2013 09:02 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
>> I agree that SELinux is enabled on enterprise distributions by default,
>> but I'm also interested to know how much overhead this imposes.  I would
>> expect that writing large external xattrs for each file would have quite
>> a significant performance overhead that should not be ignored.  Reducing
>> the mbcache overhead is good, but eliminating it entirely is better.
> 
> I was under the impression that using a 256 byte inode (which gives a
> bit over 100 bytes worth of xattr space) was plenty for SELinux.  If
> it turns out that SELinux's use of xattrs have gotten especially
> piggy, then we may need to revisit the recommended inode size for
> those systems who insist on using SELinux...  even if we eliminate the
> overhead associated with mbcache, the fact that files are requiring a
> separate xattr is going to seriously degrade performance.
> 
> 	       	  	   	     - Ted
> 

Thank you Andreas and Ted for the explanations and comments.  Yes, I see both of your points now.  Though we may reduce the mbcache overhead, due to the overhead of additional xattr I/O it would be better to provide some data to help users or distros to determine whether they will be better off completely disabling SELinux or increasing the inode size.  I will go ahead and run the suggested experiments and get back with the results.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-06 12:23         ` Thavatchai Makphaibulchoke
@ 2013-09-10 20:47           ` Andreas Dilger
  2013-09-10 21:02             ` Theodore Ts'o
  0 siblings, 1 reply; 63+ messages in thread
From: Andreas Dilger @ 2013-09-10 20:47 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Theodore Ts'o, T Makphaibulchoke, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On 2013-09-06, at 6:23 AM, Thavatchai Makphaibulchoke wrote:
> On 09/06/2013 05:10 AM, Andreas Dilger wrote:
>> On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
>>> No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck.  Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.
>>> 
>>> 
>>>                       |--3.51%-- __mb_cache_entry_find
>>>                       |          mb_cache_entry_find_first
>>>                       |          ext4_xattr_cache_find
>>>                       |          ext4_xattr_block_set
>>>                       |          ext4_xattr_set_handle
>>>                       |          ext4_initxattrs
>>>                       |          security_inode_init_security
>>>                       |          ext4_init_security
>> 
>> Looks like this is some large security xattr, or enough smaller
>> xattrs to exceed the ~120 bytes of in-inode xattr storage.  How
>> big is the SELinux xattr (assuming that is what it is)?
>> 
>> You could try a few different things here:
>> - disable selinux completely (boot with "selinux=0" on the kernel
>>  command line) and see how much faster it is

> Sorry I'm

not?

> familiar with SELinux enough to say how big its xattr is. Anyway, I'm positive that SELinux is what is generating these xattrs.  With SELinux disabled, there seems to be no call ext4_xattr_cache_find().

What is the relative performance of your benchmark with SELinux disabled?
While the oprofile graphs will be of passing interest to see that the
mbcache overhead is gone, they will not show the reduction in disk IO
from not writing/updating the external xattr blocks at all.

>> - format your ext4 filesystem with larger inodes (-I 512) and see
>>  if this is an improvement or not.  That depends on the size of
>>  the selinux xattrs and if they will fit into the extra 256 bytes
>>  of xattr space these larger inodes will give you.  The performance
>>  might also be worse, since there will be more data to read/write
>>  for each inode, but it would avoid seeking to the xattr blocks.
> 
> Thanks for the above suggestions. Could you please clarify if we are
> attempting to look for a workaround here? Since we agree the way
> mb_cache uses one global spinlock is incorrect and SELinux exposes
> the problem (which is not uncommon with Enterprise installations),
> I believe we should look at fixing it (patch 1/2). As you also
> mentioned, this will also impact both ext2 and ext3 filesystems.

I agree that SELinux is enabled on enterprise distributions by default,
but I'm also interested to know how much overhead this imposes.  I would
expect that writing large external xattrs for each file would have quite
a significant performance overhead that should not be ignored.  Reducing
the mbcache overhead is good, but eliminating it entirely is better.

Depending on how much overhead SELinux has, it might be important to
spend more time to optimize it (not just the mbcache part), or users
may consider disabling SELinux entirely on systems where they care
about peak performance.

> Anyway, please let me know if you still think any of the above
> experiments is relevant.

You have already done one of the tests that I'm interested in (the above
test which showed that disabling SELinux removed the mbcache overhead).
What I'm interested in is the actual performance (or relative performance
if you are not allowed to publish the actual numbers) of your AIM7
benchmark between SELinux enabled and SELinux disabled.

Next would be a new test that has SELinux enabled, but formatting the
filesystem with 512-byte inodes instead of the ext4 default of 256-byte
inodes.  If this makes a significant improvement, it would potentially
mean users and the upstream distros should use different formatting
options along with SELinux.  This is less clearly a win, since I don't
know enough details of how SELinux uses xattrs (I always disable it,
so I don't have any systems to check).

Cheers, Andreas






^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-10 20:47           ` Andreas Dilger
@ 2013-09-10 21:02             ` Theodore Ts'o
  2013-09-10 17:10               ` Thavatchai Makphaibulchoke
  2013-09-11  3:13               ` Eric Sandeen
  0 siblings, 2 replies; 63+ messages in thread
From: Theodore Ts'o @ 2013-09-10 21:02 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Thavatchai Makphaibulchoke, T Makphaibulchoke, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
> I agree that SELinux is enabled on enterprise distributions by default,
> but I'm also interested to know how much overhead this imposes.  I would
> expect that writing large external xattrs for each file would have quite
> a significant performance overhead that should not be ignored.  Reducing
> the mbcache overhead is good, but eliminating it entirely is better.

I was under the impression that using a 256 byte inode (which gives a
bit over 100 bytes worth of xattr space) was plenty for SELinux.  If
it turns out that SELinux's use of xattrs have gotten especially
piggy, then we may need to revisit the recommended inode size for
those systems who insist on using SELinux...  even if we eliminate the
overhead associated with mbcache, the fact that files are requiring a
separate xattr is going to seriously degrade performance.

	       	  	   	     - Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-10 21:02             ` Theodore Ts'o
  2013-09-10 17:10               ` Thavatchai Makphaibulchoke
@ 2013-09-11  3:13               ` Eric Sandeen
  2013-09-11 11:30                 ` Theodore Ts'o
  1 sibling, 1 reply; 63+ messages in thread
From: Eric Sandeen @ 2013-09-11  3:13 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Thavatchai Makphaibulchoke,
	T Makphaibulchoke, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On 9/10/13 4:02 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
>> I agree that SELinux is enabled on enterprise distributions by default,
>> but I'm also interested to know how much overhead this imposes.  I would
>> expect that writing large external xattrs for each file would have quite
>> a significant performance overhead that should not be ignored.  Reducing
>> the mbcache overhead is good, but eliminating it entirely is better.
> 
> I was under the impression that using a 256 byte inode (which gives a
> bit over 100 bytes worth of xattr space) was plenty for SELinux.  If
> it turns out that SELinux's use of xattrs have gotten especially
> piggy, then we may need to revisit the recommended inode size for
> those systems who insist on using SELinux...  even if we eliminate the
> overhead associated with mbcache, the fact that files are requiring a
> separate xattr is going to seriously degrade performance.

On my RHEL6 system,

# find / -xdev -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c
11082179

bytes of names for:

# df -i /
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_bp05-lv_root
                     3276800  280785 2996015    9% /

280785 inodes used,

so:
11082179/280785 = ~39.5 bytes per value on average, plus:

# echo -n "security.selinux" | wc -c
16

16 bytes for the name is only about 55-56 bytes per selinux attr on average.

So nope, not "especially piggy" on average.

Another way to do it is this; list all possible file contexts, and make
a histogram of sizes:

# for CONTEXT in `semanage fcontext -l | awk '{print $NF}' `; do echo -n $CONTEXT | wc -c; done | sort -n | uniq -c
      1 7
     33 8
    356 26
     14 27
     14 28
     37 29
     75 30
    237 31
    295 32
    425 33
    324 34
    445 35
    548 36
    229 37
    193 38
    181 39
    259 40
     81 41
    108 42
     96 43
     55 44
     55 45
     16 46
     41 47
     23 48
     28 49
     36 50
     10 51
     10 52
      5 54
      2 57

so a 57 byte value is max, but there aren't many of the larger values.

Above doesn't tell us the prevalence of various contexts on the actual system,
but they are all under 100 bytes in any case.

-Eric

> 	       	  	   	     - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11  3:13               ` Eric Sandeen
@ 2013-09-11 11:30                 ` Theodore Ts'o
  2013-09-11 16:49                   ` Eric Sandeen
  0 siblings, 1 reply; 63+ messages in thread
From: Theodore Ts'o @ 2013-09-11 11:30 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andreas Dilger, Thavatchai Makphaibulchoke, T Makphaibulchoke,
	Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
> 
> Above doesn't tell us the prevalence of various contexts on the actual system,
> but they are all under 100 bytes in any case.

OK, so in other words, on your system i_file_acl and i_file_acl_high
(which is where we store the block # for the external xattr block),
should always be zero for all inodes, yes?

Thavatchai, can you check to see whether or not this is true on your
system?  You can use debugfs on the file system, and then use the
"stat" command to sample various inodes in your system.  Or I can make
a version of e2fsck which counts the number of inodes with external
xattr blocks --- it sounds like this is something we should do anyway.

One difference might be that Eric ran this test on RHEL 6, and
Thavatchai is using an upstream kernel, so maybe this is bloat has
been added recently?

The reason why I'm pushing here is that mbcache shouldn't be showing
up in the profiles at all if there is no external xattr block.  And so
if newer versions of SELinux (like Adnreas, I've been burned by
SELinux too many times in the past, so I don't use SELinux on any of
my systems) is somehow causing mbcache to get triggered, we should
figure this out and understand what's really going on.

Sigh, I suppose I should figure out how to create a minimal KVM setup
which uses SELinux just so I can see what the heck is going on....

      	  	      	     	      	 - Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 11:30                 ` Theodore Ts'o
@ 2013-09-11 16:49                   ` Eric Sandeen
  2013-09-11 19:33                     ` Eric Sandeen
  2013-09-11 20:32                     ` David Lang
  0 siblings, 2 replies; 63+ messages in thread
From: Eric Sandeen @ 2013-09-11 16:49 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Thavatchai Makphaibulchoke,
	T Makphaibulchoke, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On 9/11/13 6:30 AM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>
>> Above doesn't tell us the prevalence of various contexts on the actual system,
>> but they are all under 100 bytes in any case.
> 
> OK, so in other words, on your system i_file_acl and i_file_acl_high
> (which is where we store the block # for the external xattr block),
> should always be zero for all inodes, yes?

Oh, hum - ok, so that would have been the better thing to check (or at
least an additional thing).

# find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq -c

Finds quite a lot that claim to have external blocks, but it seems broken:

# filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
Filesystem type is: ef53
File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, blocksize 4096)
 ext logical physical expected length flags
   0       0 32212996252             100 not_aligned,inline
/var/lib/yum/repos/x86_64/6Server/epel: 1 extent found

So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

And yet it's a small attr:

# getfattr -m - -d /var/lib/yum/repos/x86_64/6Server/epel
getfattr: Removing leading '/' from absolute path names
# file: var/lib/yum/repos/x86_64/6Server/epel
security.selinux="unconfined_u:object_r:rpm_var_lib_t:s0"

And debugfs thinks it's stored within the inode:

debugfs:  stat var/lib/yum/repos/x86_64/6Server/epel
Inode: 1968466   Type: directory    Mode:  0755   Flags: 0x80000
Generation: 2728788146    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 4096
File ACL: 0    Directory ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
 atime: 0x522fc8fa:62eb2d90 -- Tue Sep 10 20:35:54 2013
 mtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
crtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "unconfined_u:object_r:rpm_var_lib_t:s0\000" (39)
EXTENTS:
(0): 7873422

sooo seems like filefrag -x is buggy and can't be trusted.  :(

> Thavatchai, can you check to see whether or not this is true on your
> system?  You can use debugfs on the file system, and then use the
> "stat" command to sample various inodes in your system.  Or I can make
> a version of e2fsck which counts the number of inodes with external
> xattr blocks --- it sounds like this is something we should do anyway.
> 
> One difference might be that Eric ran this test on RHEL 6, and
> Thavatchai is using an upstream kernel, so maybe this is bloat has
> been added recently?

It's a userspace policy so the kernel shouldn't matter... "bloat" would
only come from new, longer contexts (outside the kernel).

> The reason why I'm pushing here is that mbcache shouldn't be showing
> up in the profiles at all if there is no external xattr block.  And so
> if newer versions of SELinux (like Adnreas, I've been burned by
> SELinux too many times in the past, so I don't use SELinux on any of
> my systems) is somehow causing mbcache to get triggered, we should
> figure this out and understand what's really going on.

selinux, from an fs allocation behavior perspective, is simply setxattr.

And as I showed earlier, name+value for all of the attrs set by at least RHEL6
selinux policy are well under 100 bytes.

(Add in a bunch of other non-selinux xattrs, and you'll go out of a 256b inode,
sure, but selinux on its own should not).

> Sigh, I suppose I should figure out how to create a minimal KVM setup
> which uses SELinux just so I can see what the heck is going on....

http://fedoraproject.org/en/get-fedora ;)

-Eric

>       	  	      	     	      	 - Ted
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 16:49                   ` Eric Sandeen
@ 2013-09-11 19:33                     ` Eric Sandeen
  2013-09-11 20:32                     ` David Lang
  1 sibling, 0 replies; 63+ messages in thread
From: Eric Sandeen @ 2013-09-11 19:33 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Thavatchai Makphaibulchoke,
	T Makphaibulchoke, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On 9/11/13 11:49 AM, Eric Sandeen wrote:
> On 9/11/13 6:30 AM, Theodore Ts'o wrote:
>> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>>
>>> Above doesn't tell us the prevalence of various contexts on the actual system,
>>> but they are all under 100 bytes in any case.
>>
>> OK, so in other words, on your system i_file_acl and i_file_acl_high
>> (which is where we store the block # for the external xattr block),
>> should always be zero for all inodes, yes?
> 
> Oh, hum - ok, so that would have been the better thing to check (or at
> least an additional thing).
> 
> # find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq -c
> 
> Finds quite a lot that claim to have external blocks, but it seems broken:
> 
> # filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
> Filesystem type is: ef53
> File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, blocksize 4096)
>  ext logical physical expected length flags
>    0       0 32212996252             100 not_aligned,inline
> /var/lib/yum/repos/x86_64/6Server/epel: 1 extent found
> 
> So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

Oh, this is the special-but-not-documented "print inline extents in bytes
not blocks"  :(

I'll send a patch to ignore inline extents on fiemap calls to make this
easier, but in the meantime, neither my RHEL6 root nor my F17 root have
any out-of-inode selinux xattrs on 256-byte-inode filesystems.

So selinux alone should not be exercising mbcache much, if at all, w/ 256 byte
inodes.

-Eric

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 16:49                   ` Eric Sandeen
  2013-09-11 19:33                     ` Eric Sandeen
@ 2013-09-11 20:32                     ` David Lang
  2013-09-11 20:48                       ` Eric Sandeen
  1 sibling, 1 reply; 63+ messages in thread
From: David Lang @ 2013-09-11 20:32 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, Andreas Dilger, Thavatchai Makphaibulchoke,
	T Makphaibulchoke, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On Wed, 11 Sep 2013, Eric Sandeen wrote:

>> The reason why I'm pushing here is that mbcache shouldn't be showing
>> up in the profiles at all if there is no external xattr block.  And so
>> if newer versions of SELinux (like Adnreas, I've been burned by
>> SELinux too many times in the past, so I don't use SELinux on any of
>> my systems) is somehow causing mbcache to get triggered, we should
>> figure this out and understand what's really going on.
>
> selinux, from an fs allocation behavior perspective, is simply setxattr.

what you are missing is that Ted is saying that unless you are using xattrs, the 
mbcache should not show up at all.

The fact that you are using SElinux, and SELinux sets the xattrs is what makes 
this show up on your system, but other people who don't use SELinux (and so 
don't have any xattrs set) don't see the same bottleneck.

David Lang

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 21:25                         ` Theodore Ts'o
@ 2013-09-11 20:36                           ` Thavatchai Makphaibulchoke
  2013-09-12  3:42                             ` Eric Sandeen
  2013-09-12 12:23                             ` Theodore Ts'o
  0 siblings, 2 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-11 20:36 UTC (permalink / raw)
  To: Theodore Ts'o, Eric Sandeen
  Cc: David Lang, Andreas Dilger, T Makphaibulchoke, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On 09/11/2013 09:25 PM, Theodore Ts'o wrote:
> On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote:
>>
>> So at this point I think it's up to Mak to figure out why on his system, aim7 is triggering mbcache codepaths.
>>
> 
> Yes, the next thing is to see if on his systems, whether or not he's
> seeing external xattr blocks.
> 
> 					- Ted
> 

I seem to be seeing the same thing as Eric is seeing.

On one of my systems,

# find / -mount -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c
2725655
# df -i /
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_dhg1-lv_root
[tmac@lxbuild linux]$ man find
                     1974272   84737 1889535    5% /
# find /home -mount -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c
274173
# df -i /home
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_dhg1-lv_home
                      192384    7862  184522    5% /home

For both filesystems, the security xattr are about 32.17 and 34.87 bytes respectively.

I also see a similar problem with filefrag.

# filefrag -xv /bin/sh
Filesystem type is: ef53
File size of /bin/sh is 938736 (230 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0 23622459548             100 not_aligned,inline
/bin/sh: 1 extent found
 
# getfattr -m - -d /bin/sh
getfattr: Removing leading '/' from absolute path names
# file: bin/sh
security.selinux="system_u:object_r:shell_exec_t:s0"

debugfs:  stat /bin/sh
Inode: 1441795   Type: symlink    Mode:  0777   Flags: 0x0
Generation: 3470616846    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 4
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x50c2779d:ad792a58 -- Fri Dec  7 16:11:25 2012
 atime: 0x52311211:006d1658 -- Wed Sep 11 19:00:01 2013
 mtime: 0x50c2779d:ad792a58 -- Fri Dec  7 16:11:25 2012
crtime: 0x50c2779d:ad792a58 -- Fri Dec  7 16:11:25 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "system_u:object_r:bin_t:s0\000" (27)
Fast_link_dest: bash

At this point, I'm not sure why we get into the mbcache path when SELinux is enabled.  As mentioned in one my earlier replies to Andreas, I did see actual calls into ext4_xattr_cache.

There seems to be one difference between 3.11 kernel and 2.6 kernel in set_inode_init_security(). There is an additional attempt to initialize evm xattr.  But I do not seem to be seeing any evm xattr in any file.

I will continue to try to find out how we get into the mbcache path.  Please let me know if anyone has any suggestion.

Thanks,
Mak.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 20:32                     ` David Lang
@ 2013-09-11 20:48                       ` Eric Sandeen
  2013-09-11 21:25                         ` Theodore Ts'o
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Sandeen @ 2013-09-11 20:48 UTC (permalink / raw)
  To: David Lang
  Cc: Theodore Ts'o, Andreas Dilger, Thavatchai Makphaibulchoke,
	T Makphaibulchoke, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On 9/11/13 3:32 PM, David Lang wrote:
> On Wed, 11 Sep 2013, Eric Sandeen wrote:
> 
>>> The reason why I'm pushing here is that mbcache shouldn't be showing
>>> up in the profiles at all if there is no external xattr block.  And so
>>> if newer versions of SELinux (like Adnreas, I've been burned by
>>> SELinux too many times in the past, so I don't use SELinux on any of
>>> my systems) is somehow causing mbcache to get triggered, we should
>>> figure this out and understand what's really going on.
>>
>> selinux, from an fs allocation behavior perspective, is simply setxattr.
> 
> what you are missing is that Ted is saying that unless you are using xattrs, the mbcache should not show up at all.
> 
> The fact that you are using SElinux, and SELinux sets the xattrs is
> what makes this show up on your system, but other people who don't
> use SELinux (and so don't have any xattrs set) don't see the same
> bottleneck.

Sure, I understand that quite well.  But Ted was also saying that perhaps selinux had "gotten piggy" and was causing this.  I've showed that it hasn't.


This matters because unless the selinux xattrs go out of the inode into their own block, mbcache should STILL not come into it at all.

And for attrs < 100 bytes, they stay in the inode.  And on inspection, my SELinux boxes have no external attr blocks allocated.

mbcache only handles extended attributes that live in separately-allocated blocks, and selinux does not cause that on its own.

Soo... selinux by itself should not be triggering any mbcache codepaths.

Ted suggested that "selinux had gotten piggy" so I checked, and showed that it hadn't.  That's all.

So at this point I think it's up to Mak to figure out why on his system, aim7 is triggering mbcache codepaths.

-Eric

> David Lang


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 20:48                       ` Eric Sandeen
@ 2013-09-11 21:25                         ` Theodore Ts'o
  2013-09-11 20:36                           ` Thavatchai Makphaibulchoke
  0 siblings, 1 reply; 63+ messages in thread
From: Theodore Ts'o @ 2013-09-11 21:25 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: David Lang, Andreas Dilger, Thavatchai Makphaibulchoke,
	T Makphaibulchoke, Al Viro, linux-ext4@vger.kernel.org List,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org Devel,
	aswin, Linus Torvalds, aswin_proj

On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote:
> 
> So at this point I think it's up to Mak to figure out why on his system, aim7 is triggering mbcache codepaths.
> 

Yes, the next thing is to see if on his systems, whether or not he's
seeing external xattr blocks.

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 20:36                           ` Thavatchai Makphaibulchoke
@ 2013-09-12  3:42                             ` Eric Sandeen
  2013-09-12 12:23                             ` Theodore Ts'o
  1 sibling, 0 replies; 63+ messages in thread
From: Eric Sandeen @ 2013-09-12  3:42 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Theodore Ts'o, Andreas Dilger, T Makphaibulchoke, Al Viro,
	linux-ext4@vger.kernel.org List, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org Devel, aswin, Linus Torvalds,
	aswin_proj

On 9/11/13 3:36 PM, Thavatchai Makphaibulchoke wrote:

> I seem to be seeing the same thing as Eric is seeing.

...
> For both filesystems, the security xattr are about 32.17 and 34.87 bytes respectively.
...

Can you triple-check the inode size on your fs, for good measure?

dumpe2fs -h /dev/whatever | grep "Inode size"

> I also see a similar problem with filefrag.

turns out it's not a problem, it's an undocumented & surprising "feature."  :(

        /* For inline data all offsets should be in bytes, not blocks */
        if (fm_extent->fe_flags & FIEMAP_EXTENT_DATA_INLINE)
                blk_shift = 0;

because ... ?  (the commit which added it didn't mention anything about it).

But I guess it does mean for at least those files, the xattr data is actually inline.

> At this point, I'm not sure why we get into the mbcache path when
> SELinux is enabled. As mentioned in one my earlier replies to
> Andreas, I did see actual calls into ext4_xattr_cache.
> 
> There seems to be one difference between 3.11 kernel and 2.6 kernel
> in set_inode_init_security(). There is an additional attempt to
> initialize evm xattr. But I do not seem to be seeing any evm xattr in
> any file.
> 
> I will continue to try to find out how we get into the mbcache path.
> Please let me know if anyone has any suggestion.

Sorry we got so far off the thread of the original patches.

But it seems like a mystery worth solving.

Perhaps in ext4_xattr_set_handle() you can instrument the case(s) where it
gets into ext4_xattr_block_set().

Or most simply, just printk inode number in ext4_xattr_block_set() so
you can look at them later via debugfs.

And in here,

        } else {
                error = ext4_xattr_ibody_set(handle, inode, &i, &is);
                if (!error && !bs.s.not_found) {
                        i.value = NULL;
                        error = ext4_xattr_block_set(handle, inode, &i, &bs);
                } else if (error == -ENOSPC) {
                        if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
                                error = ext4_xattr_block_find(inode, &i, &bs);
                                if (error)
                                        goto cleanup;
                        }
                        error = ext4_xattr_block_set(handle, inode, &i, &bs);

maybe print out in the ext4_xattr_ibody_set error case what the size of
of the xattr is, and probably inode number again for good measure, to get
an idea of what's causing it to fail to land in the inode?

-Eric

> 
> Thanks,
> Mak.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-11 20:36                           ` Thavatchai Makphaibulchoke
  2013-09-12  3:42                             ` Eric Sandeen
@ 2013-09-12 12:23                             ` Theodore Ts'o
  2013-09-13 12:04                               ` Thavatchai Makphaibulchoke
  1 sibling, 1 reply; 63+ messages in thread
From: Theodore Ts'o @ 2013-09-12 12:23 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Eric Sandeen, David Lang, Andreas Dilger, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, swin_proj

(I've trimmed the cc list to stop spamming people who probably don't
care as much about this).

So I've tried the following patch, and I've confirmed that for short
xattrs (i.e., that fit inside the inode body, assuming an inode size >
128 bytes), the mbcache paths don't trigger at all.

Could you try this patch and see if we can figure out why mbcache code
paths are triggering for you?

						- Ted

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c081e34..0942ae3 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -265,6 +265,8 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,6 +288,11 @@ bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 	ext4_xattr_cache_insert(bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
@@ -409,6 +416,8 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,6 +439,11 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 	ext4_xattr_cache_insert(bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
@@ -526,7 +540,14 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
@@ -745,12 +766,19 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
+		if (__ratelimit(&_rs)) {
+			ext4_warning(inode->i_sb, "inode %lu",
+				     (unsigned long) inode->i_ino);
+			WARN_ON(1);
+		}
 		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
@@ -769,6 +797,12 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
+				if (__ratelimit(&_rs)) {
+					ext4_warning(inode->i_sb,
+						     "inode %lu",
+						(unsigned long) inode->i_ino);
+					WARN_ON(1);
+				}
 				ext4_xattr_cache_insert(bs->bh);
 			}
 			unlock_buffer(bs->bh);
@@ -905,6 +939,11 @@ getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
+			if (__ratelimit(&_rs)) {
+				ext4_warning(inode->i_sb, "inode %lu",
+					     (unsigned long) inode->i_ino);
+				WARN_ON(1);
+			}
 			ext4_xattr_cache_insert(new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
@@ -1570,10 +1609,17 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 again:
 	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
 				       hash);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-12 12:23                             ` Theodore Ts'o
@ 2013-09-13 12:04                               ` Thavatchai Makphaibulchoke
  2013-09-13 18:59                                 ` Eric Sandeen
  0 siblings, 1 reply; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-13 12:04 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, David Lang, Andreas Dilger, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, aswin_proj

On 09/12/2013 06:23 AM, Theodore Ts'o wrote:
> (I've trimmed the cc list to stop spamming people who probably don't
> care as much about this).
> 
> So I've tried the following patch, and I've confirmed that for short
> xattrs (i.e., that fit inside the inode body, assuming an inode size >
> 128 bytes), the mbcache paths don't trigger at all.
> 
> Could you try this patch and see if we can figure out why mbcache code
> paths are triggering for you?
> 
> 						- Ted
> 

I tried the patch.  First off, looks like the patch has a lot of warnings, it does produces tremendous non-stop warnings during aim7 run, particularly the one from line 1627.

Unfortunately, looks like with this patch we are still calling ext4_xattr_cache_find().

I added more debugging code to print out the first few xattrs that get us into ex4_xattr_cache_find().  This is what I have,

[   80.540977] ext4_xattr_block_set: name  len 28.
[  155.480692] ext4_xattr_block_set: name selinux len 32.
[  155.486531] ext4_xattr_block_set: name selinux len 32.
[  155.492283] ext4_xattr_block_set: name selinux len 32.
[  155.497988] ext4_xattr_block_set: name selinux len 32.
[  155.504064] ext4_xattr_block_set: name selinux len 32.
[  155.509771] ext4_xattr_block_set: name selinux len 32.
[  155.515475] ext4_xattr_block_set: name selinux len 32.
[  155.521179] ext4_xattr_block_set: name selinux len 32.

I'll continue debugging and get back with any new finding.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-13 12:04                               ` Thavatchai Makphaibulchoke
@ 2013-09-13 18:59                                 ` Eric Sandeen
  2013-09-17 16:43                                   ` Thavatchai Makphaibulchoke
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Sandeen @ 2013-09-13 18:59 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Theodore Ts'o, Andreas Dilger, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, aswin_proj

On 9/13/13 7:04 AM, Thavatchai Makphaibulchoke wrote:
> On 09/12/2013 06:23 AM, Theodore Ts'o wrote:
>> (I've trimmed the cc list to stop spamming people who probably don't
>> care as much about this).
>>
>> So I've tried the following patch, and I've confirmed that for short
>> xattrs (i.e., that fit inside the inode body, assuming an inode size >
>> 128 bytes), the mbcache paths don't trigger at all.
>>
>> Could you try this patch and see if we can figure out why mbcache code
>> paths are triggering for you?
>>
>> 						- Ted
>>
> 
> I tried the patch.  First off, looks like the patch has a lot of warnings, it does produces tremendous non-stop warnings during aim7 run, particularly the one from line 1627.
> 
> Unfortunately, looks like with this patch we are still calling ext4_xattr_cache_find().

All the patch does is issue warnings, so no fixing was expected ;)

> I added more debugging code to print out the first few xattrs that get us into ex4_xattr_cache_find().  This is what I have,
> 
> [   80.540977] ext4_xattr_block_set: name  len 28.
> [  155.480692] ext4_xattr_block_set: name selinux len 32.
> [  155.486531] ext4_xattr_block_set: name selinux len 32.
> [  155.492283] ext4_xattr_block_set: name selinux len 32.
> [  155.497988] ext4_xattr_block_set: name selinux len 32.
> [  155.504064] ext4_xattr_block_set: name selinux len 32.
> [  155.509771] ext4_xattr_block_set: name selinux len 32.
> [  155.515475] ext4_xattr_block_set: name selinux len 32.
> [  155.521179] ext4_xattr_block_set: name selinux len 32.

ok, all small...

> I'll continue debugging and get back with any new finding.

suggestions:

1) send us dumpe2fs -h /dev/sdXX | grep "Inode size" for this filesystem
2) print out the inode number in your printk's above
3) stat one of those inodes in debugfs after the run (debugfs stat <inodenumber>) - with <> around the inode nr

-Eric

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-13 18:59                                 ` Eric Sandeen
@ 2013-09-17 16:43                                   ` Thavatchai Makphaibulchoke
  2013-09-18 14:19                                     ` Eric Sandeen
  0 siblings, 1 reply; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-17 16:43 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, Andreas Dilger, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, aswin_proj

On 09/13/2013 12:59 PM, Eric Sandeen wrote:
> 
> suggestions:
> 
> 1) send us dumpe2fs -h /dev/sdXX | grep "Inode size" for this filesystem
> 2) print out the inode number in your printk's above
> 3) stat one of those inodes in debugfs after the run (debugfs stat <inodenumber>) - with <> around the inode nr
> 
> -Eric
> 

I think I found out why we get into the mbcache path.  The test uses ramfss, which are mounted and unmounted at the start and end of the test.  Looks like a ramfs' default inode size is 128, causing all the mbcaching for the xattrs.  There seems to be nothing wrong with either SELinux or xattr.  Sorry for the confusion.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-17 16:43                                   ` Thavatchai Makphaibulchoke
@ 2013-09-18 14:19                                     ` Eric Sandeen
  2013-09-21 18:53                                       ` Theodore Ts'o
  0 siblings, 1 reply; 63+ messages in thread
From: Eric Sandeen @ 2013-09-18 14:19 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Theodore Ts'o, Andreas Dilger, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, aswin_proj

On 9/17/13 11:43 AM, Thavatchai Makphaibulchoke wrote:
> On 09/13/2013 12:59 PM, Eric Sandeen wrote:
>> 
>> suggestions:
>> 
>> 1) send us dumpe2fs -h /dev/sdXX | grep "Inode size" for this
>> filesystem 2) print out the inode number in your printk's above 3)
>> stat one of those inodes in debugfs after the run (debugfs stat
>> <inodenumber>) - with <> around the inode nr
>> 
>> -Eric
>> 
> 
> I think I found out why we get into the mbcache path.  The test uses
> ramfss, which are mounted and unmounted at the start and end of the
> test.  Looks like a ramfs' default inode size is 128, causing all the
> mbcaching for the xattrs.  There seems to be nothing wrong with
> either SELinux or xattr.  Sorry for the confusion.

smaller filesystems go back to 128 byte inodes, IIRC.

That's why I asked for dumpe2fs long ago ;)

But glad to have the mystery solved!

Thanks,
-Eric

> Thanks, Mak.
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-18 14:19                                     ` Eric Sandeen
@ 2013-09-21 18:53                                       ` Theodore Ts'o
  2013-09-23 15:35                                         ` Thavatchai Makphaibulchoke
       [not found]                                         ` <D59BF31F-934D-4DE6-9E5C-F92014F09C68@dilger.ca>
  0 siblings, 2 replies; 63+ messages in thread
From: Theodore Ts'o @ 2013-09-21 18:53 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Thavatchai Makphaibulchoke, Andreas Dilger, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, aswin_proj

On Wed, Sep 18, 2013 at 09:19:10AM -0500, Eric Sandeen wrote:
> > I think I found out why we get into the mbcache path.  The test uses
> > ramfss, which are mounted and unmounted at the start and end of the
> > test.  Looks like a ramfs' default inode size is 128, causing all the
> > mbcaching for the xattrs.  There seems to be nothing wrong with
> > either SELinux or xattr.  Sorry for the confusion.
> 
> smaller filesystems go back to 128 byte inodes, IIRC.

Mak, did you mean a ext4 file system created on a ramdisk, using a
/dev/ramNN device, or the literal "ramfs" file system?

I assume the former, right?

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-21 18:53                                       ` Theodore Ts'o
@ 2013-09-23 15:35                                         ` Thavatchai Makphaibulchoke
       [not found]                                         ` <D59BF31F-934D-4DE6-9E5C-F92014F09C68@dilger.ca>
  1 sibling, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-23 15:35 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, Andreas Dilger, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, aswin_proj

On 09/21/2013 12:53 PM, Theodore Ts'o wrote:

> 
> Mak, did you mean a ext4 file system created on a ramdisk, using a
> /dev/ramNN device, or the literal "ramfs" file system?
> 
> I assume the former, right?
> 
> 					- Ted
> 

Yes Ted, that is correct, the filesystem using /dev/ramNN.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
       [not found]                                         ` <D59BF31F-934D-4DE6-9E5C-F92014F09C68@dilger.ca>
@ 2013-09-25 18:43                                           ` Thavatchai Makphaibulchoke
  2013-10-04  3:22                                             ` Andreas Dilger
  0 siblings, 1 reply; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-09-25 18:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Eric Sandeen, T Makphaibulchoke,
	linux-ext4@vger.kernel.org List, aswin, aswin_proj

On 09/21/2013 05:12 PM, Andreas Dilger wrote:
> I just noticed that we have a patch for Lustre that adds a "no_mbcache" mount option to disable mbcache on a per-filesystem basis.
> 
> The newest patch we have is based on FC19 (3.10 kernel), is this something that would be of interest if we submitted it upstream?
> 
> http://review.whamcloud.com/#/c/7263/8/ldiskfs/kernel_patches/patches/fc19/ext4-disable-mb-cache.patch
> 
> Cheers, Andreas
> 

Here are some of the improvement I saw with SELinux disabled.

On an 80 core machine,

custom 15.5%
disk 27.81
fserver 5.33%
new_dbase 9.06%
new_fserver 5.45%


On an 8 core machine,

alltests 5.24
custom 2.26
shared 9.32
short 24.08

The rest is not noticably different.

Thanks,
Mak.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-09-25 18:43                                           ` Thavatchai Makphaibulchoke
@ 2013-10-04  3:22                                             ` Andreas Dilger
  2013-10-18 12:43                                               ` Thavatchai Makphaibulchoke
  0 siblings, 1 reply; 63+ messages in thread
From: Andreas Dilger @ 2013-10-04  3:22 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Theodore Ts'o, Eric Sandeen, linux-ext4@vger.kernel.org List, aswin

On 2013-09-25, at 12:43 PM, Thavatchai Makphaibulchoke wrote:
> On 09/21/2013 05:12 PM, Andreas Dilger wrote:
>> I just noticed that we have a patch for Lustre that adds a "no_mbcache" mount option to disable mbcache on a per-filesystem basis.
>> 
>> The newest patch we have is based on FC19 (3.10 kernel), is this something that would be of interest if we submitted it upstream?
>> 
>> http://review.whamcloud.com/#/c/7263/8/ldiskfs/kernel_patches/patches/fc19/ext4-disable-mb-cache.patch
>> 
>> Cheers, Andreas
>> 
> 
> Here are some of the improvement I saw with SELinux disabled.

To confirm, is this data on your "128-byte inode on ramdisk" config,
or is this on a system with a real disk and 256-byte inodes?

> On an 80 core machine,
> 
> custom 15.5%
> disk 27.81
> fserver 5.33%
> new_dbase 9.06%
> new_fserver 5.45%
> 
> 
> On an 8 core machine,
> 
> alltests 5.24
> custom 2.26
> shared 9.32
> short 24.08
> 
> The rest is not noticably different.

This is a pretty significant price to pay for SELinux in any case.
I guess it is probably lower overhead with 256-byte inodes, but
anything that adds 5-25% overhead shouldn't be taken for granted.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-10-04  3:22                                             ` Andreas Dilger
@ 2013-10-18 12:43                                               ` Thavatchai Makphaibulchoke
  2013-10-19 23:06                                                 ` Eric Sandeen
  0 siblings, 1 reply; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-10-18 12:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Eric Sandeen, linux-ext4@vger.kernel.org List,
	Chandramouleeswaran, Aswin, Linus Torvalds

On 10/03/2013 09:22 PM, Andreas Dilger wrote:

> 
> This is a pretty significant price to pay for SELinux in any case.
> I guess it is probably lower overhead with 256-byte inodes, but
> anything that adds 5-25% overhead shouldn't be taken for granted.
> 
> Cheers, Andreas
> 

So far we have determined that,

- With SELinux enabled a ramdisk filesytem, with default node size of 128 bytes, extended xattr is generated
- In one of the aim7 workload, mbcache has a hit ratio of about 65%

Seems like mbcache itself and the mbcache lock optimization attempt by the patch could actually improve performance for a real word system where extended xattr is employed.  I believe the patch should go in if there isn't any concern.

Please let me know if there is any comment or concern about the patch.

Thanks,
Mak.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 0/2] ext4: increase mbcache scalability
  2013-10-18 12:43                                               ` Thavatchai Makphaibulchoke
@ 2013-10-19 23:06                                                 ` Eric Sandeen
  0 siblings, 0 replies; 63+ messages in thread
From: Eric Sandeen @ 2013-10-19 23:06 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Andreas Dilger, Theodore Ts'o,
	linux-ext4@vger.kernel.org List, Chandramouleeswaran, Aswin,
	Linus Torvalds

On 10/18/13 7:43 AM, Thavatchai Makphaibulchoke wrote:
> On 10/03/2013 09:22 PM, Andreas Dilger wrote:
> 
>>
>> This is a pretty significant price to pay for SELinux in any case.
>> I guess it is probably lower overhead with 256-byte inodes, but
>> anything that adds 5-25% overhead shouldn't be taken for granted.
>>
>> Cheers, Andreas
>>
> 
> So far we have determined that,
> 
> - With SELinux enabled a ramdisk filesytem, with default node size of 128 bytes, extended xattr is generated
> - In one of the aim7 workload, mbcache has a hit ratio of about 65%
> 
> Seems like mbcache itself and the mbcache lock optimization attempt
> by the patch could actually improve performance for a real word
> system where extended xattr is employed. I believe the patch should
> go in if there isn't any concern.

But the 128-byte inode size is not common, so in general, mbcache isn't
exercised that often.

(still, it is there, and if there's a nice scalability patch for it,
it's probably worth it).

-Eric

> Please let me know if there is any comment or concern about the patch.
> 
> Thanks,
> Mak.
> 
> 
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 1/2] mbcache: decoupling the locking of local from global data
  2013-09-04 16:39   ` [PATCH v3 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
@ 2013-10-30 13:27     ` Theodore Ts'o
  2013-10-30 14:42     ` Theodore Ts'o
  1 sibling, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2013-10-30 13:27 UTC (permalink / raw)
  To: T Makphaibulchoke
  Cc: adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel,
	aswin, torvalds, aswin_proj

On Wed, Sep 04, 2013 at 10:39:15AM -0600, T Makphaibulchoke wrote:
> The patch increases the parallelism of mb_cache_entry utilization by
> replacing list_head with hlist_bl_node for the implementation of both the
> block and index hash tables.  Each hlist_bl_node contains a built-in lock
> used to protect mb_cache's local block and index hash chains. The global
> data mb_cache_lru_list and mb_cache_list continue to be protected by the
> global mb_cache_spinlock.

In the process of applying this patch to the ext4 tree, I had to
rework one of the patches to account for a change upstream to the
shrinker interface (which modified mb_cache_shrink_fn() to be
mb_cache_shrink_scan()).

Can you verify that the changes I made look sane?

Thanks,

					- Ted

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 1f90cd0..44e7153 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -200,25 +200,38 @@ forget:
 static unsigned long
 mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	LIST_HEAD(free_list);
-	struct mb_cache_entry *entry, *tmp;
 	int nr_to_scan = sc->nr_to_scan;
 	gfp_t gfp_mask = sc->gfp_mask;
 	unsigned long freed = 0;
 
 	mb_debug("trying to free %d entries", nr_to_scan);
-	spin_lock(&mb_cache_spinlock);
-	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
-		struct mb_cache_entry *ce =
-			list_entry(mb_cache_lru_list.next,
-				   struct mb_cache_entry, e_lru_list);
-		list_move_tail(&ce->e_lru_list, &free_list);
-		__mb_cache_entry_unhash(ce);
-		freed++;
-	}
-	spin_unlock(&mb_cache_spinlock);
-	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
-		__mb_cache_entry_forget(entry, gfp_mask);
+	while (nr_to_scan > 0) {
+		struct mb_cache_entry *ce;
+
+		spin_lock(&mb_cache_spinlock);
+		if (list_empty(&mb_cache_lru_list)) {
+			spin_unlock(&mb_cache_spinlock);
+			break;
+		}
+		ce = list_entry(mb_cache_lru_list.next,
+			struct mb_cache_entry, e_lru_list);
+		list_del_init(&ce->e_lru_list);
+		spin_unlock(&mb_cache_spinlock);
+
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		if (!(ce->e_used || ce->e_queued)) {
+			__mb_cache_entry_unhash_index(ce);
+			hlist_bl_unlock(ce->e_index_hash_p);
+			__mb_cache_entry_unhash_block(ce);
+			hlist_bl_unlock(ce->e_block_hash_p);
+			__mb_cache_entry_forget(ce, gfp_mask);
+			--nr_to_scan;
+			freed++;
+		} else {
+			hlist_bl_unlock(ce->e_index_hash_p);
+			hlist_bl_unlock(ce->e_block_hash_p);
+		}
 	}
 	return freed;
 }

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache
  2013-09-04 16:39   ` [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
@ 2013-10-30 14:30     ` Theodore Ts'o
  0 siblings, 0 replies; 63+ messages in thread
From: Theodore Ts'o @ 2013-10-30 14:30 UTC (permalink / raw)
  To: T Makphaibulchoke
  Cc: adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel,
	aswin, torvalds, aswin_proj

On Wed, Sep 04, 2013 at 10:39:16AM -0600, T Makphaibulchoke wrote:
> This patch adds new interfaces to create and destory cache,
> ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the
> cache creation and destory calls from ex4_init_xattr() and ext4_exitxattr()
> in fs/ext4/xattr.c.
> fs/ext4/super.c has been changed so that when a filesystem is mounted a
> cache is allocated and attched to its ext4_sb_info structure.

One problem with this patch is that it creates separate slabs for each
ext4 file system where previously we had a single slab cache for
mbcache entries.  I've tried adding the following patch on top of
yours so that we can retain the previous behavior of creating a single
kmem_cache for all ext4 file systems.

Please comment/review...

							- Ted

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3b525b6..d529807 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -31,6 +31,7 @@
 #include <linux/vfs.h>
 #include <linux/random.h>
 #include <linux/mount.h>
+#include <linux/mbcache.h>
 #include <linux/namei.h>
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
@@ -5492,9 +5493,16 @@ static int __init ext4_init_fs(void)
 		init_waitqueue_head(&ext4__ioend_wq[i]);
 	}
 
+	ext4_xattr_kmem_cache = kmem_cache_create("ext4_xattr",
+		sizeof(struct mb_cache_entry), 0,
+		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+	if (ext4_xattr_kmem_cache == NULL)
+		return -ENOMEM;
+
+
 	err = ext4_init_es();
 	if (err)
-		return err;
+		goto out8;
 
 	err = ext4_init_pageio();
 	if (err)
@@ -5548,12 +5556,15 @@ out6:
 	ext4_exit_pageio();
 out7:
 	ext4_exit_es();
+out8:
+	kmem_cache_destroy(ext4_xattr_kmem_cache);
 
 	return err;
 }
 
 static void __exit ext4_exit_fs(void)
 {
+	kmem_cache_destroy(ext4_xattr_kmem_cache);
 	ext4_destroy_lazyinit_thread();
 	unregister_as_ext2();
 	unregister_as_ext3();
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 07ca399..3d2bd5a 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,6 +81,8 @@
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
+struct kmem_cache *ext4_xattr_kmem_cache;
+
 static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
 static struct buffer_head *ext4_xattr_cache_find(struct inode *,
 						 struct ext4_xattr_header *,
@@ -1687,7 +1689,8 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
 struct mb_cache *
 ext4_xattr_create_cache(char *name)
 {
-	return mb_cache_create(name, HASH_BUCKET_BITS);
+	return mb_cache_create_instance(name, HASH_BUCKET_BITS,
+					ext4_xattr_kmem_cache);
 }
 
 void ext4_xattr_destroy_cache(struct mb_cache *cache)
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index b930320..15616eb 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -94,6 +94,8 @@ struct ext4_xattr_ibody_find {
 	struct ext4_iloc iloc;
 };
 
+extern struct kmem_cache *ext4_xattr_kmem_cache;
+
 extern const struct xattr_handler ext4_xattr_user_handler;
 extern const struct xattr_handler ext4_xattr_trusted_handler;
 extern const struct xattr_handler ext4_xattr_acl_access_handler;
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 44e7153..8e373c6 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -96,6 +96,7 @@ MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
 MODULE_LICENSE("GPL");
 
 EXPORT_SYMBOL(mb_cache_create);
+EXPORT_SYMBOL(mb_cache_create_instance);
 EXPORT_SYMBOL(mb_cache_shrink);
 EXPORT_SYMBOL(mb_cache_destroy);
 EXPORT_SYMBOL(mb_cache_entry_alloc);
@@ -260,18 +261,20 @@ static struct shrinker mb_cache_shrinker = {
 };
 
 /*
- * mb_cache_create()  create a new cache
+ * mb_cache_create_instance()  create a new cache for a device
  *
  * All entries in one cache are equal size. Cache entries may be from
- * multiple devices. If this is the first mbcache created, registers
- * the cache with kernel memory management. Returns NULL if no more
+ * multiple devices.  The caller must create a kmem_cache appropriate
+ * for allocating struct mb_cache_entry.  Returns NULL if no more
  * memory was available.
  *
  * @name: name of the cache (informal)
  * @bucket_bits: log2(number of hash buckets)
+ * @mb_kmem_cache: slab cache for mb_cache_entry structures
  */
 struct mb_cache *
-mb_cache_create(const char *name, int bucket_bits)
+mb_cache_create_instance(const char *name, int bucket_bits,
+			 struct kmem_cache *mb_kmem_cache)
 {
 	int n, bucket_count = 1 << bucket_bits;
 	struct mb_cache *cache = NULL;
@@ -294,11 +297,8 @@ mb_cache_create(const char *name, int bucket_bits)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
 		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
-	cache->c_entry_cache = kmem_cache_create(name,
-		sizeof(struct mb_cache_entry), 0,
-		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
-	if (!cache->c_entry_cache)
-		goto fail2;
+	BUG_ON(mb_kmem_cache == NULL);
+	cache->c_entry_cache = mb_kmem_cache;
 
 	/*
 	 * Set an upper limit on the number of cache entries so that the hash
@@ -311,15 +311,38 @@ mb_cache_create(const char *name, int bucket_bits)
 	spin_unlock(&mb_cache_spinlock);
 	return cache;
 
-fail2:
-	kfree(cache->c_index_hash);
-
 fail:
 	kfree(cache->c_block_hash);
 	kfree(cache);
 	return NULL;
 }
 
+/*
+ * mb_cache_create()  create a new cache
+ *
+ * All entries in one cache are equal size. Cache entries may be from
+ * multiple devices.  Returns NULL if no more memory was available.
+ *
+ * @name: name of the cache (informal)
+ * @bucket_bits: log2(number of hash buckets)
+ */
+struct mb_cache *
+mb_cache_create(const char *name, int bucket_bits)
+{
+	struct kmem_cache *mb_kmem_cache;
+	struct mb_cache *cache;
+
+	mb_kmem_cache = kmem_cache_create(name,
+		sizeof(struct mb_cache_entry), 0,
+		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+	if (mb_kmem_cache == NULL)
+		return NULL;
+	cache = mb_cache_create_instance(name, bucket_bits, mb_kmem_cache); 
+	if (cache == NULL)
+		kmem_cache_destroy(mb_kmem_cache);
+	cache->c_entry_cache_autofree = 1;
+	return cache;
+}
 
 /*
  * mb_cache_shrink()
@@ -406,7 +429,8 @@ mb_cache_destroy(struct mb_cache *cache)
 			  atomic_read(&cache->c_entry_count));
 	}
 
-	kmem_cache_destroy(cache->c_entry_cache);
+	if (cache->c_entry_cache_autofree)
+		kmem_cache_destroy(cache->c_entry_cache);
 
 	kfree(cache->c_index_hash);
 	kfree(cache->c_block_hash);
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 89826c0..b326ce7 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -29,11 +29,14 @@ struct mb_cache {
 	struct kmem_cache		*c_entry_cache;
 	struct hlist_bl_head		*c_block_hash;
 	struct hlist_bl_head		*c_index_hash;
+	unsigned int			c_entry_cache_autofree:1;
 };
 
 /* Functions on caches */
 
 struct mb_cache *mb_cache_create(const char *, int);
+struct mb_cache *mb_cache_create_instance(const char *,
+					  int, struct kmem_cache *);
 void mb_cache_shrink(struct block_device *);
 void mb_cache_destroy(struct mb_cache *);
 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 1/2] mbcache: decoupling the locking of local from global data
  2013-09-04 16:39   ` [PATCH v3 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
  2013-10-30 13:27     ` Theodore Ts'o
@ 2013-10-30 14:42     ` Theodore Ts'o
  2013-10-30 17:32       ` Thavatchai Makphaibulchoke
  1 sibling, 1 reply; 63+ messages in thread
From: Theodore Ts'o @ 2013-10-30 14:42 UTC (permalink / raw)
  To: T Makphaibulchoke
  Cc: adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel,
	aswin, torvalds, aswin_proj

On Wed, Sep 04, 2013 at 10:39:15AM -0600, T Makphaibulchoke wrote:
> The patch increases the parallelism of mb_cache_entry utilization by
> replacing list_head with hlist_bl_node for the implementation of both the
> block and index hash tables.  Each hlist_bl_node contains a built-in lock
> used to protect mb_cache's local block and index hash chains. The global
> data mb_cache_lru_list and mb_cache_list continue to be protected by the
> global mb_cache_spinlock.
> 
> Signed-off-by: T. Makphaibulchoke <tmac@hp.com>

I tried running xfstests with this patch, and it blew up on
generic/020 test:

generic/020	[10:21:50][  105.170352] ------------[ cut here ]------------
[  105.171683] kernel BUG at /usr/projects/linux/ext4/include/linux/bit_spinlock.h:76!
[  105.173346] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  105.173346] Modules linked in:
[  105.173346] CPU: 1 PID: 8519 Comm: attr Not tainted 3.12.0-rc5-00008-gffbe1d7-dirty #1492
[  105.173346] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  105.173346] task: f5abe560 ti: f2274000 task.ti: f2274000
[  105.173346] EIP: 0060:[<c026b464>] EFLAGS: 00010246 CPU: 1
[  105.173346] EIP is at hlist_bl_unlock+0x7/0x1c
[  105.173346] EAX: f488d360 EBX: f488d360 ECX: 00000000 EDX: f2998800
[  105.173346] ESI: f29987f0 EDI: 6954c848 EBP: f2275cc8 ESP: f2275cb8
[  105.173346]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  105.173346] CR0: 80050033 CR2: b76bcf54 CR3: 34844000 CR4: 000006f0
[  105.173346] Stack:
[  105.173346]  c026bc78 f2275d48 6954c848 f29987f0 f2275d24 c02cd7a9 f2275ce4 c02e2881
[  105.173346]  f255d8c8 00000000 f1109020 f4a67f00 f2275d54 f2275d08 c02cd020 6954c848
[  105.173346]  f4a67f00 f1109000 f2b0eba8 f2ee3800 f2275d28 f4f811e8 f2275d38 00000000
[  105.173346] Call Trace:
[  105.173346]  [<c026bc78>] ? mb_cache_entry_find_first+0x4b/0x55
[  105.173346]  [<c02cd7a9>] ext4_xattr_block_set+0x248/0x6e7
[  105.173346]  [<c02e2881>] ? jbd2_journal_put_journal_head+0xe2/0xed
[  105.173346]  [<c02cd020>] ? ext4_xattr_find_entry+0x52/0xac
[  105.173346]  [<c02ce307>] ext4_xattr_set_handle+0x1c7/0x30f
[  105.173346]  [<c02ce4f4>] ext4_xattr_set+0xa5/0xe1
[  105.173346]  [<c02ceb36>] ext4_xattr_user_set+0x46/0x5f
[  105.173346]  [<c024a4da>] generic_setxattr+0x4c/0x5e
[  105.173346]  [<c024a48e>] ? generic_listxattr+0x95/0x95
[  105.173346]  [<c024ab0f>] __vfs_setxattr_noperm+0x56/0xb6
[  105.173346]  [<c024abd2>] vfs_setxattr+0x63/0x7e
[  105.173346]  [<c024ace8>] setxattr+0xfb/0x139
[  105.173346]  [<c01b200a>] ? __lock_acquire+0x540/0xca6
[  105.173346]  [<c01877a3>] ? lg_local_unlock+0x1b/0x34
[  105.173346]  [<c01af8dd>] ? trace_hardirqs_off_caller+0x2e/0x98
[  105.173346]  [<c0227e69>] ? kmem_cache_free+0xd4/0x149
[  105.173346]  [<c01b2c2b>] ? lock_acquire+0xdd/0x107
[  105.173346]  [<c023225e>] ? __sb_start_write+0xee/0x11d
[  105.173346]  [<c0247383>] ? mnt_want_write+0x1e/0x3e
[  105.173346]  [<c01b3019>] ? trace_hardirqs_on_caller+0x12a/0x17e
[  105.173346]  [<c0247353>] ? __mnt_want_write+0x4e/0x60
[  105.173346]  [<c024af3b>] SyS_lsetxattr+0x6a/0x9f
[  105.173346]  [<c078d0e8>] syscall_call+0x7/0xb
[  105.173346] Code: 00 00 00 00 5b 5d c3 55 89 e5 53 3e 8d 74 26 00 8b 58 08 89 c2 8b 43 18 e8 3f c9 fb ff f0 ff 4b 0c 5b 5d c3 8b 10 80 e2 01 75 02 <0f> 0b 55 89 e5 0f ba 30 00 89 e0 25 00 e0 ff ff ff 48 14 5d c3
[  105.173346] EIP: [<c026b464>] hlist_bl_unlock+0x7/0x1c SS:ESP 0068:f2275cb8
[  105.273781] ---[ end trace 1ee45ddfc1df0935 ]---

When I tried to find a potential problem, I immediately ran into this.
I'm not entirely sure it's the problem, but it's raised a number of
red flags for me in terms of (a) how much testing you've employed with
this patch set, and (b) how maintaining and easy-to-audit the code
will be with this extra locking.  The comments are good start, but
some additional comments about exactly what assumptions a function
assumes about locks that are held on function entry, or especially if
the locking is different on function entry and function exit, might
make it easier for people to audit this patch.

Or maybe this commit needs to be split up with first a conversion from
using list_head to hlist_hl_node, and the changing the locking?  The
bottom line is that we need to somehow make this patch easier to
validate/review.

> @@ -520,18 +647,23 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
>  				ce->e_queued++;
>  				prepare_to_wait(&mb_cache_queue, &wait,
>  						TASK_UNINTERRUPTIBLE);
> -				spin_unlock(&mb_cache_spinlock);
> +				hlist_bl_unlock(head);
>  				schedule();
> -				spin_lock(&mb_cache_spinlock);
> +				hlist_bl_lock(head);
> +				mb_assert(ce->e_index_hash_p == head);
>  				ce->e_queued--;
>  			}
> +			hlist_bl_unlock(head);
>  			finish_wait(&mb_cache_queue, &wait);
>  
> -			if (!__mb_cache_entry_is_hashed(ce)) {
> +			hlist_bl_lock(ce->e_block_hash_p);
> +			if (!__mb_cache_entry_is_block_hashed(ce)) {
>  				__mb_cache_entry_release_unlock(ce);
> -				spin_lock(&mb_cache_spinlock);
> +				hlist_bl_lock(head);

<---  are we missing a "hlist_bl_unlock(ce->e_block_hash_p);" here?

<---- Why is it that we call "hlist_bl_lock(head);" but not in the else clause?

>  				return ERR_PTR(-EAGAIN);
> -			}
> +			} else
> +				hlist_bl_unlock(ce->e_block_hash_p);
> +			mb_assert(ce->e_index_hash_p == head);
>  			return ce;
>  		}
>  		l = l->next; 


Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 1/2] mbcache: decoupling the locking of local from global data
  2013-10-30 14:42     ` Theodore Ts'o
@ 2013-10-30 17:32       ` Thavatchai Makphaibulchoke
  0 siblings, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2013-10-30 17:32 UTC (permalink / raw)
  To: Theodore Ts'o, T Makphaibulchoke, adilger.kernel, viro,
	linux-ext4, linux-kernel, linux-fsdevel, aswin, torvalds,
	aswin_proj

On 10/30/2013 08:42 AM, Theodore Ts'o wrote:
> I tried running xfstests with this patch, and it blew up on
> generic/020 test:
> 
> generic/020	[10:21:50][  105.170352] ------------[ cut here ]------------
> [  105.171683] kernel BUG at /usr/projects/linux/ext4/include/linux/bit_spinlock.h:76!
> [  105.173346] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [  105.173346] Modules linked in:
> [  105.173346] CPU: 1 PID: 8519 Comm: attr Not tainted 3.12.0-rc5-00008-gffbe1d7-dirty #1492
> [  105.173346] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  105.173346] task: f5abe560 ti: f2274000 task.ti: f2274000
> [  105.173346] EIP: 0060:[<c026b464>] EFLAGS: 00010246 CPU: 1
> [  105.173346] EIP is at hlist_bl_unlock+0x7/0x1c
> [  105.173346] EAX: f488d360 EBX: f488d360 ECX: 00000000 EDX: f2998800
> [  105.173346] ESI: f29987f0 EDI: 6954c848 EBP: f2275cc8 ESP: f2275cb8
> [  105.173346]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [  105.173346] CR0: 80050033 CR2: b76bcf54 CR3: 34844000 CR4: 000006f0
> [  105.173346] Stack:
> [  105.173346]  c026bc78 f2275d48 6954c848 f29987f0 f2275d24 c02cd7a9 f2275ce4 c02e2881
> [  105.173346]  f255d8c8 00000000 f1109020 f4a67f00 f2275d54 f2275d08 c02cd020 6954c848
> [  105.173346]  f4a67f00 f1109000 f2b0eba8 f2ee3800 f2275d28 f4f811e8 f2275d38 00000000
> [  105.173346] Call Trace:
> [  105.173346]  [<c026bc78>] ? mb_cache_entry_find_first+0x4b/0x55
> [  105.173346]  [<c02cd7a9>] ext4_xattr_block_set+0x248/0x6e7
> [  105.173346]  [<c02e2881>] ? jbd2_journal_put_journal_head+0xe2/0xed
> [  105.173346]  [<c02cd020>] ? ext4_xattr_find_entry+0x52/0xac
> [  105.173346]  [<c02ce307>] ext4_xattr_set_handle+0x1c7/0x30f
> [  105.173346]  [<c02ce4f4>] ext4_xattr_set+0xa5/0xe1
> [  105.173346]  [<c02ceb36>] ext4_xattr_user_set+0x46/0x5f
> [  105.173346]  [<c024a4da>] generic_setxattr+0x4c/0x5e
> [  105.173346]  [<c024a48e>] ? generic_listxattr+0x95/0x95
> [  105.173346]  [<c024ab0f>] __vfs_setxattr_noperm+0x56/0xb6
> [  105.173346]  [<c024abd2>] vfs_setxattr+0x63/0x7e
> [  105.173346]  [<c024ace8>] setxattr+0xfb/0x139
> [  105.173346]  [<c01b200a>] ? __lock_acquire+0x540/0xca6
> [  105.173346]  [<c01877a3>] ? lg_local_unlock+0x1b/0x34
> [  105.173346]  [<c01af8dd>] ? trace_hardirqs_off_caller+0x2e/0x98
> [  105.173346]  [<c0227e69>] ? kmem_cache_free+0xd4/0x149
> [  105.173346]  [<c01b2c2b>] ? lock_acquire+0xdd/0x107
> [  105.173346]  [<c023225e>] ? __sb_start_write+0xee/0x11d
> [  105.173346]  [<c0247383>] ? mnt_want_write+0x1e/0x3e
> [  105.173346]  [<c01b3019>] ? trace_hardirqs_on_caller+0x12a/0x17e
> [  105.173346]  [<c0247353>] ? __mnt_want_write+0x4e/0x60
> [  105.173346]  [<c024af3b>] SyS_lsetxattr+0x6a/0x9f
> [  105.173346]  [<c078d0e8>] syscall_call+0x7/0xb
> [  105.173346] Code: 00 00 00 00 5b 5d c3 55 89 e5 53 3e 8d 74 26 00 8b 58 08 89 c2 8b 43 18 e8 3f c9 fb ff f0 ff 4b 0c 5b 5d c3 8b 10 80 e2 01 75 02 <0f> 0b 55 89 e5 0f ba 30 00 89 e0 25 00 e0 ff ff ff 48 14 5d c3
> [  105.173346] EIP: [<c026b464>] hlist_bl_unlock+0x7/0x1c SS:ESP 0068:f2275cb8
> [  105.273781] ---[ end trace 1ee45ddfc1df0935 ]---
> 
> When I tried to find a potential problem, I immediately ran into this.
> I'm not entirely sure it's the problem, but it's raised a number of
> red flags for me in terms of (a) how much testing you've employed with
> this patch set, and (b) how maintaining and easy-to-audit the code
> will be with this extra locking.  The comments are good start, but
> some additional comments about exactly what assumptions a function
> assumes about locks that are held on function entry, or especially if
> the locking is different on function entry and function exit, might
> make it easier for people to audit this patch.
> 
> Or maybe this commit needs to be split up with first a conversion from
> using list_head to hlist_hl_node, and the changing the locking?  The
> bottom line is that we need to somehow make this patch easier to
> validate/review.
> 

Thanks for the comemnts.  Yes, I did run through xfstests.  My guess is that you probably ran into a race condition that I did not.

I will try to port the patch to a more recent kernel, including the mb_cache_shrink_scan() sent earlier (BTW, it looks good) and debug the problem.

Yes, those are good suggestions.  Once I find the problem, I will resubmit with more comments and also split it into two patches, as suggested. 

>> @@ -520,18 +647,23 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
>>  				ce->e_queued++;
>>  				prepare_to_wait(&mb_cache_queue, &wait,
>>  						TASK_UNINTERRUPTIBLE);
>> -				spin_unlock(&mb_cache_spinlock);
>> +				hlist_bl_unlock(head);
>>  				schedule();
>> -				spin_lock(&mb_cache_spinlock);
>> +				hlist_bl_lock(head);
>> +				mb_assert(ce->e_index_hash_p == head);
>>  				ce->e_queued--;
>>  			}
>> +			hlist_bl_unlock(head);
>>  			finish_wait(&mb_cache_queue, &wait);
>>  
>> -			if (!__mb_cache_entry_is_hashed(ce)) {
>> +			hlist_bl_lock(ce->e_block_hash_p);
>> +			if (!__mb_cache_entry_is_block_hashed(ce)) {
>>  				__mb_cache_entry_release_unlock(ce);
>> -				spin_lock(&mb_cache_spinlock);
>> +				hlist_bl_lock(head);
> 
> <---  are we missing a "hlist_bl_unlock(ce->e_block_hash_p);" here?
> 
> <---- Why is it that we call "hlist_bl_lock(head);" but not in the else clause?

The function __mb_cache_entry_release_unlock() releases both the e_block_hash and e_index_hash locks.  So we have to reacquire the e_index_hash (head) lock in the then part, and release the e_block_hash lock in the else part.
> 
>>  				return ERR_PTR(-EAGAIN);
>> -			}
>> +			} else
>> +				hlist_bl_unlock(ce->e_block_hash_p);
>> +			mb_assert(ce->e_index_hash_p == head);
>>  			return ce;
>>  		}
>>  		l = l->next; 
> 
> 
> Cheers,
> 
> 					- Ted
> 

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v4 0/3] ext4: increase mbcache scalability
  2013-08-22 15:54 ` [PATCH v2 " T Makphaibulchoke
  2013-08-22 15:54   ` [PATCH v2 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
  2013-08-22 15:54     ` T Makphaibulchoke
@ 2014-01-24 18:31   ` T Makphaibulchoke
  2014-01-24 18:31     ` [PATCH v4 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
                       ` (3 more replies)
  2 siblings, 4 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-01-24 18:31 UTC (permalink / raw)
  To: viro, tytso, adilger.kernel, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

The patch consists of three parts.

The first part changes the implementation of both the block and hash chains of
an mb_cache from list_head to hlist_bl_head and also introduces new members,
including a spinlock to mb_cache_entry, as required by the second part.

The second part introduces higher degree of parallelism to the usages of the
mb_cache and mb_cache_entries and impacts all ext filesystems.

The third part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems, and increases the size of its mbcache hash tables.

Here are some of the benchmark results with the changes.

Using ram disk, there seems to be no peformance differences with aim7 for all
workloads on all platforms tested.

With regular disk filesystems with inode size of 128 bytes, forcing the uses
of external xattr, there seems to be some good peformance increases with
some of the aim7's workloads on all platforms tested.

Here are some of the performance improvement on aim7 with 2000 users.

On a 60 core machine:

---------------------------
|             | % increase |
---------------------------
| alltests    |     30.97  |
---------------------------
| custom      |     41.18  |
---------------------------
| dbase       |     32.06  |
---------------------------
| disk        |    125.02  |
---------------------------
| fserver     |     10.23  |
---------------------------
| new_dbase   |     14.58  |
---------------------------
| new_fserve  |      9.62  |
---------------------------
| shared      |     52.56  |
---------------------------

On an 40 core machine:

---------------------------
|             | % increase |
---------------------------
| custom      |     63.45  |
---------------------------
| disk        |     133.25  |
---------------------------
| fserver     |     78.29  |
---------------------------
| new_fserver |     80.66  |
---------------------------
| shared      |     56.34  |
---------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.

Changed in v4:
        - New performance data
	- New diff summary
	- New patch architecture

Changed in v3:
	- New idff summary

Changed in v2:
        - New performance data
	- New diff summary
T Makphaibulchoke (3):
  fs/mbcache.c change block and index hash chain to hlist_bl_node
  mbcache: decoupling the locking of local from global data
  ext4: each filesystem creates and uses its own mc_cache

 fs/ext4/ext4.h          |   1 +
 fs/ext4/super.c         |  24 ++-
 fs/ext4/xattr.c         |  51 ++---
 fs/ext4/xattr.h         |   6 +-
 fs/mbcache.c            | 491 ++++++++++++++++++++++++++++++++++--------------
 include/linux/mbcache.h |  11 +-
 6 files changed, 402 insertions(+), 182 deletions(-)

-- 
1.7.11.3


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v4 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node
  2014-01-24 18:31   ` [PATCH v4 0/3] ext4: increase mbcache scalability T Makphaibulchoke
@ 2014-01-24 18:31     ` T Makphaibulchoke
  2014-01-24 18:31     ` [PATCH v4 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-01-24 18:31 UTC (permalink / raw)
  To: viro, tytso, adilger.kernel, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

This patch changes each mb_cache's both block and index hash chains from
regular linked list to hlist_bl_node, which contains a built-in lock.  This is
the first step in decoupling of locks serializing accesses to mb_cache global
data and each mb_cache_entry local data.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/mbcache.c            | 117 ++++++++++++++++++++++++++++++++----------------
 include/linux/mbcache.h |  11 +++--
 2 files changed, 85 insertions(+), 43 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index e519e45..55db0da 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -34,9 +34,9 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
-#include <linux/init.h>
+#include <linux/list_bl.h>
 #include <linux/mbcache.h>
-
+#include <linux/init.h>
 
 #ifdef MB_CACHE_DEBUG
 # define mb_debug(f...) do { \
@@ -87,21 +87,38 @@ static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
 static inline int
-__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
+__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
 {
-	return !list_empty(&ce->e_block_list);
+	return !hlist_bl_unhashed(&ce->e_block_list);
 }
 
 
-static void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+static inline void
+__mb_cache_entry_unhash_block(struct mb_cache_entry *ce)
 {
-	if (__mb_cache_entry_is_hashed(ce)) {
-		list_del_init(&ce->e_block_list);
-		list_del(&ce->e_index.o_list);
-	}
+	if (__mb_cache_entry_is_block_hashed(ce))
+		hlist_bl_del_init(&ce->e_block_list);
+}
+
+static inline int
+__mb_cache_entry_is_index_hashed(struct mb_cache_entry *ce)
+{
+	return !hlist_bl_unhashed(&ce->e_index.o_list);
 }
 
+static inline void
+__mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
+{
+	if (__mb_cache_entry_is_index_hashed(ce))
+		hlist_bl_del_init(&ce->e_index.o_list);
+}
+
+static inline void
+__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+{
+	__mb_cache_entry_unhash_index(ce);
+	__mb_cache_entry_unhash_block(ce);
+}
 
 static void
 __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
@@ -125,7 +142,7 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
 		ce->e_used -= MB_CACHE_WRITER;
 	ce->e_used--;
 	if (!(ce->e_used || ce->e_queued)) {
-		if (!__mb_cache_entry_is_hashed(ce))
+		if (!__mb_cache_entry_is_block_hashed(ce))
 			goto forget;
 		mb_assert(list_empty(&ce->e_lru_list));
 		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
@@ -221,18 +238,18 @@ mb_cache_create(const char *name, int bucket_bits)
 	cache->c_name = name;
 	atomic_set(&cache->c_entry_count, 0);
 	cache->c_bucket_bits = bucket_bits;
-	cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
-	                              GFP_KERNEL);
+	cache->c_block_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_block_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_block_hash[n]);
-	cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
-				      GFP_KERNEL);
+		INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+	cache->c_index_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_index_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_index_hash[n]);
+		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
 	cache->c_entry_cache = kmem_cache_create(name,
 		sizeof(struct mb_cache_entry), 0,
 		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -364,10 +381,13 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 			return NULL;
 		atomic_inc(&cache->c_entry_count);
 		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_LIST_HEAD(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
 		ce->e_cache = cache;
 		ce->e_queued = 0;
 	}
+	ce->e_block_hash_p = &cache->c_block_hash[0];
+	ce->e_index_hash_p = &cache->c_index_hash[0];
 	ce->e_used = 1 + MB_CACHE_WRITER;
 	return ce;
 }
@@ -393,25 +413,32 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 {
 	struct mb_cache *cache = ce->e_cache;
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	int error = -EBUSY;
+	struct hlist_bl_head *block_hash_p;
+	struct hlist_bl_head *index_hash_p;
+	struct mb_cache_entry *lce;
 
+	mb_assert(ce);
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
+	block_hash_p = &cache->c_block_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_prev(l, &cache->c_block_hash[bucket]) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_block_list);
-		if (ce->e_bdev == bdev && ce->e_block == block)
+	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+		if (lce->e_bdev == bdev && lce->e_block == block)
 			goto out;
 	}
+	mb_assert(!__mb_cache_entry_is_block_hashed(ce));
 	__mb_cache_entry_unhash(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
-	list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
 	bucket = hash_long(key, cache->c_bucket_bits);
-	list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+	index_hash_p = &cache->c_index_hash[bucket];
+	ce->e_index_hash_p = index_hash_p;
+	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
 	error = 0;
 out:
 	spin_unlock(&mb_cache_spinlock);
@@ -463,14 +490,16 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 		   sector_t block)
 {
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *block_hash_p;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
+	block_hash_p = &cache->c_block_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	list_for_each(l, &cache->c_block_hash[bucket]) {
-		ce = list_entry(l, struct mb_cache_entry, e_block_list);
+	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
 			DEFINE_WAIT(wait);
 
@@ -489,7 +518,7 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 			finish_wait(&mb_cache_queue, &wait);
 			ce->e_used += 1 + MB_CACHE_WRITER;
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
 				return NULL;
 			}
@@ -506,12 +535,14 @@ cleanup:
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
 
 static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
-	while (l != head) {
+	while (l != NULL) {
 		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_index.o_list);
+			hlist_bl_entry(l, struct mb_cache_entry,
+				e_index.o_list);
+		mb_assert(ce->e_index_hash_p == head);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
 			DEFINE_WAIT(wait);
 
@@ -532,7 +563,7 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
 			}
 			finish_wait(&mb_cache_queue, &wait);
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
 				spin_lock(&mb_cache_spinlock);
 				return ERR_PTR(-EAGAIN);
@@ -562,12 +593,16 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 			  unsigned int key)
 {
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
-	struct mb_cache_entry *ce;
+	struct hlist_bl_node *l;
+	struct mb_cache_entry *ce = NULL;
+	struct hlist_bl_head *index_hash_p;
 
+	index_hash_p = &cache->c_index_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	l = cache->c_index_hash[bucket].next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+	if (!hlist_bl_empty(index_hash_p)) {
+		l = hlist_bl_first(index_hash_p);
+		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+	}
 	spin_unlock(&mb_cache_spinlock);
 	return ce;
 }
@@ -597,12 +632,16 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 {
 	struct mb_cache *cache = prev->e_cache;
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *index_hash_p;
 
+	index_hash_p = &cache->c_index_hash[bucket];
+	mb_assert(prev->e_index_hash_p == index_hash_p);
 	spin_lock(&mb_cache_spinlock);
+	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
 	__mb_cache_entry_release_unlock(prev);
 	return ce;
 }
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..e76cc0e 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -11,11 +11,14 @@ struct mb_cache_entry {
 	unsigned short			e_queued;
 	struct block_device		*e_bdev;
 	sector_t			e_block;
-	struct list_head		e_block_list;
+	struct hlist_bl_node		e_block_list;
 	struct {
-		struct list_head	o_list;
+		struct hlist_bl_node	o_list;
 		unsigned int		o_key;
 	} e_index;
+	struct hlist_bl_head		*e_block_hash_p;
+	struct hlist_bl_head		*e_index_hash_p;
+	spinlock_t			e_entry_lock;
 };
 
 struct mb_cache {
@@ -25,8 +28,8 @@ struct mb_cache {
 	int				c_max_entries;
 	int				c_bucket_bits;
 	struct kmem_cache		*c_entry_cache;
-	struct list_head		*c_block_hash;
-	struct list_head		*c_index_hash;
+	struct hlist_bl_head		*c_block_hash;
+	struct hlist_bl_head		*c_index_hash;
 };
 
 /* Functions on caches */
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 2/3] mbcache: decoupling the locking of local from global data
  2014-01-24 18:31   ` [PATCH v4 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  2014-01-24 18:31     ` [PATCH v4 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
@ 2014-01-24 18:31     ` T Makphaibulchoke
  2014-01-24 18:31     ` [PATCH v4 3/3] ext4: each filesystem creates and uses its own mc_cache T Makphaibulchoke
  2014-01-24 21:38     ` [PATCH v4 0/3] ext4: increase mbcache scalability Andi Kleen
  3 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-01-24 18:31 UTC (permalink / raw)
  To: viro, tytso, adilger.kernel, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables.  Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

A new spinlock is also added to the mb_cache_entry to protect its local data,
e_used and e_queued.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/mbcache.c | 363 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 260 insertions(+), 103 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 55db0da..0c4cec2 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,40 @@
  * back on the lru list.
  */
 
+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Each mb_cache_entry contains a spinlock, e_entry_lock, to serialize
+ * accesses to its local data, such as e_used and e_queued.
+ *
+ * Lock ordering:
+ *
+ * mb_cache_spinlock has the lock highest order, followed by each block and
+ * index hash chain's lock, with e_entry_lock the lowest.  While holding
+ * mb_cache_spinlock, a thread can acquire either a block and/or index hash
+ * chain lock, and in turn can also acquire each entry spinlock, e_entry_lock.
+ *
+ * Synchronization:
+ *
+ * Since both mb_cache_entry_get and mb_cache_entry_find scan the block and
+ * index hash chian, it needs to lock the corresponding hash chain.  For each
+ * mb_cache_entry within the chain, it needs to lock the mb_cache_entry to
+ * prevent either any simultaneous release or free on the entry and also
+ * to serialize accesses to either the e_used or e_queued member of the entry.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced, both e_used,
+ * and e_queued are 0's.  When an mb_cache_entry is explicitly freed it is
+ * first removed from a block hash chain.
+ */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 
@@ -113,11 +147,21 @@ __mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
 		hlist_bl_del_init(&ce->e_index.o_list);
 }
 
+/*
+ * __mb_cache_entry_unhash_unlock()
+ *
+ * This function is called to unhash both the block and index hash
+ * chain.
+ * It assumes both the block and index hash chain is locked upon entry.
+ * It also unlock both hash chains both exit
+ */
 static inline void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_unlock(struct mb_cache_entry *ce)
 {
 	__mb_cache_entry_unhash_index(ce);
+	hlist_bl_unlock(ce->e_index_hash_p);
 	__mb_cache_entry_unhash_block(ce);
+	hlist_bl_unlock(ce->e_block_hash_p);
 }
 
 static void
@@ -130,31 +174,42 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
 	atomic_dec(&cache->c_entry_count);
 }
 
-
 static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
-	__releases(mb_cache_spinlock)
+__mb_cache_entry_release(struct mb_cache_entry *ce)
 {
+	/* First lock the entry to serialize access to its local data. */
+	spin_lock(&ce->e_entry_lock);
 	/* Wake up all processes queuing for this cache entry. */
 	if (ce->e_queued)
 		wake_up_all(&mb_cache_queue);
 	if (ce->e_used >= MB_CACHE_WRITER)
 		ce->e_used -= MB_CACHE_WRITER;
+	/*
+	 * Make sure that all cache entries on lru_list have
+	 * both e_used and e_qued of 0s.
+	 */
 	ce->e_used--;
 	if (!(ce->e_used || ce->e_queued)) {
-		if (!__mb_cache_entry_is_block_hashed(ce))
+		if (!__mb_cache_entry_is_block_hashed(ce)) {
+			spin_unlock(&ce->e_entry_lock);
 			goto forget;
-		mb_assert(list_empty(&ce->e_lru_list));
-		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+		}
+		/*
+		 * Need access to lru list, first drop entry lock,
+		 * then reacquire the lock in the proper order.
+		 */
+		spin_lock(&mb_cache_spinlock);
+		if (list_empty(&ce->e_lru_list))
+			list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+		spin_unlock(&mb_cache_spinlock);
 	}
-	spin_unlock(&mb_cache_spinlock);
+	spin_unlock(&ce->e_entry_lock);
 	return;
 forget:
-	spin_unlock(&mb_cache_spinlock);
+	mb_assert(list_empty(&ce->e_lru_list));
 	__mb_cache_entry_forget(ce, GFP_KERNEL);
 }
 
-
 /*
  * mb_cache_shrink_scan()  memory pressure callback
  *
@@ -177,17 +232,32 @@ mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 
 	mb_debug("trying to free %d entries", nr_to_scan);
 	spin_lock(&mb_cache_spinlock);
-	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
+	while ((nr_to_scan-- > 0) && !list_empty(&mb_cache_lru_list)) {
 		struct mb_cache_entry *ce =
 			list_entry(mb_cache_lru_list.next,
-				   struct mb_cache_entry, e_lru_list);
-		list_move_tail(&ce->e_lru_list, &free_list);
-		__mb_cache_entry_unhash(ce);
-		freed++;
+				struct mb_cache_entry, e_lru_list);
+		list_del_init(&ce->e_lru_list);
+		spin_unlock(&mb_cache_spinlock);
+		/* Prevent any find or get operation on the entry */
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		/* Ignore if it is touched by a find/get */
+		if (ce->e_used || ce->e_queued ||
+			!list_empty(&ce->e_lru_list)) {
+			hlist_bl_unlock(ce->e_index_hash_p);
+			hlist_bl_unlock(ce->e_block_hash_p);
+			spin_lock(&mb_cache_spinlock);
+			continue;
+		}
+		__mb_cache_entry_unhash_unlock(ce);
+		list_add_tail(&ce->e_lru_list, &free_list);
+		spin_lock(&mb_cache_spinlock);
 	}
 	spin_unlock(&mb_cache_spinlock);
+
 	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
 		__mb_cache_entry_forget(entry, gfp_mask);
+		freed++;
 	}
 	return freed;
 }
@@ -290,21 +360,41 @@ void
 mb_cache_shrink(struct block_device *bdev)
 {
 	LIST_HEAD(free_list);
-	struct list_head *l, *ltmp;
+	struct list_head *l;
+	struct mb_cache_entry *ce, *tmp;
 
+	l = &mb_cache_lru_list;
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_lru_list);
+	while (!list_is_last(l, &mb_cache_lru_list)) {
+		l = l->next;
+		ce = list_entry(l, struct mb_cache_entry, e_lru_list);
 		if (ce->e_bdev == bdev) {
-			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
+			list_del_init(&ce->e_lru_list);
+			spin_unlock(&mb_cache_spinlock);
+			/*
+			 * Prevent any find or get operation on the entry.
+			 */
+			hlist_bl_lock(ce->e_block_hash_p);
+			hlist_bl_lock(ce->e_index_hash_p);
+			/* Ignore if it is touched by a find/get */
+			if (ce->e_used || ce->e_queued ||
+				!list_empty(&ce->e_lru_list)) {
+				hlist_bl_unlock(ce->e_index_hash_p);
+				hlist_bl_unlock(ce->e_block_hash_p);
+				l = &mb_cache_lru_list;
+				spin_lock(&mb_cache_spinlock);
+				continue;
+			}
+			__mb_cache_entry_unhash_unlock(ce);
+			list_add_tail(&ce->e_lru_list, &free_list);
+			l = &mb_cache_lru_list;
+			spin_lock(&mb_cache_spinlock);
 		}
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+
+	list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 }
 
@@ -320,23 +410,26 @@ void
 mb_cache_destroy(struct mb_cache *cache)
 {
 	LIST_HEAD(free_list);
-	struct list_head *l, *ltmp;
+	struct mb_cache_entry *ce, *tmp;
 
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_lru_list);
-		if (ce->e_cache == cache) {
+	list_for_each_entry_safe(ce, tmp, &mb_cache_lru_list, e_lru_list) {
+		if (ce->e_cache == cache)
 			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
-		}
 	}
 	list_del(&cache->c_cache_list);
 	spin_unlock(&mb_cache_spinlock);
 
-	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+	list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+		list_del_init(&ce->e_lru_list);
+		/*
+		 * Prevent any find or get operation on the entry.
+		 */
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		mb_assert(!(ce->e_used || ce->e_queued));
+		__mb_cache_entry_unhash_unlock(ce);
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 
 	if (atomic_read(&cache->c_entry_count) > 0) {
@@ -363,31 +456,56 @@ mb_cache_destroy(struct mb_cache *cache)
 struct mb_cache_entry *
 mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 {
-	struct mb_cache_entry *ce = NULL;
+	struct mb_cache_entry *ce;
 
 	if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+		struct list_head *l;
+
+		l = &mb_cache_lru_list;
 		spin_lock(&mb_cache_spinlock);
-		if (!list_empty(&mb_cache_lru_list)) {
-			ce = list_entry(mb_cache_lru_list.next,
-					struct mb_cache_entry, e_lru_list);
-			list_del_init(&ce->e_lru_list);
-			__mb_cache_entry_unhash(ce);
+		while (!list_is_last(l, &mb_cache_lru_list)) {
+			l = l->next;
+			ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+			if (ce->e_cache == cache) {
+				list_del_init(&ce->e_lru_list);
+				spin_unlock(&mb_cache_spinlock);
+				/*
+				 * Prevent any find or get operation on the
+				 * entry.
+				 */
+				hlist_bl_lock(ce->e_block_hash_p);
+				hlist_bl_lock(ce->e_index_hash_p);
+				/* Ignore if it is touched by a find/get */
+				if (ce->e_used || ce->e_queued ||
+					!list_empty(&ce->e_lru_list)) {
+					hlist_bl_unlock(ce->e_index_hash_p);
+					hlist_bl_unlock(ce->e_block_hash_p);
+					l = &mb_cache_lru_list;
+					spin_lock(&mb_cache_spinlock);
+					continue;
+				}
+				mb_assert(list_empty(&ce->e_lru_list));
+				mb_assert(!(ce->e_used || ce->e_queued));
+				__mb_cache_entry_unhash_unlock(ce);
+				goto found;
+			}
 		}
 		spin_unlock(&mb_cache_spinlock);
 	}
-	if (!ce) {
-		ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
-		if (!ce)
-			return NULL;
-		atomic_inc(&cache->c_entry_count);
-		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_HLIST_BL_NODE(&ce->e_block_list);
-		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
-		ce->e_cache = cache;
-		ce->e_queued = 0;
-	}
+
+	ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
+	if (!ce)
+		return NULL;
+	atomic_inc(&cache->c_entry_count);
+	INIT_LIST_HEAD(&ce->e_lru_list);
+	INIT_HLIST_BL_NODE(&ce->e_block_list);
+	INIT_HLIST_BL_NODE(&ce->e_index.o_list);
+	ce->e_cache = cache;
+	ce->e_queued = 0;
+found:
 	ce->e_block_hash_p = &cache->c_block_hash[0];
 	ce->e_index_hash_p = &cache->c_index_hash[0];
+	spin_lock_init(&ce->e_entry_lock);
 	ce->e_used = 1 + MB_CACHE_WRITER;
 	return ce;
 }
@@ -423,25 +541,28 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
 	block_hash_p = &cache->c_block_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(block_hash_p);
 	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
 		if (lce->e_bdev == bdev && lce->e_block == block)
 			goto out;
 	}
 	mb_assert(!__mb_cache_entry_is_block_hashed(ce));
-	__mb_cache_entry_unhash(ce);
+	__mb_cache_entry_unhash_block(ce);
+	__mb_cache_entry_unhash_index(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
 	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
 	bucket = hash_long(key, cache->c_bucket_bits);
 	index_hash_p = &cache->c_index_hash[bucket];
+	hlist_bl_lock(index_hash_p);
 	ce->e_index_hash_p = index_hash_p;
 	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
 	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+	hlist_bl_unlock(index_hash_p);
 	error = 0;
 out:
-	spin_unlock(&mb_cache_spinlock);
+	hlist_bl_unlock(block_hash_p);
 	return error;
 }
 
@@ -456,24 +577,26 @@ out:
 void
 mb_cache_entry_release(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
-	__mb_cache_entry_release_unlock(ce);
+	__mb_cache_entry_release(ce);
 }
 
 
 /*
  * mb_cache_entry_free()
  *
- * This is equivalent to the sequence mb_cache_entry_takeout() --
- * mb_cache_entry_release().
  */
 void
 mb_cache_entry_free(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
+	mb_assert(ce);
 	mb_assert(list_empty(&ce->e_lru_list));
-	__mb_cache_entry_unhash(ce);
-	__mb_cache_entry_release_unlock(ce);
+	hlist_bl_lock(ce->e_index_hash_p);
+	__mb_cache_entry_unhash_index(ce);
+	hlist_bl_unlock(ce->e_index_hash_p);
+	hlist_bl_lock(ce->e_block_hash_p);
+	__mb_cache_entry_unhash_block(ce);
+	hlist_bl_unlock(ce->e_block_hash_p);
+	__mb_cache_entry_release(ce);
 }
 
 
@@ -493,43 +616,58 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
 	struct hlist_bl_head *block_hash_p;
+	int held_lock = 1;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
 	block_hash_p = &cache->c_block_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	/* First serialize access to the block corresponding hash chain. */
+	hlist_bl_lock(block_hash_p);
 	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
 		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
-			DEFINE_WAIT(wait);
-
-			if (!list_empty(&ce->e_lru_list))
+			/*
+			 * Prevent a release on the entry.
+			 */
+			spin_lock(&ce->e_entry_lock);
+			if (!list_empty(&ce->e_lru_list)) {
+				spin_lock(&mb_cache_spinlock);
 				list_del_init(&ce->e_lru_list);
-
-			while (ce->e_used > 0) {
-				ce->e_queued++;
-				prepare_to_wait(&mb_cache_queue, &wait,
-						TASK_UNINTERRUPTIBLE);
 				spin_unlock(&mb_cache_spinlock);
-				schedule();
-				spin_lock(&mb_cache_spinlock);
-				ce->e_queued--;
 			}
-			finish_wait(&mb_cache_queue, &wait);
+			if (ce->e_used > 0) {
+				DEFINE_WAIT(wait);
+				while (ce->e_used > 0) {
+					ce->e_queued++;
+					prepare_to_wait(&mb_cache_queue, &wait,
+							TASK_UNINTERRUPTIBLE);
+					spin_unlock(&ce->e_entry_lock);
+					if (held_lock) {
+						hlist_bl_unlock(block_hash_p);
+						held_lock = 0;
+					}
+					schedule();
+					spin_lock(&ce->e_entry_lock);
+					ce->e_queued--;
+				}
+				finish_wait(&mb_cache_queue, &wait);
+			}
 			ce->e_used += 1 + MB_CACHE_WRITER;
+			spin_unlock(&ce->e_entry_lock);
 
 			if (!__mb_cache_entry_is_block_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
+				if (held_lock)
+					hlist_bl_unlock(block_hash_p);
+				__mb_cache_entry_release(ce);
 				return NULL;
 			}
-			goto cleanup;
+			if (held_lock)
+				hlist_bl_unlock(block_hash_p);
+			return ce;
 		}
 	}
-	ce = NULL;
-
-cleanup:
-	spin_unlock(&mb_cache_spinlock);
-	return ce;
+	hlist_bl_unlock(block_hash_p);
+	return NULL;
 }
 
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
@@ -538,40 +676,59 @@ static struct mb_cache_entry *
 __mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
+	int held_lock = 1;
+
+	/* The index hash chain is alredy acquire by caller. */
 	while (l != NULL) {
 		struct mb_cache_entry *ce =
 			hlist_bl_entry(l, struct mb_cache_entry,
 				e_index.o_list);
 		mb_assert(ce->e_index_hash_p == head);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
-			DEFINE_WAIT(wait);
-
-			if (!list_empty(&ce->e_lru_list))
+			/*
+			 * Prevent a release on the entry.
+			 */
+			spin_lock(&ce->e_entry_lock);
+			if (!list_empty(&ce->e_lru_list)) {
+				spin_lock(&mb_cache_spinlock);
 				list_del_init(&ce->e_lru_list);
-
+				spin_unlock(&mb_cache_spinlock);
+			}
+			ce->e_used++;
 			/* Incrementing before holding the lock gives readers
 			   priority over writers. */
-			ce->e_used++;
-			while (ce->e_used >= MB_CACHE_WRITER) {
-				ce->e_queued++;
-				prepare_to_wait(&mb_cache_queue, &wait,
-						TASK_UNINTERRUPTIBLE);
-				spin_unlock(&mb_cache_spinlock);
-				schedule();
-				spin_lock(&mb_cache_spinlock);
-				ce->e_queued--;
+			if (ce->e_used >= MB_CACHE_WRITER) {
+				DEFINE_WAIT(wait);
+
+				while (ce->e_used >= MB_CACHE_WRITER) {
+					ce->e_queued++;
+					prepare_to_wait(&mb_cache_queue, &wait,
+							TASK_UNINTERRUPTIBLE);
+					spin_unlock(&ce->e_entry_lock);
+					if (held_lock) {
+						hlist_bl_unlock(head);
+						held_lock = 0;
+					}
+					schedule();
+					spin_lock(&ce->e_entry_lock);
+					ce->e_queued--;
+				}
+				finish_wait(&mb_cache_queue, &wait);
 			}
-			finish_wait(&mb_cache_queue, &wait);
-
+			spin_unlock(&ce->e_entry_lock);
 			if (!__mb_cache_entry_is_block_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
-				spin_lock(&mb_cache_spinlock);
+				if (held_lock)
+					hlist_bl_unlock(head);
+				__mb_cache_entry_release(ce);
 				return ERR_PTR(-EAGAIN);
 			}
+			if (held_lock)
+				hlist_bl_unlock(head);
 			return ce;
 		}
 		l = l->next;
 	}
+	hlist_bl_unlock(head);
 	return NULL;
 }
 
@@ -598,12 +755,12 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 	struct hlist_bl_head *index_hash_p;
 
 	index_hash_p = &cache->c_index_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(index_hash_p);
 	if (!hlist_bl_empty(index_hash_p)) {
 		l = hlist_bl_first(index_hash_p);
 		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
-	}
-	spin_unlock(&mb_cache_spinlock);
+	} else
+		hlist_bl_unlock(index_hash_p);
 	return ce;
 }
 
@@ -638,11 +795,11 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 
 	index_hash_p = &cache->c_index_hash[bucket];
 	mb_assert(prev->e_index_hash_p == index_hash_p);
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(index_hash_p);
 	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
 	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
-	__mb_cache_entry_release_unlock(prev);
+	__mb_cache_entry_release(prev);
 	return ce;
 }
 
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 3/3] ext4: each filesystem creates and uses its own mc_cache
  2014-01-24 18:31   ` [PATCH v4 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  2014-01-24 18:31     ` [PATCH v4 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
  2014-01-24 18:31     ` [PATCH v4 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
@ 2014-01-24 18:31     ` T Makphaibulchoke
  2014-01-24 21:38     ` [PATCH v4 0/3] ext4: increase mbcache scalability Andi Kleen
  3 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-01-24 18:31 UTC (permalink / raw)
  To: viro, tytso, adilger.kernel, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c. It also increases the size of the mbcache's hash tables for
ext4.

fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is allocated
and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/super.c | 24 ++++++++++++++++--------
 fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
 fs/ext4/xattr.h |  6 +++---
 fs/mbcache.c    | 19 +++++++++++++------
 5 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..fa20aab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1314,6 +1314,7 @@ struct ext4_sb_info {
 	struct list_head s_es_lru;
 	unsigned long s_es_last_sorted;
 	struct percpu_counter s_extent_cache_cnt;
+	struct mb_cache *s_mb_cache;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 
 	/* Ratelimit ext4 messages. */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..176048e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
 static struct ext4_lazy_init *ext4_li_info;
 static struct mutex ext4_li_mtx;
 static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -845,6 +846,10 @@ static void ext4_put_super(struct super_block *sb)
 		invalidate_bdev(sbi->journal_bdev);
 		ext4_blkdev_remove(sbi);
 	}
+	if (sbi->s_mb_cache) {
+		ext4_xattr_destroy_cache(sbi->s_mb_cache);
+		sbi->s_mb_cache = NULL;
+	}
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 	sb->s_fs_info = NULL;
@@ -3981,6 +3986,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	if (ext4_mballoc_ready) {
+		sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+		if (!sbi->s_mb_cache) {
+			ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+			goto failed_mount_wq;
+		}
+	}
 
 	/*
 	 * The journal may have updated the bg summary counts, so we
@@ -5501,11 +5513,9 @@ static int __init ext4_init_fs(void)
 
 	err = ext4_init_mballoc();
 	if (err)
-		goto out3;
-
-	err = ext4_init_xattr();
-	if (err)
 		goto out2;
+	else
+		ext4_mballoc_ready = 1;
 	err = init_inodecache();
 	if (err)
 		goto out1;
@@ -5521,10 +5531,9 @@ out:
 	unregister_as_ext3();
 	destroy_inodecache();
 out1:
-	ext4_exit_xattr();
-out2:
+	ext4_mballoc_ready = 0;
 	ext4_exit_mballoc();
-out3:
+out2:
 	ext4_exit_feat_adverts();
 out4:
 	if (ext4_proc_root)
@@ -5547,7 +5556,6 @@ static void __exit ext4_exit_fs(void)
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
 	destroy_inodecache();
-	ext4_exit_xattr();
 	ext4_exit_mballoc();
 	ext4_exit_feat_adverts();
 	remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 1423c48..6e2788f 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
 static struct buffer_head *ext4_xattr_cache_find(struct inode *,
 						 struct ext4_xattr_header *,
 						 struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
 static int ext4_xattr_list(struct dentry *dentry, char *buffer,
 			   size_t buffer_size);
 
-static struct mb_cache *ext4_xattr_cache;
-
 static const struct xattr_handler *ext4_xattr_handler_map[] = {
 	[EXT4_XATTR_INDEX_USER]		     = &ext4_xattr_user_handler,
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
 	NULL
 };
 
+#define EXT4_GET_MB_CACHE(inode)	(((struct ext4_sb_info *) \
+				inode->i_sb->s_fs_info)->s_mb_cache)
+
 static __le32 ext4_xattr_block_csum(struct inode *inode,
 				    sector_t block_nr,
 				    struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
 	if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
 cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
-	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+	ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
 		goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
-		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+		ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
 		if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
-				ext4_xattr_cache_insert(bs->bh);
+				ext4_xattr_cache_insert(ext4_mb_cache,
+					bs->bh);
 			}
 			unlock_buffer(bs->bh);
 			if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
-			ext4_xattr_cache_insert(new_bh);
+			ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
 			if (error)
@@ -1495,13 +1501,13 @@ ext4_xattr_put_super(struct super_block *sb)
  * Returns 0, or a negative error number on failure.
  */
 static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 {
 	__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
 	struct mb_cache_entry *ce;
 	int error;
 
-	ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+	ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
 	if (!ce) {
 		ea_bdebug(bh, "out of memory");
 		return;
@@ -1573,12 +1579,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
 again:
-	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+	ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
 				       hash);
 	while (ce) {
 		struct buffer_head *bh;
@@ -1676,19 +1683,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
 
 #undef BLOCK_HASH_SHIFT
 
-int __init
-ext4_init_xattr(void)
+#define	HASH_BUCKET_BITS	10
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
 {
-	ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
-	if (!ext4_xattr_cache)
-		return -ENOMEM;
-	return 0;
+	return mb_cache_create(name, HASH_BUCKET_BITS);
 }
 
-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
 {
-	if (ext4_xattr_cache)
-		mb_cache_destroy(ext4_xattr_cache);
-	ext4_xattr_cache = NULL;
+	if (cache)
+		mb_cache_destroy(cache);
 }
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
 
-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
 extern const struct xattr_handler *ext4_xattr_handlers[];
 
 extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
 				       struct ext4_xattr_info *i,
 				       struct ext4_xattr_ibody_find *is);
 
+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
 #ifdef CONFIG_EXT4_FS_SECURITY
 extern int ext4_init_security(handle_t *handle, struct inode *inode,
 			      struct inode *dir, const struct qstr *qstr);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 0c4cec2..d620b7f 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -92,6 +92,7 @@
 #define MB_CACHE_WRITER ((unsigned short)~0U >> 1)
 
 static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
+static struct kmem_cache *mb_cache_kmem_cache;
 		
 MODULE_AUTHOR("Andreas Gruenbacher <a.gruenbacher@computer.org>");
 MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
@@ -320,11 +321,14 @@ mb_cache_create(const char *name, int bucket_bits)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
 		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
-	cache->c_entry_cache = kmem_cache_create(name,
-		sizeof(struct mb_cache_entry), 0,
-		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
-	if (!cache->c_entry_cache)
-		goto fail2;
+	if (!mb_cache_kmem_cache) {
+		mb_cache_kmem_cache = kmem_cache_create(name,
+			sizeof(struct mb_cache_entry), 0,
+			SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+		if (!mb_cache_kmem_cache)
+			goto fail2;
+	}
+	cache->c_entry_cache = mb_cache_kmem_cache;
 
 	/*
 	 * Set an upper limit on the number of cache entries so that the hash
@@ -438,7 +442,10 @@ mb_cache_destroy(struct mb_cache *cache)
 			  atomic_read(&cache->c_entry_count));
 	}
 
-	kmem_cache_destroy(cache->c_entry_cache);
+	if (list_empty(&mb_cache_list)) {
+		kmem_cache_destroy(mb_cache_kmem_cache);
+		mb_cache_kmem_cache = NULL;
+	}
 
 	kfree(cache->c_index_hash);
 	kfree(cache->c_block_hash);
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/3] ext4: increase mbcache scalability
  2014-01-24 18:31   ` [PATCH v4 0/3] ext4: increase mbcache scalability T Makphaibulchoke
                       ` (2 preceding siblings ...)
  2014-01-24 18:31     ` [PATCH v4 3/3] ext4: each filesystem creates and uses its own mc_cache T Makphaibulchoke
@ 2014-01-24 21:38     ` Andi Kleen
  2014-01-25  1:13       ` Thavatchai Makphaibulchoke
  2014-01-25  6:09       ` Andreas Dilger
  3 siblings, 2 replies; 63+ messages in thread
From: Andi Kleen @ 2014-01-24 21:38 UTC (permalink / raw)
  To: T Makphaibulchoke
  Cc: viro, tytso, adilger.kernel, linux-ext4, linux-kernel,
	linux-fsdevel, aswin

T Makphaibulchoke <tmac@hp.com> writes:

> The patch consists of three parts.
>
> The first part changes the implementation of both the block and hash chains of
> an mb_cache from list_head to hlist_bl_head and also introduces new members,
> including a spinlock to mb_cache_entry, as required by the second part.

spinlock per entry is usually overkill for larger hash tables.

Can you use a second smaller lock table that just has locks and is 
indexed by a subset of the hash key. Most likely a very small 
table is good enough.

Also I would be good to have some data on the additional memory consumption.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/3] ext4: increase mbcache scalability
  2014-01-24 21:38     ` [PATCH v4 0/3] ext4: increase mbcache scalability Andi Kleen
@ 2014-01-25  1:13       ` Thavatchai Makphaibulchoke
  2014-01-25  6:09       ` Andreas Dilger
  1 sibling, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2014-01-25  1:13 UTC (permalink / raw)
  To: Andi Kleen, T Makphaibulchoke
  Cc: viro, tytso, adilger.kernel, linux-ext4, linux-kernel,
	linux-fsdevel, aswin

On 01/24/2014 02:38 PM, Andi Kleen wrote:
> T Makphaibulchoke <tmac@hp.com> writes:
> 
>> The patch consists of three parts.
>>
>> The first part changes the implementation of both the block and hash chains of
>> an mb_cache from list_head to hlist_bl_head and also introduces new members,
>> including a spinlock to mb_cache_entry, as required by the second part.
> 
> spinlock per entry is usually overkill for larger hash tables.
> 
> Can you use a second smaller lock table that just has locks and is 
> indexed by a subset of the hash key. Most likely a very small 
> table is good enough.
> 
> Also I would be good to have some data on the additional memory consumption.
> 
> -Andi
> 

Thanks Andi for the comments.  Will look into that.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/3] ext4: increase mbcache scalability
  2014-01-24 21:38     ` [PATCH v4 0/3] ext4: increase mbcache scalability Andi Kleen
  2014-01-25  1:13       ` Thavatchai Makphaibulchoke
@ 2014-01-25  6:09       ` Andreas Dilger
  2014-01-27 12:27         ` Thavatchai Makphaibulchoke
                           ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Andreas Dilger @ 2014-01-25  6:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: T Makphaibulchoke, viro, tytso, adilger.kernel, linux-ext4,
	linux-kernel, linux-fsdevel, aswin

I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC.  See blockgroup_lock.h. 

While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.  

Cheers, Andreas

> On Jan 24, 2014, at 14:38, Andi Kleen <andi@firstfloor.org> wrote:
> 
> T Makphaibulchoke <tmac@hp.com> writes:
> 
>> The patch consists of three parts.
>> 
>> The first part changes the implementation of both the block and hash chains of
>> an mb_cache from list_head to hlist_bl_head and also introduces new members,
>> including a spinlock to mb_cache_entry, as required by the second part.
> 
> spinlock per entry is usually overkill for larger hash tables.
> 
> Can you use a second smaller lock table that just has locks and is 
> indexed by a subset of the hash key. Most likely a very small 
> table is good enough.
> 
> Also I would be good to have some data on the additional memory consumption.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/3] ext4: increase mbcache scalability
  2014-01-25  6:09       ` Andreas Dilger
@ 2014-01-27 12:27         ` Thavatchai Makphaibulchoke
  2014-02-09 19:46         ` Thavatchai Makphaibulchoke
  2014-02-11 19:58         ` Thavatchai Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2014-01-27 12:27 UTC (permalink / raw)
  To: Andreas Dilger, Andi Kleen
  Cc: T Makphaibulchoke, viro, tytso, adilger.kernel, linux-ext4,
	linux-kernel, linux-fsdevel, aswin

On 01/24/2014 11:09 PM, Andreas Dilger wrote:
> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC.  See blockgroup_lock.h. 
> 
> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.  
> 
> Cheers, Andreas
> 


Thanks Andreas for the suggestion. Will try that versus adding just a new private spinlock array in mbcache and compare the performance.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/3] ext4: increase mbcache scalability
  2014-01-25  6:09       ` Andreas Dilger
  2014-01-27 12:27         ` Thavatchai Makphaibulchoke
@ 2014-02-09 19:46         ` Thavatchai Makphaibulchoke
  2014-02-11 19:58         ` Thavatchai Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2014-02-09 19:46 UTC (permalink / raw)
  To: Andreas Dilger, Andi Kleen
  Cc: T Makphaibulchoke, viro, tytso, adilger.kernel, linux-ext4,
	linux-kernel, linux-fsdevel, aswin

On 01/24/2014 11:09 PM, Andreas Dilger wrote:
> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC.  See blockgroup_lock.h. 
> 
> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.  
> 
> Cheers, Andreas
> 




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/3] ext4: increase mbcache scalability
  2014-01-25  6:09       ` Andreas Dilger
  2014-01-27 12:27         ` Thavatchai Makphaibulchoke
  2014-02-09 19:46         ` Thavatchai Makphaibulchoke
@ 2014-02-11 19:58         ` Thavatchai Makphaibulchoke
  2014-02-13  2:01           ` Andreas Dilger
  2 siblings, 1 reply; 63+ messages in thread
From: Thavatchai Makphaibulchoke @ 2014-02-11 19:58 UTC (permalink / raw)
  To: Andreas Dilger, Andi Kleen
  Cc: T Makphaibulchoke, viro, tytso, adilger.kernel, linux-ext4,
	linux-kernel, linux-fsdevel, aswin

On 01/24/2014 11:09 PM, Andreas Dilger wrote:
> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC.  See blockgroup_lock.h. 
> 
> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.  
> 
> Cheers, Andreas
> 

Andreas, looks like your assumption is correct.  On all 3 systems, 80, 60 and 20 cores, I got almost identical aim7 results using either a smaller dedicated lock array or the block group lock.  I'm inclined to go with using the block group lock as it does not incur any extra space.

One problem is that, with the current implementation mbcache has no knowledge of the super block, including its block group lock, of the filesystem.  In my implementation I have to change the first argument of mb_cache_create() from char * to struct super_block * to be able to access the super block's block group lock. This works with my proposed change to allocate an mb_cache for each mounted ext4 filesystem.  This would also require the same change, allocating an mb_cache for each mounted filesystem, to both ext2 and ext3, which would increase the scope of the patch.  The other alternative, allocating a new smaller spinlock array, would not require any change to either ext2 and ext3.

I'm working on resubmitting my patches using the block group locks and extending the changes to also include both ext2 and ext3.  With this approach, not only that no addition space for dedicated new spinlock array is required, the e_bdev member of struct mb_cache_entry could also be removed, reducing the space required for each mb_cache_entry.

Please let me know if you have any concern or suggestion.

Thanks,
Mak.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/3] ext4: increase mbcache scalability
  2014-02-11 19:58         ` Thavatchai Makphaibulchoke
@ 2014-02-13  2:01           ` Andreas Dilger
  0 siblings, 0 replies; 63+ messages in thread
From: Andreas Dilger @ 2014-02-13  2:01 UTC (permalink / raw)
  To: Thavatchai Makphaibulchoke
  Cc: Andi Kleen, T Makphaibulchoke, viro, tytso, linux-ext4,
	linux-kernel, linux-fsdevel, aswin

[-- Attachment #1: Type: text/plain, Size: 2962 bytes --]


On Feb 11, 2014, at 12:58 PM, Thavatchai Makphaibulchoke <thavatchai.makpahibulchoke@hp.com> wrote:
> On 01/24/2014 11:09 PM, Andreas Dilger wrote:
>> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC.  See blockgroup_lock.h. 
>> 
>> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.  
>> 
>> Cheers, Andreas
>> 
> 
> Andreas, looks like your assumption is correct.  On all 3 systems, 80, 60 and 20 cores, I got almost identical aim7 results using either a smaller dedicated lock array or the block group lock.  I'm inclined to go with using the block group lock as it does not incur any extra space.
> 
> One problem is that, with the current implementation mbcache has no knowledge of the super block, including its block group lock, of the filesystem.  In my implementation I have to change the first argument of mb_cache_create() from char * to struct super_block * to be able to access the super block's block group lock.

Note that you don't have to use the ext4_sb_info->s_blockgroup_lock.
You can allocate and use a separate struct blockgroup_lock for mbcache
instead of allocating a spinlock array (and essentially reimplementing
the bgl_lock_*() code).  While it isn't a huge amount of duplication,
that code is already tuned for different SMP core configurations and
there is no reason NOT to use struct blockgroup_lock.

> This works with my proposed change to allocate an mb_cache for each mounted ext4 filesystem.  This would also require the same change, allocating an mb_cache for each mounted filesystem, to both ext2 and ext3, which would increase the scope of the patch.  The other alternative, allocating a new smaller spinlock array, would not require any change to either ext2 and ext3.
> 
> I'm working on resubmitting my patches using the block group locks and extending the changes to also include both ext2 and ext3.  With this approach, not only that no addition space for dedicated new spinlock array is required, the e_bdev member of struct mb_cache_entry could also be removed, reducing the space required for each mb_cache_entry.
> 
> Please let me know if you have any concern or suggestion.

I'm not against re-using the s_blockgroup_lock in the superblock, since
the chance of contention on this lock between threads is very small, as
there are normally 4x the number of spinlocks as cores.

You might consider starting with a dedicated struct blockgroup_lock in
the mbcache code, then move to use the in-superblock struct in a later
patch.  That would allow you to push and land the mbcache and ext4 patches
independently of the ext2 and ext3 patches (if they are big).  If the
ext2 and ext3 patches are relatively small then this extra complexity
in the patches may not be needed.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH V5 0/3] ext4: increase mbcache scalability
  2013-07-18  0:55 [PATCH 0/2] ext4: increase mbcache scalability T Makphaibulchoke
  2013-08-22 15:54 ` [PATCH v2 " T Makphaibulchoke
  2013-09-04 16:39 ` [PATCH v3 0/2] " T Makphaibulchoke
@ 2014-02-20 18:37 ` T Makphaibulchoke
  2014-02-20 18:37   ` [PATCH V5 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
                     ` (2 more replies)
  2014-03-12 16:19 ` [PATCH v5 RESEND 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  3 siblings, 3 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-02-20 18:37 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

The patch consists of three parts.

The first part changes the implementation of both the block and hash chains of
an mb_cache from list_head to hlist_bl_head and also introduces new members,
including a spinlock to mb_cache_entry, as required by the second part.

The second part introduces higher degree of parallelism to the usages of the
mb_cache and mb_cache_entries and impacts all ext filesystems.

The third part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems, and increases the size of its mbcache hash tables.

Here are some of the benchmark results with the changes.

Using ram disk, there seems to be no peformance differences with aim7 for all
workloads on all platforms tested.

With regular disk filesystems with inode size of 128 bytes, forcing the uses
of external xattr, there seems to be some good peformance increases with
some of the aim7's workloads on all platforms tested.

Here are some of the performance improvement on aim7 with 2000 users.

On a 20 core macine, there is no performance differences.

On a 60 core machine:

---------------------------
|             | % increase |
---------------------------
| alltests    |     74.69  |
---------------------------
| custom      |     77.1   |
---------------------------
| disk        |    125.02  |
---------------------------
| fserver     |    113.22  |
---------------------------
| new_dbase   |     21.17  |
---------------------------
| new_fserve  |     70.31  |
---------------------------
| shared      |     52.56  |
---------------------------

On a 80 core machine:

---------------------------
|             | % increase |
---------------------------
| custom      |     74.29  |
---------------------------
| disk        |     61.01  |
---------------------------
| fserver     |     11.59  |
---------------------------
| new_fserver |     32.76  |
---------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.

Changed in v5:
        - New performance data
	- New diff summary

Changed in v4:
        - New performance data
	- New diff summary
	- New patch architecture

Changed in v3:
	- New idff summary

Changed in v2:
        - New performance data
	- New diff summary

T Makphaibulchoke (3):
  fs/mbcache.c change block and index hash chain to hlist_bl_node
  mbcache: decoupling the locking of local from global data
  ext4: each filesystem creates and uses its own mb_cache

 fs/ext4/ext4.h          |   1 +
 fs/ext4/super.c         |  24 ++-
 fs/ext4/xattr.c         |  51 ++---
 fs/ext4/xattr.h         |   6 +-
 fs/mbcache.c            | 540 ++++++++++++++++++++++++++++++++++--------------
 include/linux/mbcache.h |  12 +-
 6 files changed, 441 insertions(+), 193 deletions(-)

-- 
1.7.11.3


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH V5 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node
  2014-02-20 18:37 ` [PATCH V5 0/3] " T Makphaibulchoke
@ 2014-02-20 18:37   ` T Makphaibulchoke
  2014-02-20 18:37   ` [PATCH V5 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
  2014-02-20 18:37   ` [PATCH V5 3/3] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-02-20 18:37 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

This patch changes each mb_cache's both block and index hash chains from
regular linked list to hlist_bl_node, which contains a built-in lock.  This is
the first step in decoupling of locks serializing accesses to mb_cache global
data and each mb_cache_entry local data.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
Changed in v5:
	- Added e_refcnt member to mb_cache_entry structure to be used in
	  the decoupling in patch 2/3.
	- Removed e_entry_lock from mb_cache_entry structure.

 fs/mbcache.c            | 117 ++++++++++++++++++++++++++++++++----------------
 include/linux/mbcache.h |  12 ++---
 2 files changed, 85 insertions(+), 44 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index e519e45..55db0da 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -34,9 +34,9 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
-#include <linux/init.h>
+#include <linux/list_bl.h>
 #include <linux/mbcache.h>
-
+#include <linux/init.h>
 
 #ifdef MB_CACHE_DEBUG
 # define mb_debug(f...) do { \
@@ -87,21 +87,38 @@ static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
 static inline int
-__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
+__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
 {
-	return !list_empty(&ce->e_block_list);
+	return !hlist_bl_unhashed(&ce->e_block_list);
 }
 
 
-static void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+static inline void
+__mb_cache_entry_unhash_block(struct mb_cache_entry *ce)
 {
-	if (__mb_cache_entry_is_hashed(ce)) {
-		list_del_init(&ce->e_block_list);
-		list_del(&ce->e_index.o_list);
-	}
+	if (__mb_cache_entry_is_block_hashed(ce))
+		hlist_bl_del_init(&ce->e_block_list);
+}
+
+static inline int
+__mb_cache_entry_is_index_hashed(struct mb_cache_entry *ce)
+{
+	return !hlist_bl_unhashed(&ce->e_index.o_list);
 }
 
+static inline void
+__mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
+{
+	if (__mb_cache_entry_is_index_hashed(ce))
+		hlist_bl_del_init(&ce->e_index.o_list);
+}
+
+static inline void
+__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+{
+	__mb_cache_entry_unhash_index(ce);
+	__mb_cache_entry_unhash_block(ce);
+}
 
 static void
 __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
@@ -125,7 +142,7 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
 		ce->e_used -= MB_CACHE_WRITER;
 	ce->e_used--;
 	if (!(ce->e_used || ce->e_queued)) {
-		if (!__mb_cache_entry_is_hashed(ce))
+		if (!__mb_cache_entry_is_block_hashed(ce))
 			goto forget;
 		mb_assert(list_empty(&ce->e_lru_list));
 		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
@@ -221,18 +238,18 @@ mb_cache_create(const char *name, int bucket_bits)
 	cache->c_name = name;
 	atomic_set(&cache->c_entry_count, 0);
 	cache->c_bucket_bits = bucket_bits;
-	cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
-	                              GFP_KERNEL);
+	cache->c_block_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_block_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_block_hash[n]);
-	cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
-				      GFP_KERNEL);
+		INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+	cache->c_index_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_index_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_index_hash[n]);
+		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
 	cache->c_entry_cache = kmem_cache_create(name,
 		sizeof(struct mb_cache_entry), 0,
 		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -364,10 +381,13 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 			return NULL;
 		atomic_inc(&cache->c_entry_count);
 		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_LIST_HEAD(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
 		ce->e_cache = cache;
 		ce->e_queued = 0;
 	}
+	ce->e_block_hash_p = &cache->c_block_hash[0];
+	ce->e_index_hash_p = &cache->c_index_hash[0];
 	ce->e_used = 1 + MB_CACHE_WRITER;
 	return ce;
 }
@@ -393,25 +413,32 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 {
 	struct mb_cache *cache = ce->e_cache;
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	int error = -EBUSY;
+	struct hlist_bl_head *block_hash_p;
+	struct hlist_bl_head *index_hash_p;
+	struct mb_cache_entry *lce;
 
+	mb_assert(ce);
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
+	block_hash_p = &cache->c_block_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_prev(l, &cache->c_block_hash[bucket]) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_block_list);
-		if (ce->e_bdev == bdev && ce->e_block == block)
+	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+		if (lce->e_bdev == bdev && lce->e_block == block)
 			goto out;
 	}
+	mb_assert(!__mb_cache_entry_is_block_hashed(ce));
 	__mb_cache_entry_unhash(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
-	list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
 	bucket = hash_long(key, cache->c_bucket_bits);
-	list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+	index_hash_p = &cache->c_index_hash[bucket];
+	ce->e_index_hash_p = index_hash_p;
+	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
 	error = 0;
 out:
 	spin_unlock(&mb_cache_spinlock);
@@ -463,14 +490,16 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 		   sector_t block)
 {
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *block_hash_p;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
+	block_hash_p = &cache->c_block_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	list_for_each(l, &cache->c_block_hash[bucket]) {
-		ce = list_entry(l, struct mb_cache_entry, e_block_list);
+	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
 			DEFINE_WAIT(wait);
 
@@ -489,7 +518,7 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 			finish_wait(&mb_cache_queue, &wait);
 			ce->e_used += 1 + MB_CACHE_WRITER;
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
 				return NULL;
 			}
@@ -506,12 +535,14 @@ cleanup:
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
 
 static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
-	while (l != head) {
+	while (l != NULL) {
 		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_index.o_list);
+			hlist_bl_entry(l, struct mb_cache_entry,
+				e_index.o_list);
+		mb_assert(ce->e_index_hash_p == head);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
 			DEFINE_WAIT(wait);
 
@@ -532,7 +563,7 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
 			}
 			finish_wait(&mb_cache_queue, &wait);
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
 				spin_lock(&mb_cache_spinlock);
 				return ERR_PTR(-EAGAIN);
@@ -562,12 +593,16 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 			  unsigned int key)
 {
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
-	struct mb_cache_entry *ce;
+	struct hlist_bl_node *l;
+	struct mb_cache_entry *ce = NULL;
+	struct hlist_bl_head *index_hash_p;
 
+	index_hash_p = &cache->c_index_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	l = cache->c_index_hash[bucket].next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+	if (!hlist_bl_empty(index_hash_p)) {
+		l = hlist_bl_first(index_hash_p);
+		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+	}
 	spin_unlock(&mb_cache_spinlock);
 	return ce;
 }
@@ -597,12 +632,16 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 {
 	struct mb_cache *cache = prev->e_cache;
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *index_hash_p;
 
+	index_hash_p = &cache->c_index_hash[bucket];
+	mb_assert(prev->e_index_hash_p == index_hash_p);
 	spin_lock(&mb_cache_spinlock);
+	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
 	__mb_cache_entry_release_unlock(prev);
 	return ce;
 }
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..6a392e7 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -3,19 +3,21 @@
 
   (C) 2001 by Andreas Gruenbacher, <a.gruenbacher@computer.org>
 */
-
 struct mb_cache_entry {
 	struct list_head		e_lru_list;
 	struct mb_cache			*e_cache;
 	unsigned short			e_used;
 	unsigned short			e_queued;
+	atomic_t			e_refcnt;
 	struct block_device		*e_bdev;
 	sector_t			e_block;
-	struct list_head		e_block_list;
+	struct hlist_bl_node		e_block_list;
 	struct {
-		struct list_head	o_list;
+		struct hlist_bl_node	o_list;
 		unsigned int		o_key;
 	} e_index;
+	struct hlist_bl_head		*e_block_hash_p;
+	struct hlist_bl_head		*e_index_hash_p;
 };
 
 struct mb_cache {
@@ -25,8 +27,8 @@ struct mb_cache {
 	int				c_max_entries;
 	int				c_bucket_bits;
 	struct kmem_cache		*c_entry_cache;
-	struct list_head		*c_block_hash;
-	struct list_head		*c_index_hash;
+	struct hlist_bl_head		*c_block_hash;
+	struct hlist_bl_head		*c_index_hash;
 };
 
 /* Functions on caches */
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH V5 2/3] mbcache: decoupling the locking of local from global data
  2014-02-20 18:37 ` [PATCH V5 0/3] " T Makphaibulchoke
  2014-02-20 18:37   ` [PATCH V5 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
@ 2014-02-20 18:37   ` T Makphaibulchoke
  2014-02-20 18:37   ` [PATCH V5 3/3] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-02-20 18:37 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables.  Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

New block group spinlock, mb_cache_bg_lock  is also added to serialize accesses
to mb_cache_entry's local data.

A new member e_refcnt is added to the mb_cache_entry structure to help
preventing an mb_cache_entry from being deallocated by a free while it is
being referenced by either mb_cache_entry_get() or mb_cache_entry_find().

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
Changed in v5:
	- Added mb_cache_bg_lock, which is used to serialize accesses to an
	  mb_cache_entry's local data.
	- Used e_refcnt to help preventing an mb_cache_entry from being
	  deallocated by a free inside mb_cache_entry_get() and
	  mb_cache_entry_find() and at the same time reducing the hash chain's
	  lock held time.

 fs/mbcache.c | 417 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 301 insertions(+), 116 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 55db0da..786ecab 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,41 @@
  * back on the lru list.
  */
 
+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Each mb_cache_entry contains a spinlock, e_entry_lock, to serialize
+ * accesses to its local data, such as e_used and e_queued.
+ *
+ * Lock ordering:
+ *
+ * Each block hash chain's lock has the highest lock order, followed by an
+ * index hash chain's lock, mb_cache_bg_lock (used to implement mb_cache_entry's
+ * lock), and mb_cach_spinlock, with the lowest order.  While holding
+ * either a block or index hash chain lock, a thread can acquire an
+ * mc_cache_bg_lock, which in turn can also acquire mb_cache_spinlock.
+ *
+ * Synchronization:
+ *
+ * Since both mb_cache_entry_get and mb_cache_entry_find scan the block and
+ * index hash chian, it needs to lock the corresponding hash chain.  For each
+ * mb_cache_entry within the chain, it needs to lock the mb_cache_entry to
+ * prevent either any simultaneous release or free on the entry and also
+ * to serialize accesses to either the e_used or e_queued member of the entry.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced, both e_used,
+ * and e_queued are 0's.  When an mb_cache_entry is explicitly freed it is
+ * first removed from a block hash chain.
+ */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 
@@ -37,6 +72,7 @@
 #include <linux/list_bl.h>
 #include <linux/mbcache.h>
 #include <linux/init.h>
+#include <linux/blockgroup_lock.h>
 
 #ifdef MB_CACHE_DEBUG
 # define mb_debug(f...) do { \
@@ -57,8 +93,13 @@
 
 #define MB_CACHE_WRITER ((unsigned short)~0U >> 1)
 
+#define MB_CACHE_ENTRY_LOCK_BITS	__builtin_log2(NR_BG_LOCKS)
+#define	MB_CACHE_ENTRY_LOCK_INDEX(ce)			\
+	(hash_long((unsigned long)ce, MB_CACHE_ENTRY_LOCK_BITS))
+
 static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
-		
+static struct blockgroup_lock *mb_cache_bg_lock;
+
 MODULE_AUTHOR("Andreas Gruenbacher <a.gruenbacher@computer.org>");
 MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
 MODULE_LICENSE("GPL");
@@ -86,6 +127,20 @@ static LIST_HEAD(mb_cache_list);
 static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
+static inline void
+__spin_lock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+	spin_lock(bgl_lock_ptr(mb_cache_bg_lock,
+		MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
+static inline void
+__spin_unlock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+	spin_unlock(bgl_lock_ptr(mb_cache_bg_lock,
+		MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
 static inline int
 __mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
 {
@@ -113,11 +168,21 @@ __mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
 		hlist_bl_del_init(&ce->e_index.o_list);
 }
 
+/*
+ * __mb_cache_entry_unhash_unlock()
+ *
+ * This function is called to unhash both the block and index hash
+ * chain.
+ * It assumes both the block and index hash chain is locked upon entry.
+ * It also unlock both hash chains both exit
+ */
 static inline void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_unlock(struct mb_cache_entry *ce)
 {
 	__mb_cache_entry_unhash_index(ce);
+	hlist_bl_unlock(ce->e_index_hash_p);
 	__mb_cache_entry_unhash_block(ce);
+	hlist_bl_unlock(ce->e_block_hash_p);
 }
 
 static void
@@ -125,36 +190,47 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
 {
 	struct mb_cache *cache = ce->e_cache;
 
-	mb_assert(!(ce->e_used || ce->e_queued));
+	mb_assert(!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt)));
 	kmem_cache_free(cache->c_entry_cache, ce);
 	atomic_dec(&cache->c_entry_count);
 }
 
-
 static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
-	__releases(mb_cache_spinlock)
+__mb_cache_entry_release(struct mb_cache_entry *ce)
 {
+	/* First lock the entry to serialize access to its local data. */
+	__spin_lock_mb_cache_entry(ce);
 	/* Wake up all processes queuing for this cache entry. */
 	if (ce->e_queued)
 		wake_up_all(&mb_cache_queue);
 	if (ce->e_used >= MB_CACHE_WRITER)
 		ce->e_used -= MB_CACHE_WRITER;
+	/*
+	 * Make sure that all cache entries on lru_list have
+	 * both e_used and e_qued of 0s.
+	 */
 	ce->e_used--;
-	if (!(ce->e_used || ce->e_queued)) {
-		if (!__mb_cache_entry_is_block_hashed(ce))
+	if (!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))) {
+		if (!__mb_cache_entry_is_block_hashed(ce)) {
+			__spin_unlock_mb_cache_entry(ce);
 			goto forget;
-		mb_assert(list_empty(&ce->e_lru_list));
-		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+		}
+		/*
+		 * Need access to lru list, first drop entry lock,
+		 * then reacquire the lock in the proper order.
+		 */
+		spin_lock(&mb_cache_spinlock);
+		if (list_empty(&ce->e_lru_list))
+			list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+		spin_unlock(&mb_cache_spinlock);
 	}
-	spin_unlock(&mb_cache_spinlock);
+	__spin_unlock_mb_cache_entry(ce);
 	return;
 forget:
-	spin_unlock(&mb_cache_spinlock);
+	mb_assert(list_empty(&ce->e_lru_list));
 	__mb_cache_entry_forget(ce, GFP_KERNEL);
 }
 
-
 /*
  * mb_cache_shrink_scan()  memory pressure callback
  *
@@ -177,17 +253,34 @@ mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 
 	mb_debug("trying to free %d entries", nr_to_scan);
 	spin_lock(&mb_cache_spinlock);
-	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
+	while ((nr_to_scan-- > 0) && !list_empty(&mb_cache_lru_list)) {
 		struct mb_cache_entry *ce =
 			list_entry(mb_cache_lru_list.next,
-				   struct mb_cache_entry, e_lru_list);
-		list_move_tail(&ce->e_lru_list, &free_list);
-		__mb_cache_entry_unhash(ce);
-		freed++;
+				struct mb_cache_entry, e_lru_list);
+		list_del_init(&ce->e_lru_list);
+		if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))
+			continue;
+		spin_unlock(&mb_cache_spinlock);
+		/* Prevent any find or get operation on the entry */
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		/* Ignore if it is touched by a find/get */
+		if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt) ||
+			!list_empty(&ce->e_lru_list)) {
+			hlist_bl_unlock(ce->e_index_hash_p);
+			hlist_bl_unlock(ce->e_block_hash_p);
+			spin_lock(&mb_cache_spinlock);
+			continue;
+		}
+		__mb_cache_entry_unhash_unlock(ce);
+		list_add_tail(&ce->e_lru_list, &free_list);
+		spin_lock(&mb_cache_spinlock);
 	}
 	spin_unlock(&mb_cache_spinlock);
+
 	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
 		__mb_cache_entry_forget(entry, gfp_mask);
+		freed++;
 	}
 	return freed;
 }
@@ -232,6 +325,14 @@ mb_cache_create(const char *name, int bucket_bits)
 	int n, bucket_count = 1 << bucket_bits;
 	struct mb_cache *cache = NULL;
 
+	if (!mb_cache_bg_lock) {
+		mb_cache_bg_lock = kmalloc(sizeof(struct blockgroup_lock),
+			GFP_KERNEL);
+		if (!mb_cache_bg_lock)
+			return NULL;
+		bgl_lock_init(mb_cache_bg_lock);
+	}
+
 	cache = kmalloc(sizeof(struct mb_cache), GFP_KERNEL);
 	if (!cache)
 		return NULL;
@@ -290,21 +391,47 @@ void
 mb_cache_shrink(struct block_device *bdev)
 {
 	LIST_HEAD(free_list);
-	struct list_head *l, *ltmp;
+	struct list_head *l;
+	struct mb_cache_entry *ce, *tmp;
 
+	l = &mb_cache_lru_list;
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_lru_list);
+	while (!list_is_last(l, &mb_cache_lru_list)) {
+		l = l->next;
+		ce = list_entry(l, struct mb_cache_entry, e_lru_list);
 		if (ce->e_bdev == bdev) {
-			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
+			list_del_init(&ce->e_lru_list);
+			if (ce->e_used || ce->e_queued ||
+				atomic_read(&ce->e_refcnt))
+				continue;
+			spin_unlock(&mb_cache_spinlock);
+			/*
+			 * Prevent any find or get operation on the entry.
+			 */
+			hlist_bl_lock(ce->e_block_hash_p);
+			hlist_bl_lock(ce->e_index_hash_p);
+			/* Ignore if it is touched by a find/get */
+			if (ce->e_used || ce->e_queued ||
+				atomic_read(&ce->e_refcnt) ||
+				!list_empty(&ce->e_lru_list)) {
+				hlist_bl_unlock(ce->e_index_hash_p);
+				hlist_bl_unlock(ce->e_block_hash_p);
+				l = &mb_cache_lru_list;
+				spin_lock(&mb_cache_spinlock);
+				continue;
+			}
+			__mb_cache_entry_unhash_unlock(ce);
+			mb_assert(!(ce->e_used || ce->e_queued ||
+				atomic_read(&ce->e_refcnt)));
+			list_add_tail(&ce->e_lru_list, &free_list);
+			l = &mb_cache_lru_list;
+			spin_lock(&mb_cache_spinlock);
 		}
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+
+	list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 }
 
@@ -320,23 +447,27 @@ void
 mb_cache_destroy(struct mb_cache *cache)
 {
 	LIST_HEAD(free_list);
-	struct list_head *l, *ltmp;
+	struct mb_cache_entry *ce, *tmp;
 
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_lru_list);
-		if (ce->e_cache == cache) {
+	list_for_each_entry_safe(ce, tmp, &mb_cache_lru_list, e_lru_list) {
+		if (ce->e_cache == cache)
 			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
-		}
 	}
 	list_del(&cache->c_cache_list);
 	spin_unlock(&mb_cache_spinlock);
 
-	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+	list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+		list_del_init(&ce->e_lru_list);
+		/*
+		 * Prevent any find or get operation on the entry.
+		 */
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		mb_assert(!(ce->e_used || ce->e_queued ||
+			atomic_read(&ce->e_refcnt)));
+		__mb_cache_entry_unhash_unlock(ce);
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 
 	if (atomic_read(&cache->c_entry_count) > 0) {
@@ -345,8 +476,6 @@ mb_cache_destroy(struct mb_cache *cache)
 			  atomic_read(&cache->c_entry_count));
 	}
 
-	kmem_cache_destroy(cache->c_entry_cache);
-
 	kfree(cache->c_index_hash);
 	kfree(cache->c_block_hash);
 	kfree(cache);
@@ -363,29 +492,59 @@ mb_cache_destroy(struct mb_cache *cache)
 struct mb_cache_entry *
 mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 {
-	struct mb_cache_entry *ce = NULL;
+	struct mb_cache_entry *ce;
 
 	if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+		struct list_head *l;
+
+		l = &mb_cache_lru_list;
 		spin_lock(&mb_cache_spinlock);
-		if (!list_empty(&mb_cache_lru_list)) {
-			ce = list_entry(mb_cache_lru_list.next,
-					struct mb_cache_entry, e_lru_list);
-			list_del_init(&ce->e_lru_list);
-			__mb_cache_entry_unhash(ce);
+		while (!list_is_last(l, &mb_cache_lru_list)) {
+			l = l->next;
+			ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+			if (ce->e_cache == cache) {
+				list_del_init(&ce->e_lru_list);
+				if (ce->e_used || ce->e_queued ||
+					atomic_read(&ce->e_refcnt))
+					continue;
+				spin_unlock(&mb_cache_spinlock);
+				/*
+				 * Prevent any find or get operation on the
+				 * entry.
+				 */
+				hlist_bl_lock(ce->e_block_hash_p);
+				hlist_bl_lock(ce->e_index_hash_p);
+				/* Ignore if it is touched by a find/get */
+				if (ce->e_used || ce->e_queued ||
+					atomic_read(&ce->e_refcnt) ||
+					!list_empty(&ce->e_lru_list)) {
+					hlist_bl_unlock(ce->e_index_hash_p);
+					hlist_bl_unlock(ce->e_block_hash_p);
+					l = &mb_cache_lru_list;
+					spin_lock(&mb_cache_spinlock);
+					continue;
+				}
+				mb_assert(list_empty(&ce->e_lru_list));
+				mb_assert(!(ce->e_used || ce->e_queued ||
+					atomic_read(&ce->e_refcnt)));
+				__mb_cache_entry_unhash_unlock(ce);
+				goto found;
+			}
 		}
 		spin_unlock(&mb_cache_spinlock);
 	}
-	if (!ce) {
-		ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
-		if (!ce)
-			return NULL;
-		atomic_inc(&cache->c_entry_count);
-		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_HLIST_BL_NODE(&ce->e_block_list);
-		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
-		ce->e_cache = cache;
-		ce->e_queued = 0;
-	}
+
+	ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
+	if (!ce)
+		return NULL;
+	atomic_inc(&cache->c_entry_count);
+	INIT_LIST_HEAD(&ce->e_lru_list);
+	INIT_HLIST_BL_NODE(&ce->e_block_list);
+	INIT_HLIST_BL_NODE(&ce->e_index.o_list);
+	ce->e_cache = cache;
+	ce->e_queued = 0;
+	atomic_set(&ce->e_refcnt, 0);
+found:
 	ce->e_block_hash_p = &cache->c_block_hash[0];
 	ce->e_index_hash_p = &cache->c_index_hash[0];
 	ce->e_used = 1 + MB_CACHE_WRITER;
@@ -414,7 +573,6 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 	struct mb_cache *cache = ce->e_cache;
 	unsigned int bucket;
 	struct hlist_bl_node *l;
-	int error = -EBUSY;
 	struct hlist_bl_head *block_hash_p;
 	struct hlist_bl_head *index_hash_p;
 	struct mb_cache_entry *lce;
@@ -423,26 +581,29 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
 	block_hash_p = &cache->c_block_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(block_hash_p);
 	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
-		if (lce->e_bdev == bdev && lce->e_block == block)
-			goto out;
+		if (lce->e_bdev == bdev && lce->e_block == block) {
+			hlist_bl_unlock(block_hash_p);
+			return -EBUSY;
+		}
 	}
 	mb_assert(!__mb_cache_entry_is_block_hashed(ce));
-	__mb_cache_entry_unhash(ce);
+	__mb_cache_entry_unhash_block(ce);
+	__mb_cache_entry_unhash_index(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
 	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
+	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+	hlist_bl_unlock(block_hash_p);
 	bucket = hash_long(key, cache->c_bucket_bits);
 	index_hash_p = &cache->c_index_hash[bucket];
+	hlist_bl_lock(index_hash_p);
 	ce->e_index_hash_p = index_hash_p;
 	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
-	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
-	error = 0;
-out:
-	spin_unlock(&mb_cache_spinlock);
-	return error;
+	hlist_bl_unlock(index_hash_p);
+	return 0;
 }
 
 
@@ -456,24 +617,26 @@ out:
 void
 mb_cache_entry_release(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
-	__mb_cache_entry_release_unlock(ce);
+	__mb_cache_entry_release(ce);
 }
 
 
 /*
  * mb_cache_entry_free()
  *
- * This is equivalent to the sequence mb_cache_entry_takeout() --
- * mb_cache_entry_release().
  */
 void
 mb_cache_entry_free(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
+	mb_assert(ce);
 	mb_assert(list_empty(&ce->e_lru_list));
-	__mb_cache_entry_unhash(ce);
-	__mb_cache_entry_release_unlock(ce);
+	hlist_bl_lock(ce->e_index_hash_p);
+	__mb_cache_entry_unhash_index(ce);
+	hlist_bl_unlock(ce->e_index_hash_p);
+	hlist_bl_lock(ce->e_block_hash_p);
+	__mb_cache_entry_unhash_block(ce);
+	hlist_bl_unlock(ce->e_block_hash_p);
+	__mb_cache_entry_release(ce);
 }
 
 
@@ -497,39 +660,48 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
 	block_hash_p = &cache->c_block_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	/* First serialize access to the block corresponding hash chain. */
+	hlist_bl_lock(block_hash_p);
 	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
 		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
-			DEFINE_WAIT(wait);
+			/*
+			 * Prevent a free from removing the entry.
+			 */
+			atomic_inc(&ce->e_refcnt);
+			hlist_bl_unlock(block_hash_p);
+			__spin_lock_mb_cache_entry(ce);
+			atomic_dec(&ce->e_refcnt);
+			if (ce->e_used > 0) {
+				DEFINE_WAIT(wait);
+				while (ce->e_used > 0) {
+					ce->e_queued++;
+					prepare_to_wait(&mb_cache_queue, &wait,
+							TASK_UNINTERRUPTIBLE);
+					__spin_unlock_mb_cache_entry(ce);
+					schedule();
+					__spin_lock_mb_cache_entry(ce);
+					ce->e_queued--;
+				}
+				finish_wait(&mb_cache_queue, &wait);
+			}
+			ce->e_used += 1 + MB_CACHE_WRITER;
+			__spin_unlock_mb_cache_entry(ce);
 
-			if (!list_empty(&ce->e_lru_list))
+			if (!list_empty(&ce->e_lru_list)) {
+				spin_lock(&mb_cache_spinlock);
 				list_del_init(&ce->e_lru_list);
-
-			while (ce->e_used > 0) {
-				ce->e_queued++;
-				prepare_to_wait(&mb_cache_queue, &wait,
-						TASK_UNINTERRUPTIBLE);
 				spin_unlock(&mb_cache_spinlock);
-				schedule();
-				spin_lock(&mb_cache_spinlock);
-				ce->e_queued--;
 			}
-			finish_wait(&mb_cache_queue, &wait);
-			ce->e_used += 1 + MB_CACHE_WRITER;
-
 			if (!__mb_cache_entry_is_block_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
+				__mb_cache_entry_release(ce);
 				return NULL;
 			}
-			goto cleanup;
+			return ce;
 		}
 	}
-	ce = NULL;
-
-cleanup:
-	spin_unlock(&mb_cache_spinlock);
-	return ce;
+	hlist_bl_unlock(block_hash_p);
+	return NULL;
 }
 
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
@@ -538,40 +710,53 @@ static struct mb_cache_entry *
 __mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
+
+	/* The index hash chain is alredy acquire by caller. */
 	while (l != NULL) {
 		struct mb_cache_entry *ce =
 			hlist_bl_entry(l, struct mb_cache_entry,
 				e_index.o_list);
 		mb_assert(ce->e_index_hash_p == head);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
-			DEFINE_WAIT(wait);
-
-			if (!list_empty(&ce->e_lru_list))
-				list_del_init(&ce->e_lru_list);
-
+			/*
+			 * Prevent a free from removing the entry.
+			 */
+			atomic_inc(&ce->e_refcnt);
+			hlist_bl_unlock(head);
+			__spin_lock_mb_cache_entry(ce);
+			atomic_dec(&ce->e_refcnt);
+			ce->e_used++;
 			/* Incrementing before holding the lock gives readers
 			   priority over writers. */
-			ce->e_used++;
-			while (ce->e_used >= MB_CACHE_WRITER) {
-				ce->e_queued++;
-				prepare_to_wait(&mb_cache_queue, &wait,
-						TASK_UNINTERRUPTIBLE);
-				spin_unlock(&mb_cache_spinlock);
-				schedule();
+			if (ce->e_used >= MB_CACHE_WRITER) {
+				DEFINE_WAIT(wait);
+
+				while (ce->e_used >= MB_CACHE_WRITER) {
+					ce->e_queued++;
+					prepare_to_wait(&mb_cache_queue, &wait,
+							TASK_UNINTERRUPTIBLE);
+					__spin_unlock_mb_cache_entry(ce);
+					schedule();
+					__spin_lock_mb_cache_entry(ce);
+					ce->e_queued--;
+				}
+				finish_wait(&mb_cache_queue, &wait);
+			}
+			__spin_unlock_mb_cache_entry(ce);
+			if (!list_empty(&ce->e_lru_list)) {
 				spin_lock(&mb_cache_spinlock);
-				ce->e_queued--;
+				list_del_init(&ce->e_lru_list);
+				spin_unlock(&mb_cache_spinlock);
 			}
-			finish_wait(&mb_cache_queue, &wait);
-
 			if (!__mb_cache_entry_is_block_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
-				spin_lock(&mb_cache_spinlock);
+				__mb_cache_entry_release(ce);
 				return ERR_PTR(-EAGAIN);
 			}
 			return ce;
 		}
 		l = l->next;
 	}
+	hlist_bl_unlock(head);
 	return NULL;
 }
 
@@ -598,12 +783,12 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 	struct hlist_bl_head *index_hash_p;
 
 	index_hash_p = &cache->c_index_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(index_hash_p);
 	if (!hlist_bl_empty(index_hash_p)) {
 		l = hlist_bl_first(index_hash_p);
 		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
-	}
-	spin_unlock(&mb_cache_spinlock);
+	} else
+		hlist_bl_unlock(index_hash_p);
 	return ce;
 }
 
@@ -638,11 +823,11 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 
 	index_hash_p = &cache->c_index_hash[bucket];
 	mb_assert(prev->e_index_hash_p == index_hash_p);
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(index_hash_p);
 	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
 	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
-	__mb_cache_entry_release_unlock(prev);
+	__mb_cache_entry_release(prev);
 	return ce;
 }
 
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH V5 3/3] ext4: each filesystem creates and uses its own mb_cache
  2014-02-20 18:37 ` [PATCH V5 0/3] " T Makphaibulchoke
  2014-02-20 18:37   ` [PATCH V5 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
  2014-02-20 18:37   ` [PATCH V5 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
@ 2014-02-20 18:37   ` T Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-02-20 18:37 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c.

fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is allocated
and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/super.c | 24 ++++++++++++++++--------
 fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
 fs/ext4/xattr.h |  6 +++---
 fs/mbcache.c    | 18 +++++++++++++-----
 5 files changed, 61 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..fa20aab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1314,6 +1314,7 @@ struct ext4_sb_info {
 	struct list_head s_es_lru;
 	unsigned long s_es_last_sorted;
 	struct percpu_counter s_extent_cache_cnt;
+	struct mb_cache *s_mb_cache;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 
 	/* Ratelimit ext4 messages. */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..176048e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
 static struct ext4_lazy_init *ext4_li_info;
 static struct mutex ext4_li_mtx;
 static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -845,6 +846,10 @@ static void ext4_put_super(struct super_block *sb)
 		invalidate_bdev(sbi->journal_bdev);
 		ext4_blkdev_remove(sbi);
 	}
+	if (sbi->s_mb_cache) {
+		ext4_xattr_destroy_cache(sbi->s_mb_cache);
+		sbi->s_mb_cache = NULL;
+	}
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 	sb->s_fs_info = NULL;
@@ -3981,6 +3986,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	if (ext4_mballoc_ready) {
+		sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+		if (!sbi->s_mb_cache) {
+			ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+			goto failed_mount_wq;
+		}
+	}
 
 	/*
 	 * The journal may have updated the bg summary counts, so we
@@ -5501,11 +5513,9 @@ static int __init ext4_init_fs(void)
 
 	err = ext4_init_mballoc();
 	if (err)
-		goto out3;
-
-	err = ext4_init_xattr();
-	if (err)
 		goto out2;
+	else
+		ext4_mballoc_ready = 1;
 	err = init_inodecache();
 	if (err)
 		goto out1;
@@ -5521,10 +5531,9 @@ out:
 	unregister_as_ext3();
 	destroy_inodecache();
 out1:
-	ext4_exit_xattr();
-out2:
+	ext4_mballoc_ready = 0;
 	ext4_exit_mballoc();
-out3:
+out2:
 	ext4_exit_feat_adverts();
 out4:
 	if (ext4_proc_root)
@@ -5547,7 +5556,6 @@ static void __exit ext4_exit_fs(void)
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
 	destroy_inodecache();
-	ext4_exit_xattr();
 	ext4_exit_mballoc();
 	ext4_exit_feat_adverts();
 	remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 1423c48..6e2788f 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
 static struct buffer_head *ext4_xattr_cache_find(struct inode *,
 						 struct ext4_xattr_header *,
 						 struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
 static int ext4_xattr_list(struct dentry *dentry, char *buffer,
 			   size_t buffer_size);
 
-static struct mb_cache *ext4_xattr_cache;
-
 static const struct xattr_handler *ext4_xattr_handler_map[] = {
 	[EXT4_XATTR_INDEX_USER]		     = &ext4_xattr_user_handler,
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
 	NULL
 };
 
+#define EXT4_GET_MB_CACHE(inode)	(((struct ext4_sb_info *) \
+				inode->i_sb->s_fs_info)->s_mb_cache)
+
 static __le32 ext4_xattr_block_csum(struct inode *inode,
 				    sector_t block_nr,
 				    struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
 	if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
 cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
-	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+	ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
 		goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
-		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+		ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
 		if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
-				ext4_xattr_cache_insert(bs->bh);
+				ext4_xattr_cache_insert(ext4_mb_cache,
+					bs->bh);
 			}
 			unlock_buffer(bs->bh);
 			if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
-			ext4_xattr_cache_insert(new_bh);
+			ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
 			if (error)
@@ -1495,13 +1501,13 @@ ext4_xattr_put_super(struct super_block *sb)
  * Returns 0, or a negative error number on failure.
  */
 static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 {
 	__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
 	struct mb_cache_entry *ce;
 	int error;
 
-	ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+	ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
 	if (!ce) {
 		ea_bdebug(bh, "out of memory");
 		return;
@@ -1573,12 +1579,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
 again:
-	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+	ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
 				       hash);
 	while (ce) {
 		struct buffer_head *bh;
@@ -1676,19 +1683,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
 
 #undef BLOCK_HASH_SHIFT
 
-int __init
-ext4_init_xattr(void)
+#define	HASH_BUCKET_BITS	10
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
 {
-	ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
-	if (!ext4_xattr_cache)
-		return -ENOMEM;
-	return 0;
+	return mb_cache_create(name, HASH_BUCKET_BITS);
 }
 
-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
 {
-	if (ext4_xattr_cache)
-		mb_cache_destroy(ext4_xattr_cache);
-	ext4_xattr_cache = NULL;
+	if (cache)
+		mb_cache_destroy(cache);
 }
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
 
-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
 extern const struct xattr_handler *ext4_xattr_handlers[];
 
 extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
 				       struct ext4_xattr_info *i,
 				       struct ext4_xattr_ibody_find *is);
 
+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
 #ifdef CONFIG_EXT4_FS_SECURITY
 extern int ext4_init_security(handle_t *handle, struct inode *inode,
 			      struct inode *dir, const struct qstr *qstr);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 786ecab..bf166e3 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -99,6 +99,7 @@
 
 static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
 static struct blockgroup_lock *mb_cache_bg_lock;
+static struct kmem_cache *mb_cache_kmem_cache;
 
 MODULE_AUTHOR("Andreas Gruenbacher <a.gruenbacher@computer.org>");
 MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
@@ -351,11 +352,14 @@ mb_cache_create(const char *name, int bucket_bits)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
 		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
-	cache->c_entry_cache = kmem_cache_create(name,
-		sizeof(struct mb_cache_entry), 0,
-		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
-	if (!cache->c_entry_cache)
-		goto fail2;
+	if (!mb_cache_kmem_cache) {
+		mb_cache_kmem_cache = kmem_cache_create(name,
+			sizeof(struct mb_cache_entry), 0,
+			SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+		if (!mb_cache_kmem_cache)
+			goto fail2;
+	}
+	cache->c_entry_cache = mb_cache_kmem_cache;
 
 	/*
 	 * Set an upper limit on the number of cache entries so that the hash
@@ -476,6 +480,10 @@ mb_cache_destroy(struct mb_cache *cache)
 			  atomic_read(&cache->c_entry_count));
 	}
 
+	if (list_empty(&mb_cache_list)) {
+		kmem_cache_destroy(mb_cache_kmem_cache);
+		mb_cache_kmem_cache = NULL;
+	}
 	kfree(cache->c_index_hash);
 	kfree(cache->c_block_hash);
 	kfree(cache);
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v5 RESEND 0/3] ext4: increase mbcache scalability
  2013-07-18  0:55 [PATCH 0/2] ext4: increase mbcache scalability T Makphaibulchoke
                   ` (2 preceding siblings ...)
  2014-02-20 18:37 ` [PATCH V5 0/3] " T Makphaibulchoke
@ 2014-03-12 16:19 ` T Makphaibulchoke
  2014-03-12 16:19   ` [PATCH v5 RESEND 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
                     ` (2 more replies)
  3 siblings, 3 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-03-12 16:19 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

The patch consists of three parts.

The first part changes the implementation of both the block and hash chains of
an mb_cache from list_head to hlist_bl_head and also introduces new members,
including a spinlock to mb_cache_entry, as required by the second part.

The second part introduces higher degree of parallelism to the usages of the
mb_cache and mb_cache_entries and impacts all ext filesystems.

The third part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems, and increases the size of its mbcache hash tables.

Here are some of the benchmark results with the changes.

Using ram disk, there seems to be no peformance differences with aim7 for all
workloads on all platforms tested.

With regular disk filesystems with inode size of 128 bytes, forcing the uses
of external xattr, there seems to be some good peformance increases with
some of the aim7's workloads on all platforms tested.

Here are some of the performance improvement on aim7 with 2000 users.

On a 20 core macine, there is no performance differences.

On a 60 core machine:

---------------------------
|             | % increase |
---------------------------
| alltests    |     74.69  |
---------------------------
| custom      |     77.1   |
---------------------------
| disk        |    125.02  |
---------------------------
| fserver     |    113.22  |
---------------------------
| new_dbase   |     21.17  |
---------------------------
| new_fserve  |     70.31  |
---------------------------
| shared      |     52.56  |
---------------------------

On a 80 core machine:

---------------------------
|             | % increase |
---------------------------
| custom      |     74.29  |
---------------------------
| disk        |     61.01  |
---------------------------
| fserver     |     11.59  |
---------------------------
| new_fserver |     32.76  |
---------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.

Changed in v5:
        - New performance data
	- New diff summary

Changed in v4:
        - New performance data
	- New diff summary
	- New patch architecture

Changed in v3:
	- New idff summary

Changed in v2:
        - New performance data
	- New diff summary

T Makphaibulchoke (3):
  fs/mbcache.c change block and index hash chain to hlist_bl_node
  mbcache: decoupling the locking of local from global data
  ext4: each filesystem creates and uses its own mb_cache

 fs/ext4/ext4.h          |   1 +
 fs/ext4/super.c         |  24 ++-
 fs/ext4/xattr.c         |  51 ++---
 fs/ext4/xattr.h         |   6 +-
 fs/mbcache.c            | 540 ++++++++++++++++++++++++++++++++++--------------
 include/linux/mbcache.h |  12 +-
 6 files changed, 441 insertions(+), 193 deletions(-)

-- 
1.7.11.3


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v5 RESEND 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node
  2014-03-12 16:19 ` [PATCH v5 RESEND 0/3] ext4: increase mbcache scalability T Makphaibulchoke
@ 2014-03-12 16:19   ` T Makphaibulchoke
  2014-03-12 16:19   ` [PATCH v5 RESEND 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
  2014-03-12 16:19   ` [PATCH v5 RESEND 3/3] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-03-12 16:19 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

This patch changes each mb_cache's both block and index hash chains from
regular linked list to hlist_bl_node, which contains a built-in lock.  This is
the first step in decoupling of locks serializing accesses to mb_cache global
data and each mb_cache_entry local data.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/mbcache.c            | 117 ++++++++++++++++++++++++++++++++----------------
 include/linux/mbcache.h |  12 ++---
 2 files changed, 85 insertions(+), 44 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index e519e45..55db0da 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -34,9 +34,9 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
-#include <linux/init.h>
+#include <linux/list_bl.h>
 #include <linux/mbcache.h>
-
+#include <linux/init.h>
 
 #ifdef MB_CACHE_DEBUG
 # define mb_debug(f...) do { \
@@ -87,21 +87,38 @@ static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
 static inline int
-__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
+__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
 {
-	return !list_empty(&ce->e_block_list);
+	return !hlist_bl_unhashed(&ce->e_block_list);
 }
 
 
-static void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+static inline void
+__mb_cache_entry_unhash_block(struct mb_cache_entry *ce)
 {
-	if (__mb_cache_entry_is_hashed(ce)) {
-		list_del_init(&ce->e_block_list);
-		list_del(&ce->e_index.o_list);
-	}
+	if (__mb_cache_entry_is_block_hashed(ce))
+		hlist_bl_del_init(&ce->e_block_list);
+}
+
+static inline int
+__mb_cache_entry_is_index_hashed(struct mb_cache_entry *ce)
+{
+	return !hlist_bl_unhashed(&ce->e_index.o_list);
 }
 
+static inline void
+__mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
+{
+	if (__mb_cache_entry_is_index_hashed(ce))
+		hlist_bl_del_init(&ce->e_index.o_list);
+}
+
+static inline void
+__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+{
+	__mb_cache_entry_unhash_index(ce);
+	__mb_cache_entry_unhash_block(ce);
+}
 
 static void
 __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
@@ -125,7 +142,7 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
 		ce->e_used -= MB_CACHE_WRITER;
 	ce->e_used--;
 	if (!(ce->e_used || ce->e_queued)) {
-		if (!__mb_cache_entry_is_hashed(ce))
+		if (!__mb_cache_entry_is_block_hashed(ce))
 			goto forget;
 		mb_assert(list_empty(&ce->e_lru_list));
 		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
@@ -221,18 +238,18 @@ mb_cache_create(const char *name, int bucket_bits)
 	cache->c_name = name;
 	atomic_set(&cache->c_entry_count, 0);
 	cache->c_bucket_bits = bucket_bits;
-	cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
-	                              GFP_KERNEL);
+	cache->c_block_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_block_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_block_hash[n]);
-	cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
-				      GFP_KERNEL);
+		INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+	cache->c_index_hash = kmalloc(bucket_count *
+		sizeof(struct hlist_bl_head), GFP_KERNEL);
 	if (!cache->c_index_hash)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
-		INIT_LIST_HEAD(&cache->c_index_hash[n]);
+		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
 	cache->c_entry_cache = kmem_cache_create(name,
 		sizeof(struct mb_cache_entry), 0,
 		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -364,10 +381,13 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 			return NULL;
 		atomic_inc(&cache->c_entry_count);
 		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_LIST_HEAD(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_block_list);
+		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
 		ce->e_cache = cache;
 		ce->e_queued = 0;
 	}
+	ce->e_block_hash_p = &cache->c_block_hash[0];
+	ce->e_index_hash_p = &cache->c_index_hash[0];
 	ce->e_used = 1 + MB_CACHE_WRITER;
 	return ce;
 }
@@ -393,25 +413,32 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 {
 	struct mb_cache *cache = ce->e_cache;
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	int error = -EBUSY;
+	struct hlist_bl_head *block_hash_p;
+	struct hlist_bl_head *index_hash_p;
+	struct mb_cache_entry *lce;
 
+	mb_assert(ce);
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
+	block_hash_p = &cache->c_block_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_prev(l, &cache->c_block_hash[bucket]) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_block_list);
-		if (ce->e_bdev == bdev && ce->e_block == block)
+	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+		if (lce->e_bdev == bdev && lce->e_block == block)
 			goto out;
 	}
+	mb_assert(!__mb_cache_entry_is_block_hashed(ce));
 	__mb_cache_entry_unhash(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
-	list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
 	bucket = hash_long(key, cache->c_bucket_bits);
-	list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+	index_hash_p = &cache->c_index_hash[bucket];
+	ce->e_index_hash_p = index_hash_p;
+	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
 	error = 0;
 out:
 	spin_unlock(&mb_cache_spinlock);
@@ -463,14 +490,16 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 		   sector_t block)
 {
 	unsigned int bucket;
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *block_hash_p;
 
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
+	block_hash_p = &cache->c_block_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	list_for_each(l, &cache->c_block_hash[bucket]) {
-		ce = list_entry(l, struct mb_cache_entry, e_block_list);
+	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
 			DEFINE_WAIT(wait);
 
@@ -489,7 +518,7 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 			finish_wait(&mb_cache_queue, &wait);
 			ce->e_used += 1 + MB_CACHE_WRITER;
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
 				return NULL;
 			}
@@ -506,12 +535,14 @@ cleanup:
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
 
 static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
-	while (l != head) {
+	while (l != NULL) {
 		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_index.o_list);
+			hlist_bl_entry(l, struct mb_cache_entry,
+				e_index.o_list);
+		mb_assert(ce->e_index_hash_p == head);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
 			DEFINE_WAIT(wait);
 
@@ -532,7 +563,7 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
 			}
 			finish_wait(&mb_cache_queue, &wait);
 
-			if (!__mb_cache_entry_is_hashed(ce)) {
+			if (!__mb_cache_entry_is_block_hashed(ce)) {
 				__mb_cache_entry_release_unlock(ce);
 				spin_lock(&mb_cache_spinlock);
 				return ERR_PTR(-EAGAIN);
@@ -562,12 +593,16 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 			  unsigned int key)
 {
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
-	struct mb_cache_entry *ce;
+	struct hlist_bl_node *l;
+	struct mb_cache_entry *ce = NULL;
+	struct hlist_bl_head *index_hash_p;
 
+	index_hash_p = &cache->c_index_hash[bucket];
 	spin_lock(&mb_cache_spinlock);
-	l = cache->c_index_hash[bucket].next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+	if (!hlist_bl_empty(index_hash_p)) {
+		l = hlist_bl_first(index_hash_p);
+		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+	}
 	spin_unlock(&mb_cache_spinlock);
 	return ce;
 }
@@ -597,12 +632,16 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 {
 	struct mb_cache *cache = prev->e_cache;
 	unsigned int bucket = hash_long(key, cache->c_bucket_bits);
-	struct list_head *l;
+	struct hlist_bl_node *l;
 	struct mb_cache_entry *ce;
+	struct hlist_bl_head *index_hash_p;
 
+	index_hash_p = &cache->c_index_hash[bucket];
+	mb_assert(prev->e_index_hash_p == index_hash_p);
 	spin_lock(&mb_cache_spinlock);
+	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
-	ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
 	__mb_cache_entry_release_unlock(prev);
 	return ce;
 }
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..6a392e7 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -3,19 +3,21 @@
 
   (C) 2001 by Andreas Gruenbacher, <a.gruenbacher@computer.org>
 */
-
 struct mb_cache_entry {
 	struct list_head		e_lru_list;
 	struct mb_cache			*e_cache;
 	unsigned short			e_used;
 	unsigned short			e_queued;
+	atomic_t			e_refcnt;
 	struct block_device		*e_bdev;
 	sector_t			e_block;
-	struct list_head		e_block_list;
+	struct hlist_bl_node		e_block_list;
 	struct {
-		struct list_head	o_list;
+		struct hlist_bl_node	o_list;
 		unsigned int		o_key;
 	} e_index;
+	struct hlist_bl_head		*e_block_hash_p;
+	struct hlist_bl_head		*e_index_hash_p;
 };
 
 struct mb_cache {
@@ -25,8 +27,8 @@ struct mb_cache {
 	int				c_max_entries;
 	int				c_bucket_bits;
 	struct kmem_cache		*c_entry_cache;
-	struct list_head		*c_block_hash;
-	struct list_head		*c_index_hash;
+	struct hlist_bl_head		*c_block_hash;
+	struct hlist_bl_head		*c_index_hash;
 };
 
 /* Functions on caches */
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v5 RESEND 2/3] mbcache: decoupling the locking of local from global data
  2014-03-12 16:19 ` [PATCH v5 RESEND 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  2014-03-12 16:19   ` [PATCH v5 RESEND 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
@ 2014-03-12 16:19   ` T Makphaibulchoke
  2014-03-12 16:19   ` [PATCH v5 RESEND 3/3] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-03-12 16:19 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables.  Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

New block group spinlock, mb_cache_bg_lock  is also added to serialize accesses
to mb_cache_entry's local data.

A new member e_refcnt is added to the mb_cache_entry structure to help
preventing an mb_cache_entry from being deallocated by a free while it is
being referenced by either mb_cache_entry_get() or mb_cache_entry_find().

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/mbcache.c | 417 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 301 insertions(+), 116 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 55db0da..786ecab 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,41 @@
  * back on the lru list.
  */
 
+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Each mb_cache_entry contains a spinlock, e_entry_lock, to serialize
+ * accesses to its local data, such as e_used and e_queued.
+ *
+ * Lock ordering:
+ *
+ * Each block hash chain's lock has the highest lock order, followed by an
+ * index hash chain's lock, mb_cache_bg_lock (used to implement mb_cache_entry's
+ * lock), and mb_cach_spinlock, with the lowest order.  While holding
+ * either a block or index hash chain lock, a thread can acquire an
+ * mc_cache_bg_lock, which in turn can also acquire mb_cache_spinlock.
+ *
+ * Synchronization:
+ *
+ * Since both mb_cache_entry_get and mb_cache_entry_find scan the block and
+ * index hash chian, it needs to lock the corresponding hash chain.  For each
+ * mb_cache_entry within the chain, it needs to lock the mb_cache_entry to
+ * prevent either any simultaneous release or free on the entry and also
+ * to serialize accesses to either the e_used or e_queued member of the entry.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced, both e_used,
+ * and e_queued are 0's.  When an mb_cache_entry is explicitly freed it is
+ * first removed from a block hash chain.
+ */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 
@@ -37,6 +72,7 @@
 #include <linux/list_bl.h>
 #include <linux/mbcache.h>
 #include <linux/init.h>
+#include <linux/blockgroup_lock.h>
 
 #ifdef MB_CACHE_DEBUG
 # define mb_debug(f...) do { \
@@ -57,8 +93,13 @@
 
 #define MB_CACHE_WRITER ((unsigned short)~0U >> 1)
 
+#define MB_CACHE_ENTRY_LOCK_BITS	__builtin_log2(NR_BG_LOCKS)
+#define	MB_CACHE_ENTRY_LOCK_INDEX(ce)			\
+	(hash_long((unsigned long)ce, MB_CACHE_ENTRY_LOCK_BITS))
+
 static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
-		
+static struct blockgroup_lock *mb_cache_bg_lock;
+
 MODULE_AUTHOR("Andreas Gruenbacher <a.gruenbacher@computer.org>");
 MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
 MODULE_LICENSE("GPL");
@@ -86,6 +127,20 @@ static LIST_HEAD(mb_cache_list);
 static LIST_HEAD(mb_cache_lru_list);
 static DEFINE_SPINLOCK(mb_cache_spinlock);
 
+static inline void
+__spin_lock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+	spin_lock(bgl_lock_ptr(mb_cache_bg_lock,
+		MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
+static inline void
+__spin_unlock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+	spin_unlock(bgl_lock_ptr(mb_cache_bg_lock,
+		MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
 static inline int
 __mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
 {
@@ -113,11 +168,21 @@ __mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
 		hlist_bl_del_init(&ce->e_index.o_list);
 }
 
+/*
+ * __mb_cache_entry_unhash_unlock()
+ *
+ * This function is called to unhash both the block and index hash
+ * chain.
+ * It assumes both the block and index hash chain is locked upon entry.
+ * It also unlock both hash chains both exit
+ */
 static inline void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_unlock(struct mb_cache_entry *ce)
 {
 	__mb_cache_entry_unhash_index(ce);
+	hlist_bl_unlock(ce->e_index_hash_p);
 	__mb_cache_entry_unhash_block(ce);
+	hlist_bl_unlock(ce->e_block_hash_p);
 }
 
 static void
@@ -125,36 +190,47 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
 {
 	struct mb_cache *cache = ce->e_cache;
 
-	mb_assert(!(ce->e_used || ce->e_queued));
+	mb_assert(!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt)));
 	kmem_cache_free(cache->c_entry_cache, ce);
 	atomic_dec(&cache->c_entry_count);
 }
 
-
 static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
-	__releases(mb_cache_spinlock)
+__mb_cache_entry_release(struct mb_cache_entry *ce)
 {
+	/* First lock the entry to serialize access to its local data. */
+	__spin_lock_mb_cache_entry(ce);
 	/* Wake up all processes queuing for this cache entry. */
 	if (ce->e_queued)
 		wake_up_all(&mb_cache_queue);
 	if (ce->e_used >= MB_CACHE_WRITER)
 		ce->e_used -= MB_CACHE_WRITER;
+	/*
+	 * Make sure that all cache entries on lru_list have
+	 * both e_used and e_qued of 0s.
+	 */
 	ce->e_used--;
-	if (!(ce->e_used || ce->e_queued)) {
-		if (!__mb_cache_entry_is_block_hashed(ce))
+	if (!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))) {
+		if (!__mb_cache_entry_is_block_hashed(ce)) {
+			__spin_unlock_mb_cache_entry(ce);
 			goto forget;
-		mb_assert(list_empty(&ce->e_lru_list));
-		list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+		}
+		/*
+		 * Need access to lru list, first drop entry lock,
+		 * then reacquire the lock in the proper order.
+		 */
+		spin_lock(&mb_cache_spinlock);
+		if (list_empty(&ce->e_lru_list))
+			list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+		spin_unlock(&mb_cache_spinlock);
 	}
-	spin_unlock(&mb_cache_spinlock);
+	__spin_unlock_mb_cache_entry(ce);
 	return;
 forget:
-	spin_unlock(&mb_cache_spinlock);
+	mb_assert(list_empty(&ce->e_lru_list));
 	__mb_cache_entry_forget(ce, GFP_KERNEL);
 }
 
-
 /*
  * mb_cache_shrink_scan()  memory pressure callback
  *
@@ -177,17 +253,34 @@ mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 
 	mb_debug("trying to free %d entries", nr_to_scan);
 	spin_lock(&mb_cache_spinlock);
-	while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
+	while ((nr_to_scan-- > 0) && !list_empty(&mb_cache_lru_list)) {
 		struct mb_cache_entry *ce =
 			list_entry(mb_cache_lru_list.next,
-				   struct mb_cache_entry, e_lru_list);
-		list_move_tail(&ce->e_lru_list, &free_list);
-		__mb_cache_entry_unhash(ce);
-		freed++;
+				struct mb_cache_entry, e_lru_list);
+		list_del_init(&ce->e_lru_list);
+		if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))
+			continue;
+		spin_unlock(&mb_cache_spinlock);
+		/* Prevent any find or get operation on the entry */
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		/* Ignore if it is touched by a find/get */
+		if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt) ||
+			!list_empty(&ce->e_lru_list)) {
+			hlist_bl_unlock(ce->e_index_hash_p);
+			hlist_bl_unlock(ce->e_block_hash_p);
+			spin_lock(&mb_cache_spinlock);
+			continue;
+		}
+		__mb_cache_entry_unhash_unlock(ce);
+		list_add_tail(&ce->e_lru_list, &free_list);
+		spin_lock(&mb_cache_spinlock);
 	}
 	spin_unlock(&mb_cache_spinlock);
+
 	list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
 		__mb_cache_entry_forget(entry, gfp_mask);
+		freed++;
 	}
 	return freed;
 }
@@ -232,6 +325,14 @@ mb_cache_create(const char *name, int bucket_bits)
 	int n, bucket_count = 1 << bucket_bits;
 	struct mb_cache *cache = NULL;
 
+	if (!mb_cache_bg_lock) {
+		mb_cache_bg_lock = kmalloc(sizeof(struct blockgroup_lock),
+			GFP_KERNEL);
+		if (!mb_cache_bg_lock)
+			return NULL;
+		bgl_lock_init(mb_cache_bg_lock);
+	}
+
 	cache = kmalloc(sizeof(struct mb_cache), GFP_KERNEL);
 	if (!cache)
 		return NULL;
@@ -290,21 +391,47 @@ void
 mb_cache_shrink(struct block_device *bdev)
 {
 	LIST_HEAD(free_list);
-	struct list_head *l, *ltmp;
+	struct list_head *l;
+	struct mb_cache_entry *ce, *tmp;
 
+	l = &mb_cache_lru_list;
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_lru_list);
+	while (!list_is_last(l, &mb_cache_lru_list)) {
+		l = l->next;
+		ce = list_entry(l, struct mb_cache_entry, e_lru_list);
 		if (ce->e_bdev == bdev) {
-			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
+			list_del_init(&ce->e_lru_list);
+			if (ce->e_used || ce->e_queued ||
+				atomic_read(&ce->e_refcnt))
+				continue;
+			spin_unlock(&mb_cache_spinlock);
+			/*
+			 * Prevent any find or get operation on the entry.
+			 */
+			hlist_bl_lock(ce->e_block_hash_p);
+			hlist_bl_lock(ce->e_index_hash_p);
+			/* Ignore if it is touched by a find/get */
+			if (ce->e_used || ce->e_queued ||
+				atomic_read(&ce->e_refcnt) ||
+				!list_empty(&ce->e_lru_list)) {
+				hlist_bl_unlock(ce->e_index_hash_p);
+				hlist_bl_unlock(ce->e_block_hash_p);
+				l = &mb_cache_lru_list;
+				spin_lock(&mb_cache_spinlock);
+				continue;
+			}
+			__mb_cache_entry_unhash_unlock(ce);
+			mb_assert(!(ce->e_used || ce->e_queued ||
+				atomic_read(&ce->e_refcnt)));
+			list_add_tail(&ce->e_lru_list, &free_list);
+			l = &mb_cache_lru_list;
+			spin_lock(&mb_cache_spinlock);
 		}
 	}
 	spin_unlock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+
+	list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 }
 
@@ -320,23 +447,27 @@ void
 mb_cache_destroy(struct mb_cache *cache)
 {
 	LIST_HEAD(free_list);
-	struct list_head *l, *ltmp;
+	struct mb_cache_entry *ce, *tmp;
 
 	spin_lock(&mb_cache_spinlock);
-	list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
-		struct mb_cache_entry *ce =
-			list_entry(l, struct mb_cache_entry, e_lru_list);
-		if (ce->e_cache == cache) {
+	list_for_each_entry_safe(ce, tmp, &mb_cache_lru_list, e_lru_list) {
+		if (ce->e_cache == cache)
 			list_move_tail(&ce->e_lru_list, &free_list);
-			__mb_cache_entry_unhash(ce);
-		}
 	}
 	list_del(&cache->c_cache_list);
 	spin_unlock(&mb_cache_spinlock);
 
-	list_for_each_safe(l, ltmp, &free_list) {
-		__mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
-						   e_lru_list), GFP_KERNEL);
+	list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+		list_del_init(&ce->e_lru_list);
+		/*
+		 * Prevent any find or get operation on the entry.
+		 */
+		hlist_bl_lock(ce->e_block_hash_p);
+		hlist_bl_lock(ce->e_index_hash_p);
+		mb_assert(!(ce->e_used || ce->e_queued ||
+			atomic_read(&ce->e_refcnt)));
+		__mb_cache_entry_unhash_unlock(ce);
+		__mb_cache_entry_forget(ce, GFP_KERNEL);
 	}
 
 	if (atomic_read(&cache->c_entry_count) > 0) {
@@ -345,8 +476,6 @@ mb_cache_destroy(struct mb_cache *cache)
 			  atomic_read(&cache->c_entry_count));
 	}
 
-	kmem_cache_destroy(cache->c_entry_cache);
-
 	kfree(cache->c_index_hash);
 	kfree(cache->c_block_hash);
 	kfree(cache);
@@ -363,29 +492,59 @@ mb_cache_destroy(struct mb_cache *cache)
 struct mb_cache_entry *
 mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
 {
-	struct mb_cache_entry *ce = NULL;
+	struct mb_cache_entry *ce;
 
 	if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+		struct list_head *l;
+
+		l = &mb_cache_lru_list;
 		spin_lock(&mb_cache_spinlock);
-		if (!list_empty(&mb_cache_lru_list)) {
-			ce = list_entry(mb_cache_lru_list.next,
-					struct mb_cache_entry, e_lru_list);
-			list_del_init(&ce->e_lru_list);
-			__mb_cache_entry_unhash(ce);
+		while (!list_is_last(l, &mb_cache_lru_list)) {
+			l = l->next;
+			ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+			if (ce->e_cache == cache) {
+				list_del_init(&ce->e_lru_list);
+				if (ce->e_used || ce->e_queued ||
+					atomic_read(&ce->e_refcnt))
+					continue;
+				spin_unlock(&mb_cache_spinlock);
+				/*
+				 * Prevent any find or get operation on the
+				 * entry.
+				 */
+				hlist_bl_lock(ce->e_block_hash_p);
+				hlist_bl_lock(ce->e_index_hash_p);
+				/* Ignore if it is touched by a find/get */
+				if (ce->e_used || ce->e_queued ||
+					atomic_read(&ce->e_refcnt) ||
+					!list_empty(&ce->e_lru_list)) {
+					hlist_bl_unlock(ce->e_index_hash_p);
+					hlist_bl_unlock(ce->e_block_hash_p);
+					l = &mb_cache_lru_list;
+					spin_lock(&mb_cache_spinlock);
+					continue;
+				}
+				mb_assert(list_empty(&ce->e_lru_list));
+				mb_assert(!(ce->e_used || ce->e_queued ||
+					atomic_read(&ce->e_refcnt)));
+				__mb_cache_entry_unhash_unlock(ce);
+				goto found;
+			}
 		}
 		spin_unlock(&mb_cache_spinlock);
 	}
-	if (!ce) {
-		ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
-		if (!ce)
-			return NULL;
-		atomic_inc(&cache->c_entry_count);
-		INIT_LIST_HEAD(&ce->e_lru_list);
-		INIT_HLIST_BL_NODE(&ce->e_block_list);
-		INIT_HLIST_BL_NODE(&ce->e_index.o_list);
-		ce->e_cache = cache;
-		ce->e_queued = 0;
-	}
+
+	ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
+	if (!ce)
+		return NULL;
+	atomic_inc(&cache->c_entry_count);
+	INIT_LIST_HEAD(&ce->e_lru_list);
+	INIT_HLIST_BL_NODE(&ce->e_block_list);
+	INIT_HLIST_BL_NODE(&ce->e_index.o_list);
+	ce->e_cache = cache;
+	ce->e_queued = 0;
+	atomic_set(&ce->e_refcnt, 0);
+found:
 	ce->e_block_hash_p = &cache->c_block_hash[0];
 	ce->e_index_hash_p = &cache->c_index_hash[0];
 	ce->e_used = 1 + MB_CACHE_WRITER;
@@ -414,7 +573,6 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 	struct mb_cache *cache = ce->e_cache;
 	unsigned int bucket;
 	struct hlist_bl_node *l;
-	int error = -EBUSY;
 	struct hlist_bl_head *block_hash_p;
 	struct hlist_bl_head *index_hash_p;
 	struct mb_cache_entry *lce;
@@ -423,26 +581,29 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff), 
 			   cache->c_bucket_bits);
 	block_hash_p = &cache->c_block_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(block_hash_p);
 	hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
-		if (lce->e_bdev == bdev && lce->e_block == block)
-			goto out;
+		if (lce->e_bdev == bdev && lce->e_block == block) {
+			hlist_bl_unlock(block_hash_p);
+			return -EBUSY;
+		}
 	}
 	mb_assert(!__mb_cache_entry_is_block_hashed(ce));
-	__mb_cache_entry_unhash(ce);
+	__mb_cache_entry_unhash_block(ce);
+	__mb_cache_entry_unhash_index(ce);
 	ce->e_bdev = bdev;
 	ce->e_block = block;
 	ce->e_block_hash_p = block_hash_p;
 	ce->e_index.o_key = key;
+	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+	hlist_bl_unlock(block_hash_p);
 	bucket = hash_long(key, cache->c_bucket_bits);
 	index_hash_p = &cache->c_index_hash[bucket];
+	hlist_bl_lock(index_hash_p);
 	ce->e_index_hash_p = index_hash_p;
 	hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
-	hlist_bl_add_head(&ce->e_block_list, block_hash_p);
-	error = 0;
-out:
-	spin_unlock(&mb_cache_spinlock);
-	return error;
+	hlist_bl_unlock(index_hash_p);
+	return 0;
 }
 
 
@@ -456,24 +617,26 @@ out:
 void
 mb_cache_entry_release(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
-	__mb_cache_entry_release_unlock(ce);
+	__mb_cache_entry_release(ce);
 }
 
 
 /*
  * mb_cache_entry_free()
  *
- * This is equivalent to the sequence mb_cache_entry_takeout() --
- * mb_cache_entry_release().
  */
 void
 mb_cache_entry_free(struct mb_cache_entry *ce)
 {
-	spin_lock(&mb_cache_spinlock);
+	mb_assert(ce);
 	mb_assert(list_empty(&ce->e_lru_list));
-	__mb_cache_entry_unhash(ce);
-	__mb_cache_entry_release_unlock(ce);
+	hlist_bl_lock(ce->e_index_hash_p);
+	__mb_cache_entry_unhash_index(ce);
+	hlist_bl_unlock(ce->e_index_hash_p);
+	hlist_bl_lock(ce->e_block_hash_p);
+	__mb_cache_entry_unhash_block(ce);
+	hlist_bl_unlock(ce->e_block_hash_p);
+	__mb_cache_entry_release(ce);
 }
 
 
@@ -497,39 +660,48 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
 	bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
 			   cache->c_bucket_bits);
 	block_hash_p = &cache->c_block_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	/* First serialize access to the block corresponding hash chain. */
+	hlist_bl_lock(block_hash_p);
 	hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
 		mb_assert(ce->e_block_hash_p == block_hash_p);
 		if (ce->e_bdev == bdev && ce->e_block == block) {
-			DEFINE_WAIT(wait);
+			/*
+			 * Prevent a free from removing the entry.
+			 */
+			atomic_inc(&ce->e_refcnt);
+			hlist_bl_unlock(block_hash_p);
+			__spin_lock_mb_cache_entry(ce);
+			atomic_dec(&ce->e_refcnt);
+			if (ce->e_used > 0) {
+				DEFINE_WAIT(wait);
+				while (ce->e_used > 0) {
+					ce->e_queued++;
+					prepare_to_wait(&mb_cache_queue, &wait,
+							TASK_UNINTERRUPTIBLE);
+					__spin_unlock_mb_cache_entry(ce);
+					schedule();
+					__spin_lock_mb_cache_entry(ce);
+					ce->e_queued--;
+				}
+				finish_wait(&mb_cache_queue, &wait);
+			}
+			ce->e_used += 1 + MB_CACHE_WRITER;
+			__spin_unlock_mb_cache_entry(ce);
 
-			if (!list_empty(&ce->e_lru_list))
+			if (!list_empty(&ce->e_lru_list)) {
+				spin_lock(&mb_cache_spinlock);
 				list_del_init(&ce->e_lru_list);
-
-			while (ce->e_used > 0) {
-				ce->e_queued++;
-				prepare_to_wait(&mb_cache_queue, &wait,
-						TASK_UNINTERRUPTIBLE);
 				spin_unlock(&mb_cache_spinlock);
-				schedule();
-				spin_lock(&mb_cache_spinlock);
-				ce->e_queued--;
 			}
-			finish_wait(&mb_cache_queue, &wait);
-			ce->e_used += 1 + MB_CACHE_WRITER;
-
 			if (!__mb_cache_entry_is_block_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
+				__mb_cache_entry_release(ce);
 				return NULL;
 			}
-			goto cleanup;
+			return ce;
 		}
 	}
-	ce = NULL;
-
-cleanup:
-	spin_unlock(&mb_cache_spinlock);
-	return ce;
+	hlist_bl_unlock(block_hash_p);
+	return NULL;
 }
 
 #if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
@@ -538,40 +710,53 @@ static struct mb_cache_entry *
 __mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
 		      struct block_device *bdev, unsigned int key)
 {
+
+	/* The index hash chain is alredy acquire by caller. */
 	while (l != NULL) {
 		struct mb_cache_entry *ce =
 			hlist_bl_entry(l, struct mb_cache_entry,
 				e_index.o_list);
 		mb_assert(ce->e_index_hash_p == head);
 		if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
-			DEFINE_WAIT(wait);
-
-			if (!list_empty(&ce->e_lru_list))
-				list_del_init(&ce->e_lru_list);
-
+			/*
+			 * Prevent a free from removing the entry.
+			 */
+			atomic_inc(&ce->e_refcnt);
+			hlist_bl_unlock(head);
+			__spin_lock_mb_cache_entry(ce);
+			atomic_dec(&ce->e_refcnt);
+			ce->e_used++;
 			/* Incrementing before holding the lock gives readers
 			   priority over writers. */
-			ce->e_used++;
-			while (ce->e_used >= MB_CACHE_WRITER) {
-				ce->e_queued++;
-				prepare_to_wait(&mb_cache_queue, &wait,
-						TASK_UNINTERRUPTIBLE);
-				spin_unlock(&mb_cache_spinlock);
-				schedule();
+			if (ce->e_used >= MB_CACHE_WRITER) {
+				DEFINE_WAIT(wait);
+
+				while (ce->e_used >= MB_CACHE_WRITER) {
+					ce->e_queued++;
+					prepare_to_wait(&mb_cache_queue, &wait,
+							TASK_UNINTERRUPTIBLE);
+					__spin_unlock_mb_cache_entry(ce);
+					schedule();
+					__spin_lock_mb_cache_entry(ce);
+					ce->e_queued--;
+				}
+				finish_wait(&mb_cache_queue, &wait);
+			}
+			__spin_unlock_mb_cache_entry(ce);
+			if (!list_empty(&ce->e_lru_list)) {
 				spin_lock(&mb_cache_spinlock);
-				ce->e_queued--;
+				list_del_init(&ce->e_lru_list);
+				spin_unlock(&mb_cache_spinlock);
 			}
-			finish_wait(&mb_cache_queue, &wait);
-
 			if (!__mb_cache_entry_is_block_hashed(ce)) {
-				__mb_cache_entry_release_unlock(ce);
-				spin_lock(&mb_cache_spinlock);
+				__mb_cache_entry_release(ce);
 				return ERR_PTR(-EAGAIN);
 			}
 			return ce;
 		}
 		l = l->next;
 	}
+	hlist_bl_unlock(head);
 	return NULL;
 }
 
@@ -598,12 +783,12 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
 	struct hlist_bl_head *index_hash_p;
 
 	index_hash_p = &cache->c_index_hash[bucket];
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(index_hash_p);
 	if (!hlist_bl_empty(index_hash_p)) {
 		l = hlist_bl_first(index_hash_p);
 		ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
-	}
-	spin_unlock(&mb_cache_spinlock);
+	} else
+		hlist_bl_unlock(index_hash_p);
 	return ce;
 }
 
@@ -638,11 +823,11 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
 
 	index_hash_p = &cache->c_index_hash[bucket];
 	mb_assert(prev->e_index_hash_p == index_hash_p);
-	spin_lock(&mb_cache_spinlock);
+	hlist_bl_lock(index_hash_p);
 	mb_assert(!hlist_bl_empty(index_hash_p));
 	l = prev->e_index.o_list.next;
 	ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
-	__mb_cache_entry_release_unlock(prev);
+	__mb_cache_entry_release(prev);
 	return ce;
 }
 
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v5 RESEND 3/3] ext4: each filesystem creates and uses its own mb_cache
  2014-03-12 16:19 ` [PATCH v5 RESEND 0/3] ext4: increase mbcache scalability T Makphaibulchoke
  2014-03-12 16:19   ` [PATCH v5 RESEND 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
  2014-03-12 16:19   ` [PATCH v5 RESEND 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
@ 2014-03-12 16:19   ` T Makphaibulchoke
  2 siblings, 0 replies; 63+ messages in thread
From: T Makphaibulchoke @ 2014-03-12 16:19 UTC (permalink / raw)
  To: tytso, adilger.kernel, viro, linux-ext4, linux-kernel, linux-fsdevel
  Cc: aswin, T Makphaibulchoke

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c.

fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is allocated
and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/super.c | 24 ++++++++++++++++--------
 fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
 fs/ext4/xattr.h |  6 +++---
 fs/mbcache.c    | 18 +++++++++++++-----
 5 files changed, 61 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..fa20aab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1314,6 +1314,7 @@ struct ext4_sb_info {
 	struct list_head s_es_lru;
 	unsigned long s_es_last_sorted;
 	struct percpu_counter s_extent_cache_cnt;
+	struct mb_cache *s_mb_cache;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 
 	/* Ratelimit ext4 messages. */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..176048e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
 static struct ext4_lazy_init *ext4_li_info;
 static struct mutex ext4_li_mtx;
 static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -845,6 +846,10 @@ static void ext4_put_super(struct super_block *sb)
 		invalidate_bdev(sbi->journal_bdev);
 		ext4_blkdev_remove(sbi);
 	}
+	if (sbi->s_mb_cache) {
+		ext4_xattr_destroy_cache(sbi->s_mb_cache);
+		sbi->s_mb_cache = NULL;
+	}
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 	sb->s_fs_info = NULL;
@@ -3981,6 +3986,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	if (ext4_mballoc_ready) {
+		sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+		if (!sbi->s_mb_cache) {
+			ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+			goto failed_mount_wq;
+		}
+	}
 
 	/*
 	 * The journal may have updated the bg summary counts, so we
@@ -5501,11 +5513,9 @@ static int __init ext4_init_fs(void)
 
 	err = ext4_init_mballoc();
 	if (err)
-		goto out3;
-
-	err = ext4_init_xattr();
-	if (err)
 		goto out2;
+	else
+		ext4_mballoc_ready = 1;
 	err = init_inodecache();
 	if (err)
 		goto out1;
@@ -5521,10 +5531,9 @@ out:
 	unregister_as_ext3();
 	destroy_inodecache();
 out1:
-	ext4_exit_xattr();
-out2:
+	ext4_mballoc_ready = 0;
 	ext4_exit_mballoc();
-out3:
+out2:
 	ext4_exit_feat_adverts();
 out4:
 	if (ext4_proc_root)
@@ -5547,7 +5556,6 @@ static void __exit ext4_exit_fs(void)
 	unregister_as_ext3();
 	unregister_filesystem(&ext4_fs_type);
 	destroy_inodecache();
-	ext4_exit_xattr();
 	ext4_exit_mballoc();
 	ext4_exit_feat_adverts();
 	remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 1423c48..6e2788f 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
 static struct buffer_head *ext4_xattr_cache_find(struct inode *,
 						 struct ext4_xattr_header *,
 						 struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
 static int ext4_xattr_list(struct dentry *dentry, char *buffer,
 			   size_t buffer_size);
 
-static struct mb_cache *ext4_xattr_cache;
-
 static const struct xattr_handler *ext4_xattr_handler_map[] = {
 	[EXT4_XATTR_INDEX_USER]		     = &ext4_xattr_user_handler,
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
 	NULL
 };
 
+#define EXT4_GET_MB_CACHE(inode)	(((struct ext4_sb_info *) \
+				inode->i_sb->s_fs_info)->s_mb_cache)
+
 static __le32 ext4_xattr_block_csum(struct inode *inode,
 				    sector_t block_nr,
 				    struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
 	if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
-	ext4_xattr_cache_insert(bh);
+	ext4_xattr_cache_insert(ext4_mb_cache, bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
 cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
-	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+	ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
 		goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
-		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+		ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
 		if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
-				ext4_xattr_cache_insert(bs->bh);
+				ext4_xattr_cache_insert(ext4_mb_cache,
+					bs->bh);
 			}
 			unlock_buffer(bs->bh);
 			if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
-			ext4_xattr_cache_insert(new_bh);
+			ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
 			if (error)
@@ -1495,13 +1501,13 @@ ext4_xattr_put_super(struct super_block *sb)
  * Returns 0, or a negative error number on failure.
  */
 static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 {
 	__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
 	struct mb_cache_entry *ce;
 	int error;
 
-	ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+	ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
 	if (!ce) {
 		ea_bdebug(bh, "out of memory");
 		return;
@@ -1573,12 +1579,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
 again:
-	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+	ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
 				       hash);
 	while (ce) {
 		struct buffer_head *bh;
@@ -1676,19 +1683,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
 
 #undef BLOCK_HASH_SHIFT
 
-int __init
-ext4_init_xattr(void)
+#define	HASH_BUCKET_BITS	10
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
 {
-	ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
-	if (!ext4_xattr_cache)
-		return -ENOMEM;
-	return 0;
+	return mb_cache_create(name, HASH_BUCKET_BITS);
 }
 
-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
 {
-	if (ext4_xattr_cache)
-		mb_cache_destroy(ext4_xattr_cache);
-	ext4_xattr_cache = NULL;
+	if (cache)
+		mb_cache_destroy(cache);
 }
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
 
-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
 extern const struct xattr_handler *ext4_xattr_handlers[];
 
 extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
 				       struct ext4_xattr_info *i,
 				       struct ext4_xattr_ibody_find *is);
 
+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
 #ifdef CONFIG_EXT4_FS_SECURITY
 extern int ext4_init_security(handle_t *handle, struct inode *inode,
 			      struct inode *dir, const struct qstr *qstr);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 786ecab..bf166e3 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -99,6 +99,7 @@
 
 static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
 static struct blockgroup_lock *mb_cache_bg_lock;
+static struct kmem_cache *mb_cache_kmem_cache;
 
 MODULE_AUTHOR("Andreas Gruenbacher <a.gruenbacher@computer.org>");
 MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
@@ -351,11 +352,14 @@ mb_cache_create(const char *name, int bucket_bits)
 		goto fail;
 	for (n=0; n<bucket_count; n++)
 		INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
-	cache->c_entry_cache = kmem_cache_create(name,
-		sizeof(struct mb_cache_entry), 0,
-		SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
-	if (!cache->c_entry_cache)
-		goto fail2;
+	if (!mb_cache_kmem_cache) {
+		mb_cache_kmem_cache = kmem_cache_create(name,
+			sizeof(struct mb_cache_entry), 0,
+			SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+		if (!mb_cache_kmem_cache)
+			goto fail2;
+	}
+	cache->c_entry_cache = mb_cache_kmem_cache;
 
 	/*
 	 * Set an upper limit on the number of cache entries so that the hash
@@ -476,6 +480,10 @@ mb_cache_destroy(struct mb_cache *cache)
 			  atomic_read(&cache->c_entry_count));
 	}
 
+	if (list_empty(&mb_cache_list)) {
+		kmem_cache_destroy(mb_cache_kmem_cache);
+		mb_cache_kmem_cache = NULL;
+	}
 	kfree(cache->c_index_hash);
 	kfree(cache->c_block_hash);
 	kfree(cache);
-- 
1.7.11.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2014-03-12 16:21 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-18  0:55 [PATCH 0/2] ext4: increase mbcache scalability T Makphaibulchoke
2013-08-22 15:54 ` [PATCH v2 " T Makphaibulchoke
2013-08-22 15:54   ` [PATCH v2 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
2013-08-22 16:53     ` Linus Torvalds
2013-08-22 15:33       ` Thavatchai Makphaibulchoke
2013-08-22 15:54   ` [PATCH v2 2/2] ext4: each filesystem creates and uses its own mc_cache T Makphaibulchoke
2013-08-22 15:54     ` T Makphaibulchoke
2014-01-24 18:31   ` [PATCH v4 0/3] ext4: increase mbcache scalability T Makphaibulchoke
2014-01-24 18:31     ` [PATCH v4 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
2014-01-24 18:31     ` [PATCH v4 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
2014-01-24 18:31     ` [PATCH v4 3/3] ext4: each filesystem creates and uses its own mc_cache T Makphaibulchoke
2014-01-24 21:38     ` [PATCH v4 0/3] ext4: increase mbcache scalability Andi Kleen
2014-01-25  1:13       ` Thavatchai Makphaibulchoke
2014-01-25  6:09       ` Andreas Dilger
2014-01-27 12:27         ` Thavatchai Makphaibulchoke
2014-02-09 19:46         ` Thavatchai Makphaibulchoke
2014-02-11 19:58         ` Thavatchai Makphaibulchoke
2014-02-13  2:01           ` Andreas Dilger
2013-09-04 16:39 ` [PATCH v3 0/2] " T Makphaibulchoke
2013-09-04 16:39   ` [PATCH v3 1/2] mbcache: decoupling the locking of local from global data T Makphaibulchoke
2013-10-30 13:27     ` Theodore Ts'o
2013-10-30 14:42     ` Theodore Ts'o
2013-10-30 17:32       ` Thavatchai Makphaibulchoke
2013-09-04 16:39   ` [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
2013-10-30 14:30     ` Theodore Ts'o
2013-09-04 20:00   ` [PATCH v3 0/2] ext4: increase mbcache scalability Andreas Dilger
2013-09-04 15:33     ` Thavatchai Makphaibulchoke
2013-09-05 15:22     ` Thavatchai Makphaibulchoke
2013-09-05  2:35   ` Theodore Ts'o
2013-09-05  9:49     ` Thavatchai Makphaibulchoke
2013-09-06  5:10       ` Andreas Dilger
2013-09-06 12:23         ` Thavatchai Makphaibulchoke
2013-09-10 20:47           ` Andreas Dilger
2013-09-10 21:02             ` Theodore Ts'o
2013-09-10 17:10               ` Thavatchai Makphaibulchoke
2013-09-11  3:13               ` Eric Sandeen
2013-09-11 11:30                 ` Theodore Ts'o
2013-09-11 16:49                   ` Eric Sandeen
2013-09-11 19:33                     ` Eric Sandeen
2013-09-11 20:32                     ` David Lang
2013-09-11 20:48                       ` Eric Sandeen
2013-09-11 21:25                         ` Theodore Ts'o
2013-09-11 20:36                           ` Thavatchai Makphaibulchoke
2013-09-12  3:42                             ` Eric Sandeen
2013-09-12 12:23                             ` Theodore Ts'o
2013-09-13 12:04                               ` Thavatchai Makphaibulchoke
2013-09-13 18:59                                 ` Eric Sandeen
2013-09-17 16:43                                   ` Thavatchai Makphaibulchoke
2013-09-18 14:19                                     ` Eric Sandeen
2013-09-21 18:53                                       ` Theodore Ts'o
2013-09-23 15:35                                         ` Thavatchai Makphaibulchoke
     [not found]                                         ` <D59BF31F-934D-4DE6-9E5C-F92014F09C68@dilger.ca>
2013-09-25 18:43                                           ` Thavatchai Makphaibulchoke
2013-10-04  3:22                                             ` Andreas Dilger
2013-10-18 12:43                                               ` Thavatchai Makphaibulchoke
2013-10-19 23:06                                                 ` Eric Sandeen
2014-02-20 18:37 ` [PATCH V5 0/3] " T Makphaibulchoke
2014-02-20 18:37   ` [PATCH V5 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
2014-02-20 18:37   ` [PATCH V5 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
2014-02-20 18:37   ` [PATCH V5 3/3] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke
2014-03-12 16:19 ` [PATCH v5 RESEND 0/3] ext4: increase mbcache scalability T Makphaibulchoke
2014-03-12 16:19   ` [PATCH v5 RESEND 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node T Makphaibulchoke
2014-03-12 16:19   ` [PATCH v5 RESEND 2/3] mbcache: decoupling the locking of local from global data T Makphaibulchoke
2014-03-12 16:19   ` [PATCH v5 RESEND 3/3] ext4: each filesystem creates and uses its own mb_cache T Makphaibulchoke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.