linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/26] Current slab allocator / SLUB patch queue
@ 2007-06-18  9:58 clameter
  2007-06-18  9:58 ` [patch 01/26] SLUB Debug: Fix initial object debug state of NUMA bootstrap objects clameter
                   ` (26 more replies)
  0 siblings, 27 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

These contain the following groups of patches:

1. Slab allocator code consolidation and fixing of inconsistencies

This makes ZERO_SIZE_PTR generic so that it works in all
slab allocators.

It adds __GFP_ZERO support to all slab allocators and
cleans up the zeroing in the slabs and provides modifications
to remove explicit zeroing following kmalloc_node and
kmem_cache_alloc_node calls.

2. SLUB improvements

Inline some small functions to reduce code size. Some more memory
optimizations using CONFIG_SLUB_DEBUG. Changes to handling of the
slub_lock and an optimization of runtime determination of kmalloc slabs
(replaces ilog2 patch that failed with gcc 3.3 on powerpc).

3. Slab defragmentation

This is V3 of the patchset with the one fix for the locking problem that
showed up during testing.

4. Performance optimizations

These patches have a long history since the early drafts of SLUB. The
problem with these patches is that they require the touching of additional
cachelines (only for read) and SLUB was designed for minimal cacheline
touching. In doing so we may be able to remove cacheline bouncing in
particular for remote alloc/ free situations where I have had reports of
issues that I was not able to confirm for lack of specificity. The tradeoffs
here are not clear. Certainly the larger cacheline footprint will hurt the
casual slab user somewhat but it will benefit processes that perform these
local/remote alloc/free operations.

I'd appreciate if someone could evaluate these.

The complete patchset against 2.6.22-rc4-mm2 is available at

http://ftp.kernel.org/pub/linux/kernel/people/christoph/slub/2.6.22-rc4-mm2

Tested on

x86_64 SMP
x86_64 NUMA emulation
IA64 emulator
Altix 64p/128G NUMA system.
Altix 8p/6G asymmetric NUMA system.


-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 01/26] SLUB Debug: Fix initial object debug state of NUMA bootstrap objects
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 02/26] Slab allocators: Consolidate code for krealloc in mm/util.c clameter
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: fix_numa_boostrap_debug --]
[-- Type: text/plain, Size: 999 bytes --]

The function we are calling to initialize object debug state during early
NUMA bootstrap sets up an inactive object giving it the wrong redzone
signature. The bootstrap nodes are active objects and should have active redzone
signatures.

Currently slab validation complains and reverts the object to active state.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 23:51:43.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 00:40:04.000000000 -0700
@@ -1925,7 +1925,8 @@ static struct kmem_cache_node * __init e
 	page->freelist = get_freepointer(kmalloc_caches, n);
 	page->inuse++;
 	kmalloc_caches->node[node] = n;
-	setup_object_debug(kmalloc_caches, page, n);
+	init_object(kmalloc_caches, n, 1);
+	init_tracking(kmalloc_caches, n);
 	init_kmem_cache_node(n);
 	atomic_long_inc(&n->nr_slabs);
 	add_partial(n, page);

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 02/26] Slab allocators: Consolidate code for krealloc in mm/util.c
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
  2007-06-18  9:58 ` [patch 01/26] SLUB Debug: Fix initial object debug state of NUMA bootstrap objects clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18 20:03   ` Pekka Enberg
  2007-06-18  9:58 ` [patch 03/26] Slab allocators: Consistent ZERO_SIZE_PTR support and NULL result semantics clameter
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_allocators_consolidate_krealloc --]
[-- Type: text/plain, Size: 6250 bytes --]

The size of a kmalloc object is readily available via ksize().
ksize is provided by all allocators and thus we canb implement
krealloc in a generic way.

Implement krealloc in mm/util.c and drop slab specific implementations
of krealloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slab.c |   46 ----------------------------------------------
 mm/slob.c |   33 ---------------------------------
 mm/slub.c |   37 -------------------------------------
 mm/util.c |   34 ++++++++++++++++++++++++++++++++++
 4 files changed, 34 insertions(+), 116 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-16 18:58:10.000000000 -0700
@@ -3715,52 +3715,6 @@ EXPORT_SYMBOL(__kmalloc);
 #endif
 
 /**
- * krealloc - reallocate memory. The contents will remain unchanged.
- * @p: object to reallocate memory for.
- * @new_size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- *
- * The contents of the object pointed to are preserved up to the
- * lesser of the new and old sizes.  If @p is %NULL, krealloc()
- * behaves exactly like kmalloc().  If @size is 0 and @p is not a
- * %NULL pointer, the object pointed to is freed.
- */
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
-{
-	struct kmem_cache *cache, *new_cache;
-	void *ret;
-
-	if (unlikely(!p))
-		return kmalloc_track_caller(new_size, flags);
-
-	if (unlikely(!new_size)) {
-		kfree(p);
-		return NULL;
-	}
-
-	cache = virt_to_cache(p);
-	new_cache = __find_general_cachep(new_size, flags);
-
-	/*
- 	 * If new size fits in the current cache, bail out.
- 	 */
-	if (likely(cache == new_cache))
-		return (void *)p;
-
-	/*
- 	 * We are on the slow-path here so do not use __cache_alloc
- 	 * because it bloats kernel text.
- 	 */
-	ret = kmalloc_track_caller(new_size, flags);
-	if (ret) {
-		memcpy(ret, p, min(new_size, ksize(p)));
-		kfree(p);
-	}
-	return ret;
-}
-EXPORT_SYMBOL(krealloc);
-
-/**
  * kmem_cache_free - Deallocate an object
  * @cachep: The cache the allocation was from.
  * @objp: The previously allocated object.
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-16 18:58:10.000000000 -0700
@@ -407,39 +407,6 @@ void *__kmalloc(size_t size, gfp_t gfp)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-/**
- * krealloc - reallocate memory. The contents will remain unchanged.
- *
- * @p: object to reallocate memory for.
- * @new_size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- *
- * The contents of the object pointed to are preserved up to the
- * lesser of the new and old sizes.  If @p is %NULL, krealloc()
- * behaves exactly like kmalloc().  If @size is 0 and @p is not a
- * %NULL pointer, the object pointed to is freed.
- */
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
-{
-	void *ret;
-
-	if (unlikely(!p))
-		return kmalloc_track_caller(new_size, flags);
-
-	if (unlikely(!new_size)) {
-		kfree(p);
-		return NULL;
-	}
-
-	ret = kmalloc_track_caller(new_size, flags);
-	if (ret) {
-		memcpy(ret, p, min(new_size, ksize(p)));
-		kfree(p);
-	}
-	return ret;
-}
-EXPORT_SYMBOL(krealloc);
-
 void kfree(const void *block)
 {
 	struct slob_page *sp;
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-16 18:58:10.000000000 -0700
@@ -2476,43 +2476,6 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
-/**
- * krealloc - reallocate memory. The contents will remain unchanged.
- * @p: object to reallocate memory for.
- * @new_size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- *
- * The contents of the object pointed to are preserved up to the
- * lesser of the new and old sizes.  If @p is %NULL, krealloc()
- * behaves exactly like kmalloc().  If @size is 0 and @p is not a
- * %NULL pointer, the object pointed to is freed.
- */
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
-{
-	void *ret;
-	size_t ks;
-
-	if (unlikely(!p || p == ZERO_SIZE_PTR))
-		return kmalloc(new_size, flags);
-
-	if (unlikely(!new_size)) {
-		kfree(p);
-		return ZERO_SIZE_PTR;
-	}
-
-	ks = ksize(p);
-	if (ks >= new_size)
-		return (void *)p;
-
-	ret = kmalloc(new_size, flags);
-	if (ret) {
-		memcpy(ret, p, min(new_size, ks));
-		kfree(p);
-	}
-	return ret;
-}
-EXPORT_SYMBOL(krealloc);
-
 /********************************************************************
  *			Basic setup of slabs
  *******************************************************************/
Index: linux-2.6.22-rc4-mm2/mm/util.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/util.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/util.c	2007-06-16 19:01:26.000000000 -0700
@@ -81,6 +81,40 @@ void *kmemdup(const void *src, size_t le
 }
 EXPORT_SYMBOL(kmemdup);
 
+/**
+ * krealloc - reallocate memory. The contents will remain unchanged.
+ * @p: object to reallocate memory for.
+ * @new_size: how many bytes of memory are required.
+ * @flags: the type of memory to allocate.
+ *
+ * The contents of the object pointed to are preserved up to the
+ * lesser of the new and old sizes.  If @p is %NULL, krealloc()
+ * behaves exactly like kmalloc().  If @size is 0 and @p is not a
+ * %NULL pointer, the object pointed to is freed.
+ */
+void *krealloc(const void *p, size_t new_size, gfp_t flags)
+{
+	void *ret;
+	size_t ks;
+
+	if (unlikely(!new_size)) {
+		kfree(p);
+		return NULL;
+	}
+
+	ks = ksize(p);
+	if (ks >= new_size)
+		return (void *)p;
+
+	ret = kmalloc_track_caller(new_size, flags);
+	if (ret) {
+		memcpy(ret, p, min(new_size, ks));
+		kfree(p);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(krealloc);
+
 /*
  * strndup_user - duplicate an existing string from user space
  * @s: The string to duplicate

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 03/26] Slab allocators: Consistent ZERO_SIZE_PTR support and NULL result semantics
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
  2007-06-18  9:58 ` [patch 01/26] SLUB Debug: Fix initial object debug state of NUMA bootstrap objects clameter
  2007-06-18  9:58 ` [patch 02/26] Slab allocators: Consolidate code for krealloc in mm/util.c clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18 20:08   ` Pekka Enberg
  2007-06-18  9:58 ` [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators clameter
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_allocators_consolidate_zero_size_ptr --]
[-- Type: text/plain, Size: 8718 bytes --]

Define ZERO_OR_NULL_PTR macro to be able to remove the checks
from the allocators. Move ZERO_SIZE_PTR related stuff into slab.h.

Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
WARN_ON_ONCE(size == 0) that is still remaining in SLAB.

Make slub return NULL like the other allocators if a too large
memory segment is requested via __kmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   13 +++++++++++++
 include/linux/slab_def.h |   12 ++++++++++++
 include/linux/slub_def.h |   11 -----------
 mm/slab.c                |   14 ++++++++------
 mm/slob.c                |   13 ++++++++-----
 mm/slub.c                |   29 ++++++++++++++++-------------
 mm/util.c                |    2 +-
 7 files changed, 58 insertions(+), 36 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-12 12:04:57.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 20:35:03.000000000 -0700
@@ -33,6 +33,19 @@
 #define SLAB_RECLAIM_ACCOUNT	0x00020000UL		/* Objects are reclaimable */
 #define SLAB_TEMPORARY		SLAB_RECLAIM_ACCOUNT	/* Objects are short-lived */
 /*
+ * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
+ *
+ * Dereferencing ZERO_SIZE_PTR will lead to a distinct access fault.
+ *
+ * ZERO_SIZE_PTR can be passed to kfree though in the same way that NULL can.
+ * Both make kfree a no-op.
+ */
+#define ZERO_SIZE_PTR ((void *)16)
+
+#define ZERO_OR_NULL_PTR(x) ((unsigned long)(x) < \
+				(unsigned long)ZERO_SIZE_PTR)
+
+/*
  * struct kmem_cache related prototypes
  */
 void __init kmem_cache_init(void);
Index: linux-2.6.22-rc4-mm2/include/linux/slab_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab_def.h	2007-06-04 17:57:25.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab_def.h	2007-06-17 20:35:03.000000000 -0700
@@ -29,6 +29,10 @@ static inline void *kmalloc(size_t size,
 {
 	if (__builtin_constant_p(size)) {
 		int i = 0;
+
+		if (!size)
+			return ZERO_SIZE_PTR;
+
 #define CACHE(x) \
 		if (size <= x) \
 			goto found; \
@@ -55,6 +59,10 @@ static inline void *kzalloc(size_t size,
 {
 	if (__builtin_constant_p(size)) {
 		int i = 0;
+
+		if (!size)
+			return ZERO_SIZE_PTR;
+
 #define CACHE(x) \
 		if (size <= x) \
 			goto found; \
@@ -84,6 +92,10 @@ static inline void *kmalloc_node(size_t 
 {
 	if (__builtin_constant_p(size)) {
 		int i = 0;
+
+		if (!size)
+			return ZERO_SIZE_PTR;
+
 #define CACHE(x) \
 		if (size <= x) \
 			goto found; \
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 20:35:03.000000000 -0700
@@ -160,17 +160,6 @@ static inline struct kmem_cache *kmalloc
 #endif
 
 
-/*
- * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
- *
- * Dereferencing ZERO_SIZE_PTR will lead to a distinct access fault.
- *
- * ZERO_SIZE_PTR can be passed to kfree though in the same way that NULL can.
- * Both make kfree a no-op.
- */
-#define ZERO_SIZE_PTR ((void *)16)
-
-
 static inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size) && !(flags & SLUB_DMA)) {
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 20:35:08.000000000 -0700
@@ -774,7 +774,9 @@ static inline struct kmem_cache *__find_
 	 */
 	BUG_ON(malloc_sizes[INDEX_AC].cs_cachep == NULL);
 #endif
-	WARN_ON_ONCE(size == 0);
+	if (!size)
+		return ZERO_SIZE_PTR;
+
 	while (size > csizep->cs_size)
 		csizep++;
 
@@ -2340,7 +2342,7 @@ kmem_cache_create (const char *name, siz
 		 * this should not happen at all.
 		 * But leave a BUG_ON for some lucky dude.
 		 */
-		BUG_ON(!cachep->slabp_cache);
+		BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));
 	}
 	cachep->ctor = ctor;
 	cachep->name = name;
@@ -3642,8 +3644,8 @@ __do_kmalloc_node(size_t size, gfp_t fla
 	struct kmem_cache *cachep;
 
 	cachep = kmem_find_general_cachep(size, flags);
-	if (unlikely(cachep == NULL))
-		return NULL;
+	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
+		return cachep;
 	return kmem_cache_alloc_node(cachep, flags, node);
 }
 
@@ -3749,7 +3751,7 @@ void kfree(const void *objp)
 	struct kmem_cache *c;
 	unsigned long flags;
 
-	if (unlikely(!objp))
+	if (unlikely(ZERO_OR_NULL_PTR(objp)))
 		return;
 	local_irq_save(flags);
 	kfree_debugcheck(objp);
@@ -4436,7 +4438,7 @@ const struct seq_operations slabstats_op
  */
 size_t ksize(const void *objp)
 {
-	if (unlikely(objp == NULL))
+	if (unlikely(ZERO_OR_NULL_PTR(objp)))
 		return 0;
 
 	return obj_size(virt_to_cache(objp));
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 20:35:04.000000000 -0700
@@ -306,7 +306,7 @@ static void slob_free(void *block, int s
 	slobidx_t units;
 	unsigned long flags;
 
-	if (!block)
+	if (ZERO_OR_NULL_PTR(block))
 		return;
 	BUG_ON(!size);
 
@@ -384,11 +384,14 @@ out:
 
 void *__kmalloc(size_t size, gfp_t gfp)
 {
+	unsigned int *m;
 	int align = max(ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
 
 	if (size < PAGE_SIZE - align) {
-		unsigned int *m;
-		m = slob_alloc(size + align, gfp, align);
+		if (!size)
+			return ZERO_SIZE_PTR;
+
+		m = slob_alloc(size + align, gfp, align);
 		if (m)
 			*m = size;
 		return (void *)m + align;
@@ -411,7 +414,7 @@ void kfree(const void *block)
 {
 	struct slob_page *sp;
 
-	if (!block)
+	if (ZERO_OR_NULL_PTR(block))
 		return;
 
 	sp = (struct slob_page *)virt_to_page(block);
@@ -430,7 +433,7 @@ size_t ksize(const void *block)
 {
 	struct slob_page *sp;
 
-	if (!block)
+	if (ZERO_OR_NULL_PTR(block))
 		return 0;
 
 	sp = (struct slob_page *)virt_to_page(block);
Index: linux-2.6.22-rc4-mm2/mm/util.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/util.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/util.c	2007-06-17 20:35:03.000000000 -0700
@@ -99,7 +99,7 @@ void *krealloc(const void *p, size_t new
 
 	if (unlikely(!new_size)) {
 		kfree(p);
-		return NULL;
+		return ZERO_SIZE_PTR;
 	}
 
 	ks = ksize(p);
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 20:35:04.000000000 -0700
@@ -2279,10 +2279,11 @@ static struct kmem_cache *get_slab(size_
 	int index = kmalloc_index(size);
 
 	if (!index)
-		return NULL;
+		return ZERO_SIZE_PTR;
 
 	/* Allocation too large? */
-	BUG_ON(index < 0);
+	if (index < 0)
+		return NULL;
 
 #ifdef CONFIG_ZONE_DMA
 	if ((flags & SLUB_DMA)) {
@@ -2323,9 +2324,10 @@ void *__kmalloc(size_t size, gfp_t flags
 {
 	struct kmem_cache *s = get_slab(size, flags);
 
-	if (s)
-		return slab_alloc(s, flags, -1, __builtin_return_address(0));
-	return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
+
+	return slab_alloc(s, flags, -1, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(__kmalloc);
 
@@ -2334,9 +2336,10 @@ void *__kmalloc_node(size_t size, gfp_t 
 {
 	struct kmem_cache *s = get_slab(size, flags);
 
-	if (s)
-		return slab_alloc(s, flags, node, __builtin_return_address(0));
-	return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
+
+	return slab_alloc(s, flags, node, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(__kmalloc_node);
 #endif
@@ -2387,7 +2390,7 @@ void kfree(const void *x)
 	 * this comparison would be true for all "negative" pointers
 	 * (which would cover the whole upper half of the address space).
 	 */
-	if ((unsigned long)x <= (unsigned long)ZERO_SIZE_PTR)
+	if (ZERO_OR_NULL_PTR(x))
 		return;
 
 	page = virt_to_head_page(x);
@@ -2706,8 +2709,8 @@ void *__kmalloc_track_caller(size_t size
 {
 	struct kmem_cache *s = get_slab(size, gfpflags);
 
-	if (!s)
-		return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
 
 	return slab_alloc(s, gfpflags, -1, caller);
 }
@@ -2717,8 +2720,8 @@ void *__kmalloc_node_track_caller(size_t
 {
 	struct kmem_cache *s = get_slab(size, gfpflags);
 
-	if (!s)
-		return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
 
 	return slab_alloc(s, gfpflags, node, caller);
 }

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators.
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (2 preceding siblings ...)
  2007-06-18  9:58 ` [patch 03/26] Slab allocators: Consistent ZERO_SIZE_PTR support and NULL result semantics clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18 10:09   ` Paul Mundt
  2007-06-18 20:11   ` Pekka Enberg
  2007-06-18  9:58 ` [patch 05/26] Slab allocators: Cleanup zeroing allocations clameter
                   ` (22 subsequent siblings)
  26 siblings, 2 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_gfpzero --]
[-- Type: text/plain, Size: 5989 bytes --]

A kernel convention for many allocators is that if __GFP_ZERO is passed to
an allocator then the allocated memory should be zeroed.

This is currently not supported by the slab allocators. The inconsistency
makes it difficult to implement in derived allocators such as in the uncached
allocator and the pool allocators.

In addition the support zeroed allocations in the slab allocators does not
have a consistent API. There are no zeroing allocator functions for NUMA node
placement (kmalloc_node, kmem_cache_alloc_node). The zeroing allocations are
only provided for default allocs (kzalloc, kmem_cache_zalloc_node).
__GFP_ZERO will make zeroing universally available and does not require any
addititional functions.

So add the necessary logic to all slab allocators to support __GFP_ZERO.

The code is added to the hot path. The gfp flags are on the stack
and so the cacheline is readily available for checking if we want a zeroed
object.

Zeroing while allocating is now a frequent operation and we seem to be
gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slab.c |    8 +++++++-
 mm/slob.c |    2 ++
 mm/slub.c |   24 +++++++++++++++---------
 3 files changed, 24 insertions(+), 10 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 22:30:01.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 22:30:38.000000000 -0700
@@ -293,6 +293,8 @@ static void *slob_alloc(size_t size, gfp
 		BUG_ON(!b);
 		spin_unlock_irqrestore(&slob_lock, flags);
 	}
+	if (unlikely((gfp & __GFP_ZERO) && b))
+		memset(b, 0, size);
 	return b;
 }
 
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 22:30:01.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 22:31:41.000000000 -0700
@@ -1087,7 +1087,7 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 
-	BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
+	BUG_ON(flags & ~(GFP_DMA | __GFP_ZERO | GFP_LEVEL_MASK));
 
 	if (flags & __GFP_WAIT)
 		local_irq_enable();
@@ -1550,7 +1550,7 @@ debug:
  * Otherwise we can simply pick the next object from the lockless free list.
  */
 static void __always_inline *slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node, void *addr)
+		gfp_t gfpflags, int node, void *addr, int length)
 {
 	struct page *page;
 	void **object;
@@ -1568,19 +1568,25 @@ static void __always_inline *slab_alloc(
 		page->lockless_freelist = object[page->offset];
 	}
 	local_irq_restore(flags);
+
+	if (unlikely((gfpflags & __GFP_ZERO) && object))
+		memset(object, 0, length);
+
 	return object;
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	return slab_alloc(s, gfpflags, -1, __builtin_return_address(0));
+	return slab_alloc(s, gfpflags, -1,
+			__builtin_return_address(0), s->objsize);
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
-	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+	return slab_alloc(s, gfpflags, node,
+		__builtin_return_address(0), s->objsize);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
 #endif
@@ -2327,7 +2333,7 @@ void *__kmalloc(size_t size, gfp_t flags
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, flags, -1, __builtin_return_address(0));
+	return slab_alloc(s, flags, -1, __builtin_return_address(0), size);
 }
 EXPORT_SYMBOL(__kmalloc);
 
@@ -2339,7 +2345,7 @@ void *__kmalloc_node(size_t size, gfp_t 
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, flags, node, __builtin_return_address(0));
+	return slab_alloc(s, flags, node, __builtin_return_address(0), size);
 }
 EXPORT_SYMBOL(__kmalloc_node);
 #endif
@@ -2662,7 +2668,7 @@ void *kmem_cache_zalloc(struct kmem_cach
 {
 	void *x;
 
-	x = slab_alloc(s, flags, -1, __builtin_return_address(0));
+	x = slab_alloc(s, flags, -1, __builtin_return_address(0), 0);
 	if (x)
 		memset(x, 0, s->objsize);
 	return x;
@@ -2712,7 +2718,7 @@ void *__kmalloc_track_caller(size_t size
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, gfpflags, -1, caller);
+	return slab_alloc(s, gfpflags, -1, caller, size);
 }
 
 void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
@@ -2723,7 +2729,7 @@ void *__kmalloc_node_track_caller(size_t
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, gfpflags, node, caller);
+	return slab_alloc(s, gfpflags, node, caller, size);
 }
 
 #if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG)
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 22:30:01.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 22:30:38.000000000 -0700
@@ -2734,7 +2734,7 @@ static int cache_grow(struct kmem_cache 
 	 * Be lazy and only check for valid flags here,  keeping it out of the
 	 * critical path in kmem_cache_alloc().
 	 */
-	BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
+	BUG_ON(flags & ~(GFP_DMA | __GFP_ZERO | GFP_LEVEL_MASK));
 
 	local_flags = (flags & GFP_LEVEL_MASK);
 	/* Take the l3 list lock to change the colour_next on this node */
@@ -3380,6 +3380,9 @@ __cache_alloc_node(struct kmem_cache *ca
 	local_irq_restore(save_flags);
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
 
+	if (unlikely((flags & __GFP_ZERO) && ptr))
+		memset(ptr, 0, cachep->buffer_size);
+
 	return ptr;
 }
 
@@ -3431,6 +3434,9 @@ __cache_alloc(struct kmem_cache *cachep,
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	prefetchw(objp);
 
+	if (unlikely((flags & __GFP_ZERO) && objp))
+		memset(objp, 0, cachep->buffer_size);
+
 	return objp;
 }
 

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (3 preceding siblings ...)
  2007-06-18  9:58 ` [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18 20:16   ` Pekka Enberg
  2007-06-19 21:00   ` Matt Mackall
  2007-06-18  9:58 ` [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO clameter
                   ` (21 subsequent siblings)
  26 siblings, 2 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_remove_shortcut --]
[-- Type: text/plain, Size: 8015 bytes --]

It becomes now easy to support the zeroing allocs with generic inline functions
in slab.h. Provide inline definitions to allow the continued use of
kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing functions
from the slab allocators and util.c.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   36 ++++++++++++++++++++++++------------
 include/linux/slab_def.h |   30 ------------------------------
 include/linux/slub_def.h |   13 -------------
 mm/slab.c                |   17 -----------------
 mm/slob.c                |   10 ----------
 mm/slub.c                |   11 -----------
 mm/util.c                |   14 --------------
 7 files changed, 24 insertions(+), 107 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:11:59.000000000 -0700
@@ -58,7 +58,6 @@ struct kmem_cache *kmem_cache_create(con
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
-void *kmem_cache_zalloc(struct kmem_cache *, gfp_t);
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
@@ -105,7 +104,6 @@ static inline void *kmem_cache_alloc_nod
  * Common kmalloc functions provided by all allocators
  */
 void *__kmalloc(size_t, gfp_t);
-void *__kzalloc(size_t, gfp_t);
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
@@ -120,7 +118,7 @@ static inline void *kcalloc(size_t n, si
 {
 	if (n != 0 && size > ULONG_MAX / n)
 		return NULL;
-	return __kzalloc(n * size, flags);
+	return __kmalloc(n * size, flags | __GFP_ZERO);
 }
 
 /*
@@ -192,15 +190,6 @@ static inline void *kmalloc(size_t size,
 	return __kmalloc(size, flags);
 }
 
-/**
- * kzalloc - allocate memory. The memory is set to zero.
- * @size: how many bytes of memory are required.
- * @flags: the type of memory to allocate (see kmalloc).
- */
-static inline void *kzalloc(size_t size, gfp_t flags)
-{
-	return __kzalloc(size, flags);
-}
 #endif
 
 #ifndef CONFIG_NUMA
@@ -258,6 +247,29 @@ extern void *__kmalloc_node_track_caller
 
 #endif /* DEBUG_SLAB */
 
+/*
+ * Shortcuts
+ */
+static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
+{
+	return kmem_cache_alloc(k, flags | __GFP_ZERO);
+}
+
+static inline void *__kzalloc(int size, gfp_t flags)
+{
+	return kmalloc(size, flags | __GFP_ZERO);
+}
+
+/**
+ * kzalloc - allocate memory. The memory is set to zero.
+ * @size: how many bytes of memory are required.
+ * @flags: the type of memory to allocate (see kmalloc).
+ */
+static inline void *kzalloc(size_t size, gfp_t flags)
+{
+	return kmalloc(size, flags | __GFP_ZERO);
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SLAB_H */
 
Index: linux-2.6.22-rc4-mm2/include/linux/slab_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab_def.h	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab_def.h	2007-06-17 18:11:59.000000000 -0700
@@ -55,36 +55,6 @@ found:
 	return __kmalloc(size, flags);
 }
 
-static inline void *kzalloc(size_t size, gfp_t flags)
-{
-	if (__builtin_constant_p(size)) {
-		int i = 0;
-
-		if (!size)
-			return ZERO_SIZE_PTR;
-
-#define CACHE(x) \
-		if (size <= x) \
-			goto found; \
-		else \
-			i++;
-#include "kmalloc_sizes.h"
-#undef CACHE
-		{
-			extern void __you_cannot_kzalloc_that_much(void);
-			__you_cannot_kzalloc_that_much();
-		}
-found:
-#ifdef CONFIG_ZONE_DMA
-		if (flags & GFP_DMA)
-			return kmem_cache_zalloc(malloc_sizes[i].cs_dmacachep,
-						flags);
-#endif
-		return kmem_cache_zalloc(malloc_sizes[i].cs_cachep, flags);
-	}
-	return __kzalloc(size, flags);
-}
-
 #ifdef CONFIG_NUMA
 extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
 
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:11:59.000000000 -0700
@@ -173,19 +173,6 @@ static inline void *kmalloc(size_t size,
 		return __kmalloc(size, flags);
 }
 
-static inline void *kzalloc(size_t size, gfp_t flags)
-{
-	if (__builtin_constant_p(size) && !(flags & SLUB_DMA)) {
-		struct kmem_cache *s = kmalloc_slab(size);
-
-		if (!s)
-			return ZERO_SIZE_PTR;
-
-		return kmem_cache_zalloc(s, flags);
-	} else
-		return __kzalloc(size, flags);
-}
-
 #ifdef CONFIG_NUMA
 extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
 
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:11:55.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:11:59.000000000 -0700
@@ -3578,23 +3578,6 @@ void *kmem_cache_alloc(struct kmem_cache
 EXPORT_SYMBOL(kmem_cache_alloc);
 
 /**
- * kmem_cache_zalloc - Allocate an object. The memory is set to zero.
- * @cache: The cache to allocate from.
- * @flags: See kmalloc().
- *
- * Allocate an object from this cache and set the allocated memory to zero.
- * The flags are only relevant if the cache has no available objects.
- */
-void *kmem_cache_zalloc(struct kmem_cache *cache, gfp_t flags)
-{
-	void *ret = __cache_alloc(cache, flags, __builtin_return_address(0));
-	if (ret)
-		memset(ret, 0, obj_size(cache));
-	return ret;
-}
-EXPORT_SYMBOL(kmem_cache_zalloc);
-
-/**
  * kmem_ptr_validate - check if an untrusted pointer might
  *	be a slab entry.
  * @cachep: the cache we're checking against
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:10:47.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:11:59.000000000 -0700
@@ -505,16 +505,6 @@ void *kmem_cache_alloc(struct kmem_cache
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
-void *kmem_cache_zalloc(struct kmem_cache *c, gfp_t flags)
-{
-	void *ret = kmem_cache_alloc(c, flags);
-	if (ret)
-		memset(ret, 0, c->size);
-
-	return ret;
-}
-EXPORT_SYMBOL(kmem_cache_zalloc);
-
 static void __kmem_cache_free(void *b, int size)
 {
 	if (size < PAGE_SIZE)
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:10:47.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:11:59.000000000 -0700
@@ -2664,17 +2664,6 @@ err:
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
-void *kmem_cache_zalloc(struct kmem_cache *s, gfp_t flags)
-{
-	void *x;
-
-	x = slab_alloc(s, flags, -1, __builtin_return_address(0), 0);
-	if (x)
-		memset(x, 0, s->objsize);
-	return x;
-}
-EXPORT_SYMBOL(kmem_cache_zalloc);
-
 #ifdef CONFIG_SMP
 /*
  * Use the cpu notifier to insure that the cpu slabs are flushed when
Index: linux-2.6.22-rc4-mm2/mm/util.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/util.c	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/util.c	2007-06-17 18:11:59.000000000 -0700
@@ -5,20 +5,6 @@
 #include <asm/uaccess.h>
 
 /**
- * __kzalloc - allocate memory. The memory is set to zero.
- * @size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- */
-void *__kzalloc(size_t size, gfp_t flags)
-{
-	void *ret = kmalloc_track_caller(size, flags);
-	if (ret)
-		memset(ret, 0, size);
-	return ret;
-}
-EXPORT_SYMBOL(__kzalloc);
-
-/**
  * kstrdup - allocate space for and copy an existing string
  * @s: the string to duplicate
  * @gfp: the GFP mask used in the kmalloc() call when allocating memory

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (4 preceding siblings ...)
  2007-06-18  9:58 ` [patch 05/26] Slab allocators: Cleanup zeroing allocations clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-19 20:55   ` Pekka Enberg
  2007-06-28  6:09   ` Andrew Morton
  2007-06-18  9:58 ` [patch 07/26] SLUB: Add some more inlines and #ifdef CONFIG_SLUB_DEBUG clameter
                   ` (20 subsequent siblings)
  26 siblings, 2 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_use_gfpzero_for_kmalloc_node --]
[-- Type: text/plain, Size: 10987 bytes --]

kmalloc_node() and kmem_cache_alloc_node() were not available in
a zeroing variant in the past. But with __GFP_ZERO it is possible
now to do zeroing while allocating.

Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
we can.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 block/as-iosched.c       |    3 +--
 block/cfq-iosched.c      |   18 +++++++++---------
 block/deadline-iosched.c |    3 +--
 block/elevator.c         |    3 +--
 block/genhd.c            |    8 ++++----
 block/ll_rw_blk.c        |    4 ++--
 drivers/ide/ide-probe.c  |    4 ++--
 kernel/timer.c           |    4 ++--
 lib/genalloc.c           |    3 +--
 mm/allocpercpu.c         |    9 +++------
 mm/mempool.c             |    3 +--
 mm/vmalloc.c             |    6 +++---
 12 files changed, 30 insertions(+), 38 deletions(-)

Index: linux-2.6.22-rc4-mm2/block/as-iosched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/as-iosched.c	2007-06-17 15:46:35.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/as-iosched.c	2007-06-17 15:46:59.000000000 -0700
@@ -1322,10 +1322,9 @@ static void *as_init_queue(request_queue
 {
 	struct as_data *ad;
 
-	ad = kmalloc_node(sizeof(*ad), GFP_KERNEL, q->node);
+	ad = kmalloc_node(sizeof(*ad), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!ad)
 		return NULL;
-	memset(ad, 0, sizeof(*ad));
 
 	ad->q = q; /* Identify what queue the data belongs to */
 
Index: linux-2.6.22-rc4-mm2/block/cfq-iosched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/cfq-iosched.c	2007-06-17 15:42:50.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/cfq-iosched.c	2007-06-17 15:47:21.000000000 -0700
@@ -1249,9 +1249,9 @@ cfq_alloc_io_context(struct cfq_data *cf
 {
 	struct cfq_io_context *cic;
 
-	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask, cfqd->queue->node);
+	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
+							cfqd->queue->node);
 	if (cic) {
-		memset(cic, 0, sizeof(*cic));
 		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
 		cic->dtor = cfq_free_io_context;
@@ -1374,17 +1374,19 @@ retry:
 			 * free memory.
 			 */
 			spin_unlock_irq(cfqd->queue->queue_lock);
-			new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask|__GFP_NOFAIL, cfqd->queue->node);
+			new_cfqq = kmem_cache_alloc_node(cfq_pool,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO,
+					cfqd->queue->node);
 			spin_lock_irq(cfqd->queue->queue_lock);
 			goto retry;
 		} else {
-			cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask, cfqd->queue->node);
+			cfqq = kmem_cache_alloc_node(cfq_pool,
+					gfp_mask | __GFP_ZERO,
+					cfqd->queue->node);
 			if (!cfqq)
 				goto out;
 		}
 
-		memset(cfqq, 0, sizeof(*cfqq));
-
 		RB_CLEAR_NODE(&cfqq->rb_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
 
@@ -2046,12 +2048,10 @@ static void *cfq_init_queue(request_queu
 {
 	struct cfq_data *cfqd;
 
-	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL, q->node);
+	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
-	memset(cfqd, 0, sizeof(*cfqd));
-
 	cfqd->service_tree = CFQ_RB_ROOT;
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
Index: linux-2.6.22-rc4-mm2/block/deadline-iosched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/deadline-iosched.c	2007-06-17 15:47:37.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/deadline-iosched.c	2007-06-17 15:47:47.000000000 -0700
@@ -360,10 +360,9 @@ static void *deadline_init_queue(request
 {
 	struct deadline_data *dd;
 
-	dd = kmalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
+	dd = kmalloc_node(sizeof(*dd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!dd)
 		return NULL;
-	memset(dd, 0, sizeof(*dd));
 
 	INIT_LIST_HEAD(&dd->fifo_list[READ]);
 	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
Index: linux-2.6.22-rc4-mm2/block/elevator.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/elevator.c	2007-06-17 15:47:57.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/elevator.c	2007-06-17 15:48:07.000000000 -0700
@@ -177,11 +177,10 @@ static elevator_t *elevator_alloc(reques
 	elevator_t *eq;
 	int i;
 
-	eq = kmalloc_node(sizeof(elevator_t), GFP_KERNEL, q->node);
+	eq = kmalloc_node(sizeof(elevator_t), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (unlikely(!eq))
 		goto err;
 
-	memset(eq, 0, sizeof(*eq));
 	eq->ops = &e->ops;
 	eq->elevator_type = e;
 	kobject_init(&eq->kobj);
Index: linux-2.6.22-rc4-mm2/block/genhd.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/genhd.c	2007-06-17 15:48:27.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/genhd.c	2007-06-17 15:49:03.000000000 -0700
@@ -726,21 +726,21 @@ struct gendisk *alloc_disk_node(int mino
 {
 	struct gendisk *disk;
 
-	disk = kmalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id);
+	disk = kmalloc_node(sizeof(struct gendisk),
+				GFP_KERNEL | __GFP_ZERO, node_id);
 	if (disk) {
-		memset(disk, 0, sizeof(struct gendisk));
 		if (!init_disk_stats(disk)) {
 			kfree(disk);
 			return NULL;
 		}
 		if (minors > 1) {
 			int size = (minors - 1) * sizeof(struct hd_struct *);
-			disk->part = kmalloc_node(size, GFP_KERNEL, node_id);
+			disk->part = kmalloc_node(size,
+				GFP_KERNEL | __GFP_ZERO, node_id);
 			if (!disk->part) {
 				kfree(disk);
 				return NULL;
 			}
-			memset(disk->part, 0, size);
 		}
 		disk->minors = minors;
 		kobj_set_kset_s(disk,block_subsys);
Index: linux-2.6.22-rc4-mm2/block/ll_rw_blk.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/ll_rw_blk.c	2007-06-17 15:44:27.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/ll_rw_blk.c	2007-06-17 15:45:03.000000000 -0700
@@ -1828,11 +1828,11 @@ request_queue_t *blk_alloc_queue_node(gf
 {
 	request_queue_t *q;
 
-	q = kmem_cache_alloc_node(requestq_cachep, gfp_mask, node_id);
+	q = kmem_cache_alloc_node(requestq_cachep,
+				gfp_mask | __GFP_ZERO, node_id);
 	if (!q)
 		return NULL;
 
-	memset(q, 0, sizeof(*q));
 	init_timer(&q->unplug_timer);
 
 	snprintf(q->kobj.name, KOBJ_NAME_LEN, "%s", "queue");
Index: linux-2.6.22-rc4-mm2/drivers/ide/ide-probe.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/ide/ide-probe.c	2007-06-17 15:49:57.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/ide/ide-probe.c	2007-06-17 15:50:13.000000000 -0700
@@ -1073,14 +1073,14 @@ static int init_irq (ide_hwif_t *hwif)
 		hwgroup->hwif->next = hwif;
 		spin_unlock_irq(&ide_lock);
 	} else {
-		hwgroup = kmalloc_node(sizeof(ide_hwgroup_t), GFP_KERNEL,
+		hwgroup = kmalloc_node(sizeof(ide_hwgroup_t),
+					GFP_KERNEL | __GFP_ZERO,
 					hwif_to_node(hwif->drives[0].hwif));
 		if (!hwgroup)
 	       		goto out_up;
 
 		hwif->hwgroup = hwgroup;
 
-		memset(hwgroup, 0, sizeof(ide_hwgroup_t));
 		hwgroup->hwif     = hwif->next = hwif;
 		hwgroup->rq       = NULL;
 		hwgroup->handler  = NULL;
Index: linux-2.6.22-rc4-mm2/kernel/timer.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/timer.c	2007-06-17 15:50:50.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/timer.c	2007-06-17 15:51:16.000000000 -0700
@@ -1221,7 +1221,8 @@ static int __devinit init_timers_cpu(int
 			/*
 			 * The APs use this path later in boot
 			 */
-			base = kmalloc_node(sizeof(*base), GFP_KERNEL,
+			base = kmalloc_node(sizeof(*base),
+						GFP_KERNEL | __GFP_ZERO,
 						cpu_to_node(cpu));
 			if (!base)
 				return -ENOMEM;
@@ -1232,7 +1233,6 @@ static int __devinit init_timers_cpu(int
 				kfree(base);
 				return -ENOMEM;
 			}
-			memset(base, 0, sizeof(*base));
 			per_cpu(tvec_bases, cpu) = base;
 		} else {
 			/*
Index: linux-2.6.22-rc4-mm2/lib/genalloc.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/lib/genalloc.c	2007-06-17 15:51:38.000000000 -0700
+++ linux-2.6.22-rc4-mm2/lib/genalloc.c	2007-06-17 15:51:56.000000000 -0700
@@ -54,11 +54,10 @@ int gen_pool_add(struct gen_pool *pool, 
 	int nbytes = sizeof(struct gen_pool_chunk) +
 				(nbits + BITS_PER_BYTE - 1) / BITS_PER_BYTE;
 
-	chunk = kmalloc_node(nbytes, GFP_KERNEL, nid);
+	chunk = kmalloc_node(nbytes, GFP_KERNEL | __GFP_ZERO, nid);
 	if (unlikely(chunk == NULL))
 		return -1;
 
-	memset(chunk, 0, nbytes);
 	spin_lock_init(&chunk->lock);
 	chunk->start_addr = addr;
 	chunk->end_addr = addr + size;
Index: linux-2.6.22-rc4-mm2/mm/allocpercpu.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/allocpercpu.c	2007-06-17 15:52:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/allocpercpu.c	2007-06-17 15:52:38.000000000 -0700
@@ -53,12 +53,9 @@ void *percpu_populate(void *__pdata, siz
 	int node = cpu_to_node(cpu);
 
 	BUG_ON(pdata->ptrs[cpu]);
-	if (node_online(node)) {
-		/* FIXME: kzalloc_node(size, gfp, node) */
-		pdata->ptrs[cpu] = kmalloc_node(size, gfp, node);
-		if (pdata->ptrs[cpu])
-			memset(pdata->ptrs[cpu], 0, size);
-	} else
+	if (node_online(node))
+		pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
+	else
 		pdata->ptrs[cpu] = kzalloc(size, gfp);
 	return pdata->ptrs[cpu];
 }
Index: linux-2.6.22-rc4-mm2/mm/mempool.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/mempool.c	2007-06-17 15:52:52.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/mempool.c	2007-06-17 15:53:19.000000000 -0700
@@ -62,10 +62,9 @@ mempool_t *mempool_create_node(int min_n
 			mempool_free_t *free_fn, void *pool_data, int node_id)
 {
 	mempool_t *pool;
-	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL, node_id);
+	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL | __GFP_ZERO, node_id);
 	if (!pool)
 		return NULL;
-	memset(pool, 0, sizeof(*pool));
 	pool->elements = kmalloc_node(min_nr * sizeof(void *),
 					GFP_KERNEL, node_id);
 	if (!pool->elements) {
Index: linux-2.6.22-rc4-mm2/mm/vmalloc.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/vmalloc.c	2007-06-17 15:57:18.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/vmalloc.c	2007-06-17 16:03:38.000000000 -0700
@@ -434,11 +434,12 @@ void *__vmalloc_area_node(struct vm_stru
 	area->nr_pages = nr_pages;
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE) {
-		pages = __vmalloc_node(array_size, gfp_mask, PAGE_KERNEL, node);
+		pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
+					PAGE_KERNEL, node);
 		area->flags |= VM_VPAGES;
 	} else {
 		pages = kmalloc_node(array_size,
-				(gfp_mask & GFP_LEVEL_MASK),
+				(gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO,
 				node);
 	}
 	area->pages = pages;
@@ -447,7 +448,6 @@ void *__vmalloc_area_node(struct vm_stru
 		kfree(area);
 		return NULL;
 	}
-	memset(area->pages, 0, array_size);
 
 	for (i = 0; i < area->nr_pages; i++) {
 		if (node < 0)

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 07/26] SLUB: Add some more inlines and #ifdef CONFIG_SLUB_DEBUG
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (5 preceding siblings ...)
  2007-06-18  9:58 ` [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 08/26] SLUB: Extract dma_kmalloc_cache from get_cache clameter
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_inlines_ifdefs --]
[-- Type: text/plain, Size: 2927 bytes --]

Add #ifdefs around data structures only needed if debugging is compiled
into SLUB.

Add inlines to small functions to reduce code size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    4 ++++
 mm/slub.c                |   13 +++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:04.000000000 -0700
@@ -259,9 +259,10 @@ static int sysfs_slab_add(struct kmem_ca
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
 static void sysfs_slab_remove(struct kmem_cache *);
 #else
-static int sysfs_slab_add(struct kmem_cache *s) { return 0; }
-static int sysfs_slab_alias(struct kmem_cache *s, const char *p) { return 0; }
-static void sysfs_slab_remove(struct kmem_cache *s) {}
+static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
+static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
+							{ return 0; }
+static inline void sysfs_slab_remove(struct kmem_cache *s) {}
 #endif
 
 /********************************************************************
@@ -1405,7 +1406,7 @@ static void deactivate_slab(struct kmem_
 	unfreeze_slab(s, page);
 }
 
-static void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
+static inline void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
 {
 	slab_lock(page);
 	deactivate_slab(s, page, cpu);
@@ -1415,7 +1416,7 @@ static void flush_slab(struct kmem_cache
  * Flush cpu slab.
  * Called from IPI handler with interrupts disabled.
  */
-static void __flush_cpu_slab(struct kmem_cache *s, int cpu)
+static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
 	struct page *page = s->cpu_slab[cpu];
 
@@ -2174,7 +2175,7 @@ static int free_list(struct kmem_cache *
 /*
  * Release all resources used by a slab cache.
  */
-static int kmem_cache_close(struct kmem_cache *s)
+static inline int kmem_cache_close(struct kmem_cache *s)
 {
 	int node;
 
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:12:04.000000000 -0700
@@ -16,7 +16,9 @@ struct kmem_cache_node {
 	unsigned long nr_partial;
 	atomic_long_t nr_slabs;
 	struct list_head partial;
+#ifdef CONFIG_SLUB_DEBUG
 	struct list_head full;
+#endif
 };
 
 /*
@@ -44,7 +46,9 @@ struct kmem_cache {
 	int align;		/* Alignment */
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
+#ifdef CONFIG_SLUB_DEBUG
 	struct kobject kobj;	/* For sysfs */
+#endif
 
 #ifdef CONFIG_NUMA
 	int defrag_ratio;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 08/26] SLUB: Extract dma_kmalloc_cache from get_cache.
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (6 preceding siblings ...)
  2007-06-18  9:58 ` [patch 07/26] SLUB: Add some more inlines and #ifdef CONFIG_SLUB_DEBUG clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 09/26] SLUB: Do proper locking during dma slab creation clameter
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_extract_dma_cache --]
[-- Type: text/plain, Size: 2583 bytes --]

The rarely used dma functionality in get_slab() makes the function too
complex. The compiler begins to spill variables from the working set onto
the stack. The created function is only used in extremely rare cases so make
sure that the compiler does not decide on its own to merge it back into
get_slab().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   66 +++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 36 insertions(+), 30 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:04.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:10.000000000 -0700
@@ -2281,6 +2281,40 @@ panic:
 	panic("Creation of kmalloc slab %s size=%d failed.\n", name, size);
 }
 
+#ifdef CONFIG_ZONE_DMA
+static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
+{
+	struct kmem_cache *s;
+	struct kmem_cache *x;
+	char *text;
+	size_t realsize;
+
+	s = kmalloc_caches_dma[index];
+	if (s)
+		return s;
+
+	/* Dynamically create dma cache */
+	x = kmalloc(kmem_size, flags & ~SLUB_DMA);
+	if (!x)
+		panic("Unable to allocate memory for dma cache\n");
+
+	if (index <= KMALLOC_SHIFT_HIGH)
+		realsize = 1 << index;
+	else {
+		if (index == 1)
+			realsize = 96;
+		else
+			realsize = 192;
+	}
+
+	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
+			(unsigned int)realsize);
+	s = create_kmalloc_cache(x, text, realsize, flags);
+	kmalloc_caches_dma[index] = s;
+	return s;
+}
+#endif
+
 static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 {
 	int index = kmalloc_index(size);
@@ -2293,36 +2327,8 @@ static struct kmem_cache *get_slab(size_
 		return NULL;
 
 #ifdef CONFIG_ZONE_DMA
-	if ((flags & SLUB_DMA)) {
-		struct kmem_cache *s;
-		struct kmem_cache *x;
-		char *text;
-		size_t realsize;
-
-		s = kmalloc_caches_dma[index];
-		if (s)
-			return s;
-
-		/* Dynamically create dma cache */
-		x = kmalloc(kmem_size, flags & ~SLUB_DMA);
-		if (!x)
-			panic("Unable to allocate memory for dma cache\n");
-
-		if (index <= KMALLOC_SHIFT_HIGH)
-			realsize = 1 << index;
-		else {
-			if (index == 1)
-				realsize = 96;
-			else
-				realsize = 192;
-		}
-
-		text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
-				(unsigned int)realsize);
-		s = create_kmalloc_cache(x, text, realsize, flags);
-		kmalloc_caches_dma[index] = s;
-		return s;
-	}
+	if ((flags & SLUB_DMA))
+		return dma_kmalloc_cache(index, flags);
 #endif
 	return &kmalloc_caches[index];
 }

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 09/26] SLUB: Do proper locking during dma slab creation
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (7 preceding siblings ...)
  2007-06-18  9:58 ` [patch 08/26] SLUB: Extract dma_kmalloc_cache from get_cache clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc clameter
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_dma_create_lock --]
[-- Type: text/plain, Size: 1043 bytes --]

We modify the kmalloc_cache_dma[] array without proper locking.
Do the proper locking and undo the dma cache creation if another processor
has already created it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:10.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:13.000000000 -0700
@@ -2310,8 +2310,15 @@ static struct kmem_cache *dma_kmalloc_ca
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			(unsigned int)realsize);
 	s = create_kmalloc_cache(x, text, realsize, flags);
-	kmalloc_caches_dma[index] = s;
-	return s;
+	down_write(&slub_lock);
+	if (!kmalloc_caches_dma[index]) {
+		kmalloc_caches_dma[index] = s;
+		up_write(&slub_lock);
+		return s;
+	}
+	up_write(&slub_lock);
+	kmem_cache_destroy(s);
+	return kmalloc_caches_dma[index];
 }
 #endif
 

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc.
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (8 preceding siblings ...)
  2007-06-18  9:58 ` [patch 09/26] SLUB: Do proper locking during dma slab creation clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-19 20:08   ` Andrew Morton
  2007-06-18  9:58 ` [patch 11/26] SLUB: Add support for kmem_cache_ops clameter
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_faster_kmalloc_slab --]
[-- Type: text/plain, Size: 3637 bytes --]

kmalloc_index is a long series of comparisons. The attempt to replace
kmalloc_index with something more efficient like ilog2 failed due to
compiler issues with constant folding on gcc 3.3 / powerpc.

kmalloc_index()'es long list of comparisons works fine for constant folding
since all the comparisons are optimized away. However, SLUB also uses
kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
This leads to a large set of comparisons in get_slab().

The patch here allows to get rid of that list of comparisons in get_slab():

1. If the requested size is larger than 192 then we can simply use
   fls to determine the slab index since all larger slabs are
   of the power of two type.

2. If the requested size is smaller then we cannot use fls since there
   are non power of two caches to be considered. However, the sizes are
   in a managable range. So we divide the size by 8. Then we have only
   24 possibilities left and then we simply look up the kmalloc index
   in a table.

Code size of slub.o decreases by more than 200 bytes through this patch.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 65 insertions(+), 8 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:13.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:16.000000000 -0700
@@ -2322,20 +2322,59 @@ static struct kmem_cache *dma_kmalloc_ca
 }
 #endif
 
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+};
+
 static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 {
-	int index = kmalloc_index(size);
+	int index;
 
-	if (!index)
-		return ZERO_SIZE_PTR;
+	if (size <= 192) {
+		if (!size)
+			return ZERO_SIZE_PTR;
 
-	/* Allocation too large? */
-	if (index < 0)
-		return NULL;
+		index = size_index[(size - 1) / 8];
+	} else {
+		if (size > KMALLOC_MAX_SIZE)
+			return NULL;
+
+		index = fls(size - 1) + 1;
+	}
 
 #ifdef CONFIG_ZONE_DMA
-	if ((flags & SLUB_DMA))
+	if (unlikely((flags & SLUB_DMA)))
 		return dma_kmalloc_cache(index, flags);
+
 #endif
 	return &kmalloc_caches[index];
 }
@@ -2550,6 +2589,24 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE;i++)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 11/26] SLUB: Add support for kmem_cache_ops
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (9 preceding siblings ...)
  2007-06-18  9:58 ` [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-19 20:58   ` Pekka Enberg
  2007-06-18  9:58 ` [patch 12/26] SLUB: Slab defragmentation core clameter
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_defrag_kmem_cache_ops --]
[-- Type: text/plain, Size: 8559 bytes --]

We use the parameter formerly used by the destructor to pass an optional
pointer to a kmem_cache_ops structure to kmem_cache_create.

kmem_cache_ops is created as empty. Later patches populate kmem_cache_ops.

Create a KMEM_CACHE_OPS macro that allows the specification of a the
kmem_cache_ops.

Code to handle kmem_cache_ops is added to SLUB. SLAB and SLOB are updated
to be able to accept a kmem_cache_ops structure but will ignore it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   13 +++++++++----
 include/linux/slub_def.h |    1 +
 mm/slab.c                |    6 +++---
 mm/slob.c                |    2 +-
 mm/slub.c                |   44 ++++++++++++++++++++++++++++++--------------
 5 files changed, 44 insertions(+), 22 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:19.000000000 -0700
@@ -51,10 +51,13 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
+struct kmem_cache_ops {
+};
+
 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			unsigned long,
 			void (*)(void *, struct kmem_cache *, unsigned long),
-			void (*)(void *, struct kmem_cache *, unsigned long));
+			const struct kmem_cache_ops *s);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
@@ -71,9 +74,11 @@ int kmem_ptr_validate(struct kmem_cache 
  * f.e. add ____cacheline_aligned_in_smp to the struct declaration
  * then the objects will be properly aligned in SMP configurations.
  */
-#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
-		sizeof(struct __struct), __alignof__(struct __struct),\
-		(__flags), NULL, NULL)
+#define KMEM_CACHE_OPS(__struct, __flags, __ops) \
+	kmem_cache_create(#__struct, sizeof(struct __struct), \
+	__alignof__(struct __struct), (__flags), NULL, (__ops))
+
+#define KMEM_CACHE(__struct, __flags) KMEM_CACHE_OPS(__struct, __flags, NULL)
 
 #ifdef CONFIG_NUMA
 extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:16.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:19.000000000 -0700
@@ -300,6 +300,9 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
+struct kmem_cache_ops slub_default_ops = {
+};
+
 /*
  * Slow version of get and set free pointer.
  *
@@ -2081,11 +2084,13 @@ static int calculate_sizes(struct kmem_c
 static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long))
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		const struct kmem_cache_ops *ops)
 {
 	memset(s, 0, kmem_size);
 	s->name = name;
 	s->ctor = ctor;
+	s->ops = ops;
 	s->objsize = size;
 	s->flags = flags;
 	s->align = align;
@@ -2268,7 +2273,7 @@ static struct kmem_cache *create_kmalloc
 
 	down_write(&slub_lock);
 	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
-			flags, NULL))
+			flags, NULL, &slub_default_ops))
 		goto panic;
 
 	list_add(&s->list, &slab_caches);
@@ -2645,12 +2650,16 @@ static int slab_unmergeable(struct kmem_
 	if (s->refcount < 0)
 		return 1;
 
+	if (s->ops != &slub_default_ops)
+		return 1;
+
 	return 0;
 }
 
 static struct kmem_cache *find_mergeable(size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long))
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		const struct kmem_cache_ops *ops)
 {
 	struct kmem_cache *s;
 
@@ -2660,6 +2669,9 @@ static struct kmem_cache *find_mergeable
 	if (ctor)
 		return NULL;
 
+	if (ops != &slub_default_ops)
+		return NULL;
+
 	size = ALIGN(size, sizeof(void *));
 	align = calculate_alignment(flags, align, size);
 	size = ALIGN(size, align);
@@ -2692,13 +2704,15 @@ static struct kmem_cache *find_mergeable
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 		size_t align, unsigned long flags,
 		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		const struct kmem_cache_ops *ops)
 {
 	struct kmem_cache *s;
 
-	BUG_ON(dtor);
+	if (!ops)
+		ops = &slub_default_ops;
+
 	down_write(&slub_lock);
-	s = find_mergeable(size, align, flags, ctor);
+	s = find_mergeable(size, align, flags, ctor, ops);
 	if (s) {
 		s->refcount++;
 		/*
@@ -2712,7 +2726,7 @@ struct kmem_cache *kmem_cache_create(con
 	} else {
 		s = kmalloc(kmem_size, GFP_KERNEL);
 		if (s && kmem_cache_open(s, GFP_KERNEL, name,
-				size, align, flags, ctor)) {
+				size, align, flags, ctor, ops)) {
 			if (sysfs_slab_add(s)) {
 				kfree(s);
 				goto err;
@@ -3323,16 +3337,18 @@ static ssize_t order_show(struct kmem_ca
 }
 SLAB_ATTR_RO(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+	int x = 0;
 
-		return n + sprintf(buf + n, "\n");
+	if (s->ctor) {
+		x += sprintf(buf + x, "ctor : ");
+		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+		x += sprintf(buf + x, "\n");
 	}
-	return 0;
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -3564,7 +3580,7 @@ static struct attribute * slab_attrs[] =
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:12:04.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:12:19.000000000 -0700
@@ -42,6 +42,7 @@ struct kmem_cache {
 	int objects;		/* Number of objects in slab */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(void *, struct kmem_cache *, unsigned long);
+	const struct kmem_cache_ops *ops;
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	const char *name;	/* Name (only for display!) */
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:12:19.000000000 -0700
@@ -2102,7 +2102,7 @@ static int __init_refok setup_cpu_cache(
  * @align: The required alignment for the objects.
  * @flags: SLAB flags
  * @ctor: A constructor for the objects.
- * @dtor: A destructor for the objects (not implemented anymore).
+ * @ops: A kmem_cache_ops structure (ignored).
  *
  * Returns a ptr to the cache on success, NULL on failure.
  * Cannot be called within a int, but can be interrupted.
@@ -2128,7 +2128,7 @@ struct kmem_cache *
 kmem_cache_create (const char *name, size_t size, size_t align,
 	unsigned long flags,
 	void (*ctor)(void*, struct kmem_cache *, unsigned long),
-	void (*dtor)(void*, struct kmem_cache *, unsigned long))
+	const struct kmem_cache_ops *ops)
 {
 	size_t left_over, slab_size, ralign;
 	struct kmem_cache *cachep = NULL, *pc;
@@ -2137,7 +2137,7 @@ kmem_cache_create (const char *name, siz
 	 * Sanity checks... these are all serious usage bugs.
 	 */
 	if (!name || in_interrupt() || (size < BYTES_PER_WORD) ||
-	    size > KMALLOC_MAX_SIZE || dtor) {
+	    size > KMALLOC_MAX_SIZE) {
 		printk(KERN_ERR "%s: Early error in slab %s\n", __FUNCTION__,
 				name);
 		BUG();
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:12:19.000000000 -0700
@@ -455,7 +455,7 @@ struct kmem_cache {
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 	size_t align, unsigned long flags,
 	void (*ctor)(void*, struct kmem_cache *, unsigned long),
-	void (*dtor)(void*, struct kmem_cache *, unsigned long))
+	const struct kmem_cache_ops *o)
 {
 	struct kmem_cache *c;
 

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 12/26] SLUB: Slab defragmentation core
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (10 preceding siblings ...)
  2007-06-18  9:58 ` [patch 11/26] SLUB: Add support for kmem_cache_ops clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-26  8:18   ` Andrew Morton
  2007-06-26 19:13   ` Nish Aravamudan
  2007-06-18  9:58 ` [patch 13/26] SLUB: Extend slabinfo to support -D and -C options clameter
                   ` (14 subsequent siblings)
  26 siblings, 2 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_defrag_core --]
[-- Type: text/plain, Size: 16592 bytes --]

Slab defragmentation occurs either

1. Unconditionally when kmem_cache_shrink is called on slab by the kernel
   calling kmem_cache_shrink or slabinfo triggering slab shrinking. This
   form performs defragmentation on all nodes of a NUMA system.

2. Conditionally when kmem_cache_defrag(<percentage>, <node>) is called.

   The defragmentation is only performed if the fragmentation of the slab
   is higher then the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab cache could hold.

   kmem_cache_defrag takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.
   If a node number was specified then defragmentation is only performed
   on a specific node.

   Slab defragmentation is a memory intensive operation that can be
   sped up in a NUMA system if mostly node local memory is accessed. That
   is the case if we just have reclaimed reclaim on a node.

For defragmentation SLUB first generates a sorted list of partial slabs.
Sorting is performed according to the number of objects allocated.
Thus the slabs with the least objects will be at the end.

We extract slabs off the tail of that list until we have either reached a
mininum number of slabs or until we encounter a slab that has more than a
quarter of its objects allocated. Then we attempt to remove the objects
from each of the slabs taken.

In order for a slabcache to support defragmentation a couple of functions
must be defined via kmem_cache_ops. These are

void *get(struct kmem_cache *s, int nr, void **objects)

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect the situation and void the attempts to handle such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get_reference(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page with the object is taken. Any attempt to perform a slab
	operation may lead to a deadlock.

	get() returns a private pointer that is passed to kick. Should we
	be unable to obtain all references then that pointer may indicate
	to the kick() function that it should not attempt any object removal
	or move but simply remove the reference counts.

void kick(struct kmem_cache *, int nr, void **objects, void *get_result)

	After SLUB has established references to the objects in a
	slab it will drop all locks and then use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via get(). The callback may perform
	any slab operation since no locks are held at the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab so that it also can be removed from
	the partial list.

	Kick() does not return a result. SLUB will check the number of
	remaining objects in the slab. If all objects were removed then
	we know that the operation was successful.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |   32 ++++
 mm/slab.c            |    5 
 mm/slob.c            |    5 
 mm/slub.c            |  344 +++++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 322 insertions(+), 64 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:22.000000000 -0700
@@ -51,7 +51,39 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
+struct kmem_cache;
+
 struct kmem_cache_ops {
+	/*
+	 * Called with slab lock held and interrupts disabled.
+	 * No slab operation may be performed.
+	 *
+	 * Parameters passed are the number of objects to process
+	 * and an array of pointers to objects for which we
+	 * need references.
+	 *
+	 * Returns a pointer that is passed to the kick function.
+	 * If all objects cannot be moved then the pointer may
+	 * indicate that this wont work and then kick can simply
+	 * remove the references that were already obtained.
+	 *
+	 * The array passed to get() is also passed to kick(). The
+	 * function may remove objects by setting array elements to NULL.
+	 */
+	void *(*get)(struct kmem_cache *, int nr, void **);
+
+	/*
+	 * Called with no locks held and interrupts enabled.
+	 * Any operation may be performed in kick().
+	 *
+	 * Parameters passed are the number of objects in the array,
+	 * the array of pointers to the objects and the pointer
+	 * returned by get().
+	 *
+	 * Success is checked by examining the number of remaining
+	 * objects in the slab.
+	 */
+	void (*kick)(struct kmem_cache *, int nr, void **, void *private);
 };
 
 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:22.000000000 -0700
@@ -2464,6 +2464,195 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	unsigned long x = 0;
+	struct page *page;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry(page, &n->partial, lru)
+		x += page->inuse;
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return x;
+}
+
+/*
+ * Vacate all objects in the given slab.
+ *
+ * Slab must be locked and frozen. Interrupts are disabled (flags must
+ * be passed).
+ *
+ * Will drop and regain and drop the slab lock. At the end the slab will
+ * either be freed or returned to the partial lists.
+ *
+ * Returns the number of remaining objects
+ */
+static int __kmem_cache_vacate(struct kmem_cache *s,
+		struct page *page, unsigned long flags, void *scratch)
+{
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	DECLARE_BITMAP(map, s->objects);
+	int leftover;
+	int objects;
+	void *private;
+
+	if (!page->inuse)
+		goto out;
+
+	/* Determine used objects */
+	bitmap_fill(map, s->objects);
+	for_each_free_object(p, s, page->freelist)
+		__clear_bit(slab_index(p, s, addr), map);
+
+	objects = 0;
+	memset(vector, 0, s->objects * sizeof(void **));
+	for_each_object(p, s, addr) {
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[objects++] = p;
+	}
+
+	private = s->ops->get(s, objects, vector);
+
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->ops->kick(s, objects, vector, private);
+
+	local_irq_save(flags);
+	slab_lock(page);
+out:
+	/*
+	 * Check the result and unfreeze the slab
+	 */
+	leftover = page->inuse;
+	unfreeze_slab(s, page);
+	local_irq_restore(flags);
+	return leftover;
+}
+
+/*
+ * Sort the partial slabs by the number of items allocated.
+ * The slabs with the least objects come last.
+ */
+static unsigned long sort_partial_list(struct kmem_cache *s,
+	struct kmem_cache_node *n, void *scratch)
+{
+	struct list_head *slabs_by_inuse = scratch;
+	int i;
+	struct page *page;
+	struct page *t;
+	unsigned long freed = 0;
+
+	for (i = 0; i < s->objects; i++)
+		INIT_LIST_HEAD(slabs_by_inuse + i);
+
+	/*
+	 * Build lists indexed by the items in use in each slab.
+	 *
+	 * Note that concurrent frees may occur while we hold the
+	 * list_lock. page->inuse here is the upper limit.
+	 */
+	list_for_each_entry_safe(page, t, &n->partial, lru) {
+		if (!page->inuse && slab_trylock(page)) {
+			/*
+			 * Must hold slab lock here because slab_free
+			 * may have freed the last object and be
+			 * waiting to release the slab.
+			 */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			freed++;
+		} else {
+			list_move(&page->lru,
+			slabs_by_inuse + page->inuse);
+		}
+	}
+
+	/*
+	 * Rebuild the partial list with the slabs filled up most
+	 * first and the least used slabs at the end.
+	 */
+	for (i = s->objects - 1; i >= 0; i--)
+		list_splice(slabs_by_inuse + i, n->partial.prev);
+
+	return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s,
+	struct kmem_cache_node *n, void *scratch)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	freed = sort_partial_list(s, n, scratch);
+
+	/*
+	 * If we have no functions available to defragment the slabs
+	 * then we are done.
+	*/
+	if (!s->ops->get || !s->ops->kick) {
+		spin_unlock_irqrestore(&n->list_lock, flags);
+		return freed;
+	}
+
+	/*
+	 * Take slabs with just a few objects off the tail of the now
+	 * ordered list. These are the slabs with the least objects
+	 * and those are likely easy to reclaim.
+	 */
+	while (n->nr_partial > MAX_PARTIAL) {
+		page = container_of(n->partial.prev, struct page, lru);
+
+		/*
+		 * We are holding the list_lock so we can only
+		 * trylock the slab
+		 */
+		if (page->inuse > s->objects / 4)
+			break;
+
+		if (!slab_trylock(page))
+			break;
+
+		list_move_tail(&page->lru, &zaplist);
+		n->nr_partial--;
+		SetSlabFrozen(page);
+		slab_unlock(page);
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	/* Now we can free objects in the slabs on the zaplist */
+	list_for_each_entry_safe(page, page2, &zaplist, lru) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		slab_lock(page);
+		if (__kmem_cache_vacate(s, page, flags, scratch) == 0)
+			freed++;
+	}
+	return freed;
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -2477,71 +2666,97 @@ EXPORT_SYMBOL(kfree);
 int kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
-	unsigned long flags;
+	void *scratch;
+
+	flush_all(s);
 
-	if (!slabs_by_inuse)
+	scratch = kmalloc(sizeof(struct list_head) * s->objects,
+							GFP_KERNEL);
+	if (!scratch)
 		return -ENOMEM;
 
-	flush_all(s);
-	for_each_online_node(node) {
-		n = get_node(s, node);
+	for_each_online_node(node)
+		__kmem_cache_shrink(s, get_node(s, node), scratch);
 
-		if (!n->nr_partial)
-			continue;
+	kfree(scratch);
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
 
-		for (i = 0; i < s->objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
+static unsigned long __kmem_cache_defrag(struct kmem_cache *s,
+				int percent, int node, void *scratch)
+{
+	unsigned long capacity;
+	unsigned long objects;
+	unsigned long ratio;
+	struct kmem_cache_node *n = get_node(s, node);
 
-		spin_lock_irqsave(&n->list_lock, flags);
+	/*
+	 * An insignificant number of partial slabs makes
+	 * the slab not interesting.
+	 */
+	if (n->nr_partial <= MAX_PARTIAL)
+		return 0;
 
-		/*
-		 * Build lists indexed by the items in use in each slab.
-		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
-		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
-				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
-			} else {
-				if (n->nr_partial > MAX_PARTIAL)
-					list_move(&page->lru,
-					slabs_by_inuse + page->inuse);
-			}
-		}
+	/*
+	 * Calculate usage ratio
+	 */
+	capacity = atomic_long_read(&n->nr_slabs) * s->objects;
+	objects = capacity - n->nr_partial * s->objects + count_partial(n);
+	ratio = objects * 100 / capacity;
 
-		if (n->nr_partial <= MAX_PARTIAL)
-			goto out;
+	/*
+	 * If usage ratio is more than required then no
+	 * defragmentation
+	 */
+	if (ratio > percent)
+		return 0;
+
+	return __kmem_cache_shrink(s, n, scratch) << s->order;
+}
+
+/*
+ * Defrag slabs on the local node if fragmentation is higher
+ * than the given percentage. This is called from the memory reclaim
+ * path.
+ */
+int kmem_cache_defrag(int percent, int node)
+{
+	struct kmem_cache *s;
+	unsigned long pages = 0;
+	void *scratch;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!down_read_trylock(&slub_lock))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * The slab cache must have defrag methods.
 		 */
-		for (i = s->objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (!s->ops || !s->ops->kick)
+			continue;
 
-	out:
-		spin_unlock_irqrestore(&n->list_lock, flags);
+		scratch = kmalloc(sizeof(struct list_head) * s->objects,
+								GFP_KERNEL);
+		if (node == -1) {
+			for_each_online_node(node)
+				pages += __kmem_cache_defrag(s, percent,
+							node, scratch);
+		} else
+			pages += __kmem_cache_defrag(s, percent, node, scratch);
+		kfree(scratch);
 	}
-
-	kfree(slabs_by_inuse);
-	return 0;
+	up_read(&slub_lock);
+	return pages;
 }
-EXPORT_SYMBOL(kmem_cache_shrink);
+EXPORT_SYMBOL(kmem_cache_defrag);
 
 /********************************************************************
  *			Basic setup of slabs
@@ -3178,19 +3393,6 @@ static int list_locations(struct kmem_ca
 	return n;
 }
 
-static unsigned long count_partial(struct kmem_cache_node *n)
-{
-	unsigned long flags;
-	unsigned long x = 0;
-	struct page *page;
-
-	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, lru)
-		x += page->inuse;
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	return x;
-}
-
 enum slab_stat_type {
 	SL_FULL,
 	SL_PARTIAL,
@@ -3346,6 +3548,20 @@ static ssize_t ops_show(struct kmem_cach
 		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
 		x += sprintf(buf + x, "\n");
 	}
+
+	if (s->ops->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->ops->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->ops->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->ops->kick);
+		x += sprintf(buf + x, "\n");
+	}
 	return x;
 }
 SLAB_ATTR_RO(ops);
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:12:22.000000000 -0700
@@ -2518,6 +2518,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int percent, int node)
+{
+	return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:12:22.000000000 -0700
@@ -553,6 +553,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int percentage, int node)
+{
+	return 0;
+}
+
 int kmem_ptr_validate(struct kmem_cache *a, const void *b)
 {
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 13/26] SLUB: Extend slabinfo to support -D and -C options
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (11 preceding siblings ...)
  2007-06-18  9:58 ` [patch 12/26] SLUB: Slab defragmentation core clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 14/26] SLUB: Logic to trigger slab defragmentation from memory reclaim clameter
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_defrag_slabinfo_updates --]
[-- Type: text/plain, Size: 4606 bytes --]

-D lists caches that support defragmentation

-C lists caches that use a ctor.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/vm/slabinfo.c |   39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

Index: linux-2.6.22-rc4-mm2/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/vm/slabinfo.c	2007-06-18 01:26:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/vm/slabinfo.c	2007-06-18 01:27:21.000000000 -0700
@@ -30,6 +30,7 @@ struct slabinfo {
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
 	unsigned long partial, objects, slabs;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
@@ -56,6 +57,8 @@ int show_slab = 0;
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -90,18 +93,20 @@ void fatal(const char *x, ...)
 void usage(void)
 {
 	printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"slabinfo [-aCDefhilnosSrtTvz1] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
-		"-e|--empty		Show empty slabs\n"
+		"-D|--defrag            Show defragmentable caches\n"
+		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
 		"-n|--numa              Show NUMA information\n"
-		"-o|--ops		Show kmem_cache_ops\n"
+		"-o|--ops               Show kmem_cache_ops\n"
 		"-s|--shrink            Shrink slabs\n"
-		"-r|--report		Detailed report on single slabs\n"
+		"-r|--report            Detailed report on single slabs\n"
 		"-S|--Size              Sort by size\n"
 		"-t|--tracking          Show alloc/free information\n"
 		"-T|--Totals            Show summary information\n"
@@ -281,7 +286,7 @@ int line = 0;
 void first_line(void)
 {
 	printf("Name                   Objects Objsize    Space "
-		"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+		"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -452,6 +457,12 @@ void slabcache(struct slabinfo *s)
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	store_size(size_str, slab_size(s));
 	sprintf(dist_str,"%lu/%lu/%d", s->slabs, s->partial, s->cpu_slabs);
 
@@ -462,6 +473,10 @@ void slabcache(struct slabinfo *s)
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'D';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -481,7 +496,7 @@ void slabcache(struct slabinfo *s)
 	printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 		s->name, s->objects, s->object_size, size_str, dist_str,
 		s->objs_per_slab, s->order,
-		s->slabs ? (s->partial * 100) / s->slabs : 100,
+		s->slabs ? (s->objects * 100) / (s->slabs * s->objs_per_slab) : 100,
 		s->slabs ? (s->objects * s->object_size * 100) /
 			(s->slabs * (page_size << s->order)) : 100,
 		flags);
@@ -1072,6 +1087,12 @@ void read_slab_dir(void)
 			slab->store_user = get_obj("store_user");
 			slab->trace = get_obj("trace");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1121,7 +1142,9 @@ void output_slabs(void)
 
 struct option opts[] = {
 	{ "aliases", 0, NULL, 'a' },
+	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
+	{ "defrag", 0, NULL, 'D' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
 	{ "help", 0, NULL, 'h' },
@@ -1146,7 +1169,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzCDTS",
 						opts, NULL)) != -1)
 	switch(c) {
 		case '1':
@@ -1196,6 +1219,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'D':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 14/26] SLUB: Logic to trigger slab defragmentation from memory reclaim
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (12 preceding siblings ...)
  2007-06-18  9:58 ` [patch 13/26] SLUB: Extend slabinfo to support -D and -C options clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches clameter
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_defrag_trigger --]
[-- Type: text/plain, Size: 9555 bytes --]

At some point slab defragmentation needs to be triggered. The logical
point for this is after slab shrinking was performed in vmscan.c. At
that point the fragmentation ratio of a slab was increased by objects
being freed. So we call kmem_cache_defrag from there.

kmem_cache_defrag takes the defrag ratio to make the decision to
defrag a slab or not. We define a new VM tunable

	slab_defrag_ratio

that contains the limit to trigger slab defragmentation.

slab_shrink() from vmscan.c is called in some contexts to do
global shrinking of slabs and in others to do shrinking for
a particular zone. Pass the zone to slab_shrink, so that slab_shrink
can call kmem_cache_defrag() and restrict the defragmentation to
the node that is under memory pressure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/sysctl/vm.txt |   25 +++++++++++++++++++++++++
 fs/drop_caches.c            |    2 +-
 include/linux/mm.h          |    2 +-
 include/linux/slab.h        |    1 +
 kernel/sysctl.c             |   10 ++++++++++
 mm/vmscan.c                 |   34 +++++++++++++++++++++++++++-------
 6 files changed, 65 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc4-mm2/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/sysctl/vm.txt	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/sysctl/vm.txt	2007-06-17 18:12:29.000000000 -0700
@@ -35,6 +35,7 @@ Currently, these files are in /proc/sys/
 - swap_prefetch
 - swap_prefetch_delay
 - swap_prefetch_sleep
+- slab_defrag_ratio
 
 ==============================================================
 
@@ -300,3 +301,27 @@ sleep for when the ram is found to be fu
 further.
 
 The default value is 5.
+
+==============================================================
+
+slab_defrag_ratio
+
+After shrinking the slabs the system checks if slabs have a lower usage
+ratio than the percentage given here. If so then slab defragmentation is
+activated to increase the usage ratio of the slab and in order to free
+memory.
+
+This is the percentage of objects allocated of the total possible number
+of objects in a slab. A lower percentage signifies more fragmentation.
+
+Note slab defragmentation only works on slabs that have the proper methods
+defined (see /sys/slab/<slabname>/ops). When this text was written slab
+defragmentation was only supported by the dentry cache and the inode cache.
+
+The main purpose of the slab defragmentation is to address pathological
+situations in which large amounts of inodes or dentries have been
+removed from the system. That may leave lots of slabs around with just
+a few objects. Slab defragmentation removes these slabs.
+
+The default value is 30% meaning for 3 items in use we have 7 free
+and unused items.
Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:29.000000000 -0700
@@ -97,6 +97,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_defrag(int percentage, int node);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6.22-rc4-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/sysctl.c	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/sysctl.c	2007-06-17 18:12:29.000000000 -0700
@@ -81,6 +81,7 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
+extern int sysctl_slab_defrag_ratio;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -917,6 +918,15 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_ratio",
+		.data		= &sysctl_slab_defrag_ratio,
+		.maxlen		= sizeof(sysctl_slab_defrag_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
 	{
 		.ctl_name	= VM_LEGACY_VA_LAYOUT,
Index: linux-2.6.22-rc4-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/vmscan.c	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/vmscan.c	2007-06-17 18:12:29.000000000 -0700
@@ -135,6 +135,12 @@ void unregister_shrinker(struct shrinker
 EXPORT_SYMBOL(unregister_shrinker);
 
 #define SHRINK_BATCH 128
+
+/*
+ * Slabs should be defragmented if less than 30% of objects are allocated.
+ */
+int sysctl_slab_defrag_ratio = 30;
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -152,10 +158,19 @@ EXPORT_SYMBOL(unregister_shrinker);
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone can be be NULL. This is currently
+ * only used to limit slab defragmentation to a NUMA node. The performace
+ * of shrink_slab would be better (in particular under NUMA) if it could
+ * be targeted as a whole to a zone that is under memory pressure but
+ * the VFS datastructures do not allow that at the present time. As a
+ * result zone_reclaim must perform global slab reclaim in order
+ * to free up memory in a zone.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+			unsigned long lru_pages, struct zone *zone)
 {
 	struct shrinker *shrinker;
 	unsigned long ret = 0;
@@ -218,6 +233,8 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+	kmem_cache_defrag(sysctl_slab_defrag_ratio,
+		zone ? zone_to_nid(zone) : -1);
 	return ret;
 }
 
@@ -1163,7 +1180,8 @@ unsigned long try_to_free_pages(struct z
 		if (!priority)
 			disable_swap_token();
 		nr_reclaimed += shrink_zones(priority, zones, &sc);
-		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages,
+						NULL);
 		if (reclaim_state) {
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -1333,7 +1351,7 @@ loop_again:
 			nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
+						lru_pages, zone);
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone->all_unreclaimable)
@@ -1601,7 +1619,7 @@ unsigned long shrink_all_memory(unsigned
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
 		reclaim_state.reclaimed_slab = 0;
-		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+		shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
 		if (!reclaim_state.reclaimed_slab)
 			break;
 
@@ -1639,7 +1657,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1656,7 +1674,8 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask,
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
@@ -1816,7 +1835,8 @@ static int __zone_reclaim(struct zone *z
 		 * Note that shrink_slab will free memory on all zones and may
 		 * take a long time.
 		 */
-		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+						zone) &&
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
 				slab_reclaimable - nr_pages)
 			;
Index: linux-2.6.22-rc4-mm2/fs/drop_caches.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/fs/drop_caches.c	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/fs/drop_caches.c	2007-06-17 18:12:29.000000000 -0700
@@ -52,7 +52,7 @@ void drop_slab(void)
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
 	} while (nr_objects > 10);
 }
 
Index: linux-2.6.22-rc4-mm2/include/linux/mm.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/mm.h	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/mm.h	2007-06-17 18:12:29.000000000 -0700
@@ -1229,7 +1229,7 @@ int in_gate_area_no_task(unsigned long a
 int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
+			unsigned long lru_pages, struct zone *zone);
 extern void drop_pagecache_sb(struct super_block *);
 void drop_pagecache(void);
 void drop_slab(void);

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (13 preceding siblings ...)
  2007-06-18  9:58 ` [patch 14/26] SLUB: Logic to trigger slab defragmentation from memory reclaim clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-26  8:18   ` Andrew Morton
  2007-06-18  9:58 ` [patch 16/26] Slab defragmentation: Support defragmentation for extX filesystem inodes clameter
                   ` (11 subsequent siblings)
  26 siblings, 1 reply; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_defrag_inode_generic --]
[-- Type: text/plain, Size: 4138 bytes --]

This implements the ability to remove inodes in a particular slab
from inode cache. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/inode.c         |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h |    5 ++
 2 files changed, 104 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc4-mm2/fs/inode.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/fs/inode.c	2007-06-17 22:29:43.000000000 -0700
+++ linux-2.6.22-rc4-mm2/fs/inode.c	2007-06-17 22:54:41.000000000 -0700
@@ -1351,6 +1351,105 @@ static int __init set_ihash_entries(char
 }
 __setup("ihash_entries=", set_ihash_entries);
 
+static void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * structures. The offset is the offset of the struct inode in the fs inode.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	struct super_block *sb;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		sb = inode->i_sb;
+		iput(inode);
+		if (abort || !(sb->s_flags & MS_ACTIVE))
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+static struct kmem_cache_ops inode_kmem_cache_ops = {
+	.get = get_inodes,
+	.kick = kick_inodes
+};
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1389,7 +1488,7 @@ void __init inode_init(unsigned long mem
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once,
-					 NULL);
+					 &inode_kmem_cache_ops);
 	register_shrinker(&icache_shrinker);
 
 	/* Hash may have been set up in inode_init_early */
Index: linux-2.6.22-rc4-mm2/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/fs.h	2007-06-17 22:29:43.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/fs.h	2007-06-17 22:31:52.000000000 -0700
@@ -1790,6 +1790,11 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 16/26] Slab defragmentation: Support defragmentation for extX filesystem inodes
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (14 preceding siblings ...)
  2007-06-18  9:58 ` [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 17/26] Slab defragmentation: Support inode defragmentation for xfs clameter
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_defrag_fs_ext234 --]
[-- Type: text/plain, Size: 3106 bytes --]

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ext2/super.c |   16 ++++++++++++++--
 fs/ext3/super.c |   14 +++++++++++++-
 fs/ext4/super.c |   14 +++++++++++++-
 3 files changed, 40 insertions(+), 4 deletions(-)

Index: slub/fs/ext2/super.c
===================================================================
--- slub.orig/fs/ext2/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/ext2/super.c	2007-06-07 14:28:47.000000000 -0700
@@ -168,14 +168,26 @@ static void init_once(void * foo, struct
 	mutex_init(&ei->truncate_mutex);
 	inode_init_once(&ei->vfs_inode);
 }
- 
+
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext2_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext2_kmem_cache_ops = {
+	.get = ext2_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
 					     sizeof(struct ext2_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext2_kmem_cache_ops);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
Index: slub/fs/ext3/super.c
===================================================================
--- slub.orig/fs/ext3/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/ext3/super.c	2007-06-07 14:28:47.000000000 -0700
@@ -483,13 +483,25 @@ static void init_once(void * foo, struct
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext3_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext3_kmem_cache_ops = {
+	.get = ext3_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
 					     sizeof(struct ext3_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext3_kmem_cache_ops);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
Index: slub/fs/ext4/super.c
===================================================================
--- slub.orig/fs/ext4/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/ext4/super.c	2007-06-07 14:29:49.000000000 -0700
@@ -543,13 +543,25 @@ static void init_once(void * foo, struct
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext4_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext4_kmem_cache_ops = {
+	.get = ext4_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
 					     sizeof(struct ext4_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext4_kmem_cache_ops);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 17/26] Slab defragmentation: Support inode defragmentation for xfs
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (15 preceding siblings ...)
  2007-06-18  9:58 ` [patch 16/26] Slab defragmentation: Support defragmentation for extX filesystem inodes clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 18/26] Slab defragmentation: Support procfs inode defragmentation clameter
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_defrag_fs_xfs --]
[-- Type: text/plain, Size: 3279 bytes --]

Add slab defrag support.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/xfs/linux-2.6/kmem.h      |    5 +++--
 fs/xfs/linux-2.6/xfs_buf.c   |    2 +-
 fs/xfs/linux-2.6/xfs_super.c |   13 ++++++++++++-
 fs/xfs/xfs_vfsops.c          |    6 +++---
 4 files changed, 19 insertions(+), 7 deletions(-)

Index: slub/fs/xfs/linux-2.6/kmem.h
===================================================================
--- slub.orig/fs/xfs/linux-2.6/kmem.h	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/kmem.h	2007-06-06 13:32:58.000000000 -0700
@@ -79,9 +79,10 @@ kmem_zone_init(int size, char *zone_name
 
 static inline kmem_zone_t *
 kmem_zone_init_flags(int size, char *zone_name, unsigned long flags,
-		     void (*construct)(void *, kmem_zone_t *, unsigned long))
+		     void (*construct)(void *, kmem_zone_t *, unsigned long),
+		     const struct kmem_cache_ops *ops)
 {
-	return kmem_cache_create(zone_name, size, 0, flags, construct, NULL);
+	return kmem_cache_create(zone_name, size, 0, flags, construct, ops);
 }
 
 static inline void
Index: slub/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- slub.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/xfs_buf.c	2007-06-06 13:32:58.000000000 -0700
@@ -1834,7 +1834,7 @@ xfs_buf_init(void)
 #endif
 
 	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
-						KM_ZONE_HWALIGN, NULL);
+						KM_ZONE_HWALIGN, NULL, NULL);
 	if (!xfs_buf_zone)
 		goto out_free_trace_buf;
 
Index: slub/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- slub.orig/fs/xfs/linux-2.6/xfs_super.c	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/xfs_super.c	2007-06-06 13:32:58.000000000 -0700
@@ -355,13 +355,24 @@ xfs_fs_inode_init_once(
 	inode_init_once(vn_to_inode((bhv_vnode_t *)vnode));
 }
 
+static void *xfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v, offsetof(bhv_vnode_t, v_inode));
+};
+
+static struct kmem_cache_ops xfs_kmem_cache_ops = {
+	.get = xfs_get_inodes,
+	.kick = kick_inodes
+};
+
 STATIC int
 xfs_init_zones(void)
 {
 	xfs_vnode_zone = kmem_zone_init_flags(sizeof(bhv_vnode_t), "xfs_vnode",
 					KM_ZONE_HWALIGN | KM_ZONE_RECLAIM |
 					KM_ZONE_SPREAD,
-					xfs_fs_inode_init_once);
+					xfs_fs_inode_init_once,
+					&xfs_kmem_cache_ops);
 	if (!xfs_vnode_zone)
 		goto out;
 
Index: slub/fs/xfs/xfs_vfsops.c
===================================================================
--- slub.orig/fs/xfs/xfs_vfsops.c	2007-06-06 15:19:52.000000000 -0700
+++ slub/fs/xfs/xfs_vfsops.c	2007-06-06 15:20:36.000000000 -0700
@@ -109,13 +109,13 @@ xfs_init(void)
 	xfs_inode_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
 					KM_ZONE_HWALIGN | KM_ZONE_RECLAIM |
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 	xfs_ili_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_log_item_t), "xfs_ili",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 	xfs_chashlist_zone =
 		kmem_zone_init_flags(sizeof(xfs_chashlist_t), "xfs_chashlist",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 
 	/*
 	 * Allocate global trace buffers.

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 18/26] Slab defragmentation: Support procfs inode defragmentation
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (16 preceding siblings ...)
  2007-06-18  9:58 ` [patch 17/26] Slab defragmentation: Support inode defragmentation for xfs clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 19/26] Slab defragmentation: Support reiserfs " clameter
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_defrag_fs_proc --]
[-- Type: text/plain, Size: 1096 bytes --]

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/proc/inode.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Index: slub/fs/proc/inode.c
===================================================================
--- slub.orig/fs/proc/inode.c	2007-06-04 20:12:56.000000000 -0700
+++ slub/fs/proc/inode.c	2007-06-04 21:35:00.000000000 -0700
@@ -112,14 +112,25 @@ static void init_once(void * foo, struct
 
 	inode_init_once(&ei->vfs_inode);
 }
- 
+
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+			offsetof(struct proc_inode, vfs_inode));
+};
+
+static struct kmem_cache_ops proc_kmem_cache_ops = {
+	.get = proc_get_inodes,
+	.kick = kick_inodes
+};
+
 int __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
 					     sizeof(struct proc_inode),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once, &proc_kmem_cache_ops);
 	if (proc_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 19/26] Slab defragmentation: Support reiserfs inode defragmentation
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (17 preceding siblings ...)
  2007-06-18  9:58 ` [patch 18/26] Slab defragmentation: Support procfs inode defragmentation clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 20/26] Slab defragmentation: Support inode defragmentation for sockets clameter
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_defrag_fs_reiser --]
[-- Type: text/plain, Size: 1168 bytes --]

Add inode defrag support

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/reiserfs/super.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Index: slub/fs/reiserfs/super.c
===================================================================
--- slub.orig/fs/reiserfs/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/reiserfs/super.c	2007-06-07 14:30:49.000000000 -0700
@@ -520,6 +520,17 @@ static void init_once(void *foo, struct 
 #endif
 }
 
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+			offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
+struct kmem_cache_ops reiserfs_kmem_cache_ops = {
+	.get = reiserfs_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -527,7 +538,8 @@ static int init_inodecache(void)
 							 reiserfs_inode_info),
 						  0, (SLAB_RECLAIM_ACCOUNT|
 							SLAB_MEM_SPREAD),
-						  init_once, NULL);
+						  init_once,
+						  &reiserfs_kmem_cache_ops);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 20/26] Slab defragmentation: Support inode defragmentation for sockets
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (18 preceding siblings ...)
  2007-06-18  9:58 ` [patch 19/26] Slab defragmentation: Support reiserfs " clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-18  9:58 ` [patch 21/26] Slab defragmentation: support dentry defragmentation clameter
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_defrag_fs_socket --]
[-- Type: text/plain, Size: 1086 bytes --]

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 net/socket.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: slub/net/socket.c
===================================================================
--- slub.orig/net/socket.c	2007-06-06 15:19:29.000000000 -0700
+++ slub/net/socket.c	2007-06-06 15:20:54.000000000 -0700
@@ -264,6 +264,17 @@ static void init_once(void *foo, struct 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *sock_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct socket_alloc, vfs_inode));
+}
+
+static struct kmem_cache_ops sock_kmem_cache_ops = {
+	.get = sock_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	sock_inode_cachep = kmem_cache_create("sock_inode_cache",
@@ -273,7 +284,7 @@ static int init_inodecache(void)
 					       SLAB_RECLAIM_ACCOUNT |
 					       SLAB_MEM_SPREAD),
 					      init_once,
-					      NULL);
+					      &sock_kmem_cache_ops);
 	if (sock_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 21/26] Slab defragmentation: support dentry defragmentation
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (19 preceding siblings ...)
  2007-06-18  9:58 ` [patch 20/26] Slab defragmentation: Support inode defragmentation for sockets clameter
@ 2007-06-18  9:58 ` clameter
  2007-06-26  8:18   ` Andrew Morton
  2007-06-18  9:59 ` [patch 22/26] SLUB: kmem_cache_vacate to support page allocator memory defragmentation clameter
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 73+ messages in thread
From: clameter @ 2007-06-18  9:58 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_defrag_dentry --]
[-- Type: text/plain, Size: 4793 bytes --]

get() uses the dcache lock and then works with dget_locked to obtain a
reference to the dentry. An additional complication is that the dentry
may be in process of being freed or it may just have been allocated.
We add an additional flag to d_flags to be able to determined the
status of an object.

kick() is called after get() has been used and after the slab has dropped
all of its own locks. The dentry pruning for unused entries works in a
straighforward way.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/dcache.c            |  112 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/dcache.h |    5 ++
 2 files changed, 109 insertions(+), 8 deletions(-)

Index: slub/fs/dcache.c
===================================================================
--- slub.orig/fs/dcache.c	2007-06-07 14:31:24.000000000 -0700
+++ slub/fs/dcache.c	2007-06-07 14:31:39.000000000 -0700
@@ -135,6 +135,7 @@ static struct dentry *d_kill(struct dent
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
+	dentry->d_flags &= ~DCACHE_ENTRY_VALID;
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
@@ -951,6 +952,7 @@ struct dentry *d_alloc(struct dentry * p
 	if (parent)
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
 	dentry_stat.nr_dentry++;
+	dentry->d_flags |= DCACHE_ENTRY_VALID;
 	spin_unlock(&dcache_lock);
 
 	return dentry;
@@ -2108,18 +2110,112 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab is holding off frees. Thus we can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+	struct dentry *dentry;
+	int i;
+
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		/*
+		 * if DCACHE_ENTRY_VALID is not set then the dentry
+		 * may be already in the process of being freed.
+		 */
+		if (!(dentry->d_flags & DCACHE_ENTRY_VALID))
+			v[i] = NULL;
+		else
+			dget_locked(dentry);
+	}
+	spin_unlock(&dcache_lock);
+	return 0;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the
+ * refcount we obtained earlier and also rid of the
+ * object.
+ */
+static void kick_dentries(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct dentry *dentry;
+	int abort = 0;
+	int i;
+
+	/*
+	 * First invalidate the dentries without holding the dcache lock
+	 */
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	/*
+	 * If we are the last one holding a reference then the dentries can
+	 * be freed. We  need the dcache_lock.
+	 */
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (!dentry)
+			continue;
+
+		if (abort)
+			goto put_dentry;
+
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) > 1) {
+			/*
+			 * Reference count was increased.
+			 * We need to abandon the freeing of
+			 * objects.
+			 */
+			abort = 1;
+			spin_unlock(&dentry->d_lock);
+put_dentry:
+			spin_unlock(&dcache_lock);
+			dput(dentry);
+			spin_lock(&dcache_lock);
+			continue;
+		}
+
+		/* Remove from LRU */
+		if (!list_empty(&dentry->d_lru)) {
+			dentry_stat.nr_unused--;
+			list_del_init(&dentry->d_lru);
+		}
+		/* Drop the entry */
+		prune_one_dentry(dentry, 1);
+	}
+	spin_unlock(&dcache_lock);
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations arei complete
+	 */
+	if (!abort)
+		synchronize_rcu();
+}
+
+static struct kmem_cache_ops dentry_kmem_cache_ops = {
+	.get = get_dentries,
+	.kick = kick_dentries,
+};
+
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
+	dentry_cache = KMEM_CACHE_OPS(dentry,
+		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		&dentry_kmem_cache_ops);
+
 	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
Index: slub/include/linux/dcache.h
===================================================================
--- slub.orig/include/linux/dcache.h	2007-06-07 14:31:24.000000000 -0700
+++ slub/include/linux/dcache.h	2007-06-07 14:32:35.000000000 -0700
@@ -177,6 +177,11 @@ d_iput:		no		no		no       yes
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
 
+#define DCACHE_ENTRY_VALID	0x0040	/*
+					 * Entry is valid and not in the
+					 * process of being created or
+					 * destroyed.
+					 */
 extern spinlock_t dcache_lock;
 
 /**

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 22/26] SLUB: kmem_cache_vacate to support page allocator memory defragmentation
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (20 preceding siblings ...)
  2007-06-18  9:58 ` [patch 21/26] Slab defragmentation: support dentry defragmentation clameter
@ 2007-06-18  9:59 ` clameter
  2007-06-18  9:59 ` [patch 23/26] SLUB: Move sysfs operations outside of slub_lock clameter
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:59 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slab_defrag_kmem_cache_vacate --]
[-- Type: text/plain, Size: 6976 bytes --]

Special function kmem_cache_vacate() to push out the objects in a
specified slab. In order to make that work we will have to handle
slab page allocations in such a way that we can determine if a slab is valid whenever we access it regardless of its time in life.

A valid slab that can be freed has PageSlab(page) and page->inuse > 0 set.
So we need to make sure in allocate_slab that page->inuse is zero before
PageSlab is set otherwise kmem_cache_vacate may operate on a slab that
has not been properly setup yet.

There is currently no in kernel user. The hope is that Mel's defragmentation
method can at some point use this functionality to make slabs movable
so that the reclaimable type of pages may not be necessary anymore.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |    1 
 mm/slab.c            |    9 ++++
 mm/slob.c            |    9 ++++
 mm/slub.c            |  109 ++++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 119 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:12:29.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:37.000000000 -0700
@@ -98,6 +98,7 @@ unsigned int kmem_cache_size(struct kmem
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 int kmem_cache_defrag(int percentage, int node);
+int kmem_cache_vacate(struct page *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:12:37.000000000 -0700
@@ -2523,6 +2523,15 @@ int kmem_cache_defrag(int percent, int n
 	return 0;
 }
 
+/*
+ * SLAB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:12:37.000000000 -0700
@@ -558,6 +558,15 @@ int kmem_cache_defrag(int percentage, in
 	return 0;
 }
 
+/*
+ * SLOB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 int kmem_ptr_validate(struct kmem_cache *a, const void *b)
 {
 	return 0;
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:37.000000000 -0700
@@ -1038,6 +1038,7 @@ static inline int slab_pad_check(struct 
 static inline int check_object(struct kmem_cache *s, struct page *page,
 			void *object, int active) { return 1; }
 static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void remove_full(struct kmem_cache *s, struct page *page) {}
 static inline void kmem_cache_open_debug_check(struct kmem_cache *s) {}
 #define slub_debug 0
 #endif
@@ -1103,12 +1104,11 @@ static struct page *new_slab(struct kmem
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
+
+	page->inuse = 0;
+	page->lockless_freelist = NULL;
 	page->offset = s->offset / sizeof(void *);
 	page->slab = s;
-	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		SetSlabDebug(page);
 
 	start = page_address(page);
 	end = start + s->objects * s->size;
@@ -1126,11 +1126,20 @@ static struct page *new_slab(struct kmem
 	set_freepointer(s, last, NULL);
 
 	page->freelist = start;
-	page->lockless_freelist = NULL;
-	page->inuse = 0;
-out:
-	if (flags & __GFP_WAIT)
-		local_irq_disable();
+
+	/*
+	 * page->inuse must be 0 when PageSlab(page) becomes
+	 * true so that defrag knows that this slab is not in use.
+	 */
+	smp_wmb();
+	__SetPageSlab(page);
+	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
+			SLAB_STORE_USER | SLAB_TRACE))
+		SetSlabDebug(page);
+
+ out:
+	if (flags & __GFP_WAIT)
+		local_irq_disable();
 	return page;
 }
 
@@ -2654,6 +2663,88 @@ static unsigned long __kmem_cache_shrink
 }
 
 /*
+ * Get a page off a list and freeze it. Must be holding slab lock.
+ */
+static void freeze_from_list(struct kmem_cache *s, struct page *page)
+{
+	if (page->inuse < s->objects)
+		remove_partial(s, page);
+	else if (s->flags & SLAB_STORE_USER)
+		remove_full(s, page);
+	SetSlabFrozen(page);
+}
+
+/*
+ * Attempt to free objects in a page. Return 1 if succesful.
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	unsigned long flags;
+	struct kmem_cache *s;
+	int vacated = 0;
+	void **vector = NULL;
+
+	/*
+	 * Get a reference to the page. Return if its freed or being freed.
+	 * This is necessary to make sure that the page does not vanish
+	 * from under us before we are able to check the result.
+	 */
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	if (!PageSlab(page))
+		goto out;
+
+	s = page->slab;
+	if (!s)
+		goto out;
+
+	vector = kmalloc(s->objects * sizeof(void *), GFP_KERNEL);
+	if (!vector)
+		goto out2;
+
+	local_irq_save(flags);
+	/*
+	 * The implicit memory barrier in slab_lock guarantees that page->inuse
+	 * is loaded after PageSlab(page) has been established to be true.
+	 * Only revelant for a  newly created slab.
+	 */
+	slab_lock(page);
+
+	/*
+	 * We may now have locked a page that may be in various stages of
+	 * being freed. If the PageSlab bit is off then we have already
+	 * reached the page allocator. If page->inuse is zero then we are
+	 * in SLUB but freeing or allocating the page.
+	 * page->inuse is never modified without the slab lock held.
+	 *
+	 * Also abort if the page happens to be already frozen. If its
+	 * frozen then a concurrent vacate may be in progress.
+	 */
+	if (!PageSlab(page) || SlabFrozen(page) || !page->inuse)
+		goto out_locked;
+
+	/*
+	 * We are holding a lock on a slab page and all operations on the
+	 * slab are blocking.
+	 */
+	if (!s->ops->get || !s->ops->kick)
+		goto out_locked;
+	freeze_from_list(s, page);
+	vacated = __kmem_cache_vacate(s, page, flags, vector);
+out:
+	kfree(vector);
+out2:
+	put_page(page);
+	return vacated == 0;
+out_locked:
+	slab_unlock(page);
+	local_irq_restore(flags);
+	goto out;
+
+}
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 23/26] SLUB: Move sysfs operations outside of slub_lock
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (21 preceding siblings ...)
  2007-06-18  9:59 ` [patch 22/26] SLUB: kmem_cache_vacate to support page allocator memory defragmentation clameter
@ 2007-06-18  9:59 ` clameter
  2007-06-18  9:59 ` [patch 24/26] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab clameter
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:59 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_lock_cleanup --]
[-- Type: text/plain, Size: 1931 bytes --]

Sysfs can do a gazillion things when called. Make sure that we do
not call any sysfs functions while holding the slub_lock.

Just protect the essentials:

1. The list of all slab caches
2. The kmalloc_dma array
3. The ref counters of the slabs.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:37.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:44.000000000 -0700
@@ -2217,12 +2217,13 @@ void kmem_cache_destroy(struct kmem_cach
 	s->refcount--;
 	if (!s->refcount) {
 		list_del(&s->list);
+		up_write(&slub_lock);
 		if (kmem_cache_close(s))
 			WARN_ON(1);
 		sysfs_slab_remove(s);
 		kfree(s);
-	}
-	up_write(&slub_lock);
+	} else
+		up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -3027,26 +3028,33 @@ struct kmem_cache *kmem_cache_create(con
 		 */
 		s->objsize = max(s->objsize, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
+		up_write(&slub_lock);
+
 		if (sysfs_slab_alias(s, name))
 			goto err;
-	} else {
-		s = kmalloc(kmem_size, GFP_KERNEL);
-		if (s && kmem_cache_open(s, GFP_KERNEL, name,
+
+		return s;
+	}
+
+	s = kmalloc(kmem_size, GFP_KERNEL);
+	if (s) {
+		if (kmem_cache_open(s, GFP_KERNEL, name,
 				size, align, flags, ctor, ops)) {
-			if (sysfs_slab_add(s)) {
-				kfree(s);
-				goto err;
-			}
 			list_add(&s->list, &slab_caches);
+			up_write(&slub_lock);
 			raise_kswapd_order(s->order);
-		} else
-			kfree(s);
+
+			if (sysfs_slab_add(s))
+				goto err;
+
+			return s;
+
+		}
+		kfree(s);
 	}
 	up_write(&slub_lock);
-	return s;
 
 err:
-	up_write(&slub_lock);
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slabcache %s\n", name);
 	else

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 24/26] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (22 preceding siblings ...)
  2007-06-18  9:59 ` [patch 23/26] SLUB: Move sysfs operations outside of slub_lock clameter
@ 2007-06-18  9:59 ` clameter
  2007-06-18  9:59 ` [patch 25/26] SLUB: Add an object counter to the kmem_cache_cpu structure clameter
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:59 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_performance_conc_free_alloc --]
[-- Type: text/plain, Size: 14853 bytes --]

A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.

This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.

Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.

This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.

The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.

The number of objects that can be treated like that is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects if necessary.

If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.

It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.

More code cleanups become possible:

- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
  Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
  checks.


Effect on allocations:

Cachelines touched before this patch:

	Write:	page cache struct and first cacheline of object

Cachelines touched after this patch:

	Write:	kmem_cache_cpu cacheline and first cacheline of object


The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.


Effect on freeing:

Cachelines touched before this patch:

	Write: page_struct and first cacheline of object

Cachelines touched after this patch depending on how we free:

  Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
  Write(to other):	page struct and first cacheline of object

  Read(to cpu_slab):	page struct to id slab etc.
  Read(to other):	cpu local kmem_cache_cpu struct to verify its not
  			the cpu slab.



Summary:

Pro:
	- Distinct cachelines so that concurrent remote frees and local
	  allocs on a cpuslab can occur without cacheline bouncing.
	- Avoids potential bouncing cachelines because of neighboring
	  per cpu pointer updates in kmem_cache's cpu_slab structure since
	  it now grows to a cacheline (Therefore remove the comment
	  that talks about that concern).

Cons:
	- Freeing objects now requires the reading of one additional
	  cacheline.

	- Memory usage grows slightly.

	The size of each per cpu object is blown up from one word
	(pointing to the page_struct) to one cacheline with various data.
	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
	NR_SLABS is 100 and a cache line size of 128 then we have just
	increased SLAB metadata requirements by 12.8k per cpu.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    9 ++
 mm/slub.c                |  164 +++++++++++++++++++++++++----------------------
 2 files changed, 97 insertions(+), 76 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:12:48.000000000 -0700
@@ -11,6 +11,13 @@
 #include <linux/workqueue.h>
 #include <linux/kobject.h>
 
+struct kmem_cache_cpu {
+	void **lockless_freelist;
+	struct page *page;
+	int node;
+	/* Lots of wasted space */
+} ____cacheline_aligned_in_smp;
+
 struct kmem_cache_node {
 	spinlock_t list_lock;	/* Protect partial list and nr_partial */
 	unsigned long nr_partial;
@@ -55,7 +62,7 @@ struct kmem_cache {
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-	struct page *cpu_slab[NR_CPUS];
+	struct kmem_cache_cpu cpu_slab[NR_CPUS];
 };
 
 /*
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:44.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:48.000000000 -0700
@@ -140,11 +140,6 @@ static inline void ClearSlabDebug(struct
 /*
  * Issues still to be resolved:
  *
- * - The per cpu array is updated for each new slab and and is a remote
- *   cacheline for most nodes. This could become a bouncing cacheline given
- *   enough frequent updates. There are 16 pointers in a cacheline, so at
- *   max 16 cpus could compete for the cacheline which may be okay.
- *
  * - Support PAGE_ALLOC_DEBUG. Should be easy to do.
  *
  * - Variable sizing of the per node arrays
@@ -283,6 +278,11 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+	return &s->cpu_slab[cpu];
+}
+
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
 {
@@ -1395,33 +1395,34 @@ static void unfreeze_slab(struct kmem_ca
 /*
  * Remove the cpu slab
  */
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
+	struct page *page = c->page;
 	/*
 	 * Merge cpu freelist into freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(page->lockless_freelist)) {
+	while (unlikely(c->lockless_freelist)) {
 		void **object;
 
 		/* Retrieve object from cpu_freelist */
-		object = page->lockless_freelist;
-		page->lockless_freelist = page->lockless_freelist[page->offset];
+		object = c->lockless_freelist;
+		c->lockless_freelist = c->lockless_freelist[page->offset];
 
 		/* And put onto the regular freelist */
 		object[page->offset] = page->freelist;
 		page->freelist = object;
 		page->inuse--;
 	}
-	s->cpu_slab[cpu] = NULL;
+	c->page = NULL;
 	unfreeze_slab(s, page);
 }
 
-static inline void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	slab_lock(page);
-	deactivate_slab(s, page, cpu);
+	slab_lock(c->page);
+	deactivate_slab(s, c);
 }
 
 /*
@@ -1430,18 +1431,17 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct page *page = s->cpu_slab[cpu];
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 
-	if (likely(page))
-		flush_slab(s, page, cpu);
+	if (likely(c && c->page))
+		flush_slab(s, c);
 }
 
 static void flush_cpu_slab(void *d)
 {
 	struct kmem_cache *s = d;
-	int cpu = smp_processor_id();
 
-	__flush_cpu_slab(s, cpu);
+	__flush_cpu_slab(s, smp_processor_id());
 }
 
 static void flush_all(struct kmem_cache *s)
@@ -1458,6 +1458,19 @@ static void flush_all(struct kmem_cache 
 }
 
 /*
+ * Check if the objects in a per cpu structure fit numa
+ * locality expectations.
+ */
+static inline int node_match(struct kmem_cache_cpu *c, int node)
+{
+#ifdef CONFIG_NUMA
+	if (node != -1 && c->node != node)
+		return 0;
+#endif
+	return 1;
+}
+
+/*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
  *
@@ -1475,45 +1488,46 @@ static void flush_all(struct kmem_cache 
  * we need to allocate a new slab. This is slowest path since we may sleep.
  */
 static void *__slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node, void *addr, struct page *page)
+		gfp_t gfpflags, int node, void *addr, struct kmem_cache_cpu *c)
 {
 	void **object;
-	int cpu = smp_processor_id();
+	struct page *new;
 
-	if (!page)
+	if (!c->page)
 		goto new_slab;
 
-	slab_lock(page);
-	if (unlikely(node != -1 && page_to_nid(page) != node))
+	slab_lock(c->page);
+	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 load_freelist:
-	object = page->freelist;
+	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SlabDebug(page)))
+	if (unlikely(SlabDebug(c->page)))
 		goto debug;
 
-	object = page->freelist;
-	page->lockless_freelist = object[page->offset];
-	page->inuse = s->objects;
-	page->freelist = NULL;
-	slab_unlock(page);
+	object = c->page->freelist;
+	c->lockless_freelist = object[c->page->offset];
+	c->page->inuse = s->objects;
+	c->page->freelist = NULL;
+	c->node = page_to_nid(c->page);
+	slab_unlock(c->page);
 	return object;
 
 another_slab:
-	deactivate_slab(s, page, cpu);
+	deactivate_slab(s, c);
 
 new_slab:
-	page = get_partial(s, gfpflags, node);
-	if (page) {
-		s->cpu_slab[cpu] = page;
+	new = get_partial(s, gfpflags, node);
+	if (new) {
+		c->page = new;
 		goto load_freelist;
 	}
 
-	page = new_slab(s, gfpflags, node);
-	if (page) {
-		cpu = smp_processor_id();
-		if (s->cpu_slab[cpu]) {
+	new = new_slab(s, gfpflags, node);
+	if (new) {
+		c = get_cpu_slab(s, smp_processor_id());
+		if (c->page) {
 			/*
 			 * Someone else populated the cpu_slab while we
 			 * enabled interrupts, or we have gotten scheduled
@@ -1521,34 +1535,32 @@ new_slab:
 			 * requested node even if __GFP_THISNODE was
 			 * specified. So we need to recheck.
 			 */
-			if (node == -1 ||
-				page_to_nid(s->cpu_slab[cpu]) == node) {
+			if (node_match(c, node)) {
 				/*
 				 * Current cpuslab is acceptable and we
 				 * want the current one since its cache hot
 				 */
-				discard_slab(s, page);
-				page = s->cpu_slab[cpu];
-				slab_lock(page);
+				discard_slab(s, new);
+				slab_lock(c->page);
 				goto load_freelist;
 			}
 			/* New slab does not fit our expectations */
-			flush_slab(s, s->cpu_slab[cpu], cpu);
+			flush_slab(s, c);
 		}
-		slab_lock(page);
-		SetSlabFrozen(page);
-		s->cpu_slab[cpu] = page;
+		slab_lock(new);
+		SetSlabFrozen(new);
+		c->page = new;
 		goto load_freelist;
 	}
 	return NULL;
 debug:
-	object = page->freelist;
-	if (!alloc_debug_processing(s, page, object, addr))
+	object = c->page->freelist;
+	if (!alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
-	page->inuse++;
-	page->freelist = object[page->offset];
-	slab_unlock(page);
+	c->page->inuse++;
+	c->page->freelist = object[c->page->offset];
+	slab_unlock(c->page);
 	return object;
 }
 
@@ -1565,20 +1577,20 @@ debug:
 static void __always_inline *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, void *addr, int length)
 {
-	struct page *page;
 	void **object;
 	unsigned long flags;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
-	page = s->cpu_slab[smp_processor_id()];
-	if (unlikely(!page || !page->lockless_freelist ||
-			(node != -1 && page_to_nid(page) != node)))
+	c = get_cpu_slab(s, smp_processor_id());
+	if (unlikely(!c->page || !c->lockless_freelist ||
+					!node_match(c, node)))
 
-		object = __slab_alloc(s, gfpflags, node, addr, page);
+		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = page->lockless_freelist;
-		page->lockless_freelist = object[page->offset];
+		object = c->lockless_freelist;
+		c->lockless_freelist = object[c->page->offset];
 	}
 	local_irq_restore(flags);
 
@@ -1678,12 +1690,13 @@ static void __always_inline slab_free(st
 {
 	void **object = (void *)x;
 	unsigned long flags;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
-	if (likely(page == s->cpu_slab[smp_processor_id()] &&
-						!SlabDebug(page))) {
-		object[page->offset] = page->lockless_freelist;
-		page->lockless_freelist = object;
+	c = get_cpu_slab(s, smp_processor_id());
+	if (likely(page == c->page && !SlabDebug(page))) {
+		object[page->offset] = c->lockless_freelist;
+		c->lockless_freelist = object;
 	} else
 		__slab_free(s, page, x, addr);
 
@@ -2931,7 +2944,7 @@ void __init kmem_cache_init(void)
 #endif
 
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct page *);
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu);
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d,"
 		" MinObjects=%d, CPUs=%d, Nodes=%d\n",
@@ -3518,22 +3531,20 @@ static unsigned long slab_objects(struct
 	per_cpu = nodes + nr_node_ids;
 
 	for_each_possible_cpu(cpu) {
-		struct page *page = s->cpu_slab[cpu];
-		int node;
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 
-		if (page) {
-			node = page_to_nid(page);
+		if (c && c->page) {
 			if (flags & SO_CPU) {
 				int x = 0;
 
 				if (flags & SO_OBJECTS)
-					x = page->inuse;
+					x = c->page->inuse;
 				else
 					x = 1;
 				total += x;
-				nodes[node] += x;
+				nodes[c->node] += x;
 			}
-			per_cpu[node]++;
+			per_cpu[c->node]++;
 		}
 	}
 
@@ -3579,14 +3590,17 @@ static int any_slab_objects(struct kmem_
 	int node;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
-		if (s->cpu_slab[cpu])
+	for_each_possible_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c && c->page)
 			return 1;
+	}
 
-	for_each_node(node) {
+	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
-		if (n->nr_partial || atomic_read(&n->nr_slabs))
+		if (n && (n->nr_partial || atomic_read(&n->nr_slabs)))
 			return 1;
 	}
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 25/26] SLUB: Add an object counter to the kmem_cache_cpu structure
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (23 preceding siblings ...)
  2007-06-18  9:59 ` [patch 24/26] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab clameter
@ 2007-06-18  9:59 ` clameter
  2007-06-18  9:59 ` [patch 26/26] SLUB: Place kmem_cache_cpu structures in a NUMA aware way clameter
  2007-06-18 11:57 ` [patch 00/26] Current slab allocator / SLUB patch queue Michal Piotrowski
  26 siblings, 0 replies; 73+ messages in thread
From: clameter @ 2007-06-18  9:59 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_performance_cpuslab_counter --]
[-- Type: text/plain, Size: 4500 bytes --]

The kmem_cache_cpu structure is now 2 1/2 words. Allocation sizes are rounded
to word boundaries so we can place an additional integer in the
kmem_cache_structure without increasing its size.

The counter is useful to keep track of the numbers of objects left in the
lockless per cpu list. If we have this number then the merging of the
per cpu objects back into the slab (when a slab is deactivated) can be very fast
since we have no need anymore to count the objects.

Pros:
	- The benefit is that requests to rapidly changing node numbers
	  from a single processor are improved on NUMA.
	  Switching from a slab on one node to another becomes faster
	  since the back spilling of objects is simplified.

Cons:
	- Additional need to increase and decrease a counter in slab_alloc
	  and slab_free. But the counter is in a cacheline already written to
	  so its cheap to do.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    1 
 mm/slub.c                |   48 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 38 insertions(+), 11 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 23:51:36.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-18 00:45:54.000000000 -0700
@@ -14,6 +14,7 @@
 struct kmem_cache_cpu {
 	void **lockless_freelist;
 	struct page *page;
+	int objects;	/* Saved page->inuse */
 	int node;
 	/* Lots of wasted space */
 } ____cacheline_aligned_in_smp;
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-18 00:40:04.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 00:45:54.000000000 -0700
@@ -1398,23 +1398,47 @@ static void unfreeze_slab(struct kmem_ca
 static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	struct page *page = c->page;
+
 	/*
 	 * Merge cpu freelist into freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(c->lockless_freelist)) {
-		void **object;
+	if (unlikely(c->lockless_freelist)) {
 
-		/* Retrieve object from cpu_freelist */
-		object = c->lockless_freelist;
-		c->lockless_freelist = c->lockless_freelist[page->offset];
+		/*
+		 * Special case in which no remote frees have occurred.
+		 * Then we can simply have the lockless_freelist become
+		 * the page->freelist and put the counter back.
+		 */
+		if (!page->freelist) {
+			page->freelist = c->lockless_freelist;
+			page->inuse = c->objects;
+			c->lockless_freelist = NULL;
+		} else {
 
-		/* And put onto the regular freelist */
-		object[page->offset] = page->freelist;
-		page->freelist = object;
-		page->inuse--;
+			/*
+			 * Objects both on page freelist and cpu freelist.
+			 * We need to merge both lists. By doing that
+			 * we reverse the object order in the slab.
+			 * Sigh. But we rarely get here.
+			 */
+			while (c->lockless_freelist) {
+				void **object;
+
+				/* Retrieve object from lockless freelist */
+				object = c->lockless_freelist;
+				c->lockless_freelist =
+					c->lockless_freelist[page->offset];
+
+				/* And put onto the regular freelist */
+				object[page->offset] = page->freelist;
+				page->freelist = object;
+				page->inuse--;
+			}
+		}
 	}
+
 	c->page = NULL;
 	unfreeze_slab(s, page);
 }
@@ -1508,6 +1532,7 @@ load_freelist:
 
 	object = c->page->freelist;
 	c->lockless_freelist = object[c->page->offset];
+	c->objects = c->page->inuse + 1;
 	c->page->inuse = s->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1583,14 +1608,14 @@ static void __always_inline *slab_alloc(
 
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
-	if (unlikely(!c->page || !c->lockless_freelist ||
-					!node_match(c, node)))
+	if (unlikely(!c->lockless_freelist || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
 		object = c->lockless_freelist;
 		c->lockless_freelist = object[c->page->offset];
+		c->objects++;
 	}
 	local_irq_restore(flags);
 
@@ -1697,6 +1722,7 @@ static void __always_inline slab_free(st
 	if (likely(page == c->page && !SlabDebug(page))) {
 		object[page->offset] = c->lockless_freelist;
 		c->lockless_freelist = object;
+		c->objects--;
 	} else
 		__slab_free(s, page, x, addr);
 

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch 26/26] SLUB: Place kmem_cache_cpu structures in a NUMA aware way.
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (24 preceding siblings ...)
  2007-06-18  9:59 ` [patch 25/26] SLUB: Add an object counter to the kmem_cache_cpu structure clameter
@ 2007-06-18  9:59 ` clameter
  2007-06-19 23:17   ` Christoph Lameter
  2007-06-18 11:57 ` [patch 00/26] Current slab allocator / SLUB patch queue Michal Piotrowski
  26 siblings, 1 reply; 73+ messages in thread
From: clameter @ 2007-06-18  9:59 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

[-- Attachment #1: slub_performance_numa_placement --]
[-- Type: text/plain, Size: 8726 bytes --]

The kmem_cache_cpu structures introduced are currently an array placed in the
kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.

In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).

The kmem_cache_cpu structures can now be allocated in a more intelligent way.
We could put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. Thus we set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.

Pro:
	- NUMA aware placement improves memory performance
	- All global structures in struct kmem_cache become readonly
	- Dense packing of per cpu structures reduces cacheline
	  footprint in SMP and NUMA.
	- Potential avoidance of exclusive cacheline fetches
	  on the free and alloc hotpath since multiple kmem_cache_cpu
	  structures are in one cacheline. This is particularly important
	  for the kmalloc array.

Cons:
	- Additional reference to one read only cacheline (per cpu
	  array of pointers to kmem_cache_cpu) in both slab_alloc()
	  and slab_free().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    9 ++-
 mm/slub.c                |  131 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 133 insertions(+), 7 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-18 01:28:48.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-18 01:34:52.000000000 -0700
@@ -16,8 +16,7 @@ struct kmem_cache_cpu {
 	struct page *page;
 	int objects;	/* Saved page->inuse */
 	int node;
-	/* Lots of wasted space */
-} ____cacheline_aligned_in_smp;
+};
 
 struct kmem_cache_node {
 	spinlock_t list_lock;	/* Protect partial list and nr_partial */
@@ -63,7 +62,11 @@ struct kmem_cache {
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-	struct kmem_cache_cpu cpu_slab[NR_CPUS];
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
 };
 
 /*
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-18 01:34:42.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 02:15:22.000000000 -0700
@@ -280,7 +280,11 @@ static inline struct kmem_cache_node *ge
 
 static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	return &s->cpu_slab[cpu];
+#ifdef CONFIG_SMP
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
 }
 
 static inline int check_valid_pointer(struct kmem_cache *s,
@@ -1924,14 +1928,126 @@ static void init_kmem_cache_node(struct 
 	INIT_LIST_HEAD(&n->full);
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Per cpu array for per cpu structures.
+ *
+ * The per cpu array places all kmem_cache_cpu structures from one processor
+ * close together meaning that it becomes possible that multiple per cpu
+ * structures are contained in one cacheline. This may be particularly
+ * beneficial for the kmalloc caches.
+ *
+ * A desktop system typically has around 60-80 slabs. With 100 here we are
+ * likely able to get per cpu structures for all caches from the array defined
+ * here. We must be able to cover all kmalloc caches during bootstrap.
+ *
+ * If the per cpu array is exhausted then fall back to kmalloc
+ * of individual cachelines. No sharing is possible then.
+ */
+#define NR_KMEM_CACHE_CPU 100
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu,
+				kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
+
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(int cpu, gfp_t flags)
+{
+	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
+
+	if (c)
+		per_cpu(kmem_cache_cpu_free, cpu) =
+				(void *)c->lockless_freelist;
+	else {
+		/* Table overflow: So allocate ourselves */
+		c = kmalloc_node(
+			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
+			flags, cpu_to_node(cpu));
+		if (!c)
+			return NULL;
+	}
+
+	memset(c, 0, sizeof(struct kmem_cache_cpu));
+	return c;
+}
+
+static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
+{
+	if (c < per_cpu(kmem_cache_cpu, cpu) ||
+			c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
+		kfree(c);
+		return;
+	}
+	c->lockless_freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
+	per_cpu(kmem_cache_cpu_free, cpu) = c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c) {
+			s->cpu_slab[cpu] = NULL;
+			free_kmem_cache_cpu(c, cpu);
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(cpu, flags);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+static void __init init_alloc_cpu(void)
+{
+	int cpu;
+	int i;
+
+	for_each_online_cpu(cpu) {
+		for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
+			free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i],
+								cpu);
+	}
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
+static inline void init_alloc_cpu(struct kmem_cache *s) {}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+	return 1;
+}
+#endif
+
 #ifdef CONFIG_NUMA
+
 /*
  * No kmalloc_node yet so do it by hand. We know that this is the first
  * slab on the node for this slabcache. There are no concurrent accesses
  * possible.
  *
  * Note that this function only works on the kmalloc_node_cache
- * when allocating for the kmalloc_node_cache.
+ * when allocating for the kmalloc_node_cache. This is used for bootstrapping
+ * memory on a fresh node that has no slab structures yet.
  */
 static struct kmem_cache_node * __init early_kmem_cache_node_alloc(gfp_t gfpflags,
 								int node)
@@ -2152,8 +2268,13 @@ static int kmem_cache_open(struct kmem_c
 #ifdef CONFIG_NUMA
 	s->defrag_ratio = 100;
 #endif
-	if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+		goto error;
+
+	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
 		return 1;
+
+	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slab %s size=%lu realsize=%u "
@@ -2236,6 +2357,8 @@ static inline int kmem_cache_close(struc
 	flush_all(s);
 
 	/* Attempt to free all objects */
+	free_kmem_cache_cpus(s);
+
 	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2908,6 +3031,8 @@ void __init kmem_cache_init(void)
 		slub_min_objects = DEFAULT_ANTIFRAG_MIN_OBJECTS;
 	}
 
+	init_alloc_cpu();
+
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -2971,7 +3096,7 @@ void __init kmem_cache_init(void)
 #endif
 
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu);
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d,"
 		" MinObjects=%d, CPUs=%d, Nodes=%d\n",
@@ -3116,15 +3241,28 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slub_lock);
+		list_for_each_entry(s, &slab_caches, list)
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(cpu,
+							GFP_KERNEL);
+		up_read(&slub_lock);
+		break;
+
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
+			free_kmem_cache_cpu(c, cpu);
+			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;

-- 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators.
  2007-06-18  9:58 ` [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators clameter
@ 2007-06-18 10:09   ` Paul Mundt
  2007-06-18 16:17     ` Christoph Lameter
  2007-06-18 20:11   ` Pekka Enberg
  1 sibling, 1 reply; 73+ messages in thread
From: Paul Mundt @ 2007-06-18 10:09 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, Jun 18, 2007 at 02:58:42AM -0700, clameter@sgi.com wrote:
> So add the necessary logic to all slab allocators to support __GFP_ZERO.
> 
Does this mean I should update my SLOB NUMA support patch? ;-)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
                   ` (25 preceding siblings ...)
  2007-06-18  9:59 ` [patch 26/26] SLUB: Place kmem_cache_cpu structures in a NUMA aware way clameter
@ 2007-06-18 11:57 ` Michal Piotrowski
  2007-06-18 16:46   ` Christoph Lameter
  26 siblings, 1 reply; 73+ messages in thread
From: Michal Piotrowski @ 2007-06-18 11:57 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

Hi,

clameter@sgi.com pisze:
> These contain the following groups of patches:
> 
> 1. Slab allocator code consolidation and fixing of inconsistencies
> 
> This makes ZERO_SIZE_PTR generic so that it works in all
> slab allocators.
> 
> It adds __GFP_ZERO support to all slab allocators and
> cleans up the zeroing in the slabs and provides modifications
> to remove explicit zeroing following kmalloc_node and
> kmem_cache_alloc_node calls.
> 
> 2. SLUB improvements
> 
> Inline some small functions to reduce code size. Some more memory
> optimizations using CONFIG_SLUB_DEBUG. Changes to handling of the
> slub_lock and an optimization of runtime determination of kmalloc slabs
> (replaces ilog2 patch that failed with gcc 3.3 on powerpc).
> 
> 3. Slab defragmentation
> 
> This is V3 of the patchset with the one fix for the locking problem that
> showed up during testing.
> 
> 4. Performance optimizations
> 
> These patches have a long history since the early drafts of SLUB. The
> problem with these patches is that they require the touching of additional
> cachelines (only for read) and SLUB was designed for minimal cacheline
> touching. In doing so we may be able to remove cacheline bouncing in
> particular for remote alloc/ free situations where I have had reports of
> issues that I was not able to confirm for lack of specificity. The tradeoffs
> here are not clear. Certainly the larger cacheline footprint will hurt the
> casual slab user somewhat but it will benefit processes that perform these
> local/remote alloc/free operations.
> 
> I'd appreciate if someone could evaluate these.
> 
> The complete patchset against 2.6.22-rc4-mm2 is available at
> 
> http://ftp.kernel.org/pub/linux/kernel/people/christoph/slub/2.6.22-rc4-mm2
> 
> Tested on
> 
> x86_64 SMP
> x86_64 NUMA emulation
> IA64 emulator
> Altix 64p/128G NUMA system.
> Altix 8p/6G asymmetric NUMA system.
> 
> 

Testcase:

#! /bin/sh

for i in `find /sys/ -type f`
do
    echo "wyświetlam $i"
    sudo cat $i > /dev/null
#    sleep 1s
done

Result:

[  212.247759] WARNING: at lib/vsprintf.c:280 vsnprintf()
[  212.253263]  [<c04052ad>] dump_trace+0x63/0x1eb
[  212.259042]  [<c040544f>] show_trace_log_lvl+0x1a/0x2f
[  212.266672]  [<c040608d>] show_trace+0x12/0x14
[  212.271622]  [<c04060a5>] dump_stack+0x16/0x18
[  212.276663]  [<c050d512>] vsnprintf+0x6b/0x48c
[  212.281325]  [<c050d9f0>] scnprintf+0x20/0x2d
[  212.286707]  [<c0508dbc>] bitmap_scnlistprintf+0xa8/0xec
[  212.292508]  [<c0480d40>] list_locations+0x24c/0x2a2
[  212.298241]  [<c0480dde>] alloc_calls_show+0x1f/0x26
[  212.303459]  [<c047e72e>] slab_attr_show+0x1c/0x20
[  212.309469]  [<c04c1cf9>] sysfs_read_file+0x94/0x105
[  212.315519]  [<c0485933>] vfs_read+0xcf/0x158
[  212.320215]  [<c0485d99>] sys_read+0x3d/0x72
[  212.327539]  [<c040420c>] syscall_call+0x7/0xb
[  212.332203]  [<b7f74410>] 0xb7f74410
[  212.336229]  =======================

Unfortunately, I don't know which file was cat'ed

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.22-rc4-mm2-slub/slub-config
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.22-rc4-mm2-slub/slub-dmesg

Regards,
Michal

-- 
LOG
http://www.stardust.webpages.pl/log/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators.
  2007-06-18 10:09   ` Paul Mundt
@ 2007-06-18 16:17     ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-18 16:17 UTC (permalink / raw)
  To: Paul Mundt; +Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007, Paul Mundt wrote:

> On Mon, Jun 18, 2007 at 02:58:42AM -0700, clameter@sgi.com wrote:
> > So add the necessary logic to all slab allocators to support __GFP_ZERO.
> > 
> Does this mean I should update my SLOB NUMA support patch? ;-)

Hehehe. Its not merged yet. Sorry about the fluidity here. The 
discussion with you triggered some thought processes on the 
consistency issues with zeroing in allocators.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 11:57 ` [patch 00/26] Current slab allocator / SLUB patch queue Michal Piotrowski
@ 2007-06-18 16:46   ` Christoph Lameter
  2007-06-18 17:38     ` Michal Piotrowski
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-18 16:46 UTC (permalink / raw)
  To: Michal Piotrowski
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007, Michal Piotrowski wrote:

> Result:
> 
> [  212.247759] WARNING: at lib/vsprintf.c:280 vsnprintf()
> [  212.253263]  [<c04052ad>] dump_trace+0x63/0x1eb
> [  212.259042]  [<c040544f>] show_trace_log_lvl+0x1a/0x2f
> [  212.266672]  [<c040608d>] show_trace+0x12/0x14
> [  212.271622]  [<c04060a5>] dump_stack+0x16/0x18
> [  212.276663]  [<c050d512>] vsnprintf+0x6b/0x48c
> [  212.281325]  [<c050d9f0>] scnprintf+0x20/0x2d
> [  212.286707]  [<c0508dbc>] bitmap_scnlistprintf+0xa8/0xec
> [  212.292508]  [<c0480d40>] list_locations+0x24c/0x2a2
> [  212.298241]  [<c0480dde>] alloc_calls_show+0x1f/0x26
> [  212.303459]  [<c047e72e>] slab_attr_show+0x1c/0x20
> [  212.309469]  [<c04c1cf9>] sysfs_read_file+0x94/0x105
> [  212.315519]  [<c0485933>] vfs_read+0xcf/0x158
> [  212.320215]  [<c0485d99>] sys_read+0x3d/0x72
> [  212.327539]  [<c040420c>] syscall_call+0x7/0xb
> [  212.332203]  [<b7f74410>] 0xb7f74410
> [  212.336229]  =======================
> 
> Unfortunately, I don't know which file was cat'ed

The dump shows that it was alloc_calls. But the issue is not related to 
this patchset.

Looks like we overflowed the buffer available for /sys output. The calls 
in list_location to format cpulist and node lists attempt to allow very
long lists by trying to calculate how many bytes are remaining in the 
page. If we are beyond the space left over by them then we may pass a
negative size to the scn_printf functions.

So we need to check first if there are enough bytes remaining before
doing the calculation of how many remaining bytes can be used to
format these lists.

Does this patch fix the issue?

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-18 09:37:41.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 09:44:38.000000000 -0700
@@ -3649,13 +3649,15 @@ static int list_locations(struct kmem_ca
 			n += sprintf(buf + n, " pid=%ld",
 				l->min_pid);
 
-		if (num_online_cpus() > 1 && !cpus_empty(l->cpus)) {
+		if (num_online_cpus() > 1 && !cpus_empty(l->cpus) &&
+				n < PAGE_SIZE - n - 57) {
 			n += sprintf(buf + n, " cpus=");
 			n += cpulist_scnprintf(buf + n, PAGE_SIZE - n - 50,
 					l->cpus);
 		}
 
-		if (num_online_nodes() > 1 && !nodes_empty(l->nodes)) {
+		if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+				n < PAGE_SIZE - n - 57) {
 			n += sprintf(buf + n, " nodes=");
 			n += nodelist_scnprintf(buf + n, PAGE_SIZE - n - 50,
 					l->nodes);






^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 16:46   ` Christoph Lameter
@ 2007-06-18 17:38     ` Michal Piotrowski
  2007-06-18 18:05       ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Michal Piotrowski @ 2007-06-18 17:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On 18/06/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Mon, 18 Jun 2007, Michal Piotrowski wrote:
>
> > Result:
> >
> > [  212.247759] WARNING: at lib/vsprintf.c:280 vsnprintf()
> > [  212.253263]  [<c04052ad>] dump_trace+0x63/0x1eb
> > [  212.259042]  [<c040544f>] show_trace_log_lvl+0x1a/0x2f
> > [  212.266672]  [<c040608d>] show_trace+0x12/0x14
> > [  212.271622]  [<c04060a5>] dump_stack+0x16/0x18
> > [  212.276663]  [<c050d512>] vsnprintf+0x6b/0x48c
> > [  212.281325]  [<c050d9f0>] scnprintf+0x20/0x2d
> > [  212.286707]  [<c0508dbc>] bitmap_scnlistprintf+0xa8/0xec
> > [  212.292508]  [<c0480d40>] list_locations+0x24c/0x2a2
> > [  212.298241]  [<c0480dde>] alloc_calls_show+0x1f/0x26
> > [  212.303459]  [<c047e72e>] slab_attr_show+0x1c/0x20
> > [  212.309469]  [<c04c1cf9>] sysfs_read_file+0x94/0x105
> > [  212.315519]  [<c0485933>] vfs_read+0xcf/0x158
> > [  212.320215]  [<c0485d99>] sys_read+0x3d/0x72
> > [  212.327539]  [<c040420c>] syscall_call+0x7/0xb
> > [  212.332203]  [<b7f74410>] 0xb7f74410
> > [  212.336229]  =======================
> >
> > Unfortunately, I don't know which file was cat'ed
>
> The dump shows that it was alloc_calls. But the issue is not related to
> this patchset.
>
> Looks like we overflowed the buffer available for /sys output. The calls
> in list_location to format cpulist and node lists attempt to allow very
> long lists by trying to calculate how many bytes are remaining in the
> page. If we are beyond the space left over by them then we may pass a
> negative size to the scn_printf functions.
>
> So we need to check first if there are enough bytes remaining before
> doing the calculation of how many remaining bytes can be used to
> format these lists.
>
> Does this patch fix the issue?
>

Unfortunately no.

AFAIR I didn't see it in 2.6.22-rc4-mm2

Regards,
Michal

-- 
LOG
http://www.stardust.webpages.pl/log/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 17:38     ` Michal Piotrowski
@ 2007-06-18 18:05       ` Christoph Lameter
  2007-06-18 18:58         ` Michal Piotrowski
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-18 18:05 UTC (permalink / raw)
  To: Michal Piotrowski
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007, Michal Piotrowski wrote:

> > Does this patch fix the issue?
> Unfortunately no.
> 
> AFAIR I didn't see it in 2.6.22-rc4-mm2

Seems that I miscounted. We need a larger safe area.


SLUB: Fix behavior if the text output of list_locations overflows PAGE_SIZE

If slabs are allocated or freed from a large set of call sites (typical 
for the kmalloc area) then we may create more output than fits into
a single PAGE and sysfs only gives us one page. The output should be
truncated. This patch fixes the checks to do the truncation properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-18 09:37:41.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 11:02:19.000000000 -0700
@@ -3649,13 +3649,15 @@ static int list_locations(struct kmem_ca
 			n += sprintf(buf + n, " pid=%ld",
 				l->min_pid);
 
-		if (num_online_cpus() > 1 && !cpus_empty(l->cpus)) {
+		if (num_online_cpus() > 1 && !cpus_empty(l->cpus) &&
+				n < PAGE_SIZE - n - 60) {
 			n += sprintf(buf + n, " cpus=");
 			n += cpulist_scnprintf(buf + n, PAGE_SIZE - n - 50,
 					l->cpus);
 		}
 
-		if (num_online_nodes() > 1 && !nodes_empty(l->nodes)) {
+		if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+				n < PAGE_SIZE - n - 60) {
 			n += sprintf(buf + n, " nodes=");
 			n += nodelist_scnprintf(buf + n, PAGE_SIZE - n - 50,
 					l->nodes);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 18:05       ` Christoph Lameter
@ 2007-06-18 18:58         ` Michal Piotrowski
  2007-06-18 19:00           ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Michal Piotrowski @ 2007-06-18 18:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On 18/06/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Mon, 18 Jun 2007, Michal Piotrowski wrote:
>
> > > Does this patch fix the issue?
> > Unfortunately no.
> >
> > AFAIR I didn't see it in 2.6.22-rc4-mm2
>
> Seems that I miscounted. We need a larger safe area.
>

Still the same.

Regards,
Michal

-- 
LOG
http://www.stardust.webpages.pl/log/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 18:58         ` Michal Piotrowski
@ 2007-06-18 19:00           ` Christoph Lameter
  2007-06-18 19:09             ` Michal Piotrowski
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-18 19:00 UTC (permalink / raw)
  To: Michal Piotrowski
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007, Michal Piotrowski wrote:

> Still the same.

Is it still exactly the same strack trace? There could be multiple issue 
if we overflow PAGE_SIZE there.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 19:00           ` Christoph Lameter
@ 2007-06-18 19:09             ` Michal Piotrowski
  2007-06-18 19:19               ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Michal Piotrowski @ 2007-06-18 19:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On 18/06/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Mon, 18 Jun 2007, Michal Piotrowski wrote:
>
> > Still the same.
>
> Is it still exactly the same strack trace?

Not exactly the same
[<c0480d4b>] list_locations+0x257/0x2ad
is the only difference

 l *list_locations+0x257
0xc1080d4b is in list_locations (mm/slub.c:3655).
3650                                    l->min_pid);
3651
3652                    if (num_online_cpus() > 1 && !cpus_empty(l->cpus) &&
3653                                    n < PAGE_SIZE - n - 60) {
3654                            n += sprintf(buf + n, " cpus=");
3655                            n += cpulist_scnprintf(buf + n,
PAGE_SIZE - n - 50,
3656                                            l->cpus);
3657                    }
3658
3659                    if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&


> There could be multiple issue
> if we overflow PAGE_SIZE there.

Regards,
Michal

-- 
LOG
http://www.stardust.webpages.pl/log/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 19:09             ` Michal Piotrowski
@ 2007-06-18 19:19               ` Christoph Lameter
  2007-06-18 20:43                 ` Michal Piotrowski
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-18 19:19 UTC (permalink / raw)
  To: Michal Piotrowski
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

Stupid me. n on both sides of the comparison. Tried to run your script 
here but I cannot trigger it.

Next attempt: Sorry for the churn.

SLUB: Fix behavior if the text output of list_locations overflows PAGE_SIZE

If slabs are allocated or freed from a large set of call sites (typical
for the kmalloc area) then we may create more output than fits into
a single PAGE and sysfs only gives us one page. The output should be
truncated. This patch fixes the checks to do the truncation properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-18 12:13:48.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 12:15:10.000000000 -0700
@@ -3649,13 +3649,15 @@ static int list_locations(struct kmem_ca
 			n += sprintf(buf + n, " pid=%ld",
 				l->min_pid);
 
-		if (num_online_cpus() > 1 && !cpus_empty(l->cpus)) {
+		if (num_online_cpus() > 1 && !cpus_empty(l->cpus) &&
+				n < PAGE_SIZE - 60) {
 			n += sprintf(buf + n, " cpus=");
 			n += cpulist_scnprintf(buf + n, PAGE_SIZE - n - 50,
 					l->cpus);
 		}
 
-		if (num_online_nodes() > 1 && !nodes_empty(l->nodes)) {
+		if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+				n < PAGE_SIZE - 60) {
 			n += sprintf(buf + n, " nodes=");
 			n += nodelist_scnprintf(buf + n, PAGE_SIZE - n - 50,
 					l->nodes);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 02/26] Slab allocators: Consolidate code for krealloc in mm/util.c
  2007-06-18  9:58 ` [patch 02/26] Slab allocators: Consolidate code for krealloc in mm/util.c clameter
@ 2007-06-18 20:03   ` Pekka Enberg
  0 siblings, 0 replies; 73+ messages in thread
From: Pekka Enberg @ 2007-06-18 20:03 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> The size of a kmalloc object is readily available via ksize().
> ksize is provided by all allocators and thus we canb implement
> krealloc in a generic way.
>
> Implement krealloc in mm/util.c and drop slab specific implementations
> of krealloc.

Looks good to me.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 03/26] Slab allocators: Consistent ZERO_SIZE_PTR support and NULL result semantics
  2007-06-18  9:58 ` [patch 03/26] Slab allocators: Consistent ZERO_SIZE_PTR support and NULL result semantics clameter
@ 2007-06-18 20:08   ` Pekka Enberg
  0 siblings, 0 replies; 73+ messages in thread
From: Pekka Enberg @ 2007-06-18 20:08 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> Define ZERO_OR_NULL_PTR macro to be able to remove the checks
> from the allocators. Move ZERO_SIZE_PTR related stuff into slab.h.
>
> Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
> WARN_ON_ONCE(size == 0) that is still remaining in SLAB.
>
> Make slub return NULL like the other allocators if a too large
> memory segment is requested via __kmalloc.

Looks good to me.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators.
  2007-06-18  9:58 ` [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators clameter
  2007-06-18 10:09   ` Paul Mundt
@ 2007-06-18 20:11   ` Pekka Enberg
  1 sibling, 0 replies; 73+ messages in thread
From: Pekka Enberg @ 2007-06-18 20:11 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> A kernel convention for many allocators is that if __GFP_ZERO is passed to
> an allocator then the allocated memory should be zeroed.
>
> This is currently not supported by the slab allocators. The inconsistency
> makes it difficult to implement in derived allocators such as in the uncached
> allocator and the pool allocators.

[snip]

> So add the necessary logic to all slab allocators to support __GFP_ZERO.

Looks good to me.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-18  9:58 ` [patch 05/26] Slab allocators: Cleanup zeroing allocations clameter
@ 2007-06-18 20:16   ` Pekka Enberg
  2007-06-18 20:26     ` Pekka Enberg
  2007-06-18 21:55     ` Christoph Lameter
  2007-06-19 21:00   ` Matt Mackall
  1 sibling, 2 replies; 73+ messages in thread
From: Pekka Enberg @ 2007-06-18 20:16 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> +static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
> +{
> +       return kmem_cache_alloc(k, flags | __GFP_ZERO);
> +}
> +
> +static inline void *__kzalloc(int size, gfp_t flags)
> +{
> +       return kmalloc(size, flags | __GFP_ZERO);
> +}

Hmm, did you check kernel text size before and after this change?
Setting the __GFP_ZERO flag at every kzalloc call-site seems like a
bad idea.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-18 20:16   ` Pekka Enberg
@ 2007-06-18 20:26     ` Pekka Enberg
  2007-06-18 22:34       ` Christoph Lameter
  2007-06-18 21:55     ` Christoph Lameter
  1 sibling, 1 reply; 73+ messages in thread
From: Pekka Enberg @ 2007-06-18 20:26 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On 6/18/07, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hmm, did you check kernel text size before and after this change?
> Setting the __GFP_ZERO flag at every kzalloc call-site seems like a
> bad idea.

Aah but most call-sites, of course, use constants such as GFP_KERNEL
only which should be folded nicely by the compiler. So this probably
doesn't have much impact. Would be nice if you'd check, though.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 00/26] Current slab allocator / SLUB patch queue
  2007-06-18 19:19               ` Christoph Lameter
@ 2007-06-18 20:43                 ` Michal Piotrowski
  0 siblings, 0 replies; 73+ messages in thread
From: Michal Piotrowski @ 2007-06-18 20:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On 18/06/07, Christoph Lameter <clameter@sgi.com> wrote:
> Stupid me. n on both sides of the comparison. Tried to run your script
> here but I cannot trigger it.
>
> Next attempt: Sorry for the churn.

Problem fixed. Thanks!

Regards,
Michal

-- 
LOG
http://www.stardust.webpages.pl/log/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-18 20:16   ` Pekka Enberg
  2007-06-18 20:26     ` Pekka Enberg
@ 2007-06-18 21:55     ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-18 21:55 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On Mon, 18 Jun 2007, Pekka Enberg wrote:

> On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> > +static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
> > +{
> > +       return kmem_cache_alloc(k, flags | __GFP_ZERO);
> > +}
> > +
> > +static inline void *__kzalloc(int size, gfp_t flags)
> > +{
> > +       return kmalloc(size, flags | __GFP_ZERO);
> > +}
> 
> Hmm, did you check kernel text size before and after this change?
> Setting the __GFP_ZERO flag at every kzalloc call-site seems like a
> bad idea.

I did not check but the flags are usually constant. Compiler does the |.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-18 20:26     ` Pekka Enberg
@ 2007-06-18 22:34       ` Christoph Lameter
  2007-06-19  5:48         ` Pekka Enberg
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-18 22:34 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On Mon, 18 Jun 2007, Pekka Enberg wrote:

> On 6/18/07, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> > Hmm, did you check kernel text size before and after this change?
> > Setting the __GFP_ZERO flag at every kzalloc call-site seems like a
> > bad idea.
> 
> Aah but most call-sites, of course, use constants such as GFP_KERNEL
> only which should be folded nicely by the compiler. So this probably
> doesn't have much impact. Would be nice if you'd check, though.

IA64

Before:

   text    data     bss     dec     hex filename
10486815        4128471 3686044 18301330        1174192 vmlinux

After:

   text    data     bss     dec     hex filename
10486335        4128439 3686044 18300818        1173f92 vmlinux

Saved ~500 bytes in text size.

x86_64:

Before:

   text    data     bss     dec     hex filename
3823932  333840  220484 4378256  42ce90 vmlinux

After

   text    data     bss     dec     hex filename
3823716  333840  220484 4378040  42cdb8 vmlinux

200 bytes saved.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-18 22:34       ` Christoph Lameter
@ 2007-06-19  5:48         ` Pekka Enberg
  0 siblings, 0 replies; 73+ messages in thread
From: Pekka Enberg @ 2007-06-19  5:48 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha


On 6/19/2007, "Christoph Lameter" <clameter@sgi.com> wrote:
> IA64

[snip]

> Saved ~500 bytes in text size.
> 
> x86_64:

[snip]

> 200 bytes saved.

Looks good. Thanks Christoph.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc.
  2007-06-18  9:58 ` [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc clameter
@ 2007-06-19 20:08   ` Andrew Morton
  2007-06-19 22:22     ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2007-06-19 20:08 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007 02:58:48 -0700
clameter@sgi.com wrote:

> +	BUG_ON(KMALLOC_MIN_SIZE > 256 ||
> +		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));

BUILD_BUG_ON?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO
  2007-06-18  9:58 ` [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO clameter
@ 2007-06-19 20:55   ` Pekka Enberg
  2007-06-28  6:09   ` Andrew Morton
  1 sibling, 0 replies; 73+ messages in thread
From: Pekka Enberg @ 2007-06-19 20:55 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> kmalloc_node() and kmem_cache_alloc_node() were not available in
> a zeroing variant in the past. But with __GFP_ZERO it is possible
> now to do zeroing while allocating.

Looks good. Maybe we want to phase out the zeroing variants altogether
(expect maybe kzalloc which is wide-spread now)?

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 11/26] SLUB: Add support for kmem_cache_ops
  2007-06-18  9:58 ` [patch 11/26] SLUB: Add support for kmem_cache_ops clameter
@ 2007-06-19 20:58   ` Pekka Enberg
  2007-06-19 22:32     ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Pekka Enberg @ 2007-06-19 20:58 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> We use the parameter formerly used by the destructor to pass an optional
> pointer to a kmem_cache_ops structure to kmem_cache_create.
>
> kmem_cache_ops is created as empty. Later patches populate kmem_cache_ops.

I like kmem_cache_ops but I don't like this patch. I know its painful
but we really want the introduction patch to fixup the API (move ctor
to kmem_cache_ops and do the callers).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-18  9:58 ` [patch 05/26] Slab allocators: Cleanup zeroing allocations clameter
  2007-06-18 20:16   ` Pekka Enberg
@ 2007-06-19 21:00   ` Matt Mackall
  2007-06-19 22:33     ` Christoph Lameter
  1 sibling, 1 reply; 73+ messages in thread
From: Matt Mackall @ 2007-06-19 21:00 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, Jun 18, 2007 at 02:58:43AM -0700, clameter@sgi.com wrote:
> It becomes now easy to support the zeroing allocs with generic inline functions
> in slab.h. Provide inline definitions to allow the continued use of
> kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing functions
> from the slab allocators and util.c.

The SLOB bits up through here look fine.

I worry a bit about adding another branch checking __GFP_ZERO in such
a hot path for SLAB/SLUB.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc.
  2007-06-19 20:08   ` Andrew Morton
@ 2007-06-19 22:22     ` Christoph Lameter
  2007-06-19 22:29       ` Andrew Morton
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-19 22:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 19 Jun 2007, Andrew Morton wrote:

> On Mon, 18 Jun 2007 02:58:48 -0700
> clameter@sgi.com wrote:
> 
> > +	BUG_ON(KMALLOC_MIN_SIZE > 256 ||
> > +		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
> 
> BUILD_BUG_ON?
> 
Does not matter. That code is __init.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc.
  2007-06-19 22:22     ` Christoph Lameter
@ 2007-06-19 22:29       ` Andrew Morton
  2007-06-19 22:38         ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2007-06-19 22:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 19 Jun 2007 15:22:36 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 19 Jun 2007, Andrew Morton wrote:
> 
> > On Mon, 18 Jun 2007 02:58:48 -0700
> > clameter@sgi.com wrote:
> > 
> > > +	BUG_ON(KMALLOC_MIN_SIZE > 256 ||
> > > +		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
> > 
> > BUILD_BUG_ON?
> > 
> Does not matter. That code is __init.

Finding out at compile time is better.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 11/26] SLUB: Add support for kmem_cache_ops
  2007-06-19 20:58   ` Pekka Enberg
@ 2007-06-19 22:32     ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-19 22:32 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: akpm, linux-kernel, linux-mm, suresh.b.siddha

On Tue, 19 Jun 2007, Pekka Enberg wrote:

> On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> > We use the parameter formerly used by the destructor to pass an optional
> > pointer to a kmem_cache_ops structure to kmem_cache_create.
> > 
> > kmem_cache_ops is created as empty. Later patches populate kmem_cache_ops.
> 
> I like kmem_cache_ops but I don't like this patch. I know its painful
> but we really want the introduction patch to fixup the API (move ctor
> to kmem_cache_ops and do the callers).

That can be done later. The effort does not increase because of this 
patch. If you have the time please do such a patch.

 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-19 21:00   ` Matt Mackall
@ 2007-06-19 22:33     ` Christoph Lameter
  2007-06-20  6:14       ` Pekka J Enberg
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-19 22:33 UTC (permalink / raw)
  To: Matt Mackall; +Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 19 Jun 2007, Matt Mackall wrote:

> I worry a bit about adding another branch checking __GFP_ZERO in such
> a hot path for SLAB/SLUB.

Its checking the gfpflags variable on the stack. In a recently touched 
cachline.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc.
  2007-06-19 22:29       ` Andrew Morton
@ 2007-06-19 22:38         ` Christoph Lameter
  2007-06-19 22:46           ` Andrew Morton
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-19 22:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 19 Jun 2007, Andrew Morton wrote:

> On Tue, 19 Jun 2007 15:22:36 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Tue, 19 Jun 2007, Andrew Morton wrote:
> > 
> > > On Mon, 18 Jun 2007 02:58:48 -0700
> > > clameter@sgi.com wrote:
> > > 
> > > > +	BUG_ON(KMALLOC_MIN_SIZE > 256 ||
> > > > +		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
> > > 
> > > BUILD_BUG_ON?
> > > 
> > Does not matter. That code is __init.
> 
> Finding out at compile time is better.

Ok and BUILD_BUG_ON really works? Had some bad experiences with it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-19 15:36:57.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-19 15:37:05.000000000 -0700
@@ -3079,7 +3079,7 @@ void __init kmem_cache_init(void)
 	 * Make sure that nothing crazy happens if someone starts tinkering
 	 * around with ARCH_KMALLOC_MINALIGN
 	 */
-	BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+	BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
 		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
 
 	for (i = 8; i < KMALLOC_MIN_SIZE;i++)
 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc.
  2007-06-19 22:38         ` Christoph Lameter
@ 2007-06-19 22:46           ` Andrew Morton
  2007-06-25  6:41             ` Nick Piggin
  0 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2007-06-19 22:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 19 Jun 2007 15:38:01 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> Ok and BUILD_BUG_ON really works? Had some bad experiences with it.

hm, I don't recall any problems, apart from its very obscure error
reporting.

But if it breaks, we get an opportunity to fix it ;)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 26/26] SLUB: Place kmem_cache_cpu structures in a NUMA aware way.
  2007-06-18  9:59 ` [patch 26/26] SLUB: Place kmem_cache_cpu structures in a NUMA aware way clameter
@ 2007-06-19 23:17   ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-19 23:17 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

Some fixups to this patch:


Fix issues with per cpu kmem_cache_cpu arrays.

1. During cpu bootstrap we also need to bootstrap the per cpu array
   for the cpu in SLUB.
   kmem_cache_init is called while only a single cpu is marked online.

2. The size determination of the kmem_cache array is wrong for UP.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-19 15:38:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-19 16:13:17.000000000 -0700
@@ -2016,21 +2016,28 @@ static int alloc_kmem_cache_cpus(struct 
 	return 1;
 }
 
+/*
+ * Initialize the per cpu array.
+ */
+static void init_alloc_cpu_cpu(int cpu)
+{
+	int i;
+
+	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
+		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
+}
+
 static void __init init_alloc_cpu(void)
 {
 	int cpu;
-	int i;
 
-	for_each_online_cpu(cpu) {
-		for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-			free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i],
-								cpu);
-	}
+	for_each_online_cpu(cpu)
+		init_alloc_cpu_cpu(cpu);
 }
 
 #else
 static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(struct kmem_cache *s) {}
+static inline void init_alloc_cpu(void) {}
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
@@ -3094,10 +3101,12 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-#endif
-
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
 				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+	kmem_size = sizeof(struct kmem_cache);
+#endif
+
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d,"
 		" MinObjects=%d, CPUs=%d, Nodes=%d\n",
@@ -3244,6 +3253,7 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
+		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
 			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(cpu,

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 05/26] Slab allocators: Cleanup zeroing allocations
  2007-06-19 22:33     ` Christoph Lameter
@ 2007-06-20  6:14       ` Pekka J Enberg
  0 siblings, 0 replies; 73+ messages in thread
From: Pekka J Enberg @ 2007-06-20  6:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matt Mackall, akpm, linux-kernel, linux-mm, suresh.b.siddha

On Tue, 19 Jun 2007, Matt Mackall wrote:
> > I worry a bit about adding another branch checking __GFP_ZERO in such
> > a hot path for SLAB/SLUB.

On Tue, 19 Jun 2007, Christoph Lameter wrote:
> Its checking the gfpflags variable on the stack. In a recently touched 
> cachline.

The variable could be in a register too but it's the _branch 
instruction_ that is bit worrisome especially for embedded devices (think 
slob). I haven't measured this, so consider this as pure speculation and 
hand-waving from my part.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc.
  2007-06-19 22:46           ` Andrew Morton
@ 2007-06-25  6:41             ` Nick Piggin
  0 siblings, 0 replies; 73+ messages in thread
From: Nick Piggin @ 2007-06-25  6:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

Andrew Morton wrote:
> On Tue, 19 Jun 2007 15:38:01 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> 
>>Ok and BUILD_BUG_ON really works? Had some bad experiences with it.
> 
> 
> hm, I don't recall any problems, apart from its very obscure error
> reporting.
> 
> But if it breaks, we get an opportunity to fix it ;)

It doesn't work outside function scope, which can be annoying. The
workaround is to just create a dummy function and put the BUILD_BUG_ON
inside that.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 12/26] SLUB: Slab defragmentation core
  2007-06-18  9:58 ` [patch 12/26] SLUB: Slab defragmentation core clameter
@ 2007-06-26  8:18   ` Andrew Morton
  2007-06-26 18:19     ` Christoph Lameter
  2007-06-26 19:13   ` Nish Aravamudan
  1 sibling, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2007-06-26  8:18 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007 02:58:50 -0700 clameter@sgi.com wrote:

> Slab defragmentation occurs either
> 
> 1. Unconditionally when kmem_cache_shrink is called on slab by the kernel
>    calling kmem_cache_shrink or slabinfo triggering slab shrinking. This
>    form performs defragmentation on all nodes of a NUMA system.
> 
> 2. Conditionally when kmem_cache_defrag(<percentage>, <node>) is called.
> 
>    The defragmentation is only performed if the fragmentation of the slab
>    is higher then the specified percentage. Fragmentation ratios are measured
>    by calculating the percentage of objects in use compared to the total
>    number of objects that the slab cache could hold.
> 
>    kmem_cache_defrag takes a node parameter. This can either be -1 if
>    defragmentation should be performed on all nodes, or a node number.
>    If a node number was specified then defragmentation is only performed
>    on a specific node.
> 
>    Slab defragmentation is a memory intensive operation that can be
>    sped up in a NUMA system if mostly node local memory is accessed. That
>    is the case if we just have reclaimed reclaim on a node.
> 
> For defragmentation SLUB first generates a sorted list of partial slabs.
> Sorting is performed according to the number of objects allocated.
> Thus the slabs with the least objects will be at the end.
> 
> We extract slabs off the tail of that list until we have either reached a
> mininum number of slabs or until we encounter a slab that has more than a
> quarter of its objects allocated. Then we attempt to remove the objects
> from each of the slabs taken.
> 
> In order for a slabcache to support defragmentation a couple of functions
> must be defined via kmem_cache_ops. These are
> 
> void *get(struct kmem_cache *s, int nr, void **objects)
> 
> 	Must obtain a reference to the listed objects. SLUB guarantees that
> 	the objects are still allocated. However, other threads may be blocked
> 	in slab_free attempting to free objects in the slab. These may succeed
> 	as soon as get() returns to the slab allocator. The function must
> 	be able to detect the situation and void the attempts to handle such
> 	objects (by for example voiding the corresponding entry in the objects
> 	array).
> 
> 	No slab operations may be performed in get_reference(). Interrupts

s/get_reference/get/, yes?

> 	are disabled. What can be done is very limited. The slab lock
> 	for the page with the object is taken. Any attempt to perform a slab
> 	operation may lead to a deadlock.
> 
> 	get() returns a private pointer that is passed to kick. Should we
> 	be unable to obtain all references then that pointer may indicate
> 	to the kick() function that it should not attempt any object removal
> 	or move but simply remove the reference counts.
> 
> void kick(struct kmem_cache *, int nr, void **objects, void *get_result)
> 
> 	After SLUB has established references to the objects in a
> 	slab it will drop all locks and then use kick() to move objects out
> 	of the slab. The existence of the object is guaranteed by virtue of
> 	the earlier obtained references via get(). The callback may perform
> 	any slab operation since no locks are held at the time of call.
> 
> 	The callback should remove the object from the slab in some way. This
> 	may be accomplished by reclaiming the object and then running
> 	kmem_cache_free() or reallocating it and then running
> 	kmem_cache_free(). Reallocation is advantageous because the partial
> 	slabs were just sorted to have the partial slabs with the most objects
> 	first. Reallocation is likely to result in filling up a slab in
> 	addition to freeing up one slab so that it also can be removed from
> 	the partial list.
> 
> 	Kick() does not return a result. SLUB will check the number of
> 	remaining objects in the slab. If all objects were removed then
> 	we know that the operation was successful.
> 

Nice changelog ;)

> +static int __kmem_cache_vacate(struct kmem_cache *s,
> +		struct page *page, unsigned long flags, void *scratch)
> +{
> +	void **vector = scratch;
> +	void *p;
> +	void *addr = page_address(page);
> +	DECLARE_BITMAP(map, s->objects);

A variable-sized local.  We have a few of these in-kernel.

What's the worst-case here?  With 4k pages and 4-byte slab it's 128 bytes
of stack?  Seems acceptable.

(What's the smallest sized object slub will create?  4 bytes?)



To hold off a concurrent free while defragging, the code relies upon
slab_lock() on the current page, yes?

But slab_lock() isn't taken for slabs whose objects are larger than PAGE_SIZE. 
How's that handled?



Overall: looks good.  It'd be nice to get a buffer_head shrinker in place,
see how that goes from a proof-of-concept POV.


How much testing has been done on this code, and of what form, and with
what results?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches
  2007-06-18  9:58 ` [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches clameter
@ 2007-06-26  8:18   ` Andrew Morton
  2007-06-26 18:21     ` Christoph Lameter
  2007-06-26 19:28     ` Christoph Lameter
  0 siblings, 2 replies; 73+ messages in thread
From: Andrew Morton @ 2007-06-26  8:18 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007 02:58:53 -0700 clameter@sgi.com wrote:

> This implements the ability to remove inodes in a particular slab
> from inode cache. In order to remove an inode we may have to write out
> the pages of an inode, the inode itself and remove the dentries referring
> to the node.
> 
> Provide generic functionality that can be used by filesystems that have
> their own inode caches to also tie into the defragmentation functions
> that are made available here.

Yes, this is tricky stuff.  I have vague ancestral memories that the sort
of inode work which you refer to here can cause various deadlocks, lockdep
warnings and such nasties when if we attempt to call it from the wrong
context (ie: from within fs code).

Possibly we could prevent that by skipping all this code if the caller
didn't have __GFP_FS.


I trust all the code in kick_inodes() was carefuly copied from
prue_icache() and such places - I didn't check it.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 21/26] Slab defragmentation: support dentry defragmentation
  2007-06-18  9:58 ` [patch 21/26] Slab defragmentation: support dentry defragmentation clameter
@ 2007-06-26  8:18   ` Andrew Morton
  2007-06-26 18:23     ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2007-06-26  8:18 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007 02:58:59 -0700 clameter@sgi.com wrote:

> get() uses the dcache lock and then works with dget_locked to obtain a
> reference to the dentry. An additional complication is that the dentry
> may be in process of being freed or it may just have been allocated.
> We add an additional flag to d_flags to be able to determined the
> status of an object.
> 
> kick() is called after get() has been used and after the slab has dropped
> all of its own locks. The dentry pruning for unused entries works in a
> straighforward way.
> 
> ...
>
> +/*
> + * Slab has dropped all the locks. Get rid of the
> + * refcount we obtained earlier and also rid of the
> + * object.
> + */
> +static void kick_dentries(struct kmem_cache *s, int nr, void **v, void *private)
> +{
> +	struct dentry *dentry;
> +	int abort = 0;
> +	int i;
> +
> +	/*
> +	 * First invalidate the dentries without holding the dcache lock
> +	 */
> +	for (i = 0; i < nr; i++) {
> +		dentry = v[i];
> +
> +		if (dentry)
> +			d_invalidate(dentry);
> +	}
> +
> +	/*
> +	 * If we are the last one holding a reference then the dentries can
> +	 * be freed. We  need the dcache_lock.
> +	 */
> +	spin_lock(&dcache_lock);
> +	for (i = 0; i < nr; i++) {
> +		dentry = v[i];
> +		if (!dentry)
> +			continue;
> +
> +		if (abort)
> +			goto put_dentry;
> +
> +		spin_lock(&dentry->d_lock);
> +		if (atomic_read(&dentry->d_count) > 1) {
> +			/*
> +			 * Reference count was increased.
> +			 * We need to abandon the freeing of
> +			 * objects.
> +			 */
> +			abort = 1;

It's unobvious why the entire shrink effort is abandoned if one busy dentry
is encountered.  Please flesh the comment out explaining this.

> +			spin_unlock(&dentry->d_lock);
> +put_dentry:
> +			spin_unlock(&dcache_lock);
> +			dput(dentry);
> +			spin_lock(&dcache_lock);
> +			continue;
> +		}
> +
> +		/* Remove from LRU */
> +		if (!list_empty(&dentry->d_lru)) {
> +			dentry_stat.nr_unused--;
> +			list_del_init(&dentry->d_lru);
> +		}
> +		/* Drop the entry */
> +		prune_one_dentry(dentry, 1);
> +	}
> +	spin_unlock(&dcache_lock);
> +
> +	/*
> +	 * dentries are freed using RCU so we need to wait until RCU
> +	 * operations arei complete
> +	 */
> +	if (!abort)
> +		synchronize_rcu();
> +}


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 12/26] SLUB: Slab defragmentation core
  2007-06-26  8:18   ` Andrew Morton
@ 2007-06-26 18:19     ` Christoph Lameter
  2007-06-26 18:38       ` Andrew Morton
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-26 18:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007, Andrew Morton wrote:

> > 	No slab operations may be performed in get_reference(). Interrupts
> 
> s/get_reference/get/, yes?

Correct.

> (What's the smallest sized object slub will create?  4 bytes?)

__alignof__(unsigned long long)

> To hold off a concurrent free while defragging, the code relies upon
> slab_lock() on the current page, yes?

Right.
 
> But slab_lock() isn't taken for slabs whose objects are larger than 
> PAGE_SIZE. How's that handled?

slab lock is always taken. How did you get that idea?

> Overall: looks good.  It'd be nice to get a buffer_head shrinker in place,
> see how that goes from a proof-of-concept POV.

Ok.

> How much testing has been done on this code, and of what form, and with
> what results?

I posted them in the intro of the last full post and then Michael 
Piotrowski did some stress tests.

See http://marc.info/?l=linux-mm&m=118125373320855&w=2


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches
  2007-06-26  8:18   ` Andrew Morton
@ 2007-06-26 18:21     ` Christoph Lameter
  2007-06-26 19:28     ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-26 18:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007, Andrew Morton wrote:

> > Provide generic functionality that can be used by filesystems that have
> > their own inode caches to also tie into the defragmentation functions
> > that are made available here.
> 
> Yes, this is tricky stuff.  I have vague ancestral memories that the sort
> of inode work which you refer to here can cause various deadlocks, lockdep
> warnings and such nasties when if we attempt to call it from the wrong
> context (ie: from within fs code).

Right that is likelyi the reason why Michael did his stress test...
 
> Possibly we could prevent that by skipping all this code if the caller
> didn't have __GFP_FS.

We do. Look at the earlier patch.

> I trust all the code in kick_inodes() was carefuly copied from 
> prue_icache() and such places - I didn't check it.

Yup tried to remain faithful to that. We could increase the usefulness if 
I could take more liberties with the code in order to actually move an 
item instead of simply reclaiming. But its better to first have a proven 
correct solution before doing more work on that.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 21/26] Slab defragmentation: support dentry defragmentation
  2007-06-26  8:18   ` Andrew Morton
@ 2007-06-26 18:23     ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-26 18:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007, Andrew Morton wrote:

> > +			 * objects.
> > +			 */
> > +			abort = 1;
> 
> It's unobvious why the entire shrink effort is abandoned if one busy dentry
> is encountered.  Please flesh the comment out explaining this.

If one item is busy then we cannot reclaim the slab. So what would be the 
use of continuing efforts. I thought I put that into the description? I 
can put that into the code too.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 12/26] SLUB: Slab defragmentation core
  2007-06-26 18:19     ` Christoph Lameter
@ 2007-06-26 18:38       ` Andrew Morton
  2007-06-26 18:52         ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2007-06-26 18:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007 11:19:26 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

>  
> > But slab_lock() isn't taken for slabs whose objects are larger than 
> > PAGE_SIZE. How's that handled?
> 
> slab lock is always taken. How did you get that idea?

Damned if I know.  Perhaps by reading slob.c instead of slub.c.  When can
we start deleting some slab implementations?

> > How much testing has been done on this code, and of what form, and with
> > what results?
> 
> I posted them in the intro of the last full post and then Michael 
> Piotrowski did some stress tests.
> 
> See http://marc.info/?l=linux-mm&m=118125373320855&w=2

hm, OK, thin.

I think we'll need to come up with a better-than-usual test plan for this
change.  One starting point might be to ask what in-the-field problem
you're trying to address here, and what the results were.


Also, what are the risks of meltdowns in this code?  For example, it
reaches the magical 30% ratio, tries to do defrag, but the defrag is for
some reason unsuccessful and it then tries to run defrag again, etc.

And that was "for example"!  Are there other such potential problems in
there?  There usually are, with memory reclaim.


(Should slab_defrag_ratio be per-slab rather than global?)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 12/26] SLUB: Slab defragmentation core
  2007-06-26 18:38       ` Andrew Morton
@ 2007-06-26 18:52         ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-26 18:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007, Andrew Morton wrote:

> Damned if I know.  Perhaps by reading slob.c instead of slub.c.  When can
> we start deleting some slab implementations?

Probably after we switch to SLUB in 2.6.23 and then address all the 
eventual complaints and issues that come up.

> > See http://marc.info/?l=linux-mm&m=118125373320855&w=2
> 
> hm, OK, thin.
> 
> I think we'll need to come up with a better-than-usual test plan for this
> change.  One starting point might be to ask what in-the-field problem
> you're trying to address here, and what the results were.

The typical scenario is the unmounting of a volume with a large number of 
entries. Anything that uses a large number of inodes and then shifts the
load so that the memory needs to be used for a different purpose. 
Currently those cases lead to trapping a lot of memory in dentry / inode 
caches.

Note that the approach may  also supports memory compaction by Mel.
It may allow us to get rid of the RECLAIMABLE category and thus simplify
his code.

> Also, what are the risks of meltdowns in this code?  For example, it
> reaches the magical 30% ratio, tries to do defrag, but the defrag is for
> some reason unsuccessful and it then tries to run defrag again, etc.

That could occur if something keeps holding extra references to 
dentries and inodes for a long time. Same issue as with page migration. 
Migrates again and again.

The issue is to some extend avoided by putting slabs that we were not able
to handle at the to of the partial list. Meaning these slabs will soon be
grabbed and used for allocations. So they will fill up and protected from
new attempts until they first have been filled up and then aged on the 
partial list.

Another measure to avoid that issue is that we abandon attempts at the
first sign of trouble. That limits the overhead. If we get into some
strange scenario where the slabs are unreclaimable then we will not retry.

Yet another measure is to not attempt anything if the number of
partial slabs is below a certain mininum. We will never attempt to
handle all partial slabs, some problem slabs may stick around without
causing additional reclaim.

> And that was "for example"!  Are there other such potential problems in
> there?  There usually are, with memory reclaim.

Its difficult to foresee all these issues. I have tried to cover what I 
could imagine.
 
> (Should slab_defrag_ratio be per-slab rather than global?)

I do not have seen scenarios that would justify that change. inode / 
dentry are very related and its easier to simply have to manage one
global number.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 12/26] SLUB: Slab defragmentation core
  2007-06-18  9:58 ` [patch 12/26] SLUB: Slab defragmentation core clameter
  2007-06-26  8:18   ` Andrew Morton
@ 2007-06-26 19:13   ` Nish Aravamudan
  2007-06-26 19:19     ` Christoph Lameter
  1 sibling, 1 reply; 73+ messages in thread
From: Nish Aravamudan @ 2007-06-26 19:13 UTC (permalink / raw)
  To: clameter; +Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On 6/18/07, clameter@sgi.com <clameter@sgi.com> wrote:
> Slab defragmentation occurs either
>
> 1. Unconditionally when kmem_cache_shrink is called on slab by the kernel
>    calling kmem_cache_shrink or slabinfo triggering slab shrinking. This
>    form performs defragmentation on all nodes of a NUMA system.
>
> 2. Conditionally when kmem_cache_defrag(<percentage>, <node>) is called.
>
>    The defragmentation is only performed if the fragmentation of the slab
>    is higher then the specified percentage. Fragmentation ratios are measured
>    by calculating the percentage of objects in use compared to the total
>    number of objects that the slab cache could hold.
>
>    kmem_cache_defrag takes a node parameter. This can either be -1 if
>    defragmentation should be performed on all nodes, or a node number.
>    If a node number was specified then defragmentation is only performed
>    on a specific node.

Hrm, isn't -1 usually 'this node' for NUMA systems? Maybe nr_node_ids
or MAX_NUMNODES should mean 'all nodes'?

Perhaps these would be served with some #defines?

#define NUMA_THISNODE_ID (-1)
#define NUMA_ALLNODES_ID (MAX_NUMNODES)

or something?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 12/26] SLUB: Slab defragmentation core
  2007-06-26 19:13   ` Nish Aravamudan
@ 2007-06-26 19:19     ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-26 19:19 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: akpm, linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007, Nish Aravamudan wrote:

> >    kmem_cache_defrag takes a node parameter. This can either be -1 if
> >    defragmentation should be performed on all nodes, or a node number.
> >    If a node number was specified then defragmentation is only performed
> >    on a specific node.
> 
> Hrm, isn't -1 usually 'this node' for NUMA systems? Maybe nr_node_ids
> or MAX_NUMNODES should mean 'all nodes'?

-1 means no node specified. What the function does in this case depends
on the function. For "this node" you can use numa_node_id().

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches
  2007-06-26  8:18   ` Andrew Morton
  2007-06-26 18:21     ` Christoph Lameter
@ 2007-06-26 19:28     ` Christoph Lameter
  2007-06-26 19:37       ` Andrew Morton
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2007-06-26 19:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007, Andrew Morton wrote:

> Yes, this is tricky stuff.  I have vague ancestral memories that the sort
> of inode work which you refer to here can cause various deadlocks, lockdep
> warnings and such nasties when if we attempt to call it from the wrong
> context (ie: from within fs code).

Right. Michael's test flushed one such issue out.

> Possibly we could prevent that by skipping all this code if the caller
> didn't have __GFP_FS.

There is no check in vmscan.c as I thought earlier.


Slab defragmentation: Only perform slab defrag if __GFP_FS is clear

Avoids slab defragmentation be triggered from filesystem operations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmscan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/vmscan.c	2007-06-26 12:25:28.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/vmscan.c	2007-06-26 12:26:18.000000000 -0700
@@ -233,8 +233,9 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
-	kmem_cache_defrag(sysctl_slab_defrag_ratio,
-		zone ? zone_to_nid(zone) : -1);
+	if (!(gfp_mask & __GFP_FS))
+		kmem_cache_defrag(sysctl_slab_defrag_ratio,
+			zone ? zone_to_nid(zone) : -1);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches
  2007-06-26 19:28     ` Christoph Lameter
@ 2007-06-26 19:37       ` Andrew Morton
  2007-06-26 19:41         ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2007-06-26 19:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007 12:28:50 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 26 Jun 2007, Andrew Morton wrote:
> 
> > Yes, this is tricky stuff.  I have vague ancestral memories that the sort
> > of inode work which you refer to here can cause various deadlocks, lockdep
> > warnings and such nasties when if we attempt to call it from the wrong
> > context (ie: from within fs code).
> 
> Right. Michael's test flushed one such issue out.
> 
> > Possibly we could prevent that by skipping all this code if the caller
> > didn't have __GFP_FS.
> 
> There is no check in vmscan.c as I thought earlier.
> 
> 
> Slab defragmentation: Only perform slab defrag if __GFP_FS is clear
> 
> Avoids slab defragmentation be triggered from filesystem operations.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  mm/vmscan.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.22-rc4-mm2/mm/vmscan.c
> ===================================================================
> --- linux-2.6.22-rc4-mm2.orig/mm/vmscan.c	2007-06-26 12:25:28.000000000 -0700
> +++ linux-2.6.22-rc4-mm2/mm/vmscan.c	2007-06-26 12:26:18.000000000 -0700
> @@ -233,8 +233,9 @@ unsigned long shrink_slab(unsigned long 
>  		shrinker->nr += total_scan;
>  	}
>  	up_read(&shrinker_rwsem);
> -	kmem_cache_defrag(sysctl_slab_defrag_ratio,
> -		zone ? zone_to_nid(zone) : -1);
> +	if (!(gfp_mask & __GFP_FS))
> +		kmem_cache_defrag(sysctl_slab_defrag_ratio,
> +			zone ? zone_to_nid(zone) : -1);
>  	return ret;
>  }

This is inverted: __GFP_FS is set if we may perform fs operations.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches
  2007-06-26 19:37       ` Andrew Morton
@ 2007-06-26 19:41         ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2007-06-26 19:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Tue, 26 Jun 2007, Andrew Morton wrote:

> This is inverted: __GFP_FS is set if we may perform fs operations.

Sigh. 



Slab defragmentation: Only perform slab defrag if __GFP_FS is clear

Avoids slab defragmentation be triggered from filesystem operations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmscan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/vmscan.c	2007-06-26 12:25:28.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/vmscan.c	2007-06-26 12:40:44.000000000 -0700
@@ -233,8 +233,9 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
-	kmem_cache_defrag(sysctl_slab_defrag_ratio,
-		zone ? zone_to_nid(zone) : -1);
+	if (gfp_mask & __GFP_FS)
+		kmem_cache_defrag(sysctl_slab_defrag_ratio,
+			zone ? zone_to_nid(zone) : -1);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO
  2007-06-18  9:58 ` [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO clameter
  2007-06-19 20:55   ` Pekka Enberg
@ 2007-06-28  6:09   ` Andrew Morton
  1 sibling, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2007-06-28  6:09 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, linux-mm, Pekka Enberg, suresh.b.siddha

On Mon, 18 Jun 2007 02:58:44 -0700 clameter@sgi.com wrote:

> kmalloc_node() and kmem_cache_alloc_node() were not available in
> a zeroing variant in the past. But with __GFP_ZERO it is possible
> now to do zeroing while allocating.
> 
> Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
> we can.

I'm geting random ugly slab corruptions from this, with CONFIG_SLAB=y, an
excerpt of which is below.

It could be damage from Paul's numa-for-slob patch, dunno.  I don't think
I've runtime tested
slab-allocators-replace-explicit-zeroing-with-__gfp_zero.patch before.

I'll keep shedding sl[aou]b patches until this lot stabilises, sorry.


initcall 0xc0513de0 ran for 0 msecs: cn_proc_init+0x0/0x40()
Calling initcall 0xc0513f20: serial8250_init+0x0/0x130()
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
slab error in cache_alloc_debugcheck_after(): cache `size-32': double free, or memory outside object was overwritten
 [<c0103e9a>] show_trace_log_lvl+0x1a/0x30
 [<c0104b42>] show_trace+0x12/0x20
 [<c0104bb6>] dump_stack+0x16/0x20
 [<c0177266>] __slab_error+0x26/0x30
 [<c017780a>] cache_alloc_debugcheck_after+0xda/0x1c0
 [<c0178c3d>] __kmalloc_track_caller+0xbd/0x150
 [<c0164429>] __kzalloc+0x19/0x50
 [<c02849fd>] kobject_get_path+0x5d/0xc0
 [<c02d1968>] dev_uevent+0x108/0x390
 [<c02856be>] kobject_uevent_env+0x24e/0x470
 [<c02858ea>] kobject_uevent+0xa/0x10
 [<c02d16cf>] device_add+0x45f/0x5d0
 [<c02d1852>] device_register+0x12/0x20
 [<c02d1e56>] device_create+0x86/0xb0
 [<c02afacf>] tty_register_device+0x6f/0xf0
 [<c02cb207>] uart_add_one_port+0x1f7/0x2f0
 [<c0514007>] serial8250_init+0xe7/0x130
 [<c04fc622>] kernel_init+0x132/0x300
 [<c0103ad7>] kernel_thread_helper+0x7/0x10
 =======================
c2e8aba8: redzone 1:0x0, redzone 2:0x9f911029d74e35b
initcall 0xc0513f20: serial8250_init+0x0/0x130() returned 0.
initcall 0xc0513f20 ran for 4 msecs: serial8250_init+0x0/0x130()
Calling initcall 0xc0514050: serial8250_pnp_init+0x0/0x10()
00:0b: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:0c: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
initcall 0xc0514050: serial8250_pnp_init+0x0/0x10() returned 0.
initcall 0xc0514050 ran for 3 msecs: serial8250_pnp_init+0x0/0x10()
Calling initcall 0xc0514060: serial8250_pci_init+0x0/0x20()
initcall 0xc0514060: serial8250_pci_init+0x0/0x20() returned 0.
initcall 0xc0514060 ran for 0 msecs: serial8250_pci_init+0x0/0x20()
Calling initcall 0xc05141e0: isa_bus_init+0x0/0x40()
initcall 0xc05141e0: isa_bus_init+0x0/0x40() returned 0.
initcall 0xc05141e0 ran for 0 msecs: isa_bus_init+0x0/0x40()
Calling initcall 0xc02d94f0: topology_sysfs_init+0x0/0x50()
initcall 0xc02d94f0: topology_sysfs_init+0x0/0x50() returned 0.
initcall 0xc02d94f0 ran for 0 msecs: topology_sysfs_init+0x0/0x50()
Calling initcall 0xc0514590: floppy_init+0x0/0xf10()
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
initcall 0xc0514590: floppy_init+0x0/0xf10() returned 0.
initcall 0xc0514590 ran for 20 msecs: floppy_init+0x0/0xf10()
Calling initcall 0xc05154f0: rd_init+0x0/0x1e0()
slab error in cache_alloc_debugcheck_after(): cache `size-64': double free, or memory outside object was overwritten
 [<c0103e9a>] show_trace_log_lvl+0x1a/0x30
 [<c0104b42>] show_trace+0x12/0x20
 [<c0104bb6>] dump_stack+0x16/0x20
 [<c0177266>] __slab_error+0x26/0x30
 [<c017780a>] cache_alloc_debugcheck_after+0xda/0x1c0
 [<c0178af0>] __kmalloc+0xc0/0x150
 [<c017a712>] percpu_populate+0x22/0x30
 [<c017a75f>] __percpu_populate_mask+0x3f/0x80
 [<c017a7e4>] __percpu_alloc_mask+0x44/0x80
 [<c027d130>] alloc_disk_node+0x30/0xb0
 [<c027d1bd>] alloc_disk+0xd/0x10
 [<c051552a>] rd_init+0x3a/0x1e0
 [<c04fc622>] kernel_init+0x132/0x300
 [<c0103ad7>] kernel_thread_helper+0x7/0x10
 =======================
c2f7fbd0: redzone 1:0x0, redzone 2:0x9f911029d74e35b
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
initcall 0xc05154f0: rd_init+0x0/0x1e0() returned 0.
initcall 0xc05154f0 ran for 8 msecs: rd_init+0x0/0x1e0()
Calling initcall 0xc05156f0: loop_init+0x0/0x180()
slab error in cache_alloc_debugcheck_after(): cache `size-64': double free, or memory outside object was overwritten
 [<c0103e9a>] show_trace_log_lvl+0x1a/0x30
 [<c0104b42>] show_trace+0x12/0x20
 [<c0104bb6>] dump_stack+0x16/0x20
 [<c0177266>] __slab_error+0x26/0x30
 [<c017780a>] cache_alloc_debugcheck_after+0xda/0x1c0
 [<c0178af0>] __kmalloc+0xc0/0x150
 [<c017a712>] percpu_populate+0x22/0x30
 [<c017a75f>] __percpu_populate_mask+0x3f/0x80
 [<c017a7e4>] __percpu_alloc_mask+0x44/0x80
 [<c027d130>] alloc_disk_node+0x30/0xb0
 [<c027d1bd>] alloc_disk+0xd/0x10
 [<c02e16a1>] loop_alloc+0x51/0x110
 [<c0515781>] loop_init+0x91/0x180
 [<c04fc622>] kernel_init+0x132/0x300
 [<c0103ad7>] kernel_thread_helper+0x7/0x10
 =======================
c2f28650: redzone 1:0x0, redzone 2:0x9f911029d74e35b
loop: module loaded
initcall 0xc05156f0: loop_init+0x0/0x180() returned 0.
initcall 0xc05156f0 ran for 7 msecs: loop_init+0x0/0x180()
Calling initcall 0xc0515870: e100_init_module+0x0/0x60()
e100: Intel(R) PRO/100 Network Driver, 3.5.17-k4-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation
e100: eth0: e100_probe: addr 0xfc5ff000, irq 11, MAC addr 00:90:27:70:14:CD
initcall 0xc0515870: e100_init_module+0x0/0x60() returned 0.
initcall 0xc0515870 ran for 22 msecs: e100_init_module+0x0/0x60()
Calling initcall 0xc0515940: net_olddevs_init+0x0/0x90()
initcall 0xc0515940: net_olddevs_init+0x0/0x90() returned 0.
initcall 0xc0515940 ran for 0 msecs: net_olddevs_init+0x0/0x90()
Calling initcall 0xc05159d0: loopback_init+0x0/0x10()
initcall 0xc05159d0: loopback_init+0x0/0x10() returned 0.
initcall 0xc05159d0 ran for 0 msecs: loopback_init+0x0/0x10()
Calling initcall 0xc05159e0: dummy_init_module+0x0/0x140()
initcall 0xc05159e0: dummy_init_module+0x0/0x140() returned 0.
initcall 0xc05159e0 ran for 0 msecs: dummy_init_module+0x0/0x140()
Calling initcall 0xc0515b20: rtl8169_init_module+0x0/0x20()
initcall 0xc0515b20: rtl8169_init_module+0x0/0x20() returned 0.
initcall 0xc0515b20 ran for 0 msecs: rtl8169_init_module+0x0/0x20()
Calling initcall 0xc02e9d70: init_netconsole+0x0/0x80()
netconsole: device eth0 not up yet, forcing it
e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex
netconsole: carrier detect appears untrustworthy, waiting 4 seconds
console [netcon0] enabled
netconsole: network logging started
initcall 0xc02e9d70: init_netconsole+0x0/0x80() returned 0.
initcall 0xc02e9d70 ran for 3965 msecs: init_netconsole+0x0/0x80()
Calling initcall 0xc0515b40: piix_ide_init+0x0/0xd0()
initcall 0xc0515b40: piix_ide_init+0x0/0xd0() returned 0.
initcall 0xc0515b40 ran for 0 msecs: piix_ide_init+0x0/0xd0()
Calling initcall 0xc0515c10: generic_ide_init+0x0/0x20()
initcall 0xc0515c10: generic_ide_init+0x0/0x20() returned 0.
initcall 0xc0515c10 ran for 0 msecs: generic_ide_init+0x0/0x20()
Calling initcall 0xc0515e40: ide_init+0x0/0x70()
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
PIIX4: IDE controller at PCI slot 0000:00:07.1
PIIX4: chipset revision 1
PIIX4: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:pio, hdb:pio
    ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:pio
Clocksource tsc unstable (delta = 68016953 ns)
Time: jiffies clocksource has been installed.
hdc: MAXTOR 6L080J4, ATA DISK drive
ide1 at 0x170-0x177,0x376 on irq 15
slab error in cache_alloc_debugcheck_after(): cache `size-64': double free, or memory outside object was overwritten
 [<c0103e9a>] show_trace_log_lvl+0x1a/0x30
 [<c0104b42>] show_trace+0x12/0x20
 [<c0104bb6>] dump_stack+0x16/0x20
 [<c0177266>] __slab_error+0x26/0x30
 [<c017780a>] cache_alloc_debugcheck_after+0xda/0x1c0
 [<c0178d4d>] kmem_cache_zalloc+0x7d/0x120
 [<c02f64d8>] __ide_add_setting+0x68/0x130
 [<c02f6699>] ide_add_generic_settings+0xf9/0x2e0
 [<c02f182d>] hwif_init+0xfd/0x360
 [<c02f1c76>] probe_hwif_init_with_fixup+0x16/0x90
 [<c02f4203>] ide_setup_pci_device+0x83/0xb0
 [<c02e9f7d>] piix_init_one+0x1d/0x20
 [<c05164a9>] ide_scan_pcidev+0x39/0x70
 [<c0516507>] ide_scan_pcibus+0x27/0xf0
 [<c0515e83>] ide_init+0x43/0x70
 [<c04fc622>] kernel_init+0x132/0x300
 [<c0103ad7>] kernel_thread_helper+0x7/0x10
 =======================
c2f28bd0: redzone 1:0x0, redzone 2:0x9f911029d74e35b
initcall 0xc0515e40: ide_init+0x0/0x70() returned 0.
initcall 0xc0515e40 ran for 1462 msecs: ide_init+0x0/0x70()
Calling initcall 0xc05165d0: ide_generic_init+0x0/0x10()
initcall 0xc05165d0: ide_generic_init+0x0/0x10() returned 0.
initcall 0xc05165d0 ran for 545 msecs: ide_generic_init+0x0/0x10()
Calling initcall 0xc05165e0: idedisk_init+0x0/0x10()
hdc: max request size: 128KiB
hdc: 156355584 sectors (80054 MB) w/1819KiB Cache, CHS=65535/16/63, UDMA(33)<3>slab error in cache_alloc_debugcheck_after(): cache `size-256': double free, or memory outside object was overwritten
 [<c0103e9a>] show_trace_log_lvl+0x1a/0x30
 [<c0104b42>] show_trace+0x12/0x20
 [<c0104bb6>] dump_stack+0x16/0x20
 [<c0177266>] __slab_error+0x26/0x30
 [<c017780a>] cache_alloc_debugcheck_after+0xda/0x1c0
 [<c0178c3d>] __kmalloc_track_caller+0xbd/0x150
 [<c034b62b>] __alloc_skb+0x4b/0x100
 [<c035cecc>] find_skb+0x3c/0x80
 [<c035db8b>] netpoll_send_udp+0x2b/0x280
 [<c02e9e3c>] write_msg+0x4c/0x80
 [<c011fec0>] __call_console_drivers+0x60/0x70
 [<c011ff1b>] _call_console_drivers+0x4b/0x90
 [<c0120274>] release_console_sem+0x1b4/0x240
 [<c0120733>] vprintk+0x1e3/0x350
 [<c01208bb>] printk+0x1b/0x20
 [<c02f4c4f>] ide_dma_verbose+0x11f/0x190
 [<c02f8173>] ide_disk_probe+0x5f3/0x6f0
 [<c02ea972>] generic_ide_probe+0x22/0x30
 [<c02d3a0d>] driver_probe_device+0x8d/0x190
 [<c02d3c9b>] __driver_attach+0xbb/0xc0
 [<c02d2d59>] bus_for_each_dev+0x49/0x70
 [<c02d3879>] driver_attach+0x19/0x20
 [<c02d315f>] bus_add_driver+0x7f/0x1b0
 [<c02d3e65>] driver_register+0x45/0x80
 [<c05165ed>] idedisk_init+0xd/0x10
 [<c04fc622>] kernel_init+0x132/0x300
 [<c0103ad7>] kernel_thread_helper+0x7/0x10


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2007-06-28  6:10 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-18  9:58 [patch 00/26] Current slab allocator / SLUB patch queue clameter
2007-06-18  9:58 ` [patch 01/26] SLUB Debug: Fix initial object debug state of NUMA bootstrap objects clameter
2007-06-18  9:58 ` [patch 02/26] Slab allocators: Consolidate code for krealloc in mm/util.c clameter
2007-06-18 20:03   ` Pekka Enberg
2007-06-18  9:58 ` [patch 03/26] Slab allocators: Consistent ZERO_SIZE_PTR support and NULL result semantics clameter
2007-06-18 20:08   ` Pekka Enberg
2007-06-18  9:58 ` [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators clameter
2007-06-18 10:09   ` Paul Mundt
2007-06-18 16:17     ` Christoph Lameter
2007-06-18 20:11   ` Pekka Enberg
2007-06-18  9:58 ` [patch 05/26] Slab allocators: Cleanup zeroing allocations clameter
2007-06-18 20:16   ` Pekka Enberg
2007-06-18 20:26     ` Pekka Enberg
2007-06-18 22:34       ` Christoph Lameter
2007-06-19  5:48         ` Pekka Enberg
2007-06-18 21:55     ` Christoph Lameter
2007-06-19 21:00   ` Matt Mackall
2007-06-19 22:33     ` Christoph Lameter
2007-06-20  6:14       ` Pekka J Enberg
2007-06-18  9:58 ` [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO clameter
2007-06-19 20:55   ` Pekka Enberg
2007-06-28  6:09   ` Andrew Morton
2007-06-18  9:58 ` [patch 07/26] SLUB: Add some more inlines and #ifdef CONFIG_SLUB_DEBUG clameter
2007-06-18  9:58 ` [patch 08/26] SLUB: Extract dma_kmalloc_cache from get_cache clameter
2007-06-18  9:58 ` [patch 09/26] SLUB: Do proper locking during dma slab creation clameter
2007-06-18  9:58 ` [patch 10/26] SLUB: Faster more efficient slab determination for __kmalloc clameter
2007-06-19 20:08   ` Andrew Morton
2007-06-19 22:22     ` Christoph Lameter
2007-06-19 22:29       ` Andrew Morton
2007-06-19 22:38         ` Christoph Lameter
2007-06-19 22:46           ` Andrew Morton
2007-06-25  6:41             ` Nick Piggin
2007-06-18  9:58 ` [patch 11/26] SLUB: Add support for kmem_cache_ops clameter
2007-06-19 20:58   ` Pekka Enberg
2007-06-19 22:32     ` Christoph Lameter
2007-06-18  9:58 ` [patch 12/26] SLUB: Slab defragmentation core clameter
2007-06-26  8:18   ` Andrew Morton
2007-06-26 18:19     ` Christoph Lameter
2007-06-26 18:38       ` Andrew Morton
2007-06-26 18:52         ` Christoph Lameter
2007-06-26 19:13   ` Nish Aravamudan
2007-06-26 19:19     ` Christoph Lameter
2007-06-18  9:58 ` [patch 13/26] SLUB: Extend slabinfo to support -D and -C options clameter
2007-06-18  9:58 ` [patch 14/26] SLUB: Logic to trigger slab defragmentation from memory reclaim clameter
2007-06-18  9:58 ` [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches clameter
2007-06-26  8:18   ` Andrew Morton
2007-06-26 18:21     ` Christoph Lameter
2007-06-26 19:28     ` Christoph Lameter
2007-06-26 19:37       ` Andrew Morton
2007-06-26 19:41         ` Christoph Lameter
2007-06-18  9:58 ` [patch 16/26] Slab defragmentation: Support defragmentation for extX filesystem inodes clameter
2007-06-18  9:58 ` [patch 17/26] Slab defragmentation: Support inode defragmentation for xfs clameter
2007-06-18  9:58 ` [patch 18/26] Slab defragmentation: Support procfs inode defragmentation clameter
2007-06-18  9:58 ` [patch 19/26] Slab defragmentation: Support reiserfs " clameter
2007-06-18  9:58 ` [patch 20/26] Slab defragmentation: Support inode defragmentation for sockets clameter
2007-06-18  9:58 ` [patch 21/26] Slab defragmentation: support dentry defragmentation clameter
2007-06-26  8:18   ` Andrew Morton
2007-06-26 18:23     ` Christoph Lameter
2007-06-18  9:59 ` [patch 22/26] SLUB: kmem_cache_vacate to support page allocator memory defragmentation clameter
2007-06-18  9:59 ` [patch 23/26] SLUB: Move sysfs operations outside of slub_lock clameter
2007-06-18  9:59 ` [patch 24/26] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab clameter
2007-06-18  9:59 ` [patch 25/26] SLUB: Add an object counter to the kmem_cache_cpu structure clameter
2007-06-18  9:59 ` [patch 26/26] SLUB: Place kmem_cache_cpu structures in a NUMA aware way clameter
2007-06-19 23:17   ` Christoph Lameter
2007-06-18 11:57 ` [patch 00/26] Current slab allocator / SLUB patch queue Michal Piotrowski
2007-06-18 16:46   ` Christoph Lameter
2007-06-18 17:38     ` Michal Piotrowski
2007-06-18 18:05       ` Christoph Lameter
2007-06-18 18:58         ` Michal Piotrowski
2007-06-18 19:00           ` Christoph Lameter
2007-06-18 19:09             ` Michal Piotrowski
2007-06-18 19:19               ` Christoph Lameter
2007-06-18 20:43                 ` Michal Piotrowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).