All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-21 21:14 Christoph Lameter
  2010-05-21 21:14 ` [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node Christoph Lameter
                   ` (15 more replies)
  0 siblings, 16 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

(V2 some more work as time permitted this week)

SLEB is a merging of SLUB with some queuing concepts from SLAB and a new way
of managing objects in the slabs using bitmaps. It uses a percpu queue so that
free operations can be properly buffered and a bitmap for managing the
free/allocated state in the slabs. It is slightly more inefficient than
SLUB (due to the need to place large bitmaps --sized a few words--in some
slab pages if there are more than BITS_PER_LONG objects in a slab page) but
in general does compete well with SLUB (and therefore also with SLOB) 
in terms of memory wastage.

It does not have the excessive memory requirements of SLAB because
there is no slab management structure nor alien caches. Under NUMA
the remote shared caches are used instead (which may have its issues).

The SLAB scheme of not touching the object during management is adopted.
SLEB can efficiently free and allocate cache cold objects without
causing cache misses.

There are numerous SLAB schemes that are not supported. Those could be
added if needed and if they really make a difference.

WARNING: This only ran successfully using hackbench in kvm instances so far.
But works with NUMA, SMP and UP there.

V1->V2 Add NUMA capabilities. Refine queue size configurations (not complete).
   Test in UP, SMP, NUMA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
@ 2010-05-21 21:14 ` Christoph Lameter
  2010-06-07 21:44   ` David Rientjes
  2010-05-21 21:14 ` [RFC V2 SLEB 02/14] SLUB: Constants need UL Christoph Lameter
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: slab_node_unspecified --]
[-- Type: text/plain, Size: 2752 bytes --]

kmalloc_node() and friends can be passed a constant -1 to indicate
that no choice was made for the node from which the object needs to
come.

Add a constant for this.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slab.h |    2 ++
 mm/slub.c            |   10 +++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2010-04-27 12:31:57.000000000 -0500
+++ linux-2.6/include/linux/slab.h	2010-04-27 12:32:26.000000000 -0500
@@ -92,6 +92,8 @@
 #define ZERO_OR_NULL_PTR(x) ((unsigned long)(x) <= \
 				(unsigned long)ZERO_SIZE_PTR)
 
+#define SLAB_NODE_UNSPECIFIED (-1L)
+
 /*
  * struct kmem_cache related prototypes
  */
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-04-27 12:32:30.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-04-27 12:33:37.000000000 -0500
@@ -1081,7 +1081,7 @@ static inline struct page *alloc_slab_pa
 
 	flags |= __GFP_NOTRACK;
 
-	if (node == -1)
+	if (node == SLAB_NODE_UNSPECIFIED)
 		return alloc_pages(flags, order);
 	else
 		return alloc_pages_node(node, flags, order);
@@ -1731,7 +1731,7 @@ static __always_inline void *slab_alloc(
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	void *ret = slab_alloc(s, gfpflags, -1, _RET_IP_);
+	void *ret = slab_alloc(s, gfpflags, SLAB_NODE_UNSPECIFIED, _RET_IP_);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s->objsize, s->size, gfpflags);
 
@@ -1742,7 +1742,7 @@ EXPORT_SYMBOL(kmem_cache_alloc);
 #ifdef CONFIG_TRACING
 void *kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags)
 {
-	return slab_alloc(s, gfpflags, -1, _RET_IP_);
+	return slab_alloc(s, gfpflags, SLAB_NODE_UNSPECIFIED, _RET_IP_);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_notrace);
 #endif
@@ -2740,7 +2740,7 @@ void *__kmalloc(size_t size, gfp_t flags
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, flags, -1, _RET_IP_);
+	ret = slab_alloc(s, flags, SLAB_NODE_UNSPECIFIED, _RET_IP_);
 
 	trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
 
@@ -3324,7 +3324,7 @@ void *__kmalloc_track_caller(size_t size
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, gfpflags, -1, caller);
+	ret = slab_alloc(s, gfpflags, SLAB_NODE_UNSPECIFIED, caller);
 
 	/* Honor the call site pointer we recieved. */
 	trace_kmalloc(caller, ret, size, s->size, gfpflags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 02/14] SLUB: Constants need UL
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
  2010-05-21 21:14 ` [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node Christoph Lameter
@ 2010-05-21 21:14 ` Christoph Lameter
  2010-05-21 21:14 ` [RFC V2 SLEB 03/14] SLUB: Use kmem_cache flags to detect if Slab is in debugging mode Christoph Lameter
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: slub_constant_ul --]
[-- Type: text/plain, Size: 1095 bytes --]

UL suffix is missing in some constants. Conform to how slab.h uses constants.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-04-27 12:39:36.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-04-27 12:41:05.000000000 -0500
@@ -170,8 +170,8 @@
 #define MAX_OBJS_PER_PAGE	65535 /* since page.objects is u16 */
 
 /* Internal SLUB flags */
-#define __OBJECT_POISON		0x80000000 /* Poison object */
-#define __SYSFS_ADD_DEFERRED	0x40000000 /* Not yet visible via sysfs */
+#define __OBJECT_POISON		0x80000000UL /* Poison object */
+#define __SYSFS_ADD_DEFERRED	0x40000000UL /* Not yet visible via sysfs */
 
 static int kmem_size = sizeof(struct kmem_cache);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 03/14] SLUB: Use kmem_cache flags to detect if Slab is in debugging mode.
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
  2010-05-21 21:14 ` [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node Christoph Lameter
  2010-05-21 21:14 ` [RFC V2 SLEB 02/14] SLUB: Constants need UL Christoph Lameter
@ 2010-05-21 21:14 ` Christoph Lameter
  2010-06-08  3:57   ` David Rientjes
  2010-05-21 21:14 ` [RFC V2 SLEB 04/14] SLUB: discard_slab_unlock Christoph Lameter
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: slub_debug_on --]
[-- Type: text/plain, Size: 3934 bytes --]

The cacheline with the flags is reachable from the hot paths after the
percpu allocator changes went in. So there is no need anymore to put a
flag into each slab page. Get rid of the SlubDebug flag and use
the flags in kmem_cache instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/page-flags.h |    1 -
 mm/slub.c                  |   33 ++++++++++++---------------------
 2 files changed, 12 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2010-04-27 12:47:10.000000000 -0500
+++ linux-2.6/include/linux/page-flags.h	2010-04-27 12:47:21.000000000 -0500
@@ -215,7 +215,6 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEAR
 __PAGEFLAG(SlobFree, slob_free)
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
-__PAGEFLAG(SlubDebug, slub_debug)
 
 /*
  * Private page markings that may be used by the filesystem that owns the page
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-04-27 12:41:05.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-04-27 13:15:32.000000000 -0500
@@ -107,11 +107,17 @@
  * 			the fast path and disables lockless freelists.
  */
 
+#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
+		SLAB_TRACE | SLAB_DEBUG_FREE)
+
+static inline int debug_on(struct kmem_cache *s)
+{
 #ifdef CONFIG_SLUB_DEBUG
-#define SLABDEBUG 1
+	return unlikely(s->flags & SLAB_DEBUG_FLAGS);
 #else
-#define SLABDEBUG 0
+	return 0;
 #endif
+}
 
 /*
  * Issues still to be resolved:
@@ -1165,9 +1171,6 @@ static struct page *new_slab(struct kmem
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		__SetPageSlubDebug(page);
 
 	start = page_address(page);
 
@@ -1194,14 +1197,13 @@ static void __free_slab(struct kmem_cach
 	int order = compound_order(page);
 	int pages = 1 << order;
 
-	if (unlikely(SLABDEBUG && PageSlubDebug(page))) {
+	if (debug_on(s)) {
 		void *p;
 
 		slab_pad_check(s, page);
 		for_each_object(p, s, page_address(page),
 						page->objects)
 			check_object(s, page, p, 0);
-		__ClearPageSlubDebug(page);
 	}
 
 	kmemcheck_free_shadow(page, compound_order(page));
@@ -1419,8 +1421,7 @@ static void unfreeze_slab(struct kmem_ca
 			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
 			stat(s, DEACTIVATE_FULL);
-			if (SLABDEBUG && PageSlubDebug(page) &&
-						(s->flags & SLAB_STORE_USER))
+			if (debug_on(s) && (s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
@@ -1628,7 +1629,7 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
+	if (debug_on(s))
 		goto debug;
 
 	c->freelist = get_freepointer(s, object);
@@ -1787,7 +1788,7 @@ static void __slab_free(struct kmem_cach
 	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
-	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
+	if (debug_on(s))
 		goto debug;
 
 checks_ok:
@@ -3400,16 +3401,6 @@ static void validate_slab_slab(struct km
 	} else
 		printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
 			s->name, page);
-
-	if (s->flags & DEBUG_DEFAULT_FLAGS) {
-		if (!PageSlubDebug(page))
-			printk(KERN_ERR "SLUB %s: SlubDebug not set "
-				"on slab 0x%p\n", s->name, page);
-	} else {
-		if (PageSlubDebug(page))
-			printk(KERN_ERR "SLUB %s: SlubDebug set on "
-				"slab 0x%p\n", s->name, page);
-	}
 }
 
 static int validate_slab_node(struct kmem_cache *s,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 04/14] SLUB: discard_slab_unlock
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (2 preceding siblings ...)
  2010-05-21 21:14 ` [RFC V2 SLEB 03/14] SLUB: Use kmem_cache flags to detect if Slab is in debugging mode Christoph Lameter
@ 2010-05-21 21:14 ` Christoph Lameter
  2010-05-21 21:14 ` [RFC V2 SLEB 05/14] SLUB: is_kmalloc_cache Christoph Lameter
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: slub_discard_unlock --]
[-- Type: text/plain, Size: 1719 bytes --]

The sequence of unlocking a slab and freeing occurs multiple times.
Put the common into a single function.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-20 17:16:27.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-20 17:16:29.000000000 -0500
@@ -1268,6 +1268,13 @@ static __always_inline int slab_trylock(
 	return rc;
 }
 
+static void discard_slab_unlock(struct kmem_cache *s,
+	struct page *page)
+{
+	slab_unlock(page);
+	discard_slab(s, page);
+}
+
 /*
  * Management of partially allocated slabs
  */
@@ -1441,9 +1448,8 @@ static void unfreeze_slab(struct kmem_ca
 			add_partial(n, page, 1);
 			slab_unlock(page);
 		} else {
-			slab_unlock(page);
 			stat(s, FREE_SLAB);
-			discard_slab(s, page);
+			discard_slab_unlock(s, page);
 		}
 	}
 }
@@ -1826,9 +1832,8 @@ slab_empty:
 		remove_partial(s, page);
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
-	slab_unlock(page);
 	stat(s, FREE_SLAB);
-	discard_slab(s, page);
+	discard_slab_unlock(s, page);
 	return;
 
 debug:
@@ -2905,8 +2910,7 @@ int kmem_cache_shrink(struct kmem_cache 
 				 */
 				list_del(&page->lru);
 				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
+				discard_slab_unlock(s, page);
 			} else {
 				list_move(&page->lru,
 				slabs_by_inuse + page->inuse);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 05/14] SLUB: is_kmalloc_cache
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (3 preceding siblings ...)
  2010-05-21 21:14 ` [RFC V2 SLEB 04/14] SLUB: discard_slab_unlock Christoph Lameter
@ 2010-05-21 21:14 ` Christoph Lameter
  2010-06-08  8:54   ` David Rientjes
  2010-05-21 21:14 ` [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab Christoph Lameter
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: slub_is_kmalloc_cache --]
[-- Type: text/plain, Size: 1614 bytes --]

The determination if a slab is a kmalloc slab is occurring
multiple times.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


---
 mm/slub.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-12 14:46:58.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-12 14:49:37.000000000 -0500
@@ -312,6 +312,11 @@ static inline int oo_objects(struct kmem
 	return x.x & OO_MASK;
 }
 
+static int is_kmalloc_cache(struct kmem_cache *s)
+{
+	return (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches);
+}
+
 #ifdef CONFIG_SLUB_DEBUG
 /*
  * Debug settings:
@@ -2076,7 +2081,7 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
-	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
+	if (is_kmalloc_cache(s))
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
 		 * since the per cpu allocator is not available yet.
@@ -2158,8 +2163,7 @@ static int init_kmem_cache_nodes(struct 
 	int node;
 	int local_node;
 
-	if (slab_state >= UP && (s < kmalloc_caches ||
-			s >= kmalloc_caches + KMALLOC_CACHES))
+	if (slab_state >= UP && !is_kmalloc_cache(s))
 		local_node = page_to_nid(virt_to_page(s));
 	else
 		local_node = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (4 preceding siblings ...)
  2010-05-21 21:14 ` [RFC V2 SLEB 05/14] SLUB: is_kmalloc_cache Christoph Lameter
@ 2010-05-21 21:14 ` Christoph Lameter
  2010-06-09  6:14   ` David Rientjes
  2010-05-21 21:14 ` [RFC V2 SLEB 07/14] SLEB: The Enhanced Slab Allocator Christoph Lameter
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_remove_kmalloc_cache_node --]
[-- Type: text/plain, Size: 4034 bytes --]

Currently bootstrap works with the kmalloc_node slab. We can avoid
creating that slab and boot using allocation from a kmalloc array slab
instead. This is necessary for the future if we want to dynamically
size kmem_cache structures.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-20 14:26:53.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-20 14:37:19.000000000 -0500
@@ -2111,10 +2111,11 @@ static void early_kmem_cache_node_alloc(
 	struct page *page;
 	struct kmem_cache_node *n;
 	unsigned long flags;
+	int i = kmalloc_index(sizeof(struct kmem_cache_node));
 
-	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
+	BUG_ON(kmalloc_caches[i].size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches + i, gfpflags, node);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
@@ -2126,15 +2127,15 @@ static void early_kmem_cache_node_alloc(
 
 	n = page->freelist;
 	BUG_ON(!n);
-	page->freelist = get_freepointer(kmalloc_caches, n);
+	page->freelist = get_freepointer(kmalloc_caches + i, n);
 	page->inuse++;
-	kmalloc_caches->node[node] = n;
+	kmalloc_caches[i].node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG
-	init_object(kmalloc_caches, n, 1);
-	init_tracking(kmalloc_caches, n);
+	init_object(kmalloc_caches + i, n, 1);
+	init_tracking(kmalloc_caches + i, n);
 #endif
-	init_kmem_cache_node(n, kmalloc_caches);
-	inc_slabs_node(kmalloc_caches, node, page->objects);
+	init_kmem_cache_node(n, kmalloc_caches + i);
+	inc_slabs_node(kmalloc_caches + i, node, page->objects);
 
 	/*
 	 * lockdep requires consistent irq usage for each lock
@@ -2152,8 +2153,9 @@ static void free_kmem_cache_nodes(struct
 
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = s->node[node];
+
 		if (n && n != &s->local_node)
-			kmem_cache_free(kmalloc_caches, n);
+			kfree(n);
 		s->node[node] = NULL;
 	}
 }
@@ -2178,8 +2180,8 @@ static int init_kmem_cache_nodes(struct 
 				early_kmem_cache_node_alloc(gfpflags, node);
 				continue;
 			}
-			n = kmem_cache_alloc_node(kmalloc_caches,
-							gfpflags, node);
+			n = kmalloc_node(sizeof(struct kmem_cache_node), gfpflags,
+				node);
 
 			if (!n) {
 				free_kmem_cache_nodes(s);
@@ -2574,6 +2576,12 @@ static struct kmem_cache *create_kmalloc
 {
 	unsigned int flags = 0;
 
+	if (s->size) {
+		s->name = name;
+		/* Already created */
+		return s;
+	}
+
 	if (gfp_flags & SLUB_DMA)
 		flags = SLAB_CACHE_DMA;
 
@@ -2978,7 +2986,7 @@ static void slab_mem_offline_callback(vo
 			BUG_ON(slabs_node(s, offline_node));
 
 			s->node[offline_node] = NULL;
-			kmem_cache_free(kmalloc_caches, n);
+			kfree(n);
 		}
 	}
 	up_read(&slub_lock);
@@ -3011,7 +3019,7 @@ static int slab_mem_going_online_callbac
 		 *      since memory is not yet available from the node that
 		 *      is brought up.
 		 */
-		n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
+		n = kmalloc(sizeof(struct kmem_cache_node), GFP_KERNEL);
 		if (!n) {
 			ret = -ENOMEM;
 			goto out;
@@ -3068,9 +3076,10 @@ void __init kmem_cache_init(void)
 	 * struct kmem_cache_node's. There is special bootstrap code in
 	 * kmem_cache_open for slab_state == DOWN.
 	 */
-	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
+	i = kmalloc_index(sizeof(struct kmem_cache_node));
+	create_kmalloc_cache(&kmalloc_caches[i], "bootstrap",
 		sizeof(struct kmem_cache_node), GFP_NOWAIT);
-	kmalloc_caches[0].refcount = -1;
+	kmalloc_caches[i].refcount = -1;
 	caches++;
 
 	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 07/14] SLEB: The Enhanced Slab Allocator
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (5 preceding siblings ...)
  2010-05-21 21:14 ` [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab Christoph Lameter
@ 2010-05-21 21:14 ` Christoph Lameter
  2010-05-21 21:15 ` [RFC V2 SLEB 08/14] SLEB: Resize cpu queue Christoph Lameter
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:14 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_core --]
[-- Type: text/plain, Size: 43230 bytes --]

SLEB is a merging of SLUB with some queuing concepts from SLAB and a new way
of managing objects in the slabs using bitmaps. It uses a percpu queue so that
free operations can be properly buffered and a bitmap for managing the
free/allocated state in the slabs. It is slightly more inefficient than
SLUB (due to the need to place large bitmaps --sized a few words--in some
slab pages) but in general does compete well with SLUBs space use.
The storage format avoids the SLAB management structure that SLAB needs for
each slab page and therefore the metadata is more compact and easily fits
into a cacheline.

The SLAB scheme of not touching the object during management is adopted.
SLEB can efficiently free and allocate cache cold objects.

There are numerous SLAB schemes that are not supported. Those could be
added if needed and if they really make a difference.

WARNING: This only ran successfully in a kvm instance so far.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   11 
 mm/slub.c                |  912 +++++++++++++++++++++--------------------------
 2 files changed, 415 insertions(+), 508 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-05-20 16:59:09.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-05-20 17:22:20.000000000 -0500
@@ -34,13 +34,16 @@ enum stat_item {
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	NR_SLUB_STAT_ITEMS };
 
+#define BOOT_QUEUE_SIZE 50
+#define BOOT_BATCH_SIZE 25
+
 struct kmem_cache_cpu {
-	void **freelist;	/* Pointer to first free per cpu object */
-	struct page *page;	/* The slab from which we are allocating */
-	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
+	int objects;		/* Number of objects available */
+	int node;		/* The node of the page (or -1 for debug) */
+	void *object[BOOT_QUEUE_SIZE];		/* List of objects */
 };
 
 struct kmem_cache_node {
@@ -72,9 +75,7 @@ struct kmem_cache {
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
-	int offset;		/* Free pointer offset. */
 	struct kmem_cache_order_objects oo;
-
 	/*
 	 * Avoid an extra cache line for UP, SMP and for the node local to
 	 * struct kmem_cache.
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-20 17:16:35.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-20 17:22:20.000000000 -0500
@@ -84,27 +84,6 @@
  * minimal so we rely on the page allocators per cpu caches for
  * fast frees and allocs.
  *
- * Overloading of page flags that are otherwise used for LRU management.
- *
- * PageActive 		The slab is frozen and exempt from list processing.
- * 			This means that the slab is dedicated to a purpose
- * 			such as satisfying allocations for a specific
- * 			processor. Objects may be freed in the slab while
- * 			it is frozen but slab_free will then skip the usual
- * 			list operations. It is up to the processor holding
- * 			the slab to integrate the slab into the slab lists
- * 			when the slab is no longer needed.
- *
- * 			One use of this flag is to mark slabs that are
- * 			used for allocations. Then such a slab becomes a cpu
- * 			slab. The cpu slab may be equipped with an additional
- * 			freelist that allows lockless access to
- * 			free objects in addition to the regular freelist
- * 			that requires the slab lock.
- *
- * PageError		Slab requires special handling due to debug
- * 			options set. This moves	slab handling out of
- * 			the fast path and disables lockless freelists.
  */
 
 #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
@@ -267,38 +246,71 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-static inline void *get_freepointer(struct kmem_cache *s, void *object)
-{
-	return *(void **)(object + s->offset);
-}
-
-static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
-{
-	*(void **)(object + s->offset) = fp;
-}
-
 /* Loop over all objects in a slab */
 #define for_each_object(__p, __s, __addr, __objects) \
 	for (__p = (__addr); __p < (__addr) + (__objects) * (__s)->size;\
 			__p += (__s)->size)
 
-/* Scan freelist */
-#define for_each_free_object(__p, __s, __free) \
-	for (__p = (__free); __p; __p = get_freepointer((__s), __p))
-
 /* Determine object index from a given position */
 static inline int slab_index(void *p, struct kmem_cache *s, void *addr)
 {
 	return (p - addr) / s->size;
 }
 
+static inline int map_in_page_struct(struct page *page)
+{
+	return page->objects <= BITS_PER_LONG;
+}
+
+static inline unsigned long *map(struct page *page)
+{
+	if (map_in_page_struct(page))
+		return (unsigned long *)&page->freelist;
+	else
+		return page->freelist;
+}
+
+static inline int map_size(struct page *page)
+{
+	return BITS_TO_LONGS(page->objects);
+}
+
+static inline int available(struct page *page)
+{
+	return bitmap_weight(map(page), page->objects);
+}
+
+static inline int all_objects_available(struct page *page)
+{
+	return bitmap_full(map(page), page->objects);
+}
+
+static inline int all_objects_used(struct page *page)
+{
+	return bitmap_empty(map(page), page->objects);
+}
+
+static inline int inuse(struct page *page)
+{
+	return page->objects - available(page);
+}
+
 static inline struct kmem_cache_order_objects oo_make(int order,
 						unsigned long size)
 {
-	struct kmem_cache_order_objects x = {
-		(order << OO_SHIFT) + (PAGE_SIZE << order) / size
-	};
+	struct kmem_cache_order_objects x;
+	unsigned long objects;
+	unsigned long page_size = PAGE_SIZE << order;
+	unsigned long ws = sizeof(unsigned long);
+
+	objects = page_size / size;
+
+	if (objects > BITS_PER_LONG)
+		/* Bitmap must fit into the slab as well */
+		objects = ((page_size / ws) * BITS_PER_LONG) /
+			((size / ws) * BITS_PER_LONG + 1);
 
+	x.x = (order << OO_SHIFT) + objects;
 	return x;
 }
 
@@ -370,10 +382,7 @@ static struct track *get_track(struct km
 {
 	struct track *p;
 
-	if (s->offset)
-		p = object + s->offset + sizeof(void *);
-	else
-		p = object + s->inuse;
+	p = object + s->inuse;
 
 	return p + alloc;
 }
@@ -421,8 +430,8 @@ static void print_tracking(struct kmem_c
 
 static void print_page_info(struct page *page)
 {
-	printk(KERN_ERR "INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx\n",
-		page, page->objects, page->inuse, page->freelist, page->flags);
+	printk(KERN_ERR "INFO: Slab 0x%p objects=%u new=%u fp=0x%p flags=0x%04lx\n",
+		page, page->objects, available(page), page->freelist, page->flags);
 
 }
 
@@ -461,8 +470,8 @@ static void print_trailer(struct kmem_ca
 
 	print_page_info(page);
 
-	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
-			p, p - addr, get_freepointer(s, p));
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n\n",
+			p, p - addr);
 
 	if (p > addr + 16)
 		print_section("Bytes b4", p - 16, 16);
@@ -473,10 +482,7 @@ static void print_trailer(struct kmem_ca
 		print_section("Redzone", p + s->objsize,
 			s->inuse - s->objsize);
 
-	if (s->offset)
-		off = s->offset + sizeof(void *);
-	else
-		off = s->inuse;
+	off = s->inuse;
 
 	if (s->flags & SLAB_STORE_USER)
 		off += 2 * sizeof(struct track);
@@ -570,8 +576,6 @@ static int check_bytes_and_report(struct
  *
  * object address
  * 	Bytes of the object to be managed.
- * 	If the freepointer may overlay the object then the free
- * 	pointer is the first word of the object.
  *
  * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
  * 	0xa5 (POISON_END)
@@ -587,9 +591,8 @@ static int check_bytes_and_report(struct
  * object + s->inuse
  * 	Meta data starts here.
  *
- * 	A. Free pointer (if we cannot overwrite object on free)
- * 	B. Tracking data for SLAB_STORE_USER
- * 	C. Padding to reach required alignment boundary or at mininum
+ * 	A. Tracking data for SLAB_STORE_USER
+ * 	B. Padding to reach required alignment boundary or at mininum
  * 		one word if debugging is on to be able to detect writes
  * 		before the word boundary.
  *
@@ -607,10 +610,6 @@ static int check_pad_bytes(struct kmem_c
 {
 	unsigned long off = s->inuse;	/* The end of info */
 
-	if (s->offset)
-		/* Freepointer is placed after the object. */
-		off += sizeof(void *);
-
 	if (s->flags & SLAB_STORE_USER)
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
@@ -635,15 +634,42 @@ static int slab_pad_check(struct kmem_ca
 		return 1;
 
 	start = page_address(page);
-	length = (PAGE_SIZE << compound_order(page));
-	end = start + length;
-	remainder = length % s->size;
+	end = start + (PAGE_SIZE << compound_order(page));
+
+	/* Check for special case of bitmap at the end of the page */
+	if (!map_in_page_struct(page)) {
+		if ((u8 *)page->freelist > start && (u8 *)page->freelist < end)
+			end = page->freelist;
+		else
+			slab_err(s, page, "pagemap pointer invalid =%p start=%p end=%p objects=%d",
+				page->freelist, start, end, page->objects);
+	}
+
+	length = end - start;
+	remainder = length - page->objects * s->size;
 	if (!remainder)
 		return 1;
 
 	fault = check_bytes(end - remainder, POISON_INUSE, remainder);
-	if (!fault)
-		return 1;
+	if (!fault) {
+		u8 *freelist_end;
+
+		if (map_in_page_struct(page))
+			return 1;
+
+		end = start + (PAGE_SIZE << compound_order(page));
+		freelist_end = page->freelist + map_size(page);
+		remainder = end - freelist_end;
+
+		if (!remainder)
+			return 1;
+
+		fault = check_bytes(freelist_end, POISON_INUSE,
+				remainder);
+		if (!fault)
+			return 1;
+	}
+
 	while (end > fault && end[-1] == POISON_INUSE)
 		end--;
 
@@ -686,25 +712,6 @@ static int check_object(struct kmem_cach
 		 */
 		check_pad_bytes(s, page, p);
 	}
-
-	if (!s->offset && active)
-		/*
-		 * Object and freepointer overlap. Cannot check
-		 * freepointer while object is allocated.
-		 */
-		return 1;
-
-	/* Check free pointer validity */
-	if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
-		object_err(s, page, p, "Freepointer corrupt");
-		/*
-		 * No choice but to zap it and thus lose the remainder
-		 * of the free objects in this slab. May cause
-		 * another error because the object count is now wrong.
-		 */
-		set_freepointer(s, p, NULL);
-		return 0;
-	}
 	return 1;
 }
 
@@ -725,51 +732,45 @@ static int check_slab(struct kmem_cache 
 			s->name, page->objects, maxobj);
 		return 0;
 	}
-	if (page->inuse > page->objects) {
-		slab_err(s, page, "inuse %u > max %u",
-			s->name, page->inuse, page->objects);
-		return 0;
-	}
+
 	/* Slab_pad_check fixes things up after itself */
 	slab_pad_check(s, page);
 	return 1;
 }
 
 /*
- * Determine if a certain object on a page is on the freelist. Must hold the
- * slab lock to guarantee that the chains are in a consistent state.
+ * Determine if a certain object on a page is on the free map.
  */
-static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
+static int object_marked_free(struct kmem_cache *s, struct page *page, void *search)
+{
+	return test_bit(slab_index(search, s, page_address(page)), map(page));
+}
+
+/* Verify the integrity of the metadata in a slab page */
+static int verify_slab(struct kmem_cache *s, struct page *page)
 {
 	int nr = 0;
-	void *fp = page->freelist;
-	void *object = NULL;
 	unsigned long max_objects;
+	void *start = page_address(page);
+	unsigned long size = PAGE_SIZE << compound_order(page);
 
-	while (fp && nr <= page->objects) {
-		if (fp == search)
-			return 1;
-		if (!check_valid_pointer(s, page, fp)) {
-			if (object) {
-				object_err(s, page, object,
-					"Freechain corrupt");
-				set_freepointer(s, object, NULL);
-				break;
-			} else {
-				slab_err(s, page, "Freepointer corrupt");
-				page->freelist = NULL;
-				page->inuse = page->objects;
-				slab_fix(s, "Freelist cleared");
-				return 0;
-			}
-			break;
-		}
-		object = fp;
-		fp = get_freepointer(s, object);
-		nr++;
+	nr = available(page);
+
+	if (map_in_page_struct(page))
+		max_objects = size / s->size;
+	else {
+		if (page->freelist <= start || page->freelist >= start + size) {
+			slab_err(s, page, "Invalid pointer to bitmap of free objects max_objects=%d!",
+				page->objects);
+			/* Switch to bitmap in page struct */
+			page->objects = max_objects = BITS_PER_LONG;
+			page->freelist = 0L;
+			slab_fix(s, "Slab sized for %d objects. ALl objects marked in use.",
+				BITS_PER_LONG);
+		} else
+			max_objects = ((void *)page->freelist - start) / s->size;
 	}
 
-	max_objects = (PAGE_SIZE << compound_order(page)) / s->size;
 	if (max_objects > MAX_OBJS_PER_PAGE)
 		max_objects = MAX_OBJS_PER_PAGE;
 
@@ -778,24 +779,19 @@ static int on_freelist(struct kmem_cache
 			"should be %d", page->objects, max_objects);
 		page->objects = max_objects;
 		slab_fix(s, "Number of objects adjusted.");
+		return 0;
 	}
-	if (page->inuse != page->objects - nr) {
-		slab_err(s, page, "Wrong object count. Counter is %d but "
-			"counted were %d", page->inuse, page->objects - nr);
-		page->inuse = page->objects - nr;
-		slab_fix(s, "Object count adjusted.");
-	}
-	return search == NULL;
+	return 1;
 }
 
 static void trace(struct kmem_cache *s, struct page *page, void *object,
 								int alloc)
 {
 	if (s->flags & SLAB_TRACE) {
-		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+		printk(KERN_INFO "TRACE %s %s 0x%p free=%d fp=0x%p\n",
 			s->name,
 			alloc ? "alloc" : "free",
-			object, page->inuse,
+			object, available(page),
 			page->freelist);
 
 		if (!alloc)
@@ -808,14 +804,19 @@ static void trace(struct kmem_cache *s, 
 /*
  * Tracking of fully allocated slabs for debugging purposes.
  */
-static void add_full(struct kmem_cache_node *n, struct page *page)
+static inline void add_full(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct page *page)
 {
+
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
 	spin_lock(&n->list_lock);
 	list_add(&page->lru, &n->full);
 	spin_unlock(&n->list_lock);
 }
 
-static void remove_full(struct kmem_cache *s, struct page *page)
+static inline void remove_full(struct kmem_cache *s, struct page *page)
 {
 	struct kmem_cache_node *n;
 
@@ -876,25 +877,30 @@ static void setup_object_debug(struct km
 	init_tracking(s, object);
 }
 
-static int alloc_debug_processing(struct kmem_cache *s, struct page *page,
+static int alloc_debug_processing(struct kmem_cache *s,
 					void *object, unsigned long addr)
 {
+	struct page *page = virt_to_head_page(object);
+
 	if (!check_slab(s, page))
 		goto bad;
 
-	if (!on_freelist(s, page, object)) {
-		object_err(s, page, object, "Object already allocated");
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Pointer check fails");
 		goto bad;
 	}
 
-	if (!check_valid_pointer(s, page, object)) {
-		object_err(s, page, object, "Freelist Pointer check fails");
+	if (object_marked_free(s, page, object)) {
+		object_err(s, page, object, "Allocated object still marked free in slab");
 		goto bad;
 	}
 
 	if (!check_object(s, page, object, 0))
 		goto bad;
 
+	if (!verify_slab(s, page))
+		goto bad;
+
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
@@ -910,15 +916,16 @@ bad:
 		 * as used avoids touching the remaining objects.
 		 */
 		slab_fix(s, "Marking all objects used");
-		page->inuse = page->objects;
-		page->freelist = NULL;
+		bitmap_zero(map(page), page->objects);
 	}
 	return 0;
 }
 
-static int free_debug_processing(struct kmem_cache *s, struct page *page,
+static int free_debug_processing(struct kmem_cache *s,
 					void *object, unsigned long addr)
 {
+	struct page *page = virt_to_head_page(object);
+
 	if (!check_slab(s, page))
 		goto fail;
 
@@ -927,7 +934,7 @@ static int free_debug_processing(struct 
 		goto fail;
 	}
 
-	if (on_freelist(s, page, object)) {
+	if (object_marked_free(s, page, object)) {
 		object_err(s, page, object, "Object already free");
 		goto fail;
 	}
@@ -950,13 +957,11 @@ static int free_debug_processing(struct 
 		goto fail;
 	}
 
-	/* Special debug activities for freeing objects */
-	if (!PageSlubFrozen(page) && !page->freelist)
-		remove_full(s, page);
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
 	init_object(s, object, 0);
+	verify_slab(s, page);
 	return 1;
 
 fail:
@@ -1061,7 +1066,8 @@ static inline int slab_pad_check(struct 
 			{ return 1; }
 static inline int check_object(struct kmem_cache *s, struct page *page,
 			void *object, int active) { return 1; }
-static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void add_full(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct page *page) {}
 static inline unsigned long kmem_cache_flags(unsigned long objsize,
 	unsigned long flags, const char *name,
 	void (*ctor)(void *))
@@ -1163,8 +1169,8 @@ static struct page *new_slab(struct kmem
 {
 	struct page *page;
 	void *start;
-	void *last;
 	void *p;
+	unsigned long size;
 
 	BUG_ON(flags & GFP_SLAB_BUG_MASK);
 
@@ -1176,23 +1182,20 @@ static struct page *new_slab(struct kmem
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
-
 	start = page_address(page);
+	size = PAGE_SIZE << compound_order(page);
 
 	if (unlikely(s->flags & SLAB_POISON))
-		memset(start, POISON_INUSE, PAGE_SIZE << compound_order(page));
+		memset(start, POISON_INUSE, size);
 
-	last = start;
-	for_each_object(p, s, start, page->objects) {
-		setup_object(s, page, last);
-		set_freepointer(s, last, p);
-		last = p;
-	}
-	setup_object(s, page, last);
-	set_freepointer(s, last, NULL);
+	if (!map_in_page_struct(page))
+		page->freelist = start + page->objects * s->size;
+
+	bitmap_fill(map(page), page->objects);
+
+	for_each_object(p, s, start, page->objects)
+		setup_object(s, page, p);
 
-	page->freelist = start;
-	page->inuse = 0;
 out:
 	return page;
 }
@@ -1316,7 +1319,6 @@ static inline int lock_and_freeze_slab(s
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
 		n->nr_partial--;
-		__SetPageSlubFrozen(page);
 		return 1;
 	}
 	return 0;
@@ -1415,113 +1417,133 @@ static struct page *get_partial(struct k
 }
 
 /*
- * Move a page back to the lists.
- *
- * Must be called with the slab lock held.
- *
- * On exit the slab lock will have been dropped.
+ * Move the vector of objects back to the slab pages they came from
  */
-static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
+void drain_objects(struct kmem_cache *s, void **object, int nr)
 {
-	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	int i;
 
-	__ClearPageSlubFrozen(page);
-	if (page->inuse) {
+	for (i = 0 ; i < nr; ) {
 
-		if (page->freelist) {
-			add_partial(n, page, tail);
-			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
-		} else {
-			stat(s, DEACTIVATE_FULL);
-			if (debug_on(s) && (s->flags & SLAB_STORE_USER))
-				add_full(n, page);
+		void *p = object[i];
+		struct page *page = virt_to_head_page(p);
+		void *addr = page_address(page);
+		unsigned long size = PAGE_SIZE << compound_order(page);
+		int was_fully_allocated;
+		unsigned long *m;
+		unsigned long offset;
+
+		if (debug_on(s) && !PageSlab(page)) {
+			object_err(s, page, object[i], "Object from non-slab page");
+			i++;
+			continue;
 		}
-		slab_unlock(page);
-	} else {
-		stat(s, DEACTIVATE_EMPTY);
-		if (n->nr_partial < s->min_partial) {
+
+		slab_lock(page);
+		m = map(page);
+		was_fully_allocated = bitmap_empty(m, page->objects);
+
+		offset = p - addr;
+
+
+		while (i < nr) {
+
+			int bit;
+			unsigned long new_offset;
+
+			if (offset >= size)
+				break;
+
+			if (debug_on(s) && offset % s->size) {
+				object_err(s, page, object[i], "Misaligned object");
+				i++;
+				new_offset = object[i] - addr;
+				continue;
+			}
+
+			bit = offset / s->size;
+
 			/*
-			 * Adding an empty slab to the partial slabs in order
-			 * to avoid page allocator overhead. This slab needs
-			 * to come after the other slabs with objects in
-			 * so that the others get filled first. That way the
-			 * size of the partial list stays small.
-			 *
-			 * kmem_cache_shrink can reclaim any empty slabs from
-			 * the partial list.
-			 */
-			add_partial(n, page, 1);
-			slab_unlock(page);
-		} else {
-			stat(s, FREE_SLAB);
-			discard_slab_unlock(s, page);
+			 * Fast loop to fold a sequence of objects into the slab
+			 * avoiding division and virt_to_head_page()
+  			 */
+			do {
+
+				if (debug_on(s)) {
+					if (unlikely(__test_and_set_bit(bit, m)))
+						object_err(s, page, object[i], "Double free");
+				} else
+					__set_bit(bit, m);
+
+				i++;
+				bit++;
+				offset += s->size;
+				new_offset = object[i] - addr;
+
+			} while (i < nr && new_offset ==  offset);
+
+			offset = new_offset;
 		}
-	}
-}
 
-/*
- * Remove the cpu slab
- */
-static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
-{
-	struct page *page = c->page;
-	int tail = 1;
+		if (bitmap_full(m, page->objects)) {
 
-	if (page->freelist)
-		stat(s, DEACTIVATE_REMOTE_FREES);
-	/*
-	 * Merge cpu freelist into slab freelist. Typically we get here
-	 * because both freelists are empty. So this is unlikely
-	 * to occur.
-	 */
-	while (unlikely(c->freelist)) {
-		void **object;
+			/* All objects are available now */
+			if (!was_fully_allocated)
+
+				remove_partial(s, page);
+			else
+				remove_full(s, page);
+
+			discard_slab_unlock(s, page);
 
-		tail = 0;	/* Hot objects. Put the slab first */
+  		} else {
 
-		/* Retrieve object from cpu_freelist */
-		object = c->freelist;
-		c->freelist = get_freepointer(s, c->freelist);
+			/* Some object are available now */
+			if (was_fully_allocated) {
 
-		/* And put onto the regular freelist */
-		set_freepointer(s, object, page->freelist);
-		page->freelist = object;
-		page->inuse--;
+				/* Slab was had no free objects but has them now */
+				remove_full(s, page);
+				add_partial(get_node(s, page_to_nid(page)), page, 1);
+				stat(s, FREE_REMOVE_PARTIAL);
+			}
+			slab_unlock(page);
+		}
 	}
-	c->page = NULL;
-	unfreeze_slab(s, page, tail);
 }
 
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+/*
+ * Drain all objects from a per cpu queue
+ */
+static void flush_cpu_objects(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
+	drain_objects(s, c->object, c->objects);
+	c->objects = 0;
 	stat(s, CPUSLAB_FLUSH);
-	slab_lock(c->page);
-	deactivate_slab(s, c);
 }
 
 /*
- * Flush cpu slab.
+ * Flush cpu objects.
  *
  * Called from IPI handler with interrupts disabled.
  */
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
+static void __flush_cpu_objects(void *d)
 {
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+	struct kmem_cache *s = d;
+	struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);
 
-	if (likely(c && c->page))
-		flush_slab(s, c);
+	if (c->objects)
+		flush_cpu_objects(s, c);
 }
 
-static void flush_cpu_slab(void *d)
+static void flush_all(struct kmem_cache *s)
 {
-	struct kmem_cache *s = d;
-
-	__flush_cpu_slab(s, smp_processor_id());
+	on_each_cpu(__flush_cpu_objects, s, 1);
 }
 
-static void flush_all(struct kmem_cache *s)
+struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
 {
-	on_each_cpu(flush_cpu_slab, s, 1);
+	return __alloc_percpu(sizeof(struct kmem_cache_cpu),
+			__alignof__(struct kmem_cache_cpu));
 }
 
 /*
@@ -1539,7 +1561,7 @@ static inline int node_match(struct kmem
 
 static int count_free(struct page *page)
 {
-	return page->objects - page->inuse;
+	return available(page);
 }
 
 static unsigned long count_partial(struct kmem_cache_node *n,
@@ -1601,144 +1623,128 @@ slab_out_of_memory(struct kmem_cache *s,
 }
 
 /*
- * Slow path. The lockless freelist is empty or we need to perform
- * debugging duties.
- *
- * Interrupts are disabled.
- *
- * Processing is still very fast if new objects have been freed to the
- * regular freelist. In that case we simply take over the regular freelist
- * as the lockless freelist and zap the regular freelist.
- *
- * If that is not working then we fall back to the partial lists. We take the
- * first element of the freelist as the object to allocate now and move the
- * rest of the freelist to the lockless freelist.
- *
- * And if we were unable to get a new slab from the partial slab lists then
- * we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
+ * Retrieve pointers to nr objects from a slab into the object array.
+ * Slab must be locked.
  */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+void retrieve_objects(struct kmem_cache *s, struct page *page, void **object, int nr)
 {
-	void **object;
-	struct page *new;
-
-	/* We handle __GFP_ZERO in the caller */
-	gfpflags &= ~__GFP_ZERO;
+	void *addr = page_address(page);
+	unsigned long *m = map(page);
 
-	if (!c->page)
-		goto new_slab;
+	while (nr > 0) {
+		int i = find_first_bit(m, page->objects);
+		void *a;
 
-	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
-		goto another_slab;
-
-	stat(s, ALLOC_REFILL);
-
-load_freelist:
-	object = c->page->freelist;
-	if (unlikely(!object))
-		goto another_slab;
-	if (debug_on(s))
-		goto debug;
-
-	c->freelist = get_freepointer(s, object);
-	c->page->inuse = c->page->objects;
-	c->page->freelist = NULL;
-	c->node = page_to_nid(c->page);
-unlock_out:
-	slab_unlock(c->page);
-	stat(s, ALLOC_SLOWPATH);
-	return object;
+		VM_BUG_ON(i >= page->objects);
 
-another_slab:
-	deactivate_slab(s, c);
+		__clear_bit(i, m);
+		a = addr + i * s->size;
 
-new_slab:
-	new = get_partial(s, gfpflags, node);
-	if (new) {
-		c->page = new;
-		stat(s, ALLOC_FROM_PARTIAL);
-		goto load_freelist;
-	}
-
-	if (gfpflags & __GFP_WAIT)
-		local_irq_enable();
-
-	new = new_slab(s, gfpflags, node);
-
-	if (gfpflags & __GFP_WAIT)
-		local_irq_disable();
-
-	if (new) {
-		c = __this_cpu_ptr(s->cpu_slab);
-		stat(s, ALLOC_SLAB);
-		if (c->page)
-			flush_slab(s, c);
-		slab_lock(new);
-		__SetPageSlubFrozen(new);
-		c->page = new;
-		goto load_freelist;
+		/*
+		 * Fast loop to get a sequence of objects out of the slab
+		 * without find_first_bit() and multiplication
+		 */
+		do {
+			nr--;
+			object[nr] = a;
+			a += s->size;
+			i++;
+		} while (nr > 0 && i < page->objects && __test_and_clear_bit(i, m));
 	}
-	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
-		slab_out_of_memory(s, gfpflags, node);
-	return NULL;
-debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
-		goto another_slab;
-
-	c->page->inuse++;
-	c->page->freelist = get_freepointer(s, object);
-	c->node = -1;
-	goto unlock_out;
 }
 
-/*
- * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
- * have the fastpath folded into their functions. So no function call
- * overhead for requests that can be satisfied on the fastpath.
- *
- * The fastpath works by first checking if the lockless freelist can be used.
- * If not then __slab_alloc is called for slow processing.
- *
- * Otherwise we can simply pick the next object from the lockless free list.
- */
-static __always_inline void *slab_alloc(struct kmem_cache *s,
+static void *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
-	gfpflags &= gfp_allowed_mask;
-
 	lockdep_trace_alloc(gfpflags);
 	might_sleep_if(gfpflags & __GFP_WAIT);
 
 	if (should_failslab(s->objsize, gfpflags, s->flags))
 		return NULL;
 
+redo:
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
+	if (unlikely(!c->objects || !node_match(c, node))) {
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		gfpflags &= gfp_allowed_mask;
 
-	else {
-		c->freelist = get_freepointer(s, object);
+		if (unlikely(!node_match(c, node))) {
+			flush_cpu_objects(s, c);
+			c->node = node;
+		}
+
+		while (c->objects < BOOT_BATCH_SIZE) {
+			struct page *new;
+			int d;
+
+			new = get_partial(s, gfpflags & ~__GFP_ZERO, node);
+			if (unlikely(!new)) {
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_enable();
+
+				new = new_slab(s, gfpflags, node);
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_disable();
+
+				/* process may have moved to different cpu */
+				c = __this_cpu_ptr(s->cpu_slab);
+
+ 				if (!new) {
+					if (!c->objects)
+						goto oom;
+					break;
+				}
+				stat(s, ALLOC_SLAB);
+				slab_lock(new);
+			} else
+				stat(s, ALLOC_FROM_PARTIAL);
+
+			d = min(BOOT_BATCH_SIZE - c->objects, available(new));
+			retrieve_objects(s, new, c->object + c->objects, d);
+			c->objects += d;
+
+			if (!all_objects_used(new))
+
+				add_partial(get_node(s, page_to_nid(new)), new, 1);
+
+			else
+				add_full(s, get_node(s, page_to_nid(new)), new);
+
+			slab_unlock(new);
+		}
+		stat(s, ALLOC_SLOWPATH);
+
+	} else
 		stat(s, ALLOC_FASTPATH);
+
+	object = c->object[--c->objects];
+
+	if (unlikely(debug_on(s))) {
+		if (!alloc_debug_processing(s, object, addr))
+			goto redo;
 	}
 	local_irq_restore(flags);
 
-	if (unlikely(gfpflags & __GFP_ZERO) && object)
+	if (unlikely(gfpflags & __GFP_ZERO))
 		memset(object, 0, s->objsize);
 
 	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
 	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
 
 	return object;
+
+oom:
+	local_irq_restore(flags);
+	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
+		slab_out_of_memory(s, gfpflags, node);
+	return NULL;
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1782,113 +1788,52 @@ void *kmem_cache_alloc_node_notrace(stru
 EXPORT_SYMBOL(kmem_cache_alloc_node_notrace);
 #endif
 
-/*
- * Slow patch handling. This may still be called frequently since objects
- * have a longer lifetime than the cpu slabs in most processing loads.
- *
- * So we still attempt to reduce cache line usage. Just take the slab
- * lock and free the item. If there is no additional partial page
- * handling required then we can return immediately.
- */
-static void __slab_free(struct kmem_cache *s, struct page *page,
+static void slab_free(struct kmem_cache *s,
 			void *x, unsigned long addr)
 {
-	void *prior;
-	void **object = (void *)x;
-
-	stat(s, FREE_SLOWPATH);
-	slab_lock(page);
-
-	if (debug_on(s))
-		goto debug;
-
-checks_ok:
-	prior = page->freelist;
-	set_freepointer(s, object, prior);
-	page->freelist = object;
-	page->inuse--;
-
-	if (unlikely(PageSlubFrozen(page))) {
-		stat(s, FREE_FROZEN);
-		goto out_unlock;
-	}
-
-	if (unlikely(!page->inuse))
-		goto slab_empty;
-
-	/*
-	 * Objects left in the slab. If it was not on the partial list before
-	 * then add it.
-	 */
-	if (unlikely(!prior)) {
-		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(s, FREE_ADD_PARTIAL);
-	}
-
-out_unlock:
-	slab_unlock(page);
-	return;
-
-slab_empty:
-	if (prior) {
-		/*
-		 * Slab still on the partial list.
-		 */
-		remove_partial(s, page);
-		stat(s, FREE_REMOVE_PARTIAL);
-	}
-	stat(s, FREE_SLAB);
-	discard_slab_unlock(s, page);
-	return;
-
-debug:
-	if (!free_debug_processing(s, page, x, addr))
-		goto out_unlock;
-	goto checks_ok;
-}
-
-/*
- * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
- * can perform fastpath freeing without additional function calls.
- *
- * The fastpath is only possible if we are freeing to the current cpu slab
- * of this processor. This typically the case if we have just allocated
- * the item before.
- *
- * If fastpath is not possible then fall back to __slab_free where we deal
- * with all sorts of special processing.
- */
-static __always_inline void slab_free(struct kmem_cache *s,
-			struct page *page, void *x, unsigned long addr)
-{
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	kmemleak_free_recursive(x, s->flags);
+
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
+
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
+
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
-	if (likely(page == c->page && c->node >= 0)) {
-		set_freepointer(s, object, c->freelist);
-		c->freelist = object;
-		stat(s, FREE_FASTPATH);
+
+	if (unlikely(c->objects >= BOOT_QUEUE_SIZE)) {
+
+		int t = min(BOOT_BATCH_SIZE, c->objects);
+
+		drain_objects(s, c->object, t);
+
+		c->objects -= t;
+		if (c->objects)
+			memcpy(c->object, c->object + t,
+					c->objects * sizeof(void *));
+
+		stat(s, FREE_SLOWPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		stat(s, FREE_FASTPATH);
+
+	if (unlikely(debug_on(s))
+			&& !free_debug_processing(s, x, addr))
+		goto out;
+
+	c->object[c->objects++] = object;
 
+out:
 	local_irq_restore(flags);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
 {
-	struct page *page;
-
-	page = virt_to_head_page(x);
-
-	slab_free(s, page, x, _RET_IP_);
+	slab_free(s, x, _RET_IP_);
 
 	trace_kmem_cache_free(_RET_IP_, x);
 }
@@ -1906,11 +1851,6 @@ static struct page *get_object_page(cons
 }
 
 /*
- * Object placement in a slab is made very easy because we always start at
- * offset 0. If we tune the size of the object to the alignment then we can
- * get the required alignment by putting one properly sized object after
- * another.
- *
  * Notice that the allocation order determines the sizes of the per cpu
  * caches. Each processor has always one slab available for allocations.
  * Increasing the allocation order reduces the number of times that slabs
@@ -2005,7 +1945,7 @@ static inline int calculate_order(int si
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects)
-		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+		min_objects = min(BITS_PER_LONG, 4 * (fls(nr_cpu_ids) + 1));
 	max_objects = (PAGE_SIZE << slub_max_order)/size;
 	min_objects = min(min_objects, max_objects);
 
@@ -2083,12 +2023,12 @@ static inline int alloc_kmem_cache_cpus(
 {
 	if (is_kmalloc_cache(s))
 		/*
-		 * Boot time creation of the kmalloc array. Use static per cpu data
-		 * since the per cpu allocator is not available yet.
+		 * Kmalloc caches have statically defined per cpu caches
 		 */
 		s->cpu_slab = kmalloc_percpu + (s - kmalloc_caches);
 	else
-		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
+
+		s->cpu_slab =  alloc_kmem_cache_cpu(s, BOOT_QUEUE_SIZE);
 
 	if (!s->cpu_slab)
 		return 0;
@@ -2125,10 +2065,7 @@ static void early_kmem_cache_node_alloc(
 				"in order to be able to continue\n");
 	}
 
-	n = page->freelist;
-	BUG_ON(!n);
-	page->freelist = get_freepointer(kmalloc_caches + i, n);
-	page->inuse++;
+	retrieve_objects(kmalloc_caches + i, page, (void **)&n, 1);
 	kmalloc_caches[i].node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG
 	init_object(kmalloc_caches + i, n, 1);
@@ -2165,7 +2102,7 @@ static int init_kmem_cache_nodes(struct 
 	int node;
 	int local_node;
 
-	if (slab_state >= UP && !is_kmalloc_cache(s))
+	if (slab_state >= UP && (!is_kmalloc_cache(s)))
 		local_node = page_to_nid(virt_to_page(s));
 	else
 		local_node = 0;
@@ -2222,10 +2159,11 @@ static void set_min_partial(struct kmem_
 static int calculate_sizes(struct kmem_cache *s, int forced_order)
 {
 	unsigned long flags = s->flags;
-	unsigned long size = s->objsize;
+	unsigned long size;
 	unsigned long align = s->align;
 	int order;
 
+	size = s->objsize;
 	/*
 	 * Round up object size to the next word boundary. We can only
 	 * place the free pointer at word boundaries and this determines
@@ -2257,24 +2195,10 @@ static int calculate_sizes(struct kmem_c
 
 	/*
 	 * With that we have determined the number of bytes in actual use
-	 * by the object. This is the potential offset to the free pointer.
+	 * by the object.
 	 */
 	s->inuse = size;
 
-	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
-		s->ctor)) {
-		/*
-		 * Relocate free pointer after the object if it is not
-		 * permitted to overwrite the first word of the object on
-		 * kmem_cache_free.
-		 *
-		 * This is the case if we do RCU, have a constructor or
-		 * destructor or are poisoning the objects.
-		 */
-		s->offset = size;
-		size += sizeof(void *);
-	}
-
 #ifdef CONFIG_SLUB_DEBUG
 	if (flags & SLAB_STORE_USER)
 		/*
@@ -2360,7 +2284,6 @@ static int kmem_cache_open(struct kmem_c
 		 */
 		if (get_order(s->size) > get_order(s->objsize)) {
 			s->flags &= ~DEBUG_METADATA_FLAGS;
-			s->offset = 0;
 			if (!calculate_sizes(s, -1))
 				goto error;
 		}
@@ -2385,9 +2308,9 @@ static int kmem_cache_open(struct kmem_c
 error:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slab %s size=%lu realsize=%u "
-			"order=%u offset=%u flags=%lx\n",
+			"order=%u flags=%lx\n",
 			s->name, (unsigned long)size, s->size, oo_order(s->oo),
-			s->offset, flags);
+			flags);
 	return 0;
 }
 
@@ -2441,17 +2364,13 @@ static void list_slab_objects(struct kme
 #ifdef CONFIG_SLUB_DEBUG
 	void *addr = page_address(page);
 	void *p;
-	DECLARE_BITMAP(map, page->objects);
 
-	bitmap_zero(map, page->objects);
 	slab_err(s, page, "%s", text);
 	slab_lock(page);
-	for_each_free_object(p, s, page->freelist)
-		set_bit(slab_index(p, s, addr), map);
 
 	for_each_object(p, s, addr, page->objects) {
 
-		if (!test_bit(slab_index(p, s, addr), map)) {
+		if (!test_bit(slab_index(p, s, addr), map(page))) {
 			printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n",
 							p, p - addr);
 			print_tracking(s, p);
@@ -2471,7 +2390,7 @@ static void free_partial(struct kmem_cac
 
 	spin_lock_irqsave(&n->list_lock, flags);
 	list_for_each_entry_safe(page, h, &n->partial, lru) {
-		if (!page->inuse) {
+		if (all_objects_available(page)) {
 			list_del(&page->lru);
 			discard_slab(s, page);
 			n->nr_partial--;
@@ -2866,7 +2785,7 @@ void kfree(const void *x)
 		put_page(page);
 		return;
 	}
-	slab_free(page->slab, page, object, _RET_IP_);
+	slab_free(page->slab, object, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
 
@@ -2914,7 +2833,7 @@ int kmem_cache_shrink(struct kmem_cache 
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
+			if (all_objects_available(page) && slab_trylock(page)) {
 				/*
 				 * Must hold slab lock here because slab_free
 				 * may have freed the last object and be
@@ -2925,7 +2844,7 @@ int kmem_cache_shrink(struct kmem_cache 
 				discard_slab_unlock(s, page);
 			} else {
 				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
+				slabs_by_inuse + inuse(page));
 			}
 		}
 
@@ -3312,7 +3231,7 @@ static int __cpuinit slab_cpuup_callback
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
 			local_irq_save(flags);
-			__flush_cpu_slab(s, cpu);
+			flush_cpu_objects(s, per_cpu_ptr(s->cpu_slab ,cpu));
 			local_irq_restore(flags);
 		}
 		up_read(&slub_lock);
@@ -3375,7 +3294,7 @@ void *__kmalloc_node_track_caller(size_t
 #ifdef CONFIG_SLUB_DEBUG
 static int count_inuse(struct page *page)
 {
-	return page->inuse;
+	return inuse(page);
 }
 
 static int count_total(struct page *page)
@@ -3383,54 +3302,52 @@ static int count_total(struct page *page
 	return page->objects;
 }
 
-static int validate_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
+static int validate_slab(struct kmem_cache *s, struct page *page)
 {
 	void *p;
 	void *addr = page_address(page);
+	unsigned long *m = map(page);
+	unsigned long errors = 0;
 
-	if (!check_slab(s, page) ||
-			!on_freelist(s, page, NULL))
+	if (!check_slab(s, page) || !verify_slab(s, page))
 		return 0;
 
-	/* Now we know that a valid freelist exists */
-	bitmap_zero(map, page->objects);
+	for_each_object(p, s, addr, page->objects) {
+		int bit = slab_index(p, s, addr);
+		int used = !test_bit(bit, m);
 
-	for_each_free_object(p, s, page->freelist) {
-		set_bit(slab_index(p, s, addr), map);
-		if (!check_object(s, page, p, 0))
-			return 0;
+		if (!check_object(s, page, p, used))
+			errors++;
 	}
 
-	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
-			if (!check_object(s, page, p, 1))
-				return 0;
-	return 1;
+	return errors;
 }
 
-static void validate_slab_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
+static unsigned long validate_slab_slab(struct kmem_cache *s, struct page *page)
 {
+	unsigned long errors = 0;
+
 	if (slab_trylock(page)) {
-		validate_slab(s, page, map);
+		errors = validate_slab(s, page);
 		slab_unlock(page);
 	} else
 		printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
 			s->name, page);
+	return errors;
 }
 
 static int validate_slab_node(struct kmem_cache *s,
-		struct kmem_cache_node *n, unsigned long *map)
+		struct kmem_cache_node *n)
 {
 	unsigned long count = 0;
 	struct page *page;
 	unsigned long flags;
+	unsigned long errors;
 
 	spin_lock_irqsave(&n->list_lock, flags);
 
 	list_for_each_entry(page, &n->partial, lru) {
-		validate_slab_slab(s, page, map);
+		errors += validate_slab_slab(s, page);
 		count++;
 	}
 	if (count != n->nr_partial)
@@ -3441,7 +3358,7 @@ static int validate_slab_node(struct kme
 		goto out;
 
 	list_for_each_entry(page, &n->full, lru) {
-		validate_slab_slab(s, page, map);
+		validate_slab_slab(s, page);
 		count++;
 	}
 	if (count != atomic_long_read(&n->nr_slabs))
@@ -3451,26 +3368,20 @@ static int validate_slab_node(struct kme
 
 out:
 	spin_unlock_irqrestore(&n->list_lock, flags);
-	return count;
+	return errors;
 }
 
 static long validate_slab_cache(struct kmem_cache *s)
 {
 	int node;
 	unsigned long count = 0;
-	unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) *
-				sizeof(unsigned long), GFP_KERNEL);
-
-	if (!map)
-		return -ENOMEM;
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
-		count += validate_slab_node(s, n, map);
+		count += validate_slab_node(s, n);
 	}
-	kfree(map);
 	return count;
 }
 
@@ -3662,15 +3573,10 @@ static void process_slab(struct loc_trac
 		struct page *page, enum track_item alloc)
 {
 	void *addr = page_address(page);
-	DECLARE_BITMAP(map, page->objects);
 	void *p;
 
-	bitmap_zero(map, page->objects);
-	for_each_free_object(p, s, page->freelist)
-		set_bit(slab_index(p, s, addr), map);
-
 	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
+		if (!test_bit(slab_index(p, s, addr), map(page)))
 			add_location(t, s, get_track(s, p, alloc));
 }
 
@@ -3794,11 +3700,11 @@ static ssize_t show_slab_objects(struct 
 			if (!c || c->node < 0)
 				continue;
 
-			if (c->page) {
-					if (flags & SO_TOTAL)
-						x = c->page->objects;
+			if (c->objects) {
+				if (flags & SO_TOTAL)
+					x = 0;
 				else if (flags & SO_OBJECTS)
-					x = c->page->inuse;
+					x = c->objects;
 				else
 					x = 1;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 08/14] SLEB: Resize cpu queue
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (6 preceding siblings ...)
  2010-05-21 21:14 ` [RFC V2 SLEB 07/14] SLEB: The Enhanced Slab Allocator Christoph Lameter
@ 2010-05-21 21:15 ` Christoph Lameter
  2010-05-21 21:15 ` [RFC V2 SLEB 09/14] SLED: Get rid of useless function Christoph Lameter
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:15 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_resize --]
[-- Type: text/plain, Size: 7910 bytes --]

Allow resizing of cpu queue and batch sizes. Resizing queues is only
possible for non kmalloc slabs since kmalloc slabs have statically
allocated per cpu queues (avoid bootstap issues) and involves
reallocating the per cpu structures. This is done by replicating
the basic steps of how SLAB does it.

Careful: This means that the ->cpu_slab pointer is only
guaranteed to be stable if interrupts are disabled.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    2 
 mm/slub.c                |  152 +++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 143 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-20 14:40:08.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-20 14:40:17.000000000 -0500
@@ -1521,6 +1521,11 @@ static void flush_cpu_objects(struct kme
 	stat(s, CPUSLAB_FLUSH);
 }
 
+struct flush_control {
+	struct kmem_cache *s;
+	struct kmem_cache_cpu *c;
+};
+
 /*
  * Flush cpu objects.
  *
@@ -1528,24 +1533,77 @@ static void flush_cpu_objects(struct kme
  */
 static void __flush_cpu_objects(void *d)
 {
-	struct kmem_cache *s = d;
-	struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);
+	struct flush_control *f = d;
+	struct kmem_cache_cpu *c = __this_cpu_ptr(f->c);
 
 	if (c->objects)
-		flush_cpu_objects(s, c);
+		flush_cpu_objects(f->s, c);
 }
 
 static void flush_all(struct kmem_cache *s)
 {
-	on_each_cpu(__flush_cpu_objects, s, 1);
+	struct flush_control f = { s, s->cpu_slab};
+
+	on_each_cpu(__flush_cpu_objects, &f, 1);
 }
 
 struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
 {
-	return __alloc_percpu(sizeof(struct kmem_cache_cpu),
+	return __alloc_percpu(
+			sizeof(struct kmem_cache_cpu) + sizeof(void *) * (n - BOOT_QUEUE_SIZE),
 			__alignof__(struct kmem_cache_cpu));
 }
 
+static void resize_cpu_queue(struct kmem_cache *s, int queue)
+{
+
+	if (is_kmalloc_cache(s)) {
+		if (queue < BOOT_QUEUE_SIZE) {
+			s->queue = queue;
+			if (s->batch > queue)
+				s->batch = queue;
+		} else {
+			/* More than max. Go to max allowed */
+			s->queue = BOOT_QUEUE_SIZE;
+			s->batch = BOOT_BATCH_SIZE;
+		}
+	} else {
+		struct kmem_cache_cpu *n = alloc_kmem_cache_cpu(s, queue);
+		struct flush_control f;
+
+		/* Create the new cpu queue and then free the old one */
+		down_write(&slub_lock);
+		f.s = s;
+		f.c = s->cpu_slab;
+
+		/* We can only shrink the queue here since the new
+		 * queue size may be smaller and there may be concurrent
+		 * slab operations. The upate of the queue must be seen
+		 * before the change of the location of the percpu queue.
+		 *
+		 * Note that the queue may contain more object than the
+		 * queue size after this operation.
+		 */
+		if (queue < s->queue) {
+			s->queue = queue;
+			barrier();
+		}
+		s->cpu_slab = n;
+		on_each_cpu(__flush_cpu_objects, &f, 1);
+
+		/*
+		 * If the queue needs to be extended then we deferred
+		 * the update until now when the larger sized queue
+		 * has been allocated and is working.
+		 */
+		if (queue > s->queue)
+			s->queue = queue;
+
+		up_write(&slub_lock);
+		free_percpu(f.c);
+	}
+}
+
 /*
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
@@ -1678,7 +1736,7 @@ redo:
 			c->node = node;
 		}
 
-		while (c->objects < BOOT_BATCH_SIZE) {
+		while (c->objects < s->batch) {
 			struct page *new;
 			int d;
 
@@ -1706,7 +1764,7 @@ redo:
 			} else
 				stat(s, ALLOC_FROM_PARTIAL);
 
-			d = min(BOOT_BATCH_SIZE - c->objects, available(new));
+			d = min(s->batch - c->objects, available(new));
 			retrieve_objects(s, new, c->object + c->objects, d);
 			c->objects += d;
 
@@ -1806,9 +1864,9 @@ static void slab_free(struct kmem_cache 
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 
-	if (unlikely(c->objects >= BOOT_QUEUE_SIZE)) {
+	if (unlikely(c->objects >= s->queue)) {
 
-		int t = min(BOOT_BATCH_SIZE, c->objects);
+		int t = min(s->batch, c->objects);
 
 		drain_objects(s, c->object, t);
 
@@ -2028,7 +2086,7 @@ static inline int alloc_kmem_cache_cpus(
 		s->cpu_slab = kmalloc_percpu + (s - kmalloc_caches);
 	else
 
-		s->cpu_slab =  alloc_kmem_cache_cpu(s, BOOT_QUEUE_SIZE);
+		s->cpu_slab =  alloc_kmem_cache_cpu(s, s->queue);
 
 	if (!s->cpu_slab)
 		return 0;
@@ -2263,6 +2321,26 @@ static int calculate_sizes(struct kmem_c
 
 }
 
+/* Autotuning of the per cpu queueing */
+void initial_cpu_queue_setup(struct kmem_cache *s)
+{
+	if (s->size > PAGE_SIZE)
+		s->queue = 8;
+	else if (s->size > 1024)
+		s->queue = 24;
+	else if (s->size > 256)
+		s->queue = 54;
+	else
+		s->queue = 120;
+
+	if (is_kmalloc_cache(s) && s->queue > BOOT_QUEUE_SIZE) {
+		/* static so cap it */
+		s->queue = BOOT_QUEUE_SIZE;
+	}
+
+	s->batch = (s->queue + 1) / 2;
+}
+
 static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
@@ -2298,6 +2376,7 @@ static int kmem_cache_open(struct kmem_c
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
+	initial_cpu_queue_setup(s);
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;
 
@@ -3855,6 +3934,55 @@ static ssize_t min_partial_store(struct 
 }
 SLAB_ATTR(min_partial);
 
+static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->queue);
+}
+
+static ssize_t cpu_queue_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long queue;
+	int err;
+
+	err = strict_strtoul(buf, 10, &queue);
+	if (err)
+		return err;
+
+	if (queue > 10000 || queue < 4)
+		return -EINVAL;
+
+	if (s->batch > queue)
+		s->batch = queue;
+
+	resize_cpu_queue(s, queue);
+	return length;
+}
+SLAB_ATTR(cpu_queue_size);
+
+static ssize_t cpu_batch_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->batch);
+}
+
+static ssize_t cpu_batch_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long batch;
+	int err;
+
+	err = strict_strtoul(buf, 10, &batch);
+	if (err)
+		return err;
+
+	if (batch < s->queue || batch < 4)
+		return -EINVAL;
+
+	s->batch = batch;
+	return length;
+}
+SLAB_ATTR(cpu_batch_size);
+
 static ssize_t ctor_show(struct kmem_cache *s, char *buf)
 {
 	if (s->ctor) {
@@ -4204,6 +4332,8 @@ static struct attribute *slab_attrs[] = 
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
 	&min_partial_attr.attr,
+	&cpu_queue_size_attr.attr,
+	&cpu_batch_size_attr.attr,
 	&objects_attr.attr,
 	&objects_partial_attr.attr,
 	&total_objects_attr.attr,
@@ -4561,7 +4691,7 @@ static int s_show(struct seq_file *m, vo
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
 		   nr_objs, s->size, oo_objects(s->oo),
 		   (1 << oo_order(s->oo)));
-	seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0);
+	seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, 0);
 	seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
 		   0UL);
 	seq_putc(m, '\n');
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-05-20 14:39:20.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-05-20 14:40:17.000000000 -0500
@@ -76,6 +76,8 @@ struct kmem_cache {
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
 	struct kmem_cache_order_objects oo;
+	int queue;		/* per cpu queue size */
+	int batch;		/* batch size */
 	/*
 	 * Avoid an extra cache line for UP, SMP and for the node local to
 	 * struct kmem_cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 09/14] SLED: Get rid of useless function
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (7 preceding siblings ...)
  2010-05-21 21:15 ` [RFC V2 SLEB 08/14] SLEB: Resize cpu queue Christoph Lameter
@ 2010-05-21 21:15 ` Christoph Lameter
  2010-05-21 21:15 ` [RFC V2 SLEB 10/14] SLEB: Remove MAX_OBJS limitation Christoph Lameter
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:15 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_drop_count_free --]
[-- Type: text/plain, Size: 1749 bytes --]

count_free() == available()

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-20 14:40:17.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-20 14:40:22.000000000 -0500
@@ -1617,11 +1617,6 @@ static inline int node_match(struct kmem
 	return 1;
 }
 
-static int count_free(struct page *page)
-{
-	return available(page);
-}
-
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
@@ -1670,7 +1665,7 @@ slab_out_of_memory(struct kmem_cache *s,
 		if (!n)
 			continue;
 
-		nr_free  = count_partial(n, count_free);
+		nr_free  = count_partial(n, available);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);
 
@@ -3802,7 +3797,7 @@ static ssize_t show_slab_objects(struct 
 			x = atomic_long_read(&n->total_objects);
 		else if (flags & SO_OBJECTS)
 			x = atomic_long_read(&n->total_objects) -
-				count_partial(n, count_free);
+				count_partial(n, available);
 
 			else
 				x = atomic_long_read(&n->nr_slabs);
@@ -4683,7 +4678,7 @@ static int s_show(struct seq_file *m, vo
 		nr_partials += n->nr_partial;
 		nr_slabs += atomic_long_read(&n->nr_slabs);
 		nr_objs += atomic_long_read(&n->total_objects);
-		nr_free += count_partial(n, count_free);
+		nr_free += count_partial(n, available);
 	}
 
 	nr_inuse = nr_objs - nr_free;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 10/14] SLEB: Remove MAX_OBJS limitation
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (8 preceding siblings ...)
  2010-05-21 21:15 ` [RFC V2 SLEB 09/14] SLED: Get rid of useless function Christoph Lameter
@ 2010-05-21 21:15 ` Christoph Lameter
  2010-05-21 21:15 ` [RFC V2 SLEB 11/14] SLEB: Add per node cache (with a fixed size for now) Christoph Lameter
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:15 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_unlimited_objects --]
[-- Type: text/plain, Size: 2267 bytes --]

There is no need anymore for the "inuse" field in the page struct.
Extend the objects field to 32 bit allowing a practically unlimited
number of objects.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm_types.h |    5 +----
 mm/slub.c                |    7 -------
 2 files changed, 1 insertion(+), 11 deletions(-)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2010-05-20 14:26:48.000000000 -0500
+++ linux-2.6/include/linux/mm_types.h	2010-05-20 14:40:25.000000000 -0500
@@ -40,10 +40,7 @@ struct page {
 					 * to show when page is mapped
 					 * & limit reverse map searches.
 					 */
-		struct {		/* SLUB */
-			u16 inuse;
-			u16 objects;
-		};
+		u32 objects;		/* SLEB */
 	};
 	union {
 	    struct {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-20 14:40:22.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-20 14:40:25.000000000 -0500
@@ -152,7 +152,6 @@ static inline int debug_on(struct kmem_c
 
 #define OO_SHIFT	16
 #define OO_MASK		((1 << OO_SHIFT) - 1)
-#define MAX_OBJS_PER_PAGE	65535 /* since page.objects is u16 */
 
 /* Internal SLUB flags */
 #define __OBJECT_POISON		0x80000000UL /* Poison object */
@@ -771,9 +770,6 @@ static int verify_slab(struct kmem_cache
 			max_objects = ((void *)page->freelist - start) / s->size;
 	}
 
-	if (max_objects > MAX_OBJS_PER_PAGE)
-		max_objects = MAX_OBJS_PER_PAGE;
-
 	if (page->objects != max_objects) {
 		slab_err(s, page, "Wrong number of objects. Found %d but "
 			"should be %d", page->objects, max_objects);
@@ -1959,9 +1955,6 @@ static inline int slab_order(int size, i
 	int rem;
 	int min_order = slub_min_order;
 
-	if ((PAGE_SIZE << min_order) / size > MAX_OBJS_PER_PAGE)
-		return get_order(size * MAX_OBJS_PER_PAGE) - 1;
-
 	for (order = max(min_order,
 				fls(min_objects * size - 1) - PAGE_SHIFT);
 			order <= max_order; order++) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 11/14] SLEB: Add per node cache (with a fixed size for now)
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (9 preceding siblings ...)
  2010-05-21 21:15 ` [RFC V2 SLEB 10/14] SLEB: Remove MAX_OBJS limitation Christoph Lameter
@ 2010-05-21 21:15 ` Christoph Lameter
  2010-05-21 21:15 ` [RFC V2 SLEB 12/14] SLEB: Make the size of the shared cache configurable Christoph Lameter
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:15 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_shared_static --]
[-- Type: text/plain, Size: 3581 bytes --]

The per node cache has the function of the shared cache in SLAB. However,
it will also perform the role of the alien cache in the future.

If the per cpu queues are exhausted then the shared cache will be consulted
first before acquiring objects directly from the slab pages.

On free objects will be first pushed into the shared cache before freeing
them directly to the slab pages.

Both methods allows other processes running on the same node to pickup the
freed objects that may be cache hot in shared caches. No approximation of
the actual topology is done though. It is simply assumed that all processors
on a node derive some benefit from acquiring an object that has been used
on another processor.

This is an initial static size of the shared cache.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    3 +++
 mm/slub.c                |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-05-21 13:08:11.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-05-21 13:08:24.000000000 -0500
@@ -55,6 +55,9 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+	int objects;		/* Objects in the per node cache  */
+	spinlock_t shared_lock;	/* Serialization for per node cache */
+	void *object[BOOT_QUEUE_SIZE];
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-21 13:08:12.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-21 13:08:24.000000000 -0500
@@ -1418,6 +1418,22 @@ static struct page *get_partial(struct k
 void drain_objects(struct kmem_cache *s, void **object, int nr)
 {
 	int i;
+	struct kmem_cache_node *n = get_node(s, numa_node_id());
+
+	/* First drain to shared cache if its there */
+	if (n->objects < BOOT_QUEUE_SIZE) {
+		int d;
+
+		spin_lock(&n->shared_lock);
+		d = min(nr, BOOT_QUEUE_SIZE - n->objects);
+		if (d > 0) {
+			memcpy(n->object + n->objects, object, d * sizeof(void *));
+			n->objects += d;
+			nr -= d;
+			object += d;
+		}
+		spin_unlock(&n->shared_lock);
+	}
 
 	for (i = 0 ; i < nr; ) {
 
@@ -1725,6 +1741,29 @@ redo:
 		if (unlikely(!node_match(c, node))) {
 			flush_cpu_objects(s, c);
 			c->node = node;
+		} else {
+			struct kmem_cache_node *n = get_node(s, c->node);
+
+			/*
+			 * Node specified is matching the stuff that we cache,
+			 * so we could retrieve objects from the shared cache
+			 * of the indicated node if there would be anything
+			 * there.
+			 */
+			if (n->objects) {
+				int d;
+
+				spin_lock(&n->shared_lock);
+				d = min(min(s->batch, BOOT_QUEUE_SIZE), n->objects);
+				if (d > 0) {
+					memcpy(c->object + c->objects,
+						n->object + n->objects - d,
+						d * sizeof(void *));
+					n->objects -= d;
+					c->objects += d;
+				}
+				spin_unlock(&n->shared_lock);
+			}
 		}
 
 		while (c->objects < s->batch) {
@@ -2061,6 +2100,8 @@ init_kmem_cache_node(struct kmem_cache_n
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+	spin_lock_init(&n->shared_lock);
+	n->objects = 0;
 }
 
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 12/14] SLEB: Make the size of the shared cache configurable
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (10 preceding siblings ...)
  2010-05-21 21:15 ` [RFC V2 SLEB 11/14] SLEB: Add per node cache (with a fixed size for now) Christoph Lameter
@ 2010-05-21 21:15 ` Christoph Lameter
  2010-05-21 21:15 ` [RFC V2 SLEB 13/14] SLEB: Enhanced NUMA support Christoph Lameter
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:15 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_shared_dynamic --]
[-- Type: text/plain, Size: 6393 bytes --]

This makes the size of the shared array configurable. Not that this is a bit
problematic and there are likely unresolved race conditions. The kmem_cache->node[x]
pointers become unstable if interrupts are allowed.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    3 +
 mm/slub.c                |  133 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 116 insertions(+), 20 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-05-21 13:17:14.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-05-21 13:47:41.000000000 -0500
@@ -81,11 +81,14 @@ struct kmem_cache {
 	struct kmem_cache_order_objects oo;
 	int queue;		/* per cpu queue size */
 	int batch;		/* batch size */
+	int shared;		/* Shared queue size */
+#ifndef CONFIG_NUMA
 	/*
 	 * Avoid an extra cache line for UP, SMP and for the node local to
 	 * struct kmem_cache.
 	 */
 	struct kmem_cache_node local_node;
+#endif
 
 	/* Allocation and freeing of slabs */
 	struct kmem_cache_order_objects max;
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-21 13:17:14.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-21 13:48:01.000000000 -0500
@@ -1754,7 +1754,7 @@ redo:
 				int d;
 
 				spin_lock(&n->shared_lock);
-				d = min(min(s->batch, BOOT_QUEUE_SIZE), n->objects);
+				d = min(min(s->batch, s->shared), n->objects);
 				if (d > 0) {
 					memcpy(c->object + c->objects,
 						n->object + n->objects - d,
@@ -1864,6 +1864,7 @@ void *kmem_cache_alloc_node(struct kmem_
 	return ret;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
+
 #endif
 
 #ifdef CONFIG_TRACING
@@ -2176,10 +2177,7 @@ static void free_kmem_cache_nodes(struct
 	int node;
 
 	for_each_node_state(node, N_NORMAL_MEMORY) {
-		struct kmem_cache_node *n = s->node[node];
-
-		if (n && n != &s->local_node)
-			kfree(n);
+		kfree(s->node[node]);
 		s->node[node] = NULL;
 	}
 }
@@ -2197,27 +2195,96 @@ static int init_kmem_cache_nodes(struct 
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n;
 
-		if (local_node == node)
-			n = &s->local_node;
-		else {
-			if (slab_state == DOWN) {
-				early_kmem_cache_node_alloc(gfpflags, node);
-				continue;
-			}
-			n = kmalloc_node(sizeof(struct kmem_cache_node), gfpflags,
-				node);
-
-			if (!n) {
-				free_kmem_cache_nodes(s);
-				return 0;
-			}
+		if (slab_state == DOWN) {
+			early_kmem_cache_node_alloc(gfpflags, node);
+			continue;
+		}
+		n = kmalloc_node(sizeof(struct kmem_cache_node), gfpflags,
+			node);
 
+		if (!n) {
+			free_kmem_cache_nodes(s);
+			return 0;
 		}
 		s->node[node] = n;
 		init_kmem_cache_node(n, s);
 	}
 	return 1;
 }
+
+static void resize_shared_queue(struct kmem_cache *s, int shared)
+{
+
+	if (is_kmalloc_cache(s)) {
+		if (shared < BOOT_QUEUE_SIZE) {
+			s->shared = shared;
+		} else {
+			/* More than max. Go to max allowed */
+			s->queue = BOOT_QUEUE_SIZE;
+			s->batch = BOOT_BATCH_SIZE;
+		}
+	} else {
+		int node;
+
+		/* Create the new cpu queue and then free the old one */
+		down_write(&slub_lock);
+
+		/* We can only shrink the queue here since the new
+		 * queue size may be smaller and there may be concurrent
+		 * slab operations. The upate of the queue must be seen
+		 * before the change of the location of the percpu queue.
+		 *
+		 * Note that the queue may contain more object than the
+		 * queue size after this operation.
+		 */
+		if (shared < s->shared) {
+			s->shared = shared;
+			barrier();
+		}
+
+
+		/* Serialization has not been worked out yet */
+		for_each_online_node(node) {
+			struct kmem_cache_node *n = get_node(s, node);
+			struct kmem_cache_node *nn =
+				kmalloc_node(sizeof(struct kmem_cache_node),
+					GFP_KERNEL, node);
+
+			init_kmem_cache_node(nn, s);
+			s->node[node] = nn;
+
+			spin_lock(&nn->list_lock);
+			list_move(&n->partial, &nn->partial);
+#ifdef CONFIG_SLUB_DEBUG
+			list_move(&n->full, &nn->full);
+#endif
+			spin_unlock(&nn->list_lock);
+
+			nn->nr_partial = n->nr_partial;
+#ifdef CONFIG_SLUB_DEBUG
+			nn->nr_slabs = n->nr_slabs;
+			nn->total_objects = n->total_objects;
+#endif
+
+			spin_lock(&nn->shared_lock);
+			nn->objects = n->objects;
+			memcpy(&nn->object, n->object, nn->objects * sizeof(void *));
+			spin_unlock(&nn->shared_lock);
+
+			kfree(n);
+		}
+		/*
+		 * If the queue needs to be extended then we deferred
+		 * the update until now when the larger sized queue
+		 * has been allocated and is working.
+		 */
+		if (shared > s->shared)
+			s->shared = shared;
+
+		up_write(&slub_lock);
+	}
+}
+
 #else
 static void free_kmem_cache_nodes(struct kmem_cache *s)
 {
@@ -3989,6 +4056,31 @@ static ssize_t cpu_queue_size_store(stru
 }
 SLAB_ATTR(cpu_queue_size);
 
+#ifdef CONFIG_NUMA
+static ssize_t shared_queue_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->shared);
+}
+
+static ssize_t shared_queue_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long queue;
+	int err;
+
+	err = strict_strtoul(buf, 10, &queue);
+	if (err)
+		return err;
+
+	if (queue > 10000 || queue < s->batch)
+		return -EINVAL;
+
+	resize_shared_queue(s, queue);
+	return length;
+}
+SLAB_ATTR(shared_queue_size);
+#endif
+
 static ssize_t cpu_batch_size_show(struct kmem_cache *s, char *buf)
 {
 	return sprintf(buf, "%u\n", s->batch);
@@ -4388,6 +4480,7 @@ static struct attribute *slab_attrs[] = 
 	&cache_dma_attr.attr,
 #endif
 #ifdef CONFIG_NUMA
+	&shared_queue_size_attr.attr,
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
@@ -4720,7 +4813,7 @@ static int s_show(struct seq_file *m, vo
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
 		   nr_objs, s->size, oo_objects(s->oo),
 		   (1 << oo_order(s->oo)));
-	seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, 0);
+	seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, s->shared);
 	seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
 		   0UL);
 	seq_putc(m, '\n');

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 13/14] SLEB: Enhanced NUMA support
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (11 preceding siblings ...)
  2010-05-21 21:15 ` [RFC V2 SLEB 12/14] SLEB: Make the size of the shared cache configurable Christoph Lameter
@ 2010-05-21 21:15 ` Christoph Lameter
  2010-05-21 21:15 ` [RFC V2 SLEB 14/14] SLEB: Allocate off node objects from remote shared caches Christoph Lameter
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:15 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_numa --]
[-- Type: text/plain, Size: 3885 bytes --]

Before this patch all queues in SLEB may contain mixed objects (from any node).
This will continue even with this patch unless the SLAB has SLAB_MEM_SPREAD set.

For SLAB_MEM_SPREAD slabs an ordering by locality is enforced and objects are
managed per NUMA node (like SLAB). Cpu queues only contain
objects from the local node. Alien Objects (from non local nodes)
are freed into the shared cache of the remote node (avoids alien caches
but introduces cold cache objects into the shared cache).

This also adds object level NUMA functionality like in SLAB that can be
managed via cpusets or memory policies.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-20 16:57:14.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-20 16:57:37.000000000 -0500
@@ -1718,6 +1718,24 @@ void retrieve_objects(struct kmem_cache 
 	}
 }
 
+static inline int find_numa_node(struct kmem_cache *s, int selected_node)
+{
+#ifdef CONFIG_NUMA
+	if (s->flags & SLAB_MEM_SPREAD &&
+			!in_interrupt() &&
+			selected_node == SLAB_NODE_UNSPECIFIED) {
+
+		if (cpuset_do_slab_mem_spread())
+			return cpuset_mem_spread_node();
+
+		if (current->mempolicy)
+			return slab_node(current->mempolicy);
+	}
+#endif
+	return selected_node;
+}
+
+
 static void *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr)
 {
@@ -1732,6 +1750,7 @@ static void *slab_alloc(struct kmem_cach
 		return NULL;
 
 redo:
+	node = find_numa_node(s, node);
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
 	if (unlikely(!c->objects || !node_match(c, node))) {
@@ -1877,6 +1896,54 @@ void *kmem_cache_alloc_node_notrace(stru
 EXPORT_SYMBOL(kmem_cache_alloc_node_notrace);
 #endif
 
+int numa_off_node_free(struct kmem_cache *s, void *x)
+{
+#ifdef CONFIG_NUMA
+	if (s->flags & SLAB_MEM_SPREAD) {
+		int node = page_to_nid(virt_to_page(x));
+		/*
+		 * Slab requires object level control of locality. We can only
+		 * keep objects from the local node in the per cpu queue other
+		 * foreign object must not be freed to the queue.
+		 *
+		 * If we enconter a free of an off node object then we free
+		 * it to the shared cache of that node. This places a cache
+		 * cold object into that queue though. But using the queue
+		 * is much more effective than going directly into the slab.
+		 *
+		 * Alternate approach: Call drain_objects directly for a single
+		 * object. (Drain objects would have to be fixed to not save
+		 * to the local shared mem cache by default).
+		 */
+		if (node != numa_node_id()) {
+			struct kmem_cache_node *n = get_node(s, node);
+redo:
+			if (n->objects >= s->shared) {
+				int t = min(s->batch, n->objects);
+
+				drain_objects(s, n->object, t);
+
+				n->objects -= t;
+				if (n->objects)
+					memcpy(n->object, n->object + t,
+						n->objects * sizeof(void *));
+			}
+			spin_lock(&n->shared_lock);
+			if (n->objects < s->shared) {
+				n->object[n->objects++] = x;
+				x = NULL;
+			}
+			spin_unlock(&n->shared_lock);
+			if (x)
+				goto redo;
+			return 1;
+		}
+	}
+#endif
+	return 0;
+}
+
+
 static void slab_free(struct kmem_cache *s,
 			void *x, unsigned long addr)
 {
@@ -1895,6 +1962,9 @@ static void slab_free(struct kmem_cache 
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 
+	if (numa_off_node_free(s, x))
+		goto out;
+
 	if (unlikely(c->objects >= s->queue)) {
 
 		int t = min(s->batch, c->objects);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC V2 SLEB 14/14] SLEB: Allocate off node objects from remote shared caches
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (12 preceding siblings ...)
  2010-05-21 21:15 ` [RFC V2 SLEB 13/14] SLEB: Enhanced NUMA support Christoph Lameter
@ 2010-05-21 21:15 ` Christoph Lameter
  2010-05-22  8:37 ` [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Pekka Enberg
  2010-05-24  7:03 ` Nick Piggin
  15 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-21 21:15 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm

[-- Attachment #1: sled_off_node_from_shared --]
[-- Type: text/plain, Size: 7316 bytes --]

This is in a draft state.

Leave the cpu queue alone for off node accesses and go directly to the
remote shared cache for alloations.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    1 
 mm/slub.c                |  184 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 142 insertions(+), 43 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-05-21 15:30:47.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-05-21 15:34:45.000000000 -0500
@@ -42,7 +42,6 @@ struct kmem_cache_cpu {
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
 	int objects;		/* Number of objects available */
-	int node;		/* The node of the page (or -1 for debug) */
 	void *object[BOOT_QUEUE_SIZE];		/* List of objects */
 };
 
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-21 15:30:47.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-21 15:37:04.000000000 -0500
@@ -1616,19 +1616,6 @@ static void resize_cpu_queue(struct kmem
 	}
 }
 
-/*
- * Check if the objects in a per cpu structure fit numa
- * locality expectations.
- */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
-{
-#ifdef CONFIG_NUMA
-	if (node != -1 && c->node != node)
-		return 0;
-#endif
-	return 1;
-}
-
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
@@ -1718,9 +1705,9 @@ void retrieve_objects(struct kmem_cache 
 	}
 }
 
+#ifdef CONFIG_NUMA
 static inline int find_numa_node(struct kmem_cache *s, int selected_node)
 {
-#ifdef CONFIG_NUMA
 	if (s->flags & SLAB_MEM_SPREAD &&
 			!in_interrupt() &&
 			selected_node == SLAB_NODE_UNSPECIFIED) {
@@ -1731,10 +1718,113 @@ static inline int find_numa_node(struct 
 		if (current->mempolicy)
 			return slab_node(current->mempolicy);
 	}
-#endif
 	return selected_node;
 }
 
+/*
+ * Try to allocate a partial slab from a specific node.
+ */
+static struct page *__get_partial_node(struct kmem_cache_node *n)
+{
+	struct page *page;
+
+	if (!n->nr_partial)
+		return NULL;
+
+	list_for_each_entry(page, &n->partial, lru)
+		if (lock_and_freeze_slab(n, page))
+			goto out;
+	page = NULL;
+out:
+	return page;
+}
+
+
+void *off_node_alloc(struct kmem_cache *s, int node, gfp_t gfpflags)
+{
+	void *object = NULL;
+	struct kmem_cache_node *n = get_node(s, node);
+
+	spin_lock(&n->shared_lock);
+
+	while (!object) {
+		/* Direct allocation from remote shared cache */
+		if (n->objects) {
+#if 0
+			/* Taking a hot object remotely  */
+			object = n->object[--n->objects];
+#else
+			/* Take a cold object from the remote shared cache */
+			object = n->object[0];
+			n->objects--;
+			memcpy(n->object, n->object + 1, n->objects * sizeof(void *));
+#endif
+			break;
+		}
+
+		while (n->objects < s->batch) {
+			struct page *new;
+			int d;
+
+			/* Should be getting cold remote page !! This is hot */
+			new = __get_partial_node(n);
+			if (unlikely(!new)) {
+
+				spin_unlock(&n->shared_lock);
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_enable();
+
+				new = new_slab(s, gfpflags, node);
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_disable();
+
+				spin_lock(&n->shared_lock);
+
+ 				if (!new)
+					goto out;
+
+				stat(s, ALLOC_SLAB);
+				slab_lock(new);
+			} else
+				stat(s, ALLOC_FROM_PARTIAL);
+
+			d = min(s->batch - n->objects, available(new));
+			retrieve_objects(s, new, n->object + n->objects, d);
+			n->objects += d;
+
+			if (!all_objects_used(new))
+
+				add_partial(get_node(s, page_to_nid(new)), new, 1);
+
+			else
+				add_full(s, get_node(s, page_to_nid(new)), new);
+
+			slab_unlock(new);
+		}
+	}
+out:
+	spin_unlock(&n->shared_lock);
+	return object;
+}
+
+/*
+ * Check if the objects in a per cpu structure fit numa
+ * locality expectations.
+ */
+static inline int node_local(int node)
+{
+	if (node != -1 || numa_node_id() != node)
+		return 0;
+	return 1;
+}
+
+#else
+static inline int find_numa_node(struct kmem_cache *s, int selected_node) { return selected_node; }
+static inline void *off_node_alloc(struct kmem_cache *s, int node, gfp_t gfpflags) { return NULL; }
+static inline int node_local(int node) { return 1; }
+#endif
 
 static void *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr)
@@ -1753,36 +1843,41 @@ redo:
 	node = find_numa_node(s, node);
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	if (unlikely(!c->objects || !node_match(c, node))) {
+	if (unlikely(!c->objects || !node_local(node))) {
+
+		struct kmem_cache_node *n;
 
 		gfpflags &= gfp_allowed_mask;
 
-		if (unlikely(!node_match(c, node))) {
-			flush_cpu_objects(s, c);
-			c->node = node;
-		} else {
-			struct kmem_cache_node *n = get_node(s, c->node);
+		if (unlikely(!node_local(node))) {
+			object = off_node_alloc(s, node, gfpflags);
+			if (!object)
+				goto oom;
+			else
+				goto got_object;
+		}
 
-			/*
-			 * Node specified is matching the stuff that we cache,
-			 * so we could retrieve objects from the shared cache
-			 * of the indicated node if there would be anything
-			 * there.
-			 */
-			if (n->objects) {
-				int d;
+		n = get_node(s, numa_node_id());
 
-				spin_lock(&n->shared_lock);
-				d = min(min(s->batch, s->shared), n->objects);
-				if (d > 0) {
-					memcpy(c->object + c->objects,
-						n->object + n->objects - d,
-						d * sizeof(void *));
-					n->objects -= d;
-					c->objects += d;
-				}
-				spin_unlock(&n->shared_lock);
+		/*
+		 * Node specified is matching the stuff that we cache,
+		 * so we could retrieve objects from the shared cache
+		 * of the indicated node if there would be anything
+		 * there.
+		 */
+		if (n->objects) {
+			int d;
+
+			spin_lock(&n->shared_lock);
+			d = min(min(s->batch, s->shared), n->objects);
+			if (d > 0) {
+				memcpy(c->object + c->objects,
+					n->object + n->objects - d,
+					d * sizeof(void *));
+				n->objects -= d;
+				c->objects += d;
 			}
+			spin_unlock(&n->shared_lock);
 		}
 
 		while (c->objects < s->batch) {
@@ -1833,6 +1928,8 @@ redo:
 
 	object = c->object[--c->objects];
 
+got_object:
+
 	if (unlikely(debug_on(s))) {
 		if (!alloc_debug_processing(s, object, addr))
 			goto redo;
@@ -1962,8 +2059,10 @@ static void slab_free(struct kmem_cache 
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 
+#ifdef CONFIG_NUMA
 	if (numa_off_node_free(s, x))
 		goto out;
+#endif
 
 	if (unlikely(c->objects >= s->queue)) {
 
@@ -3941,8 +4040,9 @@ static ssize_t show_slab_objects(struct 
 
 		for_each_possible_cpu(cpu) {
 			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+			int node = cpu_to_node(cpu);
 
-			if (!c || c->node < 0)
+			if (!c)
 				continue;
 
 			if (c->objects) {
@@ -3954,9 +4054,9 @@ static ssize_t show_slab_objects(struct 
 					x = 1;
 
 				total += x;
-				nodes[c->node] += x;
+				nodes[node] += x;
 			}
-			per_cpu[c->node]++;
+			per_cpu[node]++;
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (13 preceding siblings ...)
  2010-05-21 21:15 ` [RFC V2 SLEB 14/14] SLEB: Allocate off node objects from remote shared caches Christoph Lameter
@ 2010-05-22  8:37 ` Pekka Enberg
  2010-05-24  7:03 ` Nick Piggin
  15 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-22  8:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, David Rientjes, Zhang Yanmin, Nick Piggin

On Sat, May 22, 2010 at 12:14 AM, Christoph Lameter <cl@linux.com> wrote:
> SLEB is a merging of SLUB with some queuing concepts from SLAB and a new way
> of managing objects in the slabs using bitmaps. It uses a percpu queue so that
> free operations can be properly buffered and a bitmap for managing the
> free/allocated state in the slabs. It is slightly more inefficient than
> SLUB (due to the need to place large bitmaps --sized a few words--in some
> slab pages if there are more than BITS_PER_LONG objects in a slab page) but
> in general does compete well with SLUB (and therefore also with SLOB)
> in terms of memory wastage.

I merged patches 1-7 to "sleb/core" branch of slab.git if people want
to test them:

http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=shortlog;h=refs/heads/sleb/core

I didn't put them in linux-next for obvious reasons.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
                   ` (14 preceding siblings ...)
  2010-05-22  8:37 ` [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Pekka Enberg
@ 2010-05-24  7:03 ` Nick Piggin
  2010-05-24 15:06   ` Christoph Lameter
  15 siblings, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2010-05-24  7:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm

Well I'm glad you've conceded that queues are useful for high
performance computing, and that higher order allocations are not
a free and unlimited resource.

I hope we can move forward now with some objective, testable
comparisons and criteria for selecting one main slab allocator.

On Fri, May 21, 2010 at 04:14:52PM -0500, Christoph Lameter wrote:
> (V2 some more work as time permitted this week)
> 
> SLEB is a merging of SLUB with some queuing concepts from SLAB and a new way
> of managing objects in the slabs using bitmaps. It uses a percpu queue so that
> free operations can be properly buffered and a bitmap for managing the
> free/allocated state in the slabs. It is slightly more inefficient than
> SLUB (due to the need to place large bitmaps --sized a few words--in some
> slab pages if there are more than BITS_PER_LONG objects in a slab page) but
> in general does compete well with SLUB (and therefore also with SLOB) 
> in terms of memory wastage.
> 
> It does not have the excessive memory requirements of SLAB because
> there is no slab management structure nor alien caches. Under NUMA
> the remote shared caches are used instead (which may have its issues).
> 
> The SLAB scheme of not touching the object during management is adopted.
> SLEB can efficiently free and allocate cache cold objects without
> causing cache misses.
> 
> There are numerous SLAB schemes that are not supported. Those could be
> added if needed and if they really make a difference.
> 
> WARNING: This only ran successfully using hackbench in kvm instances so far.
> But works with NUMA, SMP and UP there.
> 
> V1->V2 Add NUMA capabilities. Refine queue size configurations (not complete).
>    Test in UP, SMP, NUMA
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-24  7:03 ` Nick Piggin
@ 2010-05-24 15:06   ` Christoph Lameter
  2010-05-25  2:06     ` Nick Piggin
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-24 15:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Mon, 24 May 2010, Nick Piggin wrote:

> Well I'm glad you've conceded that queues are useful for high
> performance computing, and that higher order allocations are not
> a free and unlimited resource.

Ahem. I have never made any such claim and would never make them. And
"conceding" something ???

The "unqueueing" was the result of excessive queue handling in SLAB due and
the higher order allocations are a natural move in HPC to gain performance.

> I hope we can move forward now with some objective, testable
> comparisons and criteria for selecting one main slab allocator.

If can find criteria that are universally agreed upon then yes but that is
doubtful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-24 15:06   ` Christoph Lameter
@ 2010-05-25  2:06     ` Nick Piggin
  2010-05-25  6:55       ` Pekka Enberg
  2010-05-25 14:13       ` Christoph Lameter
  0 siblings, 2 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25  2:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Mon, May 24, 2010 at 10:06:08AM -0500, Christoph Lameter wrote:
> On Mon, 24 May 2010, Nick Piggin wrote:
> 
> > Well I'm glad you've conceded that queues are useful for high
> > performance computing, and that higher order allocations are not
> > a free and unlimited resource.
> 
> Ahem. I have never made any such claim and would never make them. And
> "conceding" something ???

Well, you were quite vocal about the subject.

 
> The "unqueueing" was the result of excessive queue handling in SLAB due and
> the higher order allocations are a natural move in HPC to gain performance.

This is the kind of handwavings that need to be put into a testable
form. I repeatedly asked you for examples of where the jitter is
excessive or where the TLB improvements help, but you never provided
any testable case. I'm not saying they don't exist, but we have to be
reational about this.

 
> > I hope we can move forward now with some objective, testable
> > comparisons and criteria for selecting one main slab allocator.
> 
> If can find criteria that are universally agreed upon then yes but that is
> doubtful.

I think we can agree that perfect is the enemy of good, and that no
allocator will do the perfect thing for everybody. I think we have to
come up with a way to a single allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  2:06     ` Nick Piggin
@ 2010-05-25  6:55       ` Pekka Enberg
  2010-05-25  7:07         ` Nick Piggin
  2010-05-25 14:13       ` Christoph Lameter
  1 sibling, 1 reply; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25  6:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Christoph Lameter, linux-mm

On Tue, May 25, 2010 at 5:06 AM, Nick Piggin <npiggin@suse.de> wrote:
>> If can find criteria that are universally agreed upon then yes but that is
>> doubtful.
>
> I think we can agree that perfect is the enemy of good, and that no
> allocator will do the perfect thing for everybody. I think we have to
> come up with a way to a single allocator.

Yes. The interesting most interesting bit about SLEB for me is the
freelist handling as bitmaps, not necessarily the "queuing" part. If
the latter also helps some workloads, it's a bonus for sure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  6:55       ` Pekka Enberg
@ 2010-05-25  7:07         ` Nick Piggin
  2010-05-25  8:03             ` Pekka Enberg
  0 siblings, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2010-05-25  7:07 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, Christoph Lameter, linux-mm

On Tue, May 25, 2010 at 09:55:28AM +0300, Pekka Enberg wrote:
> On Tue, May 25, 2010 at 5:06 AM, Nick Piggin <npiggin@suse.de> wrote:
> >> If can find criteria that are universally agreed upon then yes but that is
> >> doubtful.
> >
> > I think we can agree that perfect is the enemy of good, and that no
> > allocator will do the perfect thing for everybody. I think we have to
> > come up with a way to a single allocator.
> 
> Yes. The interesting most interesting bit about SLEB for me is the
> freelist handling as bitmaps, not necessarily the "queuing" part. If
> the latter also helps some workloads, it's a bonus for sure.

Agreed it is all interesting, but I think we have to have a rational
path toward having just one.

There is nothing to stop incremental changes or tweaks on top of that
allocator, even to the point of completely changing the allocation
scheme. It is inevitable that with changes in workloads, SMP/NUMA, and
cache/memory costs and hierarchies, the best slab allocation schemes
will change over time.

I think it is more important to have one allocator than trying to get
the absolute most perfect one for everybody. That way changes are
carefully and slowly reviewed and merged, with results to justify the
change. This way everybody is testing the same thing, and bisection will
work. The situation with SLUB is already a nightmare because now each
allocator has half the testing and half the work put into it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  7:07         ` Nick Piggin
@ 2010-05-25  8:03             ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25  8:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

Hi Nick,

On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> There is nothing to stop incremental changes or tweaks on top of that
> allocator, even to the point of completely changing the allocation
> scheme. It is inevitable that with changes in workloads, SMP/NUMA, and
> cache/memory costs and hierarchies, the best slab allocation schemes
> will change over time.

Agreed.

On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> I think it is more important to have one allocator than trying to get
> the absolute most perfect one for everybody. That way changes are
> carefully and slowly reviewed and merged, with results to justify the
> change. This way everybody is testing the same thing, and bisection will
> work. The situation with SLUB is already a nightmare because now each
> allocator has half the testing and half the work put into it.

I wouldn't say it's a nightmare, but yes, it could be better. From my
point of view SLUB is the base of whatever the future will be because
the code is much cleaner and simpler than SLAB. That's why I find
Christoph's work on SLEB more interesting than SLQB, for example,
because it's building on top of something that's mature and stable.

That said, are you proposing that even without further improvements to
SLUB, we should go ahead and, for example, remove SLAB from Kconfig
for v2.6.36 and see if we can just delete the whole thing from, say,
v2.6.38?

                        Pekka

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25  8:03             ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25  8:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

Hi Nick,

On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> There is nothing to stop incremental changes or tweaks on top of that
> allocator, even to the point of completely changing the allocation
> scheme. It is inevitable that with changes in workloads, SMP/NUMA, and
> cache/memory costs and hierarchies, the best slab allocation schemes
> will change over time.

Agreed.

On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> I think it is more important to have one allocator than trying to get
> the absolute most perfect one for everybody. That way changes are
> carefully and slowly reviewed and merged, with results to justify the
> change. This way everybody is testing the same thing, and bisection will
> work. The situation with SLUB is already a nightmare because now each
> allocator has half the testing and half the work put into it.

I wouldn't say it's a nightmare, but yes, it could be better. From my
point of view SLUB is the base of whatever the future will be because
the code is much cleaner and simpler than SLAB. That's why I find
Christoph's work on SLEB more interesting than SLQB, for example,
because it's building on top of something that's mature and stable.

That said, are you proposing that even without further improvements to
SLUB, we should go ahead and, for example, remove SLAB from Kconfig
for v2.6.36 and see if we can just delete the whole thing from, say,
v2.6.38?

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  8:03             ` Pekka Enberg
@ 2010-05-25  8:16               ` Nick Piggin
  -1 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25  8:16 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, May 25, 2010 at 11:03:49AM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> > There is nothing to stop incremental changes or tweaks on top of that
> > allocator, even to the point of completely changing the allocation
> > scheme. It is inevitable that with changes in workloads, SMP/NUMA, and
> > cache/memory costs and hierarchies, the best slab allocation schemes
> > will change over time.
> 
> Agreed.
> 
> On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> > I think it is more important to have one allocator than trying to get
> > the absolute most perfect one for everybody. That way changes are
> > carefully and slowly reviewed and merged, with results to justify the
> > change. This way everybody is testing the same thing, and bisection will
> > work. The situation with SLUB is already a nightmare because now each
> > allocator has half the testing and half the work put into it.
> 
> I wouldn't say it's a nightmare, but yes, it could be better. From my
> point of view SLUB is the base of whatever the future will be because
> the code is much cleaner and simpler than SLAB. That's why I find
> Christoph's work on SLEB more interesting than SLQB, for example,
> because it's building on top of something that's mature and stable.

I don't think SLUB ever proved itself very well. The selling points
were some untestable handwaving about how queueing is bad and jitter
is bad, ignoring the fact that queues could be shortened and periodic
reaping disabled at runtime with SLAB style of allocator. It also
has relied heavily on higher order allocations which put great strain
on hugepage allocations and page reclaim (witness the big slowdown
in low memory conditions when tmpfs was using higher order allocations
via SLUB).


> That said, are you proposing that even without further improvements to
> SLUB, we should go ahead and, for example, remove SLAB from Kconfig
> for v2.6.36 and see if we can just delete the whole thing from, say,
> v2.6.38?

SLUB has not been able to displace SLAB for a long timedue to
performance and higher order allocation problems.

I think "clean code" is very important, but by far the hardest thing to
get right by far is the actual allocation and freeing strategies. So
it's crazy to base such a choice on code cleanliness. If that's the
deciding factor, then I can provide a patch to modernise SLAB and then
we can remove SLUB and start incremental improvements from there.
 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25  8:16               ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25  8:16 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, May 25, 2010 at 11:03:49AM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> > There is nothing to stop incremental changes or tweaks on top of that
> > allocator, even to the point of completely changing the allocation
> > scheme. It is inevitable that with changes in workloads, SMP/NUMA, and
> > cache/memory costs and hierarchies, the best slab allocation schemes
> > will change over time.
> 
> Agreed.
> 
> On Tue, May 25, 2010 at 10:07 AM, Nick Piggin <npiggin@suse.de> wrote:
> > I think it is more important to have one allocator than trying to get
> > the absolute most perfect one for everybody. That way changes are
> > carefully and slowly reviewed and merged, with results to justify the
> > change. This way everybody is testing the same thing, and bisection will
> > work. The situation with SLUB is already a nightmare because now each
> > allocator has half the testing and half the work put into it.
> 
> I wouldn't say it's a nightmare, but yes, it could be better. From my
> point of view SLUB is the base of whatever the future will be because
> the code is much cleaner and simpler than SLAB. That's why I find
> Christoph's work on SLEB more interesting than SLQB, for example,
> because it's building on top of something that's mature and stable.

I don't think SLUB ever proved itself very well. The selling points
were some untestable handwaving about how queueing is bad and jitter
is bad, ignoring the fact that queues could be shortened and periodic
reaping disabled at runtime with SLAB style of allocator. It also
has relied heavily on higher order allocations which put great strain
on hugepage allocations and page reclaim (witness the big slowdown
in low memory conditions when tmpfs was using higher order allocations
via SLUB).


> That said, are you proposing that even without further improvements to
> SLUB, we should go ahead and, for example, remove SLAB from Kconfig
> for v2.6.36 and see if we can just delete the whole thing from, say,
> v2.6.38?

SLUB has not been able to displace SLAB for a long timedue to
performance and higher order allocation problems.

I think "clean code" is very important, but by far the hardest thing to
get right by far is the actual allocation and freeing strategies. So
it's crazy to base such a choice on code cleanliness. If that's the
deciding factor, then I can provide a patch to modernise SLAB and then
we can remove SLUB and start incremental improvements from there.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  8:16               ` Nick Piggin
@ 2010-05-25  9:19                 ` Pekka Enberg
  -1 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25  9:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> I don't think SLUB ever proved itself very well. The selling points
> were some untestable handwaving about how queueing is bad and jitter
> is bad, ignoring the fact that queues could be shortened and periodic
> reaping disabled at runtime with SLAB style of allocator. It also
> has relied heavily on higher order allocations which put great strain
> on hugepage allocations and page reclaim (witness the big slowdown
> in low memory conditions when tmpfs was using higher order allocations
> via SLUB).

The main selling point for SLUB was NUMA. Has the situation changed?
Reliance on higher order allocations isn't that relevant if we're
anyway discussing ways to change allocation strategy.

On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> SLUB has not been able to displace SLAB for a long timedue to
> performance and higher order allocation problems.
>
> I think "clean code" is very important, but by far the hardest thing to
> get right by far is the actual allocation and freeing strategies. So
> it's crazy to base such a choice on code cleanliness. If that's the
> deciding factor, then I can provide a patch to modernise SLAB and then
> we can remove SLUB and start incremental improvements from there.

I'm more than happy to take in patches to clean up SLAB but I think
you're underestimating the required effort. What SLUB has going for
it:

  - No NUMA alien caches
  - No special lockdep handling required
  - Debugging support is better
  - Cpuset interractions are simpler
  - Memory hotplug is more mature
  - Much more contributors to SLUB than to SLAB

I was one of the people cleaning up SLAB when SLUB was merged and
based on that experience I'm strongly in favor of SLUB as a base.

                        Pekka

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25  9:19                 ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25  9:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> I don't think SLUB ever proved itself very well. The selling points
> were some untestable handwaving about how queueing is bad and jitter
> is bad, ignoring the fact that queues could be shortened and periodic
> reaping disabled at runtime with SLAB style of allocator. It also
> has relied heavily on higher order allocations which put great strain
> on hugepage allocations and page reclaim (witness the big slowdown
> in low memory conditions when tmpfs was using higher order allocations
> via SLUB).

The main selling point for SLUB was NUMA. Has the situation changed?
Reliance on higher order allocations isn't that relevant if we're
anyway discussing ways to change allocation strategy.

On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> SLUB has not been able to displace SLAB for a long timedue to
> performance and higher order allocation problems.
>
> I think "clean code" is very important, but by far the hardest thing to
> get right by far is the actual allocation and freeing strategies. So
> it's crazy to base such a choice on code cleanliness. If that's the
> deciding factor, then I can provide a patch to modernise SLAB and then
> we can remove SLUB and start incremental improvements from there.

I'm more than happy to take in patches to clean up SLAB but I think
you're underestimating the required effort. What SLUB has going for
it:

  - No NUMA alien caches
  - No special lockdep handling required
  - Debugging support is better
  - Cpuset interractions are simpler
  - Memory hotplug is more mature
  - Much more contributors to SLUB than to SLAB

I was one of the people cleaning up SLAB when SLUB was merged and
based on that experience I'm strongly in favor of SLUB as a base.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  9:19                 ` Pekka Enberg
@ 2010-05-25  9:34                   ` Nick Piggin
  -1 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25  9:34 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 12:19:09PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> > I don't think SLUB ever proved itself very well. The selling points
> > were some untestable handwaving about how queueing is bad and jitter
> > is bad, ignoring the fact that queues could be shortened and periodic
> > reaping disabled at runtime with SLAB style of allocator. It also
> > has relied heavily on higher order allocations which put great strain
> > on hugepage allocations and page reclaim (witness the big slowdown
> > in low memory conditions when tmpfs was using higher order allocations
> > via SLUB).
> 
> The main selling point for SLUB was NUMA. Has the situation changed?

Well one problem with SLAB was really just those alien caches. AFAIK
they were added by Christoph Lameter (maybe wrong), and I didn't ever
actually see much justification for them in the changelog. noaliencache
can be and is used on bigger machines, and SLES and RHEL kernels are
using SLAB on production NUMA systems up to thousands of CPU Altixes,
and have been looking at working on SGI's UV, and hundreds of cores
POWER7 etc.

I have not seen NUMA benchmarks showing SLUB is significantly better.
I haven't done much testing myself, mind you. But from indications, we
could probably quite easily drop the alien caches setup and do like a
simpler single remote freeing queue per CPU or something like that.


> Reliance on higher order allocations isn't that relevant if we're
> anyway discussing ways to change allocation strategy.

Then it's just going through more churn and adding untested code to
get where SLAB already is (top performance without higher order
allocations). So it is very relevant if we're considering how to get
to a single allocator.

 
> On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> > SLUB has not been able to displace SLAB for a long timedue to
> > performance and higher order allocation problems.
> >
> > I think "clean code" is very important, but by far the hardest thing to
> > get right by far is the actual allocation and freeing strategies. So
> > it's crazy to base such a choice on code cleanliness. If that's the
> > deciding factor, then I can provide a patch to modernise SLAB and then
> > we can remove SLUB and start incremental improvements from there.
> 
> I'm more than happy to take in patches to clean up SLAB but I think
> you're underestimating the required effort. What SLUB has going for
> it:
> 
>   - No NUMA alien caches
>   - No special lockdep handling required
>   - Debugging support is better
>   - Cpuset interractions are simpler
>   - Memory hotplug is more mature

All this I don't think is much problem. It was only a problem because we
put in SLUB and so half these new features were added to it and people
weren't adding them to SLAB.


>   - Much more contributors to SLUB than to SLAB

In large part because it is less mature. But also because it seems to be
seen as the allocator of the future.

Problem is that SLUB was never able to prove why it should be merged.
The code cleanliness issue is really trivial in comparison to how much
head scratching and work goes into analysing the performance.

It *really* is not required to completely replace a whole subsystem like
this to make progress. Even if we make relatively large changes,
everyone gets to use and test them, and it's so easy to bisect and
work out how changes interact and change behaviour. Compare that with
the problems we have when someone says that SLUB has a performance
regression against SLAB.


> I was one of the people cleaning up SLAB when SLUB was merged and
> based on that experience I'm strongly in favor of SLUB as a base.

I think we should: modernise SLAB code, add missing debug features,
possibly turn off alien caches by default, chuck out SLUB, and then
require that future changes have some reasonable bar set to justify
them.

I would not be at all against adding changes that transform SLAB to
SLUB or SLEB or SLQB. That's how it really should be done in the
first place.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25  9:34                   ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25  9:34 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 12:19:09PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> > I don't think SLUB ever proved itself very well. The selling points
> > were some untestable handwaving about how queueing is bad and jitter
> > is bad, ignoring the fact that queues could be shortened and periodic
> > reaping disabled at runtime with SLAB style of allocator. It also
> > has relied heavily on higher order allocations which put great strain
> > on hugepage allocations and page reclaim (witness the big slowdown
> > in low memory conditions when tmpfs was using higher order allocations
> > via SLUB).
> 
> The main selling point for SLUB was NUMA. Has the situation changed?

Well one problem with SLAB was really just those alien caches. AFAIK
they were added by Christoph Lameter (maybe wrong), and I didn't ever
actually see much justification for them in the changelog. noaliencache
can be and is used on bigger machines, and SLES and RHEL kernels are
using SLAB on production NUMA systems up to thousands of CPU Altixes,
and have been looking at working on SGI's UV, and hundreds of cores
POWER7 etc.

I have not seen NUMA benchmarks showing SLUB is significantly better.
I haven't done much testing myself, mind you. But from indications, we
could probably quite easily drop the alien caches setup and do like a
simpler single remote freeing queue per CPU or something like that.


> Reliance on higher order allocations isn't that relevant if we're
> anyway discussing ways to change allocation strategy.

Then it's just going through more churn and adding untested code to
get where SLAB already is (top performance without higher order
allocations). So it is very relevant if we're considering how to get
to a single allocator.

 
> On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> > SLUB has not been able to displace SLAB for a long timedue to
> > performance and higher order allocation problems.
> >
> > I think "clean code" is very important, but by far the hardest thing to
> > get right by far is the actual allocation and freeing strategies. So
> > it's crazy to base such a choice on code cleanliness. If that's the
> > deciding factor, then I can provide a patch to modernise SLAB and then
> > we can remove SLUB and start incremental improvements from there.
> 
> I'm more than happy to take in patches to clean up SLAB but I think
> you're underestimating the required effort. What SLUB has going for
> it:
> 
>   - No NUMA alien caches
>   - No special lockdep handling required
>   - Debugging support is better
>   - Cpuset interractions are simpler
>   - Memory hotplug is more mature

All this I don't think is much problem. It was only a problem because we
put in SLUB and so half these new features were added to it and people
weren't adding them to SLAB.


>   - Much more contributors to SLUB than to SLAB

In large part because it is less mature. But also because it seems to be
seen as the allocator of the future.

Problem is that SLUB was never able to prove why it should be merged.
The code cleanliness issue is really trivial in comparison to how much
head scratching and work goes into analysing the performance.

It *really* is not required to completely replace a whole subsystem like
this to make progress. Even if we make relatively large changes,
everyone gets to use and test them, and it's so easy to bisect and
work out how changes interact and change behaviour. Compare that with
the problems we have when someone says that SLUB has a performance
regression against SLAB.


> I was one of the people cleaning up SLAB when SLUB was merged and
> based on that experience I'm strongly in favor of SLUB as a base.

I think we should: modernise SLAB code, add missing debug features,
possibly turn off alien caches by default, chuck out SLUB, and then
require that future changes have some reasonable bar set to justify
them.

I would not be at all against adding changes that transform SLAB to
SLUB or SLEB or SLQB. That's how it really should be done in the
first place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  9:34                   ` Nick Piggin
@ 2010-05-25  9:53                     ` Pekka Enberg
  -1 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25  9:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
>> The main selling point for SLUB was NUMA. Has the situation changed?
>
> Well one problem with SLAB was really just those alien caches. AFAIK
> they were added by Christoph Lameter (maybe wrong), and I didn't ever
> actually see much justification for them in the changelog. noaliencache
> can be and is used on bigger machines, and SLES and RHEL kernels are
> using SLAB on production NUMA systems up to thousands of CPU Altixes,
> and have been looking at working on SGI's UV, and hundreds of cores
> POWER7 etc.

Yes, Christoph and some other people introduced alien caches IIRC for
big iron SGI boxes. As for benchmarks, commit
e498be7dafd72fd68848c1eef1575aa7c5d658df ("Numa-aware slab allocator
V5") mentions AIM.

On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
> I have not seen NUMA benchmarks showing SLUB is significantly better.
> I haven't done much testing myself, mind you. But from indications, we
> could probably quite easily drop the alien caches setup and do like a
> simpler single remote freeing queue per CPU or something like that.

Commit 81819f0fc8285a2a5a921c019e3e3d7b6169d225 ("SLUB core") mentions
kernbench improvements.

Other than these two data points, I unfortunately don't have any as I
wasn't involved with merging of either of the patches. If other NUMA
people know better, please feel free to share the data.

On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> I think we should: modernise SLAB code, add missing debug features,
> possibly turn off alien caches by default, chuck out SLUB, and then
> require that future changes have some reasonable bar set to justify
> them.
>
> I would not be at all against adding changes that transform SLAB to
> SLUB or SLEB or SLQB. That's how it really should be done in the
> first place.

Like I said, as a maintainer I'm happy to merge patches to modernize
SLAB but I still think you're underestimating the effort especially
considering the fact that we can't afford many performance regressions
there either. I guess trying to get rid of alien caches would be the
first logical step there.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25  9:53                     ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25  9:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
>> The main selling point for SLUB was NUMA. Has the situation changed?
>
> Well one problem with SLAB was really just those alien caches. AFAIK
> they were added by Christoph Lameter (maybe wrong), and I didn't ever
> actually see much justification for them in the changelog. noaliencache
> can be and is used on bigger machines, and SLES and RHEL kernels are
> using SLAB on production NUMA systems up to thousands of CPU Altixes,
> and have been looking at working on SGI's UV, and hundreds of cores
> POWER7 etc.

Yes, Christoph and some other people introduced alien caches IIRC for
big iron SGI boxes. As for benchmarks, commit
e498be7dafd72fd68848c1eef1575aa7c5d658df ("Numa-aware slab allocator
V5") mentions AIM.

On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
> I have not seen NUMA benchmarks showing SLUB is significantly better.
> I haven't done much testing myself, mind you. But from indications, we
> could probably quite easily drop the alien caches setup and do like a
> simpler single remote freeing queue per CPU or something like that.

Commit 81819f0fc8285a2a5a921c019e3e3d7b6169d225 ("SLUB core") mentions
kernbench improvements.

Other than these two data points, I unfortunately don't have any as I
wasn't involved with merging of either of the patches. If other NUMA
people know better, please feel free to share the data.

On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> I think we should: modernise SLAB code, add missing debug features,
> possibly turn off alien caches by default, chuck out SLUB, and then
> require that future changes have some reasonable bar set to justify
> them.
>
> I would not be at all against adding changes that transform SLAB to
> SLUB or SLEB or SLQB. That's how it really should be done in the
> first place.

Like I said, as a maintainer I'm happy to merge patches to modernize
SLAB but I still think you're underestimating the effort especially
considering the fact that we can't afford many performance regressions
there either. I guess trying to get rid of alien caches would be the
first logical step there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  8:03             ` Pekka Enberg
@ 2010-05-25 10:02               ` David Rientjes
  -1 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-05-25 10:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, 25 May 2010, Pekka Enberg wrote:

> I wouldn't say it's a nightmare, but yes, it could be better. From my
> point of view SLUB is the base of whatever the future will be because
> the code is much cleaner and simpler than SLAB.

The code may be much cleaner and simpler than slab, but nobody (to date) 
has addressed the significant netperf TCP_RR regression that slub has, for 
example.  I worked on a patchset to do that for a while but it wasn't 
popular because it added some increments to the fastpath for tracking 
data.

I think it's great to have clean and simple code, but even considering its 
use is a non-starter when the entire kernel is significantly slower for 
certain networking loads.

> That's why I find
> Christoph's work on SLEB more interesting than SLQB, for example,
> because it's building on top of something that's mature and stable.
> 
> That said, are you proposing that even without further improvements to
> SLUB, we should go ahead and, for example, remove SLAB from Kconfig
> for v2.6.36 and see if we can just delete the whole thing from, say,
> v2.6.38?
> 

We use slab internally specifically because of the slub regressions.  
Removing it from the kernel at this point would be the equivalent of 
saying that Linux cares about certain workloads more than others since 
there are clearly benchmarks that show slub to be inferior in pure 
performance numbers.  I'd love for us to switch to slub but we can't take 
the performance hit.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 10:02               ` David Rientjes
  0 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-05-25 10:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, 25 May 2010, Pekka Enberg wrote:

> I wouldn't say it's a nightmare, but yes, it could be better. From my
> point of view SLUB is the base of whatever the future will be because
> the code is much cleaner and simpler than SLAB.

The code may be much cleaner and simpler than slab, but nobody (to date) 
has addressed the significant netperf TCP_RR regression that slub has, for 
example.  I worked on a patchset to do that for a while but it wasn't 
popular because it added some increments to the fastpath for tracking 
data.

I think it's great to have clean and simple code, but even considering its 
use is a non-starter when the entire kernel is significantly slower for 
certain networking loads.

> That's why I find
> Christoph's work on SLEB more interesting than SLQB, for example,
> because it's building on top of something that's mature and stable.
> 
> That said, are you proposing that even without further improvements to
> SLUB, we should go ahead and, for example, remove SLAB from Kconfig
> for v2.6.36 and see if we can just delete the whole thing from, say,
> v2.6.38?
> 

We use slab internally specifically because of the slub regressions.  
Removing it from the kernel at this point would be the equivalent of 
saying that Linux cares about certain workloads more than others since 
there are clearly benchmarks that show slub to be inferior in pure 
performance numbers.  I'd love for us to switch to slub but we can't take 
the performance hit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  8:16               ` Nick Piggin
@ 2010-05-25 10:07                 ` David Rientjes
  -1 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-05-25 10:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, 25 May 2010, Nick Piggin wrote:

> I don't think SLUB ever proved itself very well. The selling points
> were some untestable handwaving about how queueing is bad and jitter
> is bad, ignoring the fact that queues could be shortened and periodic
> reaping disabled at runtime with SLAB style of allocator. It also
> has relied heavily on higher order allocations which put great strain
> on hugepage allocations and page reclaim (witness the big slowdown
> in low memory conditions when tmpfs was using higher order allocations
> via SLUB).
> 

I agree that the higher order allocations is a major problem and slub 
relies heavily on them for being able to utilize both the allocation and 
freeing fastpaths for a number of caches.  For systems with a very large 
amount of memory that isn't fully utilized and fragmentation isn't an 
issue, this works fine, but for users who use all their memory and do some 
amount of reclaim it comes at a significant cost.  The cpu slab thrashing 
problem that I identified with the netperf TCP_RR benchmark can be heavily 
reduced by tuning certain kmalloc caches to allocate higher order slabs, 
but that makes it very difficult to run with hugepages and the allocation 
slowpath even slower.  There are commandline workarounds to prevent slub 
from using these higher order allocations, but the performance of the 
allocator then suffers as a result.

> SLUB has not been able to displace SLAB for a long timedue to
> performance and higher order allocation problems.
> 

Completely agreed.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 10:07                 ` David Rientjes
  0 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-05-25 10:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, 25 May 2010, Nick Piggin wrote:

> I don't think SLUB ever proved itself very well. The selling points
> were some untestable handwaving about how queueing is bad and jitter
> is bad, ignoring the fact that queues could be shortened and periodic
> reaping disabled at runtime with SLAB style of allocator. It also
> has relied heavily on higher order allocations which put great strain
> on hugepage allocations and page reclaim (witness the big slowdown
> in low memory conditions when tmpfs was using higher order allocations
> via SLUB).
> 

I agree that the higher order allocations is a major problem and slub 
relies heavily on them for being able to utilize both the allocation and 
freeing fastpaths for a number of caches.  For systems with a very large 
amount of memory that isn't fully utilized and fragmentation isn't an 
issue, this works fine, but for users who use all their memory and do some 
amount of reclaim it comes at a significant cost.  The cpu slab thrashing 
problem that I identified with the netperf TCP_RR benchmark can be heavily 
reduced by tuning certain kmalloc caches to allocate higher order slabs, 
but that makes it very difficult to run with hugepages and the allocation 
slowpath even slower.  There are commandline workarounds to prevent slub 
from using these higher order allocations, but the performance of the 
allocator then suffers as a result.

> SLUB has not been able to displace SLAB for a long timedue to
> performance and higher order allocation problems.
> 

Completely agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  9:53                     ` Pekka Enberg
@ 2010-05-25 10:19                       ` Nick Piggin
  -1 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 10:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 12:53:43PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> The main selling point for SLUB was NUMA. Has the situation changed?
> >
> > Well one problem with SLAB was really just those alien caches. AFAIK
> > they were added by Christoph Lameter (maybe wrong), and I didn't ever
> > actually see much justification for them in the changelog. noaliencache
> > can be and is used on bigger machines, and SLES and RHEL kernels are
> > using SLAB on production NUMA systems up to thousands of CPU Altixes,
> > and have been looking at working on SGI's UV, and hundreds of cores
> > POWER7 etc.
> 
> Yes, Christoph and some other people introduced alien caches IIRC for
> big iron SGI boxes. As for benchmarks, commit
> e498be7dafd72fd68848c1eef1575aa7c5d658df ("Numa-aware slab allocator
> V5") mentions AIM.

It's quite a change with a lot of things. But there are definitely
other ways we can improve this without having a huge dumb crossbar
for remote frees.

 
> On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
> > I have not seen NUMA benchmarks showing SLUB is significantly better.
> > I haven't done much testing myself, mind you. But from indications, we
> > could probably quite easily drop the alien caches setup and do like a
> > simpler single remote freeing queue per CPU or something like that.
> 
> Commit 81819f0fc8285a2a5a921c019e3e3d7b6169d225 ("SLUB core") mentions
> kernbench improvements.

I haven't measured anything like that. Kernbench for me has never
had slab show anywhere near the profiles (it's always page fault,
teardown, page allocator paths).

Must have been a pretty specific configuration, but anyway I don't
know that it is realistic.

 
> Other than these two data points, I unfortunately don't have any as I
> wasn't involved with merging of either of the patches. If other NUMA
> people know better, please feel free to share the data.

A lot of people are finding SLAB is still required for performance
reasons. We did not want to change in SLES11 for example because
of performance concerns. Not sure about RHEL6?

 
> On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> > I think we should: modernise SLAB code, add missing debug features,
> > possibly turn off alien caches by default, chuck out SLUB, and then
> > require that future changes have some reasonable bar set to justify
> > them.
> >
> > I would not be at all against adding changes that transform SLAB to
> > SLUB or SLEB or SLQB. That's how it really should be done in the
> > first place.
> 
> Like I said, as a maintainer I'm happy to merge patches to modernize
> SLAB

I think that would be most productive at this point. I will volunteer
to do it.

As much as I would like to see SLQB be merged :) I think the best
option is to go with SLAB because it is very well tested and very
very well performing.

If Christoph or you or I or anyone have genuine improvements to make
to the core algorithms, then the best thing to do will just be do
make incremental changes to SLAB.


> but I still think you're underestimating the effort especially
> considering the fact that we can't afford many performance regressions
> there either. I guess trying to get rid of alien caches would be the
> first logical step there.

There are several aspects to this. I think the first one will be to
actually modernize the code style, simplify the bootstrap process and
static memory allocations (SLQB goes even further than SLUB in this
regard), and to pull in debug features from SLUB.

These steps should be made without any changes to core algorithms.
Alien caches can easily be disabled and at present they are really
only a problem for big Altixes where it is a known parameter to tune.

>From that point, I think we should concede that SLUB has not fulfilled
performance promises, and make SLAB the default.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 10:19                       ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 10:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 12:53:43PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> The main selling point for SLUB was NUMA. Has the situation changed?
> >
> > Well one problem with SLAB was really just those alien caches. AFAIK
> > they were added by Christoph Lameter (maybe wrong), and I didn't ever
> > actually see much justification for them in the changelog. noaliencache
> > can be and is used on bigger machines, and SLES and RHEL kernels are
> > using SLAB on production NUMA systems up to thousands of CPU Altixes,
> > and have been looking at working on SGI's UV, and hundreds of cores
> > POWER7 etc.
> 
> Yes, Christoph and some other people introduced alien caches IIRC for
> big iron SGI boxes. As for benchmarks, commit
> e498be7dafd72fd68848c1eef1575aa7c5d658df ("Numa-aware slab allocator
> V5") mentions AIM.

It's quite a change with a lot of things. But there are definitely
other ways we can improve this without having a huge dumb crossbar
for remote frees.

 
> On Tue, May 25, 2010 at 12:34 PM, Nick Piggin <npiggin@suse.de> wrote:
> > I have not seen NUMA benchmarks showing SLUB is significantly better.
> > I haven't done much testing myself, mind you. But from indications, we
> > could probably quite easily drop the alien caches setup and do like a
> > simpler single remote freeing queue per CPU or something like that.
> 
> Commit 81819f0fc8285a2a5a921c019e3e3d7b6169d225 ("SLUB core") mentions
> kernbench improvements.

I haven't measured anything like that. Kernbench for me has never
had slab show anywhere near the profiles (it's always page fault,
teardown, page allocator paths).

Must have been a pretty specific configuration, but anyway I don't
know that it is realistic.

 
> Other than these two data points, I unfortunately don't have any as I
> wasn't involved with merging of either of the patches. If other NUMA
> people know better, please feel free to share the data.

A lot of people are finding SLAB is still required for performance
reasons. We did not want to change in SLES11 for example because
of performance concerns. Not sure about RHEL6?

 
> On Tue, May 25, 2010 at 11:16 AM, Nick Piggin <npiggin@suse.de> wrote:
> > I think we should: modernise SLAB code, add missing debug features,
> > possibly turn off alien caches by default, chuck out SLUB, and then
> > require that future changes have some reasonable bar set to justify
> > them.
> >
> > I would not be at all against adding changes that transform SLAB to
> > SLUB or SLEB or SLQB. That's how it really should be done in the
> > first place.
> 
> Like I said, as a maintainer I'm happy to merge patches to modernize
> SLAB

I think that would be most productive at this point. I will volunteer
to do it.

As much as I would like to see SLQB be merged :) I think the best
option is to go with SLAB because it is very well tested and very
very well performing.

If Christoph or you or I or anyone have genuine improvements to make
to the core algorithms, then the best thing to do will just be do
make incremental changes to SLAB.


> but I still think you're underestimating the effort especially
> considering the fact that we can't afford many performance regressions
> there either. I guess trying to get rid of alien caches would be the
> first logical step there.

There are several aspects to this. I think the first one will be to
actually modernize the code style, simplify the bootstrap process and
static memory allocations (SLQB goes even further than SLUB in this
regard), and to pull in debug features from SLUB.

These steps should be made without any changes to core algorithms.
Alien caches can easily be disabled and at present they are really
only a problem for big Altixes where it is a known parameter to tune.

>From that point, I think we should concede that SLUB has not fulfilled
performance promises, and make SLAB the default.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 10:19                       ` Nick Piggin
@ 2010-05-25 10:45                         ` Pekka Enberg
  -1 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 10:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 1:19 PM, Nick Piggin <npiggin@suse.de> wrote:
>> Like I said, as a maintainer I'm happy to merge patches to modernize
>> SLAB
>
> I think that would be most productive at this point. I will volunteer
> to do it.

OK, great!

> As much as I would like to see SLQB be merged :) I think the best
> option is to go with SLAB because it is very well tested and very
> very well performing.

I would have liked to see SLQB merged as well but it just didn't happen.

> If Christoph or you or I or anyone have genuine improvements to make
> to the core algorithms, then the best thing to do will just be do
> make incremental changes to SLAB.

I don't see the problem in improving SLUB even if we start modernizing
SLAB. Do you? I'm obviously biased towards SLUB still for the reasons
I already mentioned. I don't want to be a blocker for progress so if I
turn out to be a problem, we should consider changing the
maintainer(s). ;-)

> There are several aspects to this. I think the first one will be to
> actually modernize the code style, simplify the bootstrap process and
> static memory allocations (SLQB goes even further than SLUB in this
> regard), and to pull in debug features from SLUB.
>
> These steps should be made without any changes to core algorithms.
> Alien caches can easily be disabled and at present they are really
> only a problem for big Altixes where it is a known parameter to tune.
>
> From that point, I think we should concede that SLUB has not fulfilled
> performance promises, and make SLAB the default.

Sure. I don't care which allocator "wins" if we actually are able to get there.

                        Pekka

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 10:45                         ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 10:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 1:19 PM, Nick Piggin <npiggin@suse.de> wrote:
>> Like I said, as a maintainer I'm happy to merge patches to modernize
>> SLAB
>
> I think that would be most productive at this point. I will volunteer
> to do it.

OK, great!

> As much as I would like to see SLQB be merged :) I think the best
> option is to go with SLAB because it is very well tested and very
> very well performing.

I would have liked to see SLQB merged as well but it just didn't happen.

> If Christoph or you or I or anyone have genuine improvements to make
> to the core algorithms, then the best thing to do will just be do
> make incremental changes to SLAB.

I don't see the problem in improving SLUB even if we start modernizing
SLAB. Do you? I'm obviously biased towards SLUB still for the reasons
I already mentioned. I don't want to be a blocker for progress so if I
turn out to be a problem, we should consider changing the
maintainer(s). ;-)

> There are several aspects to this. I think the first one will be to
> actually modernize the code style, simplify the bootstrap process and
> static memory allocations (SLQB goes even further than SLUB in this
> regard), and to pull in debug features from SLUB.
>
> These steps should be made without any changes to core algorithms.
> Alien caches can easily be disabled and at present they are really
> only a problem for big Altixes where it is a known parameter to tune.
>
> From that point, I think we should concede that SLUB has not fulfilled
> performance promises, and make SLAB the default.

Sure. I don't care which allocator "wins" if we actually are able to get there.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 10:02               ` David Rientjes
@ 2010-05-25 10:47                 ` Pekka Enberg
  -1 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 10:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

Hi David,

On Tue, May 25, 2010 at 1:02 PM, David Rientjes <rientjes@google.com> wrote:
>> I wouldn't say it's a nightmare, but yes, it could be better. From my
>> point of view SLUB is the base of whatever the future will be because
>> the code is much cleaner and simpler than SLAB.
>
> The code may be much cleaner and simpler than slab, but nobody (to date)
> has addressed the significant netperf TCP_RR regression that slub has, for
> example.  I worked on a patchset to do that for a while but it wasn't
> popular because it added some increments to the fastpath for tracking
> data.

Yes and IIRC I asked you to resend the series because while I care a
lot about performance regressions, I simply don't have the time or the
hardware to reproduce and fix the weird cases you're seeing.

                        Pekka

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 10:47                 ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 10:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

Hi David,

On Tue, May 25, 2010 at 1:02 PM, David Rientjes <rientjes@google.com> wrote:
>> I wouldn't say it's a nightmare, but yes, it could be better. From my
>> point of view SLUB is the base of whatever the future will be because
>> the code is much cleaner and simpler than SLAB.
>
> The code may be much cleaner and simpler than slab, but nobody (to date)
> has addressed the significant netperf TCP_RR regression that slub has, for
> example.  I worked on a patchset to do that for a while but it wasn't
> popular because it added some increments to the fastpath for tracking
> data.

Yes and IIRC I asked you to resend the series because while I care a
lot about performance regressions, I simply don't have the time or the
hardware to reproduce and fix the weird cases you're seeing.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 10:45                         ` Pekka Enberg
@ 2010-05-25 11:06                           ` Nick Piggin
  -1 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 11:06 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 01:45:07PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 1:19 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> Like I said, as a maintainer I'm happy to merge patches to modernize
> >> SLAB
> >
> > I think that would be most productive at this point. I will volunteer
> > to do it.
> 
> OK, great!
> 
> > As much as I would like to see SLQB be merged :) I think the best
> > option is to go with SLAB because it is very well tested and very
> > very well performing.
> 
> I would have liked to see SLQB merged as well but it just didn't happen.

It seemed a bit counter productive if the goal is to have one allocator.
I think it still has merit but I should really practice what I preach
and propose incremental improvements to SLAB.

 
> > If Christoph or you or I or anyone have genuine improvements to make
> > to the core algorithms, then the best thing to do will just be do
> > make incremental changes to SLAB.
> 
> I don't see the problem in improving SLUB even if we start modernizing
> SLAB. Do you? I'm obviously biased towards SLUB still for the reasons
> I already mentioned. I don't want to be a blocker for progress so if I
> turn out to be a problem, we should consider changing the
> maintainer(s). ;-)

I think it just has not proven itself at this point, we have most
production kernels (at least, the performance sensitive ones that
I'm aware of) running on SLAB, and if it is conceded that lack of
queueing and reliance on higher order allocations is a problem then
I think it is far better just to bite the bullet now, drop it so
we can have a single allocator. Rather than adding SLAB-like queueing
to it and other big changes. Then make incremental improvements to SLAB.

I have no problems at all with trying new ideas, but really, they
should be done in SLAB as incremental improvements. Everywhere we
take that approach, things seem to work better than when we do
wholesale rip and replacements.

I don't want Christoph (or myself, or you) to stop testing new ideas,
but really there are almost no good reasons as to why they can be done
as incremental patches.

With SLAB code cleaned up, there will be even fewer reasons.


> > There are several aspects to this. I think the first one will be to
> > actually modernize the code style, simplify the bootstrap process and
> > static memory allocations (SLQB goes even further than SLUB in this
> > regard), and to pull in debug features from SLUB.
> >
> > These steps should be made without any changes to core algorithms.
> > Alien caches can easily be disabled and at present they are really
> > only a problem for big Altixes where it is a known parameter to tune.
> >
> > From that point, I think we should concede that SLUB has not fulfilled
> > performance promises, and make SLAB the default.
> 
> Sure. I don't care which allocator "wins" if we actually are able to get there.

SLUB is already behind the 8 ball here. So is SLQB I don't mind saying
because it has had much much less testing.



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 11:06                           ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 11:06 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Christoph Lameter, linux-mm, LKML,
	Andrew Morton, Linus Torvalds, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 01:45:07PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 1:19 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> Like I said, as a maintainer I'm happy to merge patches to modernize
> >> SLAB
> >
> > I think that would be most productive at this point. I will volunteer
> > to do it.
> 
> OK, great!
> 
> > As much as I would like to see SLQB be merged :) I think the best
> > option is to go with SLAB because it is very well tested and very
> > very well performing.
> 
> I would have liked to see SLQB merged as well but it just didn't happen.

It seemed a bit counter productive if the goal is to have one allocator.
I think it still has merit but I should really practice what I preach
and propose incremental improvements to SLAB.

 
> > If Christoph or you or I or anyone have genuine improvements to make
> > to the core algorithms, then the best thing to do will just be do
> > make incremental changes to SLAB.
> 
> I don't see the problem in improving SLUB even if we start modernizing
> SLAB. Do you? I'm obviously biased towards SLUB still for the reasons
> I already mentioned. I don't want to be a blocker for progress so if I
> turn out to be a problem, we should consider changing the
> maintainer(s). ;-)

I think it just has not proven itself at this point, we have most
production kernels (at least, the performance sensitive ones that
I'm aware of) running on SLAB, and if it is conceded that lack of
queueing and reliance on higher order allocations is a problem then
I think it is far better just to bite the bullet now, drop it so
we can have a single allocator. Rather than adding SLAB-like queueing
to it and other big changes. Then make incremental improvements to SLAB.

I have no problems at all with trying new ideas, but really, they
should be done in SLAB as incremental improvements. Everywhere we
take that approach, things seem to work better than when we do
wholesale rip and replacements.

I don't want Christoph (or myself, or you) to stop testing new ideas,
but really there are almost no good reasons as to why they can be done
as incremental patches.

With SLAB code cleaned up, there will be even fewer reasons.


> > There are several aspects to this. I think the first one will be to
> > actually modernize the code style, simplify the bootstrap process and
> > static memory allocations (SLQB goes even further than SLUB in this
> > regard), and to pull in debug features from SLUB.
> >
> > These steps should be made without any changes to core algorithms.
> > Alien caches can easily be disabled and at present they are really
> > only a problem for big Altixes where it is a known parameter to tune.
> >
> > From that point, I think we should concede that SLUB has not fulfilled
> > performance promises, and make SLAB the default.
> 
> Sure. I don't care which allocator "wins" if we actually are able to get there.

SLUB is already behind the 8 ball here. So is SLQB I don't mind saying
because it has had much much less testing.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25  2:06     ` Nick Piggin
  2010-05-25  6:55       ` Pekka Enberg
@ 2010-05-25 14:13       ` Christoph Lameter
  2010-05-25 14:34         ` Nick Piggin
  2010-05-25 14:40         ` Nick Piggin
  1 sibling, 2 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-05-25 14:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Tue, 25 May 2010, Nick Piggin wrote:

> On Mon, May 24, 2010 at 10:06:08AM -0500, Christoph Lameter wrote:
> > On Mon, 24 May 2010, Nick Piggin wrote:
> >
> > > Well I'm glad you've conceded that queues are useful for high
> > > performance computing, and that higher order allocations are not
> > > a free and unlimited resource.
> >
> > Ahem. I have never made any such claim and would never make them. And
> > "conceding" something ???
>
> Well, you were quite vocal about the subject.

I was always vocal about the huge amounts of queues and the complexity
coming with alien caches etc. The alien caches were introduced against my
objections on the development team that did the NUMA slab. But even SLUB
has "queues" as many have repeatedly pointed out. The queuing is
different though in order to minimize excessive NUMA queueing. IMHO the
NUMA design of SLAB has fundamental problems because it implements its own
"NUMAness" aside from the page allocator. I had to put lots of band aid on
the NUMA functionality in SLAB to make it correct.

One of the key things in SLEB is the question how to deal with the alien
issue. So far I think the best compromise would be to use the shared
caches of the remote node as a stand in for the alien cache. Problem is
that we will then free cache cold objects to the remote shared cache.
Maybe that can be addressed by freeing to the end of the queue instead of
freeing to the top.

> > The "unqueueing" was the result of excessive queue handling in SLAB due and
> > the higher order allocations are a natural move in HPC to gain performance.
>
> This is the kind of handwavings that need to be put into a testable
> form. I repeatedly asked you for examples of where the jitter is
> excessive or where the TLB improvements help, but you never provided
> any testable case. I'm not saying they don't exist, but we have to be
> reational about this.

The initial test that showed the improvements was on IA64 (16K page size)
and that was the measurement that was accepted for the initial merge. Mel
was able to verify those numbers.

While it will be easily possible to have less higher order allocations
with SLEB I still think that higher order allocations are desirable to
increase data locality and TLB pressure. Its easy though to set the
defaults to order 1 (like SLAB) though and then allow manual override if
desired.

Fundamentally it is still the case that memory sizes are increasing and
that management overhead of 4K pages will therefore increasingly become an
issue. Support for larger page sizes and huge pages is critical for all
kernel components to compete in the future.

> > > I hope we can move forward now with some objective, testable
> > > comparisons and criteria for selecting one main slab allocator.
> >
> > If can find criteria that are universally agreed upon then yes but that is
> > doubtful.
>
> I think we can agree that perfect is the enemy of good, and that no
> allocator will do the perfect thing for everybody. I think we have to
> come up with a way to a single allocator.

Yes but SLAB is not really the way to go. The code is too messy. Thats why
I think the best way to go at this point is to merge the clean SLUB design
and add the SLAB features needed and try to keep the NUMA stuff cleaner.

I am not entirely sure that I want to get rid of SLUB. Certainly if you
want minimal latency (like in my field) and more determinism then you
would want a SLUB design instead of periodic queue handling. Also SLUB has
a minimal memory footprint due to the linked list architecture.

The queues sacrifice a lot there. The linked list does not allow managing
cache cold objects like SLAB does because you always need to touch the
object and this will cause regressions against SLAB. I think this is also
one of the weaknesses of SLQB.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 14:13       ` Christoph Lameter
@ 2010-05-25 14:34         ` Nick Piggin
  2010-05-25 14:43           ` Nick Piggin
  2010-05-25 14:48           ` Christoph Lameter
  2010-05-25 14:40         ` Nick Piggin
  1 sibling, 2 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 14:34 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Tue, May 25, 2010 at 09:13:37AM -0500, Christoph Lameter wrote:
> On Tue, 25 May 2010, Nick Piggin wrote:
> 
> > On Mon, May 24, 2010 at 10:06:08AM -0500, Christoph Lameter wrote:
> > This is the kind of handwavings that need to be put into a testable
> > form. I repeatedly asked you for examples of where the jitter is
> > excessive or where the TLB improvements help, but you never provided
> > any testable case. I'm not saying they don't exist, but we have to be
> > reational about this.
> 
> The initial test that showed the improvements was on IA64 (16K page size)
> and that was the measurement that was accepted for the initial merge. Mel
> was able to verify those numbers.

And there is nothing to prevent a SLAB type allocator from using higher
order allocations, except for the fact that it usually wouldn't because
far more often than not it is a bad idea.

Also, people actually want to use hugepages in userspace. The more that
other allocations use them, the worse problems with fragmentation and
reclaim become.

 
> While it will be easily possible to have less higher order allocations
> with SLEB I still think that higher order allocations are desirable to
> increase data locality and TLB pressure. Its easy though to set the
> defaults to order 1 (like SLAB) though and then allow manual override if
> desired.
> 
> Fundamentally it is still the case that memory sizes are increasing and
> that management overhead of 4K pages will therefore increasingly become an
> issue. Support for larger page sizes and huge pages is critical for all
> kernel components to compete in the future.

Numbers haven't really shown that SLUB is better because of higher order
allocations. Besides, as I said, higher order allocations can be used
by others.

 
> > > > I hope we can move forward now with some objective, testable
> > > > comparisons and criteria for selecting one main slab allocator.
> > >
> > > If can find criteria that are universally agreed upon then yes but that is
> > > doubtful.
> >
> > I think we can agree that perfect is the enemy of good, and that no
> > allocator will do the perfect thing for everybody. I think we have to
> > come up with a way to a single allocator.
> 
> Yes but SLAB is not really the way to go. The code is too messy. Thats why

That's a weak reason. SLUB has taken years to prove that it's not a
suitable replacement, so more big changes to it is not make it more
suitable now. We should just admit the rip and replace idea has
failed, and go with more reasonable incremental improvements rather
than subject everyone to another round of testing.

This is why I stopped pushing SLQB TBH, even though it showed some
promise.

The hard part is clearly NOT the code cleanup. It is the design and
all the testing and tuning.


> I think the best way to go at this point is to merge the clean SLUB design
> and add the SLAB features needed and try to keep the NUMA stuff cleaner.

I think it is to get rid of SLUB and add SLUB features gradually to
SLAB if/when they prove themselves.

 
> I am not entirely sure that I want to get rid of SLUB. Certainly if you
> want minimal latency (like in my field) and more determinism then you
> would want a SLUB design instead of periodic queue handling. Also SLUB has
> a minimal memory footprint due to the linked list architecture.

I disagree completely. The queues can be shrunk to a similar size as
the SLUB queues (which are just implicit by design), and periodic
shrinking can be disabled like SLUB. It's not a fundamental design
property.

Also, there were no numbers or test cases, simply handwaving. I don't
disagree it might be a problem, but the way to solve problems is to
provide a test case or numbers.


> The queues sacrifice a lot there. The linked list does not allow managing
> cache cold objects like SLAB does because you always need to touch the
> object and this will cause regressions against SLAB. I think this is also
> one of the weaknesses of SLQB.

But this is just more handwaving. That's what got us into this situation
we are in now.

What we know is that SLAB is still used by all high performance
enterprise distros (and google). And it is used by Altixes in production
as well as all other large NUMA machines that Linux runs on.

Given that information, how can you still say that SLUB+more big changes
is the right way to proceed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 14:13       ` Christoph Lameter
  2010-05-25 14:34         ` Nick Piggin
@ 2010-05-25 14:40         ` Nick Piggin
  2010-05-25 14:48           ` Christoph Lameter
  1 sibling, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 14:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Tue, May 25, 2010 at 09:13:37AM -0500, Christoph Lameter wrote:
> On Tue, 25 May 2010, Nick Piggin wrote:
> 
> > On Mon, May 24, 2010 at 10:06:08AM -0500, Christoph Lameter wrote:
> > > On Mon, 24 May 2010, Nick Piggin wrote:
> > >
> > > > Well I'm glad you've conceded that queues are useful for high
> > > > performance computing, and that higher order allocations are not
> > > > a free and unlimited resource.
> > >
> > > Ahem. I have never made any such claim and would never make them. And
> > > "conceding" something ???
> >
> > Well, you were quite vocal about the subject.
> 
> I was always vocal about the huge amounts of queues and the complexity
> coming with alien caches etc. The alien caches were introduced against my
> objections on the development team that did the NUMA slab. But even SLUB
> has "queues" as many have repeatedly pointed out. The queuing is
> different though in order to minimize excessive NUMA queueing. IMHO the
> NUMA design of SLAB has fundamental problems because it implements its own
> "NUMAness" aside from the page allocator.

And by the way I disagreed completely that this is a problem. And you
never demonstrated that it is a problem.

It's totally unproductive to say things like it implements its own
"NUMAness" aside from the page allocator. I can say SLUB implements its
own "numaness" because it is checking for objects matching NUMA
requirements too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 14:34         ` Nick Piggin
@ 2010-05-25 14:43           ` Nick Piggin
  2010-05-25 14:48           ` Christoph Lameter
  1 sibling, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 14:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Wed, May 26, 2010 at 12:34:09AM +1000, Nick Piggin wrote:
> On Tue, May 25, 2010 at 09:13:37AM -0500, Christoph Lameter wrote:
> > The queues sacrifice a lot there. The linked list does not allow managing
> > cache cold objects like SLAB does because you always need to touch the
> > object and this will cause regressions against SLAB. I think this is also
> > one of the weaknesses of SLQB.
> 
> But this is just more handwaving. That's what got us into this situation
> we are in now.
> 
> What we know is that SLAB is still used by all high performance
> enterprise distros (and google). And it is used by Altixes in production
> as well as all other large NUMA machines that Linux runs on.
> 
> Given that information, how can you still say that SLUB+more big changes
> is the right way to proceed?

Might I add that once SLAB code is cleaned up, you can always propose
improvements from SLUB or any other ideas for it which we can carefully
test and merge in slowly as bisectable changes to our benchmark
performance slab allocator.

In fact, if you have better ideas in SLEB, I would encourage it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 14:34         ` Nick Piggin
  2010-05-25 14:43           ` Nick Piggin
@ 2010-05-25 14:48           ` Christoph Lameter
  2010-05-25 15:11             ` Nick Piggin
  1 sibling, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-25 14:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Wed, 26 May 2010, Nick Piggin wrote:

> > The initial test that showed the improvements was on IA64 (16K page size)
> > and that was the measurement that was accepted for the initial merge. Mel
> > was able to verify those numbers.
>
> And there is nothing to prevent a SLAB type allocator from using higher
> order allocations, except for the fact that it usually wouldn't because
> far more often than not it is a bad idea.

16K is the base page size on IA64. Higher order allocations are a pressing
issue for the kernel given growing memory sizes and we are slowly but
surely making progress with defrag etc.

> > Fundamentally it is still the case that memory sizes are increasing and
> > that management overhead of 4K pages will therefore increasingly become an
> > issue. Support for larger page sizes and huge pages is critical for all
> > kernel components to compete in the future.
>
> Numbers haven't really shown that SLUB is better because of higher order
> allocations. Besides, as I said, higher order allocations can be used
> by others.

Boot with huge page support (slub_min_order=9) and you will see a
performance increase on many loads.

> Also, there were no numbers or test cases, simply handwaving. I don't
> disagree it might be a problem, but the way to solve problems is to
> provide a test case or numbers.

The reason that the alien caches made it into SLAB were performance
numbers that showed that the design "must" be this way. I prefer a clear
maintainable design over some numbers (that invariably show the bias of
the tester for certain loads).

> Given that information, how can you still say that SLUB+more big changes
> is the right way to proceed?

Have you looked at the SLAB code?

Also please stop exaggerating. There are no immediate plans to replace
SLAB. We are exploring a possible solution.

If the SLEB idea pans out and we can replicate SLAB (and SLUB) performance
then we will have to think about replacing SLAB / SLUB at some point. So
far this is just a riggedy thing that barely works where there is some
hope that the SLAB - SLUB conumdrum may be solved by the approach.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 14:40         ` Nick Piggin
@ 2010-05-25 14:48           ` Christoph Lameter
  2010-05-25 15:12             ` Nick Piggin
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-25 14:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Wed, 26 May 2010, Nick Piggin wrote:

> And by the way I disagreed completely that this is a problem. And you
> never demonstrated that it is a problem.
>
> It's totally unproductive to say things like it implements its own
> "NUMAness" aside from the page allocator. I can say SLUB implements its
> own "numaness" because it is checking for objects matching NUMA
> requirements too.

SLAB implement numa policies etc in the SLAB logic. It has its own rotor
now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 14:48           ` Christoph Lameter
@ 2010-05-25 15:11             ` Nick Piggin
  2010-05-25 15:28               ` Christoph Lameter
  0 siblings, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 15:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Tue, May 25, 2010 at 09:48:01AM -0500, Christoph Lameter wrote:
> On Wed, 26 May 2010, Nick Piggin wrote:
> 
> > > The initial test that showed the improvements was on IA64 (16K page size)
> > > and that was the measurement that was accepted for the initial merge. Mel
> > > was able to verify those numbers.
> >
> > And there is nothing to prevent a SLAB type allocator from using higher
> > order allocations, except for the fact that it usually wouldn't because
> > far more often than not it is a bad idea.
> 
> 16K is the base page size on IA64. Higher order allocations are a pressing
> issue for the kernel given growing memory sizes and we are slowly but
> surely making progress with defrag etc.

You do not understand. There is nothing *preventing* other designs of
allocators from using higher order allocations. The problem is that
SLUB is *forced* to use them due to it's limited queueing capabilities.

You keep spinning this as a good thing for SLUB design when it is not.

 
> > > Fundamentally it is still the case that memory sizes are increasing and
> > > that management overhead of 4K pages will therefore increasingly become an
> > > issue. Support for larger page sizes and huge pages is critical for all
> > > kernel components to compete in the future.
> >
> > Numbers haven't really shown that SLUB is better because of higher order
> > allocations. Besides, as I said, higher order allocations can be used
> > by others.
> 
> Boot with huge page support (slub_min_order=9) and you will see a
> performance increase on many loads.

Pretty ridiculous.

 
> > Also, there were no numbers or test cases, simply handwaving. I don't
> > disagree it might be a problem, but the way to solve problems is to
> > provide a test case or numbers.
> 
> The reason that the alien caches made it into SLAB were performance
> numbers that showed that the design "must" be this way. I prefer a clear
> maintainable design over some numbers (that invariably show the bias of
> the tester for certain loads).

I don't really agree. There are a number of other possible ways to
improve it, including fewer remote freeing queues.

For the slab allocator, if anything, I'm pretty sure that numbers
actually is the most important criteria. A few thousand lines of
self contained code that services almost all the rest of the kernel
we are talking about.

 
> > Given that information, how can you still say that SLUB+more big changes
> > is the right way to proceed?
> 
> Have you looked at the SLAB code?

Of course. Have you had a look at the SLUB numbers and reports of
failures?

 
> Also please stop exaggerating. There are no immediate plans to replace
> SLAB. We are exploring a possible solution.

Good, because it cannot be replaced, I am proposing to replace SLUB in
fact. I have heard no good reasons why not.

 
> If the SLEB idea pans out and we can replicate SLAB (and SLUB) performance
> then we will have to think about replacing SLAB / SLUB at some point. So
> far this is just a riggedy thing that barely works where there is some
> hope that the SLAB - SLUB conumdrum may be solved by the approach.
 
SLUB has gone back to the drawing board because its original design
cannot support high enough performance to replace SLAB. This gives
us the opportunity to do what we should have done from the start, and
incrementally improve SLAB code.

I repeat for the Nth time that there is nothing stopping you from
adding SLUB ideas into SLAB. This is how it should have been done
from the start.

How is it possibly better to instead start from the known suboptimal
code and make changes to it? What exactly is your concern with
making incremental changes to SLAB?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 14:48           ` Christoph Lameter
@ 2010-05-25 15:12             ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 15:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Tue, May 25, 2010 at 09:48:56AM -0500, Christoph Lameter wrote:
> On Wed, 26 May 2010, Nick Piggin wrote:
> 
> > And by the way I disagreed completely that this is a problem. And you
> > never demonstrated that it is a problem.
> >
> > It's totally unproductive to say things like it implements its own
> > "NUMAness" aside from the page allocator. I can say SLUB implements its
> > own "numaness" because it is checking for objects matching NUMA
> > requirements too.
> 
> SLAB implement numa policies etc in the SLAB logic. It has its own rotor
> now.

We all know. I am saying it is unproductive because you just claim
that it is some fundamental problem without why it is a problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 10:45                         ` Pekka Enberg
@ 2010-05-25 15:13                           ` Linus Torvalds
  -1 siblings, 0 replies; 89+ messages in thread
From: Linus Torvalds @ 2010-05-25 15:13 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman



On Tue, 25 May 2010, Pekka Enberg wrote:
> 
> I would have liked to see SLQB merged as well but it just didn't happen.

And it's not going to. I'm not going to merge YASA that will stay around 
for years, not improve on anything, and will just mean that there are some 
bugs that developers don't see because they depend on some subtle 
interaction with the sl*b allocator.

We've got three. That's at least one too many. We're not adding any new 
ones until we've gotten rid of at least one old one.

		Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 15:13                           ` Linus Torvalds
  0 siblings, 0 replies; 89+ messages in thread
From: Linus Torvalds @ 2010-05-25 15:13 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman



On Tue, 25 May 2010, Pekka Enberg wrote:
> 
> I would have liked to see SLQB merged as well but it just didn't happen.

And it's not going to. I'm not going to merge YASA that will stay around 
for years, not improve on anything, and will just mean that there are some 
bugs that developers don't see because they depend on some subtle 
interaction with the sl*b allocator.

We've got three. That's at least one too many. We're not adding any new 
ones until we've gotten rid of at least one old one.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 15:11             ` Nick Piggin
@ 2010-05-25 15:28               ` Christoph Lameter
  2010-05-25 15:37                 ` Nick Piggin
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-25 15:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Wed, 26 May 2010, Nick Piggin wrote:

> You do not understand. There is nothing *preventing* other designs of
> allocators from using higher order allocations. The problem is that
> SLUB is *forced* to use them due to it's limited queueing capabilities.

SLUBs use of higher order allocation is *optional*. The limited queuing is
advantageous within the framework of SLUB because NUMA locality checks are
simplified and locking is localized to a single page increasing
concurrency.

> You keep spinning this as a good thing for SLUB design when it is not.

It is a good design decision. You have an irrational fear of higher order
allocations.

> > The reason that the alien caches made it into SLAB were performance
> > numbers that showed that the design "must" be this way. I prefer a clear
> > maintainable design over some numbers (that invariably show the bias of
> > the tester for certain loads).
>
> I don't really agree. There are a number of other possible ways to
> improve it, including fewer remote freeing queues.

You disagree with the history of the allocator?

> How is it possibly better to instead start from the known suboptimal
> code and make changes to it? What exactly is your concern with
> making incremental changes to SLAB?

I am not sure why you want me to repeat what I already said. Guess we
should stop this conversation since it is deteriorating.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 15:28               ` Christoph Lameter
@ 2010-05-25 15:37                 ` Nick Piggin
  2010-05-27 14:24                   ` Christoph Lameter
  0 siblings, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Linus Torvalds

On Tue, May 25, 2010 at 10:28:11AM -0500, Christoph Lameter wrote:
> On Wed, 26 May 2010, Nick Piggin wrote:
> 
> > You do not understand. There is nothing *preventing* other designs of
> > allocators from using higher order allocations. The problem is that
> > SLUB is *forced* to use them due to it's limited queueing capabilities.
> 
> SLUBs use of higher order allocation is *optional*. The limited queuing is
> advantageous within the framework of SLUB because NUMA locality checks are
> simplified and locking is localized to a single page increasing
> concurrency.

It's not optional if performance sucks without it. People want to have
a well performing slab allocator and also not have the downsides of it
using higher order allocations.

Look at what David said about Google's kernel for a concrete example.

 
> > You keep spinning this as a good thing for SLUB design when it is not.
> 
> It is a good design decision. You have an irrational fear of higher order
> allocations.

No.

 
> > > The reason that the alien caches made it into SLAB were performance
> > > numbers that showed that the design "must" be this way. I prefer a clear
> > > maintainable design over some numbers (that invariably show the bias of
> > > the tester for certain loads).
> >
> > I don't really agree. There are a number of other possible ways to
> > improve it, including fewer remote freeing queues.
> 
> You disagree with the history of the allocator?

I don't agree with you saying that it "must" be that way. There are
other ways to improve things there.


> > How is it possibly better to instead start from the known suboptimal
> > code and make changes to it? What exactly is your concern with
> > making incremental changes to SLAB?
> 
> I am not sure why you want me to repeat what I already said. Guess we
> should stop this conversation since it is deteriorating.

You never answered these questions adequately. These are the 2 most
important things because if I can address your concerns with them,
then we can go ahead and throw out SLUB and make incremental
improvements from there instead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 15:13                           ` Linus Torvalds
@ 2010-05-25 15:43                             ` Nick Piggin
  -1 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 08:13:50AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 25 May 2010, Pekka Enberg wrote:
> > 
> > I would have liked to see SLQB merged as well but it just didn't happen.
> 
> And it's not going to. I'm not going to merge YASA that will stay around 
> for years, not improve on anything, and will just mean that there are some 
> bugs that developers don't see because they depend on some subtle 
> interaction with the sl*b allocator.
> 
> We've got three. That's at least one too many. We're not adding any new 
> ones until we've gotten rid of at least one old one.

No agree and realized that a while back (hence stop pushing SLQB).
SLAB is simply a good allocator that is very very hard to beat. The
fact that a lot of places are still using SLAB despite the real
secondary advantages of SLUB (cleaner code, better debugging support)
indicate to me that we should go back and start from there.

What is sad is all this duplicate (and unsynchronized and not always
complete) work implementing things in both the allocators[*] and
split testing base.

As far as I can see, there was never a good reason to replace SLAB
rather than clean up its code and make incremental improvements.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 15:43                             ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 08:13:50AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 25 May 2010, Pekka Enberg wrote:
> > 
> > I would have liked to see SLQB merged as well but it just didn't happen.
> 
> And it's not going to. I'm not going to merge YASA that will stay around 
> for years, not improve on anything, and will just mean that there are some 
> bugs that developers don't see because they depend on some subtle 
> interaction with the sl*b allocator.
> 
> We've got three. That's at least one too many. We're not adding any new 
> ones until we've gotten rid of at least one old one.

No agree and realized that a while back (hence stop pushing SLQB).
SLAB is simply a good allocator that is very very hard to beat. The
fact that a lot of places are still using SLAB despite the real
secondary advantages of SLUB (cleaner code, better debugging support)
indicate to me that we should go back and start from there.

What is sad is all this duplicate (and unsynchronized and not always
complete) work implementing things in both the allocators[*] and
split testing base.

As far as I can see, there was never a good reason to replace SLAB
rather than clean up its code and make incremental improvements.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 15:43                             ` Nick Piggin
@ 2010-05-25 17:02                               ` Pekka Enberg
  -1 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 17:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 6:43 PM, Nick Piggin <npiggin@suse.de> wrote:
> As far as I can see, there was never a good reason to replace SLAB
> rather than clean up its code and make incremental improvements.

I'm not totally convinced but I guess we're about to find that out.
How do you propose we benchmark SLAB while we clean it up and change
things to make sure we don't make the same mistakes as we did with
SLUB (i.e. miss an important workload like TPC-C)?

                        Pekka

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 17:02                               ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 17:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

Hi Nick,

On Tue, May 25, 2010 at 6:43 PM, Nick Piggin <npiggin@suse.de> wrote:
> As far as I can see, there was never a good reason to replace SLAB
> rather than clean up its code and make incremental improvements.

I'm not totally convinced but I guess we're about to find that out.
How do you propose we benchmark SLAB while we clean it up and change
things to make sure we don't make the same mistakes as we did with
SLUB (i.e. miss an important workload like TPC-C)?

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 17:02                               ` Pekka Enberg
@ 2010-05-25 17:19                                 ` Nick Piggin
  -1 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 17:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 08:02:32PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 6:43 PM, Nick Piggin <npiggin@suse.de> wrote:
> > As far as I can see, there was never a good reason to replace SLAB
> > rather than clean up its code and make incremental improvements.
> 
> I'm not totally convinced but I guess we're about to find that out.
> How do you propose we benchmark SLAB while we clean it up

Well the first pass will be code cleanups, bootstrap simplifications.
Then looking at what debugging features were implemented in SLUB but not
SLAB and what will be useful to bring over from there.

At this point the aim would be for actual allocation behaviour with
non-debug settings to be unchanged. Hopefully this removes everyone's
(apparently) largest gripe that code is crufty.

Next would be to add some options to tweak queue sizes and disable
cache reaping at runtime, for the benfit of the low jitter crowd,
see if any further hotplug fixes are required.

Then would be to propose incremental improvements to actual algorithm.
For example, replacing the alien cache crossbar with a lighter weight
or more scalable structure.


> and change
> things to make sure we don't make the same mistakes as we did with
> SLUB (i.e. miss an important workload like TPC-C)?

Obviously it is impossible to make forward progress and also catch
all regressions before release. This fact means that we have to be
able to cope with them as well as possible.

We get two benefits from starting with SLAB. Firstly, we get a larger
testing base. Secondly, we get a simple (ie. git revert) formula of how
to get from good behaviour to bad behaviour.

I don't anticipate a huge number of functional changes to SLAB here
though. It's surprisingly hard to do better than it. alien caches are
one area, maybe configurable higher order allocation support, jitter
reduction.

If we do get a big proposed change in the pipeline, then we have to
eat it somehow, but AFAIKS we've still got a better foundation than
starting with a completely new allocator and feeling around in the
dark trying to move it past SLAB in terms of performance.



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 17:19                                 ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 17:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 08:02:32PM +0300, Pekka Enberg wrote:
> Hi Nick,
> 
> On Tue, May 25, 2010 at 6:43 PM, Nick Piggin <npiggin@suse.de> wrote:
> > As far as I can see, there was never a good reason to replace SLAB
> > rather than clean up its code and make incremental improvements.
> 
> I'm not totally convinced but I guess we're about to find that out.
> How do you propose we benchmark SLAB while we clean it up

Well the first pass will be code cleanups, bootstrap simplifications.
Then looking at what debugging features were implemented in SLUB but not
SLAB and what will be useful to bring over from there.

At this point the aim would be for actual allocation behaviour with
non-debug settings to be unchanged. Hopefully this removes everyone's
(apparently) largest gripe that code is crufty.

Next would be to add some options to tweak queue sizes and disable
cache reaping at runtime, for the benfit of the low jitter crowd,
see if any further hotplug fixes are required.

Then would be to propose incremental improvements to actual algorithm.
For example, replacing the alien cache crossbar with a lighter weight
or more scalable structure.


> and change
> things to make sure we don't make the same mistakes as we did with
> SLUB (i.e. miss an important workload like TPC-C)?

Obviously it is impossible to make forward progress and also catch
all regressions before release. This fact means that we have to be
able to cope with them as well as possible.

We get two benefits from starting with SLAB. Firstly, we get a larger
testing base. Secondly, we get a simple (ie. git revert) formula of how
to get from good behaviour to bad behaviour.

I don't anticipate a huge number of functional changes to SLAB here
though. It's surprisingly hard to do better than it. alien caches are
one area, maybe configurable higher order allocation support, jitter
reduction.

If we do get a big proposed change in the pipeline, then we have to
eat it somehow, but AFAIKS we've still got a better foundation than
starting with a completely new allocator and feeling around in the
dark trying to move it past SLAB in terms of performance.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 17:19                                 ` Nick Piggin
@ 2010-05-25 17:35                                   ` Pekka Enberg
  -1 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 17:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 8:19 PM, Nick Piggin <npiggin@suse.de> wrote:
>> I'm not totally convinced but I guess we're about to find that out.
>> How do you propose we benchmark SLAB while we clean it up
>
> Well the first pass will be code cleanups, bootstrap simplifications.
> Then looking at what debugging features were implemented in SLUB but not
> SLAB and what will be useful to bring over from there.

Bootstrap might be easy to clean up but the biggest source of cruft
comes from the deeply inlined, complex allocation paths. Cleaning
those up is bound to cause performance regressions if you're not
careful.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 17:35                                   ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-05-25 17:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 8:19 PM, Nick Piggin <npiggin@suse.de> wrote:
>> I'm not totally convinced but I guess we're about to find that out.
>> How do you propose we benchmark SLAB while we clean it up
>
> Well the first pass will be code cleanups, bootstrap simplifications.
> Then looking at what debugging features were implemented in SLUB but not
> SLAB and what will be useful to bring over from there.

Bootstrap might be easy to clean up but the biggest source of cruft
comes from the deeply inlined, complex allocation paths. Cleaning
those up is bound to cause performance regressions if you're not
careful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 17:35                                   ` Pekka Enberg
@ 2010-05-25 17:40                                     ` Nick Piggin
  -1 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 17:40 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 08:35:05PM +0300, Pekka Enberg wrote:
> On Tue, May 25, 2010 at 8:19 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> I'm not totally convinced but I guess we're about to find that out.
> >> How do you propose we benchmark SLAB while we clean it up
> >
> > Well the first pass will be code cleanups, bootstrap simplifications.
> > Then looking at what debugging features were implemented in SLUB but not
> > SLAB and what will be useful to bring over from there.
> 
> Bootstrap might be easy to clean up but the biggest source of cruft
> comes from the deeply inlined, complex allocation paths. Cleaning
> those up is bound to cause performance regressions if you're not
> careful.

Oh I see what you mean, just straight-line code speed regressions
could bite us when doing cleanups.

That's possible. I'll keep a close eye on generated asm.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 17:40                                     ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-25 17:40 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, David Rientjes, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall, Mel Gorman

On Tue, May 25, 2010 at 08:35:05PM +0300, Pekka Enberg wrote:
> On Tue, May 25, 2010 at 8:19 PM, Nick Piggin <npiggin@suse.de> wrote:
> >> I'm not totally convinced but I guess we're about to find that out.
> >> How do you propose we benchmark SLAB while we clean it up
> >
> > Well the first pass will be code cleanups, bootstrap simplifications.
> > Then looking at what debugging features were implemented in SLUB but not
> > SLAB and what will be useful to bring over from there.
> 
> Bootstrap might be easy to clean up but the biggest source of cruft
> comes from the deeply inlined, complex allocation paths. Cleaning
> those up is bound to cause performance regressions if you're not
> careful.

Oh I see what you mean, just straight-line code speed regressions
could bite us when doing cleanups.

That's possible. I'll keep a close eye on generated asm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 10:47                 ` Pekka Enberg
@ 2010-05-25 19:57                   ` David Rientjes
  -1 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-05-25 19:57 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, 25 May 2010, Pekka Enberg wrote:

> > The code may be much cleaner and simpler than slab, but nobody (to date)
> > has addressed the significant netperf TCP_RR regression that slub has, for
> > example. I worked on a patchset to do that for a while but it wasn't
> > popular because it added some increments to the fastpath for tracking
> > data.
> 
> Yes and IIRC I asked you to resend the series because while I care a
> lot about performance regressions, I simply don't have the time or the
> hardware to reproduce and fix the weird cases you're seeing.
> 

My patchset still never attained parity with slab even though it improved 
slub's performance for that specific benchmark on my 16-core machine with 
64G of memory:

	# threads	SLAB		SLUB		SLUB+patchset
	16		69892		71592		69505
	32		126490		95373		119731
	48		138050		113072		125014
	64		169240		149043		158919
	80		192294		172035		179679
	96		197779		187849		192154
	112		217283		204962		209988
	128		229848		217547		223507
	144		238550		232369		234565
	160		250333		239871		244789
	176		256878		242712		248971
	192		261611		243182		255596

CONFIG_SLUB_STATS demonstrates that the kmalloc-256 and kmalloc-2048 are
performing quite poorly without the changes:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH
	kmalloc-256	98125871	31585955
	kmalloc-2048	77243698	52347453

	cache		FREE_FASTPATH	FREE_SLOWPATH
	kmalloc-256	173624		129538000
	kmalloc-2048	90520		129500630

When you have these type of results, it's obvious why slub is failing to 
achieve the same performance as slab.  With the slub fastpath percpu work 
that has been done recently, it might be possible to resurrect my patchset 
and get more positive feedback because the penalty won't be as a 
significant, but the point is that slub still fails to achieve the same 
results that slab can with heavy networking loads.  Thus, I think any 
discussion about removing slab is premature until it's no longer shown to 
be a clear winner in comparison to its replacement, whether that is slub, 
slqb, sleb, or another allocator.  I agree that slub is clearly better in 
terms of maintainability, but we simply can't use it because of its 
performance for networking loads.

If you want to duplicate these results on machines with a larger number of 
cores, just download netperf, run with CONFIG_SLUB on both netserver and 
netperf machines, and use this script:

#!/bin/bash

TIME=60				# seconds
HOSTNAME=<hostname>		# netserver

NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
echo NR_CPUS=$NR_CPUS

run_netperf() {
	for i in $(seq 1 $1); do
		netperf -H $HOSTNAME -t TCP_RR -l $TIME &
	done
}

ITERATIONS=0
while [ $ITERATIONS -lt 12 ]; do
	RATE=0
	ITERATIONS=$[$ITERATIONS + 1]	
	THREADS=$[$NR_CPUS * $ITERATIONS]
	RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')

	for j in $RESULTS; do
		RATE=$[$RATE + ${j/.*}]
	done
	echo threads=$THREADS rate=$RATE
done

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
@ 2010-05-25 19:57                   ` David Rientjes
  0 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-05-25 19:57 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, Christoph Lameter, linux-mm,
	LKML, Andrew Morton, Linus Torvalds, Zhang Yanmin,
	Matthew Wilcox, Matt Mackall

On Tue, 25 May 2010, Pekka Enberg wrote:

> > The code may be much cleaner and simpler than slab, but nobody (to date)
> > has addressed the significant netperf TCP_RR regression that slub has, for
> > example. I worked on a patchset to do that for a while but it wasn't
> > popular because it added some increments to the fastpath for tracking
> > data.
> 
> Yes and IIRC I asked you to resend the series because while I care a
> lot about performance regressions, I simply don't have the time or the
> hardware to reproduce and fix the weird cases you're seeing.
> 

My patchset still never attained parity with slab even though it improved 
slub's performance for that specific benchmark on my 16-core machine with 
64G of memory:

	# threads	SLAB		SLUB		SLUB+patchset
	16		69892		71592		69505
	32		126490		95373		119731
	48		138050		113072		125014
	64		169240		149043		158919
	80		192294		172035		179679
	96		197779		187849		192154
	112		217283		204962		209988
	128		229848		217547		223507
	144		238550		232369		234565
	160		250333		239871		244789
	176		256878		242712		248971
	192		261611		243182		255596

CONFIG_SLUB_STATS demonstrates that the kmalloc-256 and kmalloc-2048 are
performing quite poorly without the changes:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH
	kmalloc-256	98125871	31585955
	kmalloc-2048	77243698	52347453

	cache		FREE_FASTPATH	FREE_SLOWPATH
	kmalloc-256	173624		129538000
	kmalloc-2048	90520		129500630

When you have these type of results, it's obvious why slub is failing to 
achieve the same performance as slab.  With the slub fastpath percpu work 
that has been done recently, it might be possible to resurrect my patchset 
and get more positive feedback because the penalty won't be as a 
significant, but the point is that slub still fails to achieve the same 
results that slab can with heavy networking loads.  Thus, I think any 
discussion about removing slab is premature until it's no longer shown to 
be a clear winner in comparison to its replacement, whether that is slub, 
slqb, sleb, or another allocator.  I agree that slub is clearly better in 
terms of maintainability, but we simply can't use it because of its 
performance for networking loads.

If you want to duplicate these results on machines with a larger number of 
cores, just download netperf, run with CONFIG_SLUB on both netserver and 
netperf machines, and use this script:

#!/bin/bash

TIME=60				# seconds
HOSTNAME=<hostname>		# netserver

NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
echo NR_CPUS=$NR_CPUS

run_netperf() {
	for i in $(seq 1 $1); do
		netperf -H $HOSTNAME -t TCP_RR -l $TIME &
	done
}

ITERATIONS=0
while [ $ITERATIONS -lt 12 ]; do
	RATE=0
	ITERATIONS=$[$ITERATIONS + 1]	
	THREADS=$[$NR_CPUS * $ITERATIONS]
	RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')

	for j in $RESULTS; do
		RATE=$[$RATE + ${j/.*}]
	done
	echo threads=$THREADS rate=$RATE
done

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-25 15:37                 ` Nick Piggin
@ 2010-05-27 14:24                   ` Christoph Lameter
  2010-05-27 14:37                     ` Nick Piggin
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-27 14:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm, Linus Torvalds

On Wed, 26 May 2010, Nick Piggin wrote:

> > > > The reason that the alien caches made it into SLAB were performance
> > > > numbers that showed that the design "must" be this way. I prefer a clear
> > > > maintainable design over some numbers (that invariably show the bias of
> > > > the tester for certain loads).
> > >
> > > I don't really agree. There are a number of other possible ways to
> > > improve it, including fewer remote freeing queues.
> >
> > You disagree with the history of the allocator?
>
> I don't agree with you saying that it "must" be that way. There are
> other ways to improve things there.

People told me that it "must" be this way. Could not convince them
otherwise at the time. I never wanted it to be that way and have been
looking for other ways ever since. SLUB is a result of trying something
different.

> then we can go ahead and throw out SLUB and make incremental
> improvements from there instead.

I am just amazed at the tosses and turns by you. Didnt you write SLQB on
the basis of SLUB? And then it was abandoned? If you really believe ths
and want to get this done then please invest some time in SLAB to get it
cleaned up. I have some doubt that you are aware of the difficulties that
you will encounter.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-27 14:24                   ` Christoph Lameter
@ 2010-05-27 14:37                     ` Nick Piggin
  2010-05-27 15:52                       ` Christoph Lameter
  0 siblings, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2010-05-27 14:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Linus Torvalds

On Thu, May 27, 2010 at 09:24:28AM -0500, Christoph Lameter wrote:
> On Wed, 26 May 2010, Nick Piggin wrote:
> 
> > > > > The reason that the alien caches made it into SLAB were performance
> > > > > numbers that showed that the design "must" be this way. I prefer a clear
> > > > > maintainable design over some numbers (that invariably show the bias of
> > > > > the tester for certain loads).
> > > >
> > > > I don't really agree. There are a number of other possible ways to
> > > > improve it, including fewer remote freeing queues.
> > >
> > > You disagree with the history of the allocator?
> >
> > I don't agree with you saying that it "must" be that way. There are
> > other ways to improve things there.
> 
> People told me that it "must" be this way. Could not convince them
> otherwise at the time.

So again there was no numbers just handwaving?


> I never wanted it to be that way and have been
> looking for other ways ever since. SLUB is a result of trying something
> different.
> 
> > then we can go ahead and throw out SLUB and make incremental
> > improvements from there instead.
> 
> I am just amazed at the tosses and turns by you. Didnt you write SLQB on
> the basis of SLUB? And then it was abandoned? If you really believe ths

Sure I hoped it would be able to conclusively beat SLAB, and I'd
thought it might be a good idea. I stopped pushing it because I
realized that incremental improvements to SLAB would likely be a
far better idea.


> and want to get this done then please invest some time in SLAB to get it
> cleaned up. I have some doubt that you are aware of the difficulties that
> you will encounter.

I am working on it. We'll see.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-27 14:37                     ` Nick Piggin
@ 2010-05-27 15:52                       ` Christoph Lameter
  2010-05-27 16:07                         ` Nick Piggin
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-27 15:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm, Linus Torvalds

On Fri, 28 May 2010, Nick Piggin wrote:

> > I am just amazed at the tosses and turns by you. Didnt you write SLQB on
> > the basis of SLUB? And then it was abandoned? If you really believe ths
>
> Sure I hoped it would be able to conclusively beat SLAB, and I'd
> thought it might be a good idea. I stopped pushing it because I
> realized that incremental improvements to SLAB would likely be a
> far better idea.

It looked to me as if there was a major conceptual issue with the linked
lists used for objects that impacted performance plus unresolved issues
with crashes on boot. I did not see you work on SLAB improvements. Seemed
that other things had higher priority. The work on slab allocators in
general is not well funded, not high priority and is a side issue. The
time that I can spend on this is also limited.

> > and want to get this done then please invest some time in SLAB to get it
> > cleaned up. I have some doubt that you are aware of the difficulties that
> > you will encounter.
>
> I am working on it. We'll see.

I think we agreee on one thing regardless of SLAB or SLUB as a base: It
would be good to put the best approaches together to form a superior slab
allocator. I just think its much easier to do give a mature and clean code
base in SLUB. If we both work on this then this may coalesce at some
point.

The main gripes with SLAB

- Code base difficult to maintain. Has grown over almost 2 decades.
- Alien caches need to be kept under control. Various hacky ways
  are implemented to bypass that problem.
- Locking issues because of long hold times of per node lock. SLUB
  has locking on per page level. This is important for high number of
  threads per node. Westmere has already 12. EX 24 and it way grow
  from there.
- Debugging features and recovery mechanisms.
- Off or on page slab metadata causes space wastage, complex allocation
  and locking and alignment issues. SLED replaces that metadata structure
  with a bitfield in the page struct. This may also save access to
  additional cacheline and maybe allow freeing of objects to a slab page
  without taking locks.
- Variable and function naming is confusing.
- OS noise caused by periodic cache cleaning (which requires scans over
  all caches of all slabs on every processor).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-27 15:52                       ` Christoph Lameter
@ 2010-05-27 16:07                         ` Nick Piggin
  2010-05-27 16:57                           ` Christoph Lameter
  0 siblings, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2010-05-27 16:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Linus Torvalds

On Thu, May 27, 2010 at 10:52:52AM -0500, Christoph Lameter wrote:
> On Fri, 28 May 2010, Nick Piggin wrote:
> 
> > > I am just amazed at the tosses and turns by you. Didnt you write SLQB on
> > > the basis of SLUB? And then it was abandoned? If you really believe ths
> >
> > Sure I hoped it would be able to conclusively beat SLAB, and I'd
> > thought it might be a good idea. I stopped pushing it because I
> > realized that incremental improvements to SLAB would likely be a
> > far better idea.
> 
> It looked to me as if there was a major conceptual issue with the linked
> lists used for objects that impacted performance

With SLQB's linked list? No. Single threaded cache hot performance was
the same (+/- a couple of cycles IIRC) as SLUB on your microbenchmark.
On Intel's OLTP workload it was as good as SLAB.

The linked lists were similar to SLOB/SLUB IIRC.


> plus unresolved issues
> with crashes on boot.

Was due to a hack using per_cpu definition for a node field (some
systems would have nodes not equal to a CPU number). I don't think
there were any problems left.


> I did not see you work on SLAB improvements. Seemed
> that other things had higher priority. The work on slab allocators in
> general is not well funded, not high priority and is a side issue. The
> time that I can spend on this is also limited.

I heard SLUB was just about to get there with new per-cpu accessors.
That didn't seem to help too much in real world. I would have liked more
time on SLAB but unfortunately have not until now.

It seems that it is *still* the best and most mature allocator we have
for most users, and the most widely deployed one. So AFAIKS it still
makes sense to incrementally improve it rather than take something
else.


> > > and want to get this done then please invest some time in SLAB to get it
> > > cleaned up. I have some doubt that you are aware of the difficulties that
> > > you will encounter.
> >
> > I am working on it. We'll see.
> 
> I think we agreee on one thing regardless of SLAB or SLUB as a base: It
> would be good to put the best approaches together to form a superior slab
> allocator. I just think its much easier to do give a mature and clean code
> base in SLUB. If we both work on this then this may coalesce at some
> point.

And I've listed my gripes with SLUB countless times, so I won't any more.

 
> The main gripes with SLAB
> 
> - Code base difficult to maintain. Has grown over almost 2 decades.
> - Alien caches need to be kept under control. Various hacky ways
>   are implemented to bypass that problem.
> - Locking issues because of long hold times of per node lock. SLUB
>   has locking on per page level. This is important for high number of
>   threads per node. Westmere has already 12. EX 24 and it way grow
>   from there.
> - Debugging features and recovery mechanisms.
> - Off or on page slab metadata causes space wastage, complex allocation
>   and locking and alignment issues. SLED replaces that metadata structure
>   with a bitfield in the page struct. This may also save access to
>   additional cacheline and maybe allow freeing of objects to a slab page
>   without taking locks.
> - Variable and function naming is confusing.
> - OS noise caused by periodic cache cleaning (which requires scans over
>   all caches of all slabs on every processor).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-27 16:07                         ` Nick Piggin
@ 2010-05-27 16:57                           ` Christoph Lameter
  2010-05-28  8:39                             ` Nick Piggin
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-05-27 16:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, Pekka Enberg, linux-mm, Linus Torvalds

On Fri, 28 May 2010, Nick Piggin wrote:

> > > realized that incremental improvements to SLAB would likely be a
> > > far better idea.
> >
> > It looked to me as if there was a major conceptual issue with the linked
> > lists used for objects that impacted performance
>
> With SLQB's linked list? No. Single threaded cache hot performance was
> the same (+/- a couple of cycles IIRC) as SLUB on your microbenchmark.
> On Intel's OLTP workload it was as good as SLAB.
>
> The linked lists were similar to SLOB/SLUB IIRC.

Yes that is the problem. So it did not address the cache cold
regressions in SLUB. SLQB mostly addressed the slow path frequency on
free.

The design of SLAB is superior for cache cold objects since SLAB does
not touch the objects on alloc and free (if one requires similar
cache cold performance from other slab allocators) thats why I cleaned
up the per cpu queueing concept in SLAB (easy now with the percpu
allocator and operations) and came up with SLEB. At the same time this
also addresses the slowpath issues on free. I am not entirely sure how to
deal with the NUMAness but I want to focus more on machines with low node
counts.

The problem with SLAB was that so far the "incremental improvements" have
lead to more deteriorations in the maintainability of the code. There are
multiple people who have tried going this route that you propose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
  2010-05-27 16:57                           ` Christoph Lameter
@ 2010-05-28  8:39                             ` Nick Piggin
  0 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2010-05-28  8:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Linus Torvalds

On Thu, May 27, 2010 at 11:57:54AM -0500, Christoph Lameter wrote:
> On Fri, 28 May 2010, Nick Piggin wrote:
> 
> > > > realized that incremental improvements to SLAB would likely be a
> > > > far better idea.
> > >
> > > It looked to me as if there was a major conceptual issue with the linked
> > > lists used for objects that impacted performance
> >
> > With SLQB's linked list? No. Single threaded cache hot performance was
> > the same (+/- a couple of cycles IIRC) as SLUB on your microbenchmark.
> > On Intel's OLTP workload it was as good as SLAB.
> >
> > The linked lists were similar to SLOB/SLUB IIRC.
> 
> Yes that is the problem. So it did not address the cache cold
> regressions in SLUB. SLQB mostly addressed the slow path frequency on
> free.

This is going a bit off topic considering that I'm not pushing SLQB
or any concept from SLQB (just yet at least). As far as I know there
were no cache cold regressions in SLQB.


> The design of SLAB is superior for cache cold objects since SLAB does
> not touch the objects on alloc and free (if one requires similar
> cache cold performance from other slab allocators) thats why I cleaned
> up the per cpu queueing concept in SLAB (easy now with the percpu
> allocator and operations) and came up with SLEB. At the same time this
> also addresses the slowpath issues on free. I am not entirely sure how to
> deal with the NUMAness but I want to focus more on machines with low node
> counts.
> 
> The problem with SLAB was that so far the "incremental improvements" have
> lead to more deteriorations in the maintainability of the code. There are
> multiple people who have tried going this route that you propose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-05-21 21:14 ` [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node Christoph Lameter
@ 2010-06-07 21:44   ` David Rientjes
  2010-06-07 22:30     ` Christoph Lameter
  0 siblings, 1 reply; 89+ messages in thread
From: David Rientjes @ 2010-06-07 21:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm

On Fri, 21 May 2010, Christoph Lameter wrote:

> kmalloc_node() and friends can be passed a constant -1 to indicate
> that no choice was made for the node from which the object needs to
> come.
> 
> Add a constant for this.
> 

I think it would be better to simply use the generic NUMA_NO_NODE for this 
purpose, which is identical to how hugetlb, pxm mappings, etc, use it to 
specify no specific node affinity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-06-07 21:44   ` David Rientjes
@ 2010-06-07 22:30     ` Christoph Lameter
  2010-06-08  5:41       ` Pekka Enberg
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-06-07 22:30 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, Pekka Enberg, linux-mm

On Mon, 7 Jun 2010, David Rientjes wrote:

> On Fri, 21 May 2010, Christoph Lameter wrote:
>
> > kmalloc_node() and friends can be passed a constant -1 to indicate
> > that no choice was made for the node from which the object needs to
> > come.
> >
> > Add a constant for this.
> >
>
> I think it would be better to simply use the generic NUMA_NO_NODE for this
> purpose, which is identical to how hugetlb, pxm mappings, etc, use it to
> specify no specific node affinity.

Ok will do that in the next release.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 03/14] SLUB: Use kmem_cache flags to detect if Slab is in debugging mode.
  2010-05-21 21:14 ` [RFC V2 SLEB 03/14] SLUB: Use kmem_cache flags to detect if Slab is in debugging mode Christoph Lameter
@ 2010-06-08  3:57   ` David Rientjes
  0 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-06-08  3:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm

On Fri, 21 May 2010, Christoph Lameter wrote:

> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2010-04-27 12:41:05.000000000 -0500
> +++ linux-2.6/mm/slub.c	2010-04-27 13:15:32.000000000 -0500
> @@ -107,11 +107,17 @@
>   * 			the fast path and disables lockless freelists.
>   */
>  
> +#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
> +		SLAB_TRACE | SLAB_DEBUG_FREE)
> +
> +static inline int debug_on(struct kmem_cache *s)
> +{
>  #ifdef CONFIG_SLUB_DEBUG
> -#define SLABDEBUG 1
> +	return unlikely(s->flags & SLAB_DEBUG_FLAGS);
>  #else
> -#define SLABDEBUG 0
> +	return 0;
>  #endif
> +}
>  
>  /*
>   * Issues still to be resolved:

Nice optimization!  I'd recommend a non-generic name for this check, 
though, such as cache_debug_on().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-06-07 22:30     ` Christoph Lameter
@ 2010-06-08  5:41       ` Pekka Enberg
  2010-06-08  6:20         ` David Rientjes
  0 siblings, 1 reply; 89+ messages in thread
From: Pekka Enberg @ 2010-06-08  5:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, Christoph Lameter, linux-mm

On Tue, Jun 8, 2010 at 1:30 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> On Mon, 7 Jun 2010, David Rientjes wrote:
>
>> On Fri, 21 May 2010, Christoph Lameter wrote:
>>
>> > kmalloc_node() and friends can be passed a constant -1 to indicate
>> > that no choice was made for the node from which the object needs to
>> > come.
>> >
>> > Add a constant for this.
>> >
>>
>> I think it would be better to simply use the generic NUMA_NO_NODE for this
>> purpose, which is identical to how hugetlb, pxm mappings, etc, use it to
>> specify no specific node affinity.
>
> Ok will do that in the next release.

Patches 1-5 are queued for 2.6.36 so please send an incremental patch
on top of 'slub/cleanups' branch of slab.git.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-06-08  5:41       ` Pekka Enberg
@ 2010-06-08  6:20         ` David Rientjes
  2010-06-08  6:34           ` Pekka Enberg
  0 siblings, 1 reply; 89+ messages in thread
From: David Rientjes @ 2010-06-08  6:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-mm

On Tue, 8 Jun 2010, Pekka Enberg wrote:

> > Ok will do that in the next release.
> 
> Patches 1-5 are queued for 2.6.36 so please send an incremental patch
> on top of 'slub/cleanups' branch of slab.git.
> 

An incremental patch in this case would change everything that the 
original patch did, so it'd probably be best to simply revert and queue 
the updated version.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-06-08  6:20         ` David Rientjes
@ 2010-06-08  6:34           ` Pekka Enberg
  2010-06-08 23:35             ` David Rientjes
  0 siblings, 1 reply; 89+ messages in thread
From: Pekka Enberg @ 2010-06-08  6:34 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-mm

Hi David,

On Tue, 8 Jun 2010, Pekka Enberg wrote:
>> > Ok will do that in the next release.
>>
>> Patches 1-5 are queued for 2.6.36 so please send an incremental patch
>> on top of 'slub/cleanups' branch of slab.git.

On Tue, Jun 8, 2010 at 9:20 AM, David Rientjes <rientjes@google.com> wrote:
> An incremental patch in this case would change everything that the
> original patch did, so it'd probably be best to simply revert and queue
> the updated version.

If I revert it, we end up with two commits instead of one. And I
really prefer not to *rebase* a topic branch even though it might be
doable for a small tree like slab.git.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 05/14] SLUB: is_kmalloc_cache
  2010-05-21 21:14 ` [RFC V2 SLEB 05/14] SLUB: is_kmalloc_cache Christoph Lameter
@ 2010-06-08  8:54   ` David Rientjes
  0 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-06-08  8:54 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm

On Fri, 21 May 2010, Christoph Lameter wrote:

> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2010-05-12 14:46:58.000000000 -0500
> +++ linux-2.6/mm/slub.c	2010-05-12 14:49:37.000000000 -0500
> @@ -312,6 +312,11 @@ static inline int oo_objects(struct kmem
>  	return x.x & OO_MASK;
>  }
>  
> +static int is_kmalloc_cache(struct kmem_cache *s)
> +{
> +	return (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches);
> +}
> +
>  #ifdef CONFIG_SLUB_DEBUG
>  /*
>   * Debug settings:
> @@ -2076,7 +2081,7 @@ static DEFINE_PER_CPU(struct kmem_cache_
>  
>  static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
>  {
> -	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
> +	if (is_kmalloc_cache(s))
>  		/*
>  		 * Boot time creation of the kmalloc array. Use static per cpu data
>  		 * since the per cpu allocator is not available yet.
> @@ -2158,8 +2163,7 @@ static int init_kmem_cache_nodes(struct 
>  	int node;
>  	int local_node;
>  
> -	if (slab_state >= UP && (s < kmalloc_caches ||
> -			s >= kmalloc_caches + KMALLOC_CACHES))
> +	if (slab_state >= UP && !is_kmalloc_cache(s))
>  		local_node = page_to_nid(virt_to_page(s));
>  	else
>  		local_node = 0;

Looks good, how about extending it to dma_kmalloc_cache() as well?
---
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2641,13 +2641,12 @@ static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
 
-	s = NULL;
 	for (i = 0; i < KMALLOC_CACHES; i++)
 		if (!kmalloc_caches[i].size)
 			break;
 
-	BUG_ON(i >= KMALLOC_CACHES);
 	s = kmalloc_caches + i;
+	BUG_ON(!is_kmalloc_cache(s));
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-06-08  6:34           ` Pekka Enberg
@ 2010-06-08 23:35             ` David Rientjes
  2010-06-09  5:55                 ` Pekka Enberg
  0 siblings, 1 reply; 89+ messages in thread
From: David Rientjes @ 2010-06-08 23:35 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-mm

On Tue, 8 Jun 2010, Pekka Enberg wrote:

> > An incremental patch in this case would change everything that the
> > original patch did, so it'd probably be best to simply revert and queue
> > the updated version.
> 
> If I revert it, we end up with two commits instead of one. And I
> really prefer not to *rebase* a topic branch even though it might be
> doable for a small tree like slab.git.
> 

I commented on improvements for three of the five patches you've added as 
slub cleanups and Christoph has shown an interest in proposing them again 
(perhaps seperating patches 1-5 out as a seperate set of cleanups?), so 
it's probably cleaner to just reset and reapply with the revisions.  

Let me know if my suggested changes should be add-on patches to 
Christoph's first five and I'll come up with a three patch series to do 
just that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified  node.
  2010-06-08 23:35             ` David Rientjes
@ 2010-06-09  5:55                 ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-06-09  5:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-mm, LKML, Ingo Molnar

Hi David,

(I'm LKML and Ingo to CC.)

On Tue, 8 Jun 2010, Pekka Enberg wrote:
>> > An incremental patch in this case would change everything that the
>> > original patch did, so it'd probably be best to simply revert and queue
>> > the updated version.
>>
>> If I revert it, we end up with two commits instead of one. And I
>> really prefer not to *rebase* a topic branch even though it might be
>> doable for a small tree like slab.git.

On Wed, Jun 9, 2010 at 2:35 AM, David Rientjes <rientjes@google.com> wrote:
> I commented on improvements for three of the five patches you've added as
> slub cleanups and Christoph has shown an interest in proposing them again
> (perhaps seperating patches 1-5 out as a seperate set of cleanups?), so
> it's probably cleaner to just reset and reapply with the revisions.

As I said, we can probably get away with that in slab.git because
we're so small but that doesn't work in general.

If we ignore the fact how painful the actual rebase operation is
(there's a 'sleb/core' branch that shares the commits), I don't think
the revised history is 'cleaner' by any means. The current patches are
known to be good (I've tested them) but if I just replace them, all
the testing effort was basically wasted. So if I need to do a
git-bisect, for example, I didn't benefit one bit from testing the
original patches.

The other issue is patch metadata. If I just nuke the existing
patches, I'm also could be dropping important stuff like Tested-by or
Reported-by tags. Yes, I realize that in this particular case, there's
none but the approach works only as long as you remember exactly what
you merged.

There are probably other benefits for larger trees but those two are
enough for me to keep my published branches append-only.

On Wed, Jun 9, 2010 at 2:35 AM, David Rientjes <rientjes@google.com> wrote:
> Let me know if my suggested changes should be add-on patches to
> Christoph's first five and I'll come up with a three patch series to do
> just that.

Yes, I really would prefer incremental patches on top of the
'slub/cleanups' branch.

                        Pekka

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
@ 2010-06-09  5:55                 ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-06-09  5:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-mm, LKML, Ingo Molnar

Hi David,

(I'm LKML and Ingo to CC.)

On Tue, 8 Jun 2010, Pekka Enberg wrote:
>> > An incremental patch in this case would change everything that the
>> > original patch did, so it'd probably be best to simply revert and queue
>> > the updated version.
>>
>> If I revert it, we end up with two commits instead of one. And I
>> really prefer not to *rebase* a topic branch even though it might be
>> doable for a small tree like slab.git.

On Wed, Jun 9, 2010 at 2:35 AM, David Rientjes <rientjes@google.com> wrote:
> I commented on improvements for three of the five patches you've added as
> slub cleanups and Christoph has shown an interest in proposing them again
> (perhaps seperating patches 1-5 out as a seperate set of cleanups?), so
> it's probably cleaner to just reset and reapply with the revisions.

As I said, we can probably get away with that in slab.git because
we're so small but that doesn't work in general.

If we ignore the fact how painful the actual rebase operation is
(there's a 'sleb/core' branch that shares the commits), I don't think
the revised history is 'cleaner' by any means. The current patches are
known to be good (I've tested them) but if I just replace them, all
the testing effort was basically wasted. So if I need to do a
git-bisect, for example, I didn't benefit one bit from testing the
original patches.

The other issue is patch metadata. If I just nuke the existing
patches, I'm also could be dropping important stuff like Tested-by or
Reported-by tags. Yes, I realize that in this particular case, there's
none but the approach works only as long as you remember exactly what
you merged.

There are probably other benefits for larger trees but those two are
enough for me to keep my published branches append-only.

On Wed, Jun 9, 2010 at 2:35 AM, David Rientjes <rientjes@google.com> wrote:
> Let me know if my suggested changes should be add-on patches to
> Christoph's first five and I'll come up with a three patch series to do
> just that.

Yes, I really would prefer incremental patches on top of the
'slub/cleanups' branch.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab
  2010-05-21 21:14 ` [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab Christoph Lameter
@ 2010-06-09  6:14   ` David Rientjes
  2010-06-09 16:14     ` Christoph Lameter
  0 siblings, 1 reply; 89+ messages in thread
From: David Rientjes @ 2010-06-09  6:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm

On Fri, 21 May 2010, Christoph Lameter wrote:

> Currently bootstrap works with the kmalloc_node slab.

s/kmalloc_node/kmem_cache_node/

> We can avoid
> creating that slab and boot using allocation from a kmalloc array slab
> instead. This is necessary for the future if we want to dynamically
> size kmem_cache structures.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  mm/slub.c |   39 ++++++++++++++++++++++++---------------
>  1 file changed, 24 insertions(+), 15 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2010-05-20 14:26:53.000000000 -0500
> +++ linux-2.6/mm/slub.c	2010-05-20 14:37:19.000000000 -0500
> @@ -2111,10 +2111,11 @@ static void early_kmem_cache_node_alloc(
>  	struct page *page;
>  	struct kmem_cache_node *n;
>  	unsigned long flags;
> +	int i = kmalloc_index(sizeof(struct kmem_cache_node));
>  

const int?


Maybe even better would be

	struct kmem_cache *s = kmalloc_caches + i;

to make the rest of this easier?

> -	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
> +	BUG_ON(kmalloc_caches[i].size < sizeof(struct kmem_cache_node));
>  
> -	page = new_slab(kmalloc_caches, gfpflags, node);
> +	page = new_slab(kmalloc_caches + i, gfpflags, node);
>  
>  	BUG_ON(!page);
>  	if (page_to_nid(page) != node) {
> @@ -2126,15 +2127,15 @@ static void early_kmem_cache_node_alloc(
>  
>  	n = page->freelist;
>  	BUG_ON(!n);

I don't think we need this BUG_ON() anymore, but that's a seperate issue.

> -	page->freelist = get_freepointer(kmalloc_caches, n);
> +	page->freelist = get_freepointer(kmalloc_caches + i, n);
>  	page->inuse++;
> -	kmalloc_caches->node[node] = n;
> +	kmalloc_caches[i].node[node] = n;
>  #ifdef CONFIG_SLUB_DEBUG
> -	init_object(kmalloc_caches, n, 1);
> -	init_tracking(kmalloc_caches, n);
> +	init_object(kmalloc_caches + i, n, 1);
> +	init_tracking(kmalloc_caches + i, n);
>  #endif
> -	init_kmem_cache_node(n, kmalloc_caches);
> -	inc_slabs_node(kmalloc_caches, node, page->objects);
> +	init_kmem_cache_node(n, kmalloc_caches + i);
> +	inc_slabs_node(kmalloc_caches + i, node, page->objects);
>  
>  	/*
>  	 * lockdep requires consistent irq usage for each lock
> @@ -2152,8 +2153,9 @@ static void free_kmem_cache_nodes(struct
>  
>  	for_each_node_state(node, N_NORMAL_MEMORY) {
>  		struct kmem_cache_node *n = s->node[node];
> +
>  		if (n && n != &s->local_node)
> -			kmem_cache_free(kmalloc_caches, n);
> +			kfree(n);
>  		s->node[node] = NULL;
>  	}
>  }
> @@ -2178,8 +2180,8 @@ static int init_kmem_cache_nodes(struct 
>  				early_kmem_cache_node_alloc(gfpflags, node);
>  				continue;
>  			}
> -			n = kmem_cache_alloc_node(kmalloc_caches,
> -							gfpflags, node);
> +			n = kmalloc_node(sizeof(struct kmem_cache_node), gfpflags,
> +				node);
>  
>  			if (!n) {
>  				free_kmem_cache_nodes(s);
> @@ -2574,6 +2576,12 @@ static struct kmem_cache *create_kmalloc
>  {
>  	unsigned int flags = 0;
>  
> +	if (s->size) {
> +		s->name = name;

Do we need this?  The iteration at the end of kmem_cache_init() should 
reset this kmalloc cache to have the standard kmalloc-<size> name, so I 
don't think we need to reset "bootstrap" here.

> +		/* Already created */
> +		return s;
> +	}
> +
>  	if (gfp_flags & SLUB_DMA)
>  		flags = SLAB_CACHE_DMA;
>  
> @@ -2978,7 +2986,7 @@ static void slab_mem_offline_callback(vo
>  			BUG_ON(slabs_node(s, offline_node));
>  
>  			s->node[offline_node] = NULL;
> -			kmem_cache_free(kmalloc_caches, n);
> +			kfree(n);
>  		}
>  	}
>  	up_read(&slub_lock);
> @@ -3011,7 +3019,7 @@ static int slab_mem_going_online_callbac
>  		 *      since memory is not yet available from the node that
>  		 *      is brought up.
>  		 */
> -		n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
> +		n = kmalloc(sizeof(struct kmem_cache_node), GFP_KERNEL);
>  		if (!n) {
>  			ret = -ENOMEM;
>  			goto out;
> @@ -3068,9 +3076,10 @@ void __init kmem_cache_init(void)
>  	 * struct kmem_cache_node's. There is special bootstrap code in
>  	 * kmem_cache_open for slab_state == DOWN.
>  	 */
> -	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
> +	i = kmalloc_index(sizeof(struct kmem_cache_node));
> +	create_kmalloc_cache(&kmalloc_caches[i], "bootstrap",
>  		sizeof(struct kmem_cache_node), GFP_NOWAIT);
> -	kmalloc_caches[0].refcount = -1;
> +	kmalloc_caches[i].refcount = -1;
>  	caches++;
>  
>  	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);

So kmalloc_caches[0] will never be used after this change, then?


We could also remove the gfp_t argument to create_kmalloc_cache(), it's 
not used for anything other than GFP_NOWAIT anymore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
  2010-06-09  5:55                 ` Pekka Enberg
@ 2010-06-09  6:20                   ` David Rientjes
  -1 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-06-09  6:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-mm, LKML, Ingo Molnar

On Wed, 9 Jun 2010, Pekka Enberg wrote:

> As I said, we can probably get away with that in slab.git because
> we're so small but that doesn't work in general.
> 
> If we ignore the fact how painful the actual rebase operation is
> (there's a 'sleb/core' branch that shares the commits), I don't think
> the revised history is 'cleaner' by any means. The current patches are
> known to be good (I've tested them) but if I just replace them, all
> the testing effort was basically wasted. So if I need to do a
> git-bisect, for example, I didn't benefit one bit from testing the
> original patches.
> 
> The other issue is patch metadata. If I just nuke the existing
> patches, I'm also could be dropping important stuff like Tested-by or
> Reported-by tags. Yes, I realize that in this particular case, there's
> none but the approach works only as long as you remember exactly what
> you merged.
> 
> There are probably other benefits for larger trees but those two are
> enough for me to keep my published branches append-only.
> 

I wasn't really trying to suggest an alternative way to do it for all git 
trees, I just thought that since Christoph wanted to repropose these 
changes in another set and given there's no harm in doing it within 
slab.git right now that you'd have no problem making an exception in this 
case just for a cleaner history later.

If you'd like to keep a commit that is then completely obsoleted by 
another commit when it's on the tip of your tree right now and could 
be reverted with minimal work simply to follow this general principle, 
that's fine :)

> > Let me know if my suggested changes should be add-on patches to
> > Christoph's first five and I'll come up with a three patch series to do
> > just that.
> 
> Yes, I really would prefer incremental patches on top of the
> 'slub/cleanups' branch.
> 

Ok then, I'll send incremental changes based on my feedback of patches 
1-5.  Thanks!

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node.
@ 2010-06-09  6:20                   ` David Rientjes
  0 siblings, 0 replies; 89+ messages in thread
From: David Rientjes @ 2010-06-09  6:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-mm, LKML, Ingo Molnar

On Wed, 9 Jun 2010, Pekka Enberg wrote:

> As I said, we can probably get away with that in slab.git because
> we're so small but that doesn't work in general.
> 
> If we ignore the fact how painful the actual rebase operation is
> (there's a 'sleb/core' branch that shares the commits), I don't think
> the revised history is 'cleaner' by any means. The current patches are
> known to be good (I've tested them) but if I just replace them, all
> the testing effort was basically wasted. So if I need to do a
> git-bisect, for example, I didn't benefit one bit from testing the
> original patches.
> 
> The other issue is patch metadata. If I just nuke the existing
> patches, I'm also could be dropping important stuff like Tested-by or
> Reported-by tags. Yes, I realize that in this particular case, there's
> none but the approach works only as long as you remember exactly what
> you merged.
> 
> There are probably other benefits for larger trees but those two are
> enough for me to keep my published branches append-only.
> 

I wasn't really trying to suggest an alternative way to do it for all git 
trees, I just thought that since Christoph wanted to repropose these 
changes in another set and given there's no harm in doing it within 
slab.git right now that you'd have no problem making an exception in this 
case just for a cleaner history later.

If you'd like to keep a commit that is then completely obsoleted by 
another commit when it's on the tip of your tree right now and could 
be reverted with minimal work simply to follow this general principle, 
that's fine :)

> > Let me know if my suggested changes should be add-on patches to
> > Christoph's first five and I'll come up with a three patch series to do
> > just that.
> 
> Yes, I really would prefer incremental patches on top of the
> 'slub/cleanups' branch.
> 

Ok then, I'll send incremental changes based on my feedback of patches 
1-5.  Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab
  2010-06-09  6:14   ` David Rientjes
@ 2010-06-09 16:14     ` Christoph Lameter
  2010-06-09 16:26       ` Pekka Enberg
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2010-06-09 16:14 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm


The patch needs a rework since it sometimes calculates the wrong kmalloc
slab. Value needs to be rounded up to the next kmalloc slab size. This
problem shows up if CONFIG_SLUB_DEBUG is enabled.

Please do not merge patches that are marked "RFC". That usually means
that I am not satisfied with their quality yet.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab
  2010-06-09 16:14     ` Christoph Lameter
@ 2010-06-09 16:26       ` Pekka Enberg
  2010-06-10  6:07         ` Pekka Enberg
  0 siblings, 1 reply; 89+ messages in thread
From: Pekka Enberg @ 2010-06-09 16:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, linux-mm

Hi Christoph,

On Wed, Jun 9, 2010 at 7:14 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> The patch needs a rework since it sometimes calculates the wrong kmalloc
> slab. Value needs to be rounded up to the next kmalloc slab size. This
> problem shows up if CONFIG_SLUB_DEBUG is enabled.
>
> Please do not merge patches that are marked "RFC". That usually means
> that I am not satisfied with their quality yet.

I actually _asked_ you whether or not it's OK to merge patches 1-5. Do
you want to guess what you said? I have it all in writing stashed in
my mailbox if you want to refresh your memory. ;-)

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab
  2010-06-09 16:26       ` Pekka Enberg
@ 2010-06-10  6:07         ` Pekka Enberg
  0 siblings, 0 replies; 89+ messages in thread
From: Pekka Enberg @ 2010-06-10  6:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, linux-mm

On Wed, Jun 9, 2010 at 7:14 PM, Christoph Lameter
> <cl@linux-foundation.org> wrote:
>> The patch needs a rework since it sometimes calculates the wrong kmalloc
>> slab. Value needs to be rounded up to the next kmalloc slab size. This
>> problem shows up if CONFIG_SLUB_DEBUG is enabled.
>>
>> Please do not merge patches that are marked "RFC". That usually means
>> that I am not satisfied with their quality yet.

On Wed, Jun 9, 2010 at 7:26 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> I actually _asked_ you whether or not it's OK to merge patches 1-5. Do
> you want to guess what you said? I have it all in writing stashed in
> my mailbox if you want to refresh your memory. ;-)

I only now realized you were talking about patch number six which
shouldn't have been merged. Sorry for that, Christoph.

                          Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2010-06-10  6:07 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-21 21:14 [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Christoph Lameter
2010-05-21 21:14 ` [RFC V2 SLEB 01/14] slab: Introduce a constant for a unspecified node Christoph Lameter
2010-06-07 21:44   ` David Rientjes
2010-06-07 22:30     ` Christoph Lameter
2010-06-08  5:41       ` Pekka Enberg
2010-06-08  6:20         ` David Rientjes
2010-06-08  6:34           ` Pekka Enberg
2010-06-08 23:35             ` David Rientjes
2010-06-09  5:55               ` Pekka Enberg
2010-06-09  5:55                 ` Pekka Enberg
2010-06-09  6:20                 ` David Rientjes
2010-06-09  6:20                   ` David Rientjes
2010-05-21 21:14 ` [RFC V2 SLEB 02/14] SLUB: Constants need UL Christoph Lameter
2010-05-21 21:14 ` [RFC V2 SLEB 03/14] SLUB: Use kmem_cache flags to detect if Slab is in debugging mode Christoph Lameter
2010-06-08  3:57   ` David Rientjes
2010-05-21 21:14 ` [RFC V2 SLEB 04/14] SLUB: discard_slab_unlock Christoph Lameter
2010-05-21 21:14 ` [RFC V2 SLEB 05/14] SLUB: is_kmalloc_cache Christoph Lameter
2010-06-08  8:54   ` David Rientjes
2010-05-21 21:14 ` [RFC V2 SLEB 06/14] SLUB: Get rid of the kmalloc_node slab Christoph Lameter
2010-06-09  6:14   ` David Rientjes
2010-06-09 16:14     ` Christoph Lameter
2010-06-09 16:26       ` Pekka Enberg
2010-06-10  6:07         ` Pekka Enberg
2010-05-21 21:14 ` [RFC V2 SLEB 07/14] SLEB: The Enhanced Slab Allocator Christoph Lameter
2010-05-21 21:15 ` [RFC V2 SLEB 08/14] SLEB: Resize cpu queue Christoph Lameter
2010-05-21 21:15 ` [RFC V2 SLEB 09/14] SLED: Get rid of useless function Christoph Lameter
2010-05-21 21:15 ` [RFC V2 SLEB 10/14] SLEB: Remove MAX_OBJS limitation Christoph Lameter
2010-05-21 21:15 ` [RFC V2 SLEB 11/14] SLEB: Add per node cache (with a fixed size for now) Christoph Lameter
2010-05-21 21:15 ` [RFC V2 SLEB 12/14] SLEB: Make the size of the shared cache configurable Christoph Lameter
2010-05-21 21:15 ` [RFC V2 SLEB 13/14] SLEB: Enhanced NUMA support Christoph Lameter
2010-05-21 21:15 ` [RFC V2 SLEB 14/14] SLEB: Allocate off node objects from remote shared caches Christoph Lameter
2010-05-22  8:37 ` [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator Pekka Enberg
2010-05-24  7:03 ` Nick Piggin
2010-05-24 15:06   ` Christoph Lameter
2010-05-25  2:06     ` Nick Piggin
2010-05-25  6:55       ` Pekka Enberg
2010-05-25  7:07         ` Nick Piggin
2010-05-25  8:03           ` Pekka Enberg
2010-05-25  8:03             ` Pekka Enberg
2010-05-25  8:16             ` Nick Piggin
2010-05-25  8:16               ` Nick Piggin
2010-05-25  9:19               ` Pekka Enberg
2010-05-25  9:19                 ` Pekka Enberg
2010-05-25  9:34                 ` Nick Piggin
2010-05-25  9:34                   ` Nick Piggin
2010-05-25  9:53                   ` Pekka Enberg
2010-05-25  9:53                     ` Pekka Enberg
2010-05-25 10:19                     ` Nick Piggin
2010-05-25 10:19                       ` Nick Piggin
2010-05-25 10:45                       ` Pekka Enberg
2010-05-25 10:45                         ` Pekka Enberg
2010-05-25 11:06                         ` Nick Piggin
2010-05-25 11:06                           ` Nick Piggin
2010-05-25 15:13                         ` Linus Torvalds
2010-05-25 15:13                           ` Linus Torvalds
2010-05-25 15:43                           ` Nick Piggin
2010-05-25 15:43                             ` Nick Piggin
2010-05-25 17:02                             ` Pekka Enberg
2010-05-25 17:02                               ` Pekka Enberg
2010-05-25 17:19                               ` Nick Piggin
2010-05-25 17:19                                 ` Nick Piggin
2010-05-25 17:35                                 ` Pekka Enberg
2010-05-25 17:35                                   ` Pekka Enberg
2010-05-25 17:40                                   ` Nick Piggin
2010-05-25 17:40                                     ` Nick Piggin
2010-05-25 10:07               ` David Rientjes
2010-05-25 10:07                 ` David Rientjes
2010-05-25 10:02             ` David Rientjes
2010-05-25 10:02               ` David Rientjes
2010-05-25 10:47               ` Pekka Enberg
2010-05-25 10:47                 ` Pekka Enberg
2010-05-25 19:57                 ` David Rientjes
2010-05-25 19:57                   ` David Rientjes
2010-05-25 14:13       ` Christoph Lameter
2010-05-25 14:34         ` Nick Piggin
2010-05-25 14:43           ` Nick Piggin
2010-05-25 14:48           ` Christoph Lameter
2010-05-25 15:11             ` Nick Piggin
2010-05-25 15:28               ` Christoph Lameter
2010-05-25 15:37                 ` Nick Piggin
2010-05-27 14:24                   ` Christoph Lameter
2010-05-27 14:37                     ` Nick Piggin
2010-05-27 15:52                       ` Christoph Lameter
2010-05-27 16:07                         ` Nick Piggin
2010-05-27 16:57                           ` Christoph Lameter
2010-05-28  8:39                             ` Nick Piggin
2010-05-25 14:40         ` Nick Piggin
2010-05-25 14:48           ` Christoph Lameter
2010-05-25 15:12             ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.