All of lore.kernel.org
 help / color / mirror / Atom feed
* [slub p4 0/7] slub: per cpu partial lists V4
@ 2011-08-09 21:12 Christoph Lameter
  2011-08-09 21:12 ` [slub p4 1/7] slub: free slabs without holding locks (V2) Christoph Lameter
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

V3->V4 : Use a single linked per cpu list instead of a per cpu array.
	This results in improvements even for the single threaded
	case. I think this is ready for more widespread testing (-next?)
	The number of partial pages per cpu is configurable via
	/sys/kernel/slab/<name>/cpu_partial

V2->V3 : Work on the todo list. Still some work to be done to reduce
         code impact and make this all cleaner. (Pekka: patch 1-3
         are cleanup patches of general usefulness. You got #1 already
         2+3 could be picked up w/o any issue).

The following patchset introduces per cpu partial lists which allow
a performance increase of around ~10-20% with hackbench on my Sandybridge
processor.

These lists help to avoid per node locking overhead. Allocator latency
could be further reduced by making these operations work without
disabling interrupts (like the fastpath and the free slowpath) but that
is another project.

It is interesting to note that BSD has gone to a scheme with partial
pages only per cpu (source: Adrian). Transfer of cpu ownerships is
done using IPIs. Probably too much overhead for our taste. The approach
here keeps the per node partial lists essentially meaning the "pages"
in there have no cpu owner.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [slub p4 1/7] slub: free slabs without holding locks (V2)
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
@ 2011-08-09 21:12 ` Christoph Lameter
  2011-08-20 10:32   ` Pekka Enberg
  2011-08-09 21:12 ` [slub p4 2/7] slub: Remove useless statements in __slab_alloc Christoph Lameter
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

[-- Attachment #1: slub_free_wo_locks --]
[-- Type: text/plain, Size: 3062 bytes --]

There are two situations in which slub holds a lock while releasing
pages:

	A. During kmem_cache_shrink()
	B. During kmem_cache_close()

For A build a list while holding the lock and then release the pages
later. In case of B we are the last remaining user of the slab so
there is no need to take the listlock.

After this patch all calls to the page allocator to free pages are
done without holding any spinlocks. kmem_cache_destroy() will still
hold the slub_lock semaphore.

V1->V2. Remove kfree. Avoid locking in free_partial.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-08-09 13:01:59.071582163 -0500
+++ linux-2.6/mm/slub.c	2011-08-09 13:05:00.051582012 -0500
@@ -2970,13 +2970,13 @@ static void list_slab_objects(struct kme
 
 /*
  * Attempt to free all partial slabs on a node.
+ * This is called from kmem_cache_close(). We must be the last thread
+ * using the cache and therefore we do not need to lock anymore.
  */
 static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
 {
-	unsigned long flags;
 	struct page *page, *h;
 
-	spin_lock_irqsave(&n->list_lock, flags);
 	list_for_each_entry_safe(page, h, &n->partial, lru) {
 		if (!page->inuse) {
 			remove_partial(n, page);
@@ -2986,7 +2986,6 @@ static void free_partial(struct kmem_cac
 				"Objects remaining on kmem_cache_close()");
 		}
 	}
-	spin_unlock_irqrestore(&n->list_lock, flags);
 }
 
 /*
@@ -3020,6 +3019,7 @@ void kmem_cache_destroy(struct kmem_cach
 	s->refcount--;
 	if (!s->refcount) {
 		list_del(&s->list);
+		up_write(&slub_lock);
 		if (kmem_cache_close(s)) {
 			printk(KERN_ERR "SLUB %s: %s called for cache that "
 				"still has objects.\n", s->name, __func__);
@@ -3028,8 +3028,8 @@ void kmem_cache_destroy(struct kmem_cach
 		if (s->flags & SLAB_DESTROY_BY_RCU)
 			rcu_barrier();
 		sysfs_slab_remove(s);
-	}
-	up_write(&slub_lock);
+	} else
+		up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -3347,23 +3347,23 @@ int kmem_cache_shrink(struct kmem_cache
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse) {
-				remove_partial(n, page);
-				discard_slab(s, page);
-			} else {
-				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
-			}
+			list_move(&page->lru, slabs_by_inuse + page->inuse);
+			if (!page->inuse)
+				n->nr_partial--;
 		}
 
 		/*
 		 * Rebuild the partial list with the slabs filled up most
 		 * first and the least used slabs at the end.
 		 */
-		for (i = objects - 1; i >= 0; i--)
+		for (i = objects - 1; i > 0; i--)
 			list_splice(slabs_by_inuse + i, n->partial.prev);
 
 		spin_unlock_irqrestore(&n->list_lock, flags);
+
+		/* Release empty slabs */
+		list_for_each_entry_safe(page, t, slabs_by_inuse, lru)
+			discard_slab(s, page);
 	}
 
 	kfree(slabs_by_inuse);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [slub p4 2/7] slub: Remove useless statements in __slab_alloc
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
  2011-08-09 21:12 ` [slub p4 1/7] slub: free slabs without holding locks (V2) Christoph Lameter
@ 2011-08-09 21:12 ` Christoph Lameter
  2011-08-20 10:44   ` Pekka Enberg
  2011-08-09 21:12 ` [slub p4 3/7] slub: Prepare inuse field in new_slab() Christoph Lameter
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, torvalds, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

[-- Attachment #1: remove_useless_page_null --]
[-- Type: text/plain, Size: 1155 bytes --]

Two statements in __slab_alloc() do not have any effect.

1. c->page is already set to NULL by deactivate_slab() called right before.

2. gfpflags are masked in new_slab() before being passed to the page
   allocator. There is no need to mask gfpflags in __slab_alloc in particular
   since most frequent processing in __slab_alloc does not require the use of a
   gfpmask.

Cc: torvalds@linux-foundation.org
Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |    4 ----
 1 file changed, 4 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-08-01 11:03:15.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-08-01 11:04:06.385859038 -0500
@@ -2064,9 +2064,6 @@ static void *__slab_alloc(struct kmem_ca
 	c = this_cpu_ptr(s->cpu_slab);
 #endif
 
-	/* We handle __GFP_ZERO in the caller */
-	gfpflags &= ~__GFP_ZERO;
-
 	page = c->page;
 	if (!page)
 		goto new_slab;
@@ -2163,7 +2160,6 @@ debug:
 
 	c->freelist = get_freepointer(s, object);
 	deactivate_slab(s, c);
-	c->page = NULL;
 	c->node = NUMA_NO_NODE;
 	local_irq_restore(flags);
 	return object;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [slub p4 3/7] slub: Prepare inuse field in new_slab()
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
  2011-08-09 21:12 ` [slub p4 1/7] slub: free slabs without holding locks (V2) Christoph Lameter
  2011-08-09 21:12 ` [slub p4 2/7] slub: Remove useless statements in __slab_alloc Christoph Lameter
@ 2011-08-09 21:12 ` Christoph Lameter
  2011-08-09 21:12 ` [slub p4 4/7] slub: pass kmem_cache_cpu pointer to get_partial() Christoph Lameter
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

[-- Attachment #1: new_slab --]
[-- Type: text/plain, Size: 1180 bytes --]

inuse will always be set to page->objects. There is no point in
initializing the field to zero in new_slab() and then overwriting
the value in __slab_alloc().

Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-08-09 13:05:06.211582007 -0500
+++ linux-2.6/mm/slub.c	2011-08-09 13:05:07.091582007 -0500
@@ -1447,7 +1447,7 @@ static struct page *new_slab(struct kmem
 	set_freepointer(s, last, NULL);
 
 	page->freelist = start;
-	page->inuse = 0;
+	page->inuse = page->objects;
 	page->frozen = 1;
 out:
 	return page;
@@ -2139,7 +2139,6 @@ new_slab:
 		 */
 		object = page->freelist;
 		page->freelist = NULL;
-		page->inuse = page->objects;
 
 		stat(s, ALLOC_SLAB);
 		c->node = page_to_nid(page);
@@ -2681,7 +2680,7 @@ static void early_kmem_cache_node_alloc(
 	n = page->freelist;
 	BUG_ON(!n);
 	page->freelist = get_freepointer(kmem_cache_node, n);
-	page->inuse++;
+	page->inuse = 1;
 	page->frozen = 0;
 	kmem_cache_node->node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [slub p4 4/7] slub: pass kmem_cache_cpu pointer to get_partial()
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
                   ` (2 preceding siblings ...)
  2011-08-09 21:12 ` [slub p4 3/7] slub: Prepare inuse field in new_slab() Christoph Lameter
@ 2011-08-09 21:12 ` Christoph Lameter
  2011-08-09 21:12 ` [slub p4 5/7] slub: return object pointer from get_partial() / new_slab() Christoph Lameter
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

[-- Attachment #1: push_c_into_get_partial --]
[-- Type: text/plain, Size: 3456 bytes --]

Pass the kmem_cache_cpu pointer to get_partial(). That way
we can avoid the this_cpu_write() statements.

Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-08-01 11:04:26.025858912 -0500
+++ linux-2.6/mm/slub.c	2011-08-01 11:04:29.985858887 -0500
@@ -1557,7 +1557,8 @@ static inline void remove_partial(struct
  * Must hold list_lock.
  */
 static inline int acquire_slab(struct kmem_cache *s,
-		struct kmem_cache_node *n, struct page *page)
+		struct kmem_cache_node *n, struct page *page,
+		struct kmem_cache_cpu *c)
 {
 	void *freelist;
 	unsigned long counters;
@@ -1586,9 +1587,9 @@ static inline int acquire_slab(struct km
 
 	if (freelist) {
 		/* Populate the per cpu freelist */
-		this_cpu_write(s->cpu_slab->freelist, freelist);
-		this_cpu_write(s->cpu_slab->page, page);
-		this_cpu_write(s->cpu_slab->node, page_to_nid(page));
+		c->freelist = freelist;
+		c->page = page;
+		c->node = page_to_nid(page);
 		return 1;
 	} else {
 		/*
@@ -1606,7 +1607,7 @@ static inline int acquire_slab(struct km
  * Try to allocate a partial slab from a specific node.
  */
 static struct page *get_partial_node(struct kmem_cache *s,
-					struct kmem_cache_node *n)
+		struct kmem_cache_node *n, struct kmem_cache_cpu *c)
 {
 	struct page *page;
 
@@ -1621,7 +1622,7 @@ static struct page *get_partial_node(str
 
 	spin_lock(&n->list_lock);
 	list_for_each_entry(page, &n->partial, lru)
-		if (acquire_slab(s, n, page))
+		if (acquire_slab(s, n, page, c))
 			goto out;
 	page = NULL;
 out:
@@ -1632,7 +1633,8 @@ out:
 /*
  * Get a page from somewhere. Search in increasing NUMA distances.
  */
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
+		struct kmem_cache_cpu *c)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -1672,7 +1674,7 @@ static struct page *get_any_partial(stru
 
 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > s->min_partial) {
-			page = get_partial_node(s, n);
+			page = get_partial_node(s, n, c);
 			if (page) {
 				put_mems_allowed();
 				return page;
@@ -1687,16 +1689,17 @@ static struct page *get_any_partial(stru
 /*
  * Get a partial page, lock it and return it.
  */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node,
+		struct kmem_cache_cpu *c)
 {
 	struct page *page;
 	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 
-	page = get_partial_node(s, get_node(s, searchnode));
+	page = get_partial_node(s, get_node(s, searchnode), c);
 	if (page || node != NUMA_NO_NODE)
 		return page;
 
-	return get_any_partial(s, flags);
+	return get_any_partial(s, flags, c);
 }
 
 #ifdef CONFIG_PREEMPT
@@ -1765,9 +1768,6 @@ void init_kmem_cache_cpus(struct kmem_ca
 	for_each_possible_cpu(cpu)
 		per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
 }
-/*
- * Remove the cpu slab
- */
 
 /*
  * Remove the cpu slab
@@ -2116,7 +2116,7 @@ load_freelist:
 	return object;
 
 new_slab:
-	page = get_partial(s, gfpflags, node);
+	page = get_partial(s, gfpflags, node, c);
 	if (page) {
 		stat(s, ALLOC_FROM_PARTIAL);
 		object = c->freelist;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [slub p4 5/7] slub: return object pointer from get_partial() / new_slab().
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
                   ` (3 preceding siblings ...)
  2011-08-09 21:12 ` [slub p4 4/7] slub: pass kmem_cache_cpu pointer to get_partial() Christoph Lameter
@ 2011-08-09 21:12 ` Christoph Lameter
  2011-08-09 21:12 ` [slub p4 6/7] slub: per cpu cache for partial pages Christoph Lameter
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

[-- Attachment #1: object_instead_of_page_return --]
[-- Type: text/plain, Size: 7402 bytes --]

There is no need anymore to return the pointer to a slab page from get_partial()
since the page reference can be stored in the kmem_cache_cpu structures "page" field.

Return an object pointer instead.

That in turn allows a simplification of the spaghetti code in __slab_alloc().

Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |  133 ++++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 73 insertions(+), 60 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-08-01 11:04:29.985858887 -0500
+++ linux-2.6/mm/slub.c	2011-08-01 11:04:33.755858864 -0500
@@ -1554,9 +1554,11 @@ static inline void remove_partial(struct
  * Lock slab, remove from the partial list and put the object into the
  * per cpu freelist.
  *
+ * Returns a list of objects or NULL if it fails.
+ *
  * Must hold list_lock.
  */
-static inline int acquire_slab(struct kmem_cache *s,
+static inline void *acquire_slab(struct kmem_cache *s,
 		struct kmem_cache_node *n, struct page *page,
 		struct kmem_cache_cpu *c)
 {
@@ -1587,10 +1589,11 @@ static inline int acquire_slab(struct km
 
 	if (freelist) {
 		/* Populate the per cpu freelist */
-		c->freelist = freelist;
 		c->page = page;
 		c->node = page_to_nid(page);
-		return 1;
+		stat(s, ALLOC_FROM_PARTIAL);
+
+		return freelist;
 	} else {
 		/*
 		 * Slab page came from the wrong list. No object to allocate
@@ -1599,17 +1602,18 @@ static inline int acquire_slab(struct km
 		 */
 		printk(KERN_ERR "SLUB: %s : Page without available objects on"
 			" partial list\n", s->name);
-		return 0;
+		return NULL;
 	}
 }
 
 /*
  * Try to allocate a partial slab from a specific node.
  */
-static struct page *get_partial_node(struct kmem_cache *s,
+static void *get_partial_node(struct kmem_cache *s,
 		struct kmem_cache_node *n, struct kmem_cache_cpu *c)
 {
 	struct page *page;
+	void *object;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -1621,13 +1625,15 @@ static struct page *get_partial_node(str
 		return NULL;
 
 	spin_lock(&n->list_lock);
-	list_for_each_entry(page, &n->partial, lru)
-		if (acquire_slab(s, n, page, c))
+	list_for_each_entry(page, &n->partial, lru) {
+		object = acquire_slab(s, n, page, c);
+		if (object)
 			goto out;
-	page = NULL;
+	}
+	object = NULL;
 out:
 	spin_unlock(&n->list_lock);
-	return page;
+	return object;
 }
 
 /*
@@ -1641,7 +1647,7 @@ static struct page *get_any_partial(stru
 	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
-	struct page *page;
+	void *object;
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1674,10 +1680,10 @@ static struct page *get_any_partial(stru
 
 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > s->min_partial) {
-			page = get_partial_node(s, n, c);
-			if (page) {
+			object = get_partial_node(s, n, c);
+			if (object) {
 				put_mems_allowed();
-				return page;
+				return object;
 			}
 		}
 	}
@@ -1689,15 +1695,15 @@ static struct page *get_any_partial(stru
 /*
  * Get a partial page, lock it and return it.
  */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node,
+static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 		struct kmem_cache_cpu *c)
 {
-	struct page *page;
+	void *object;
 	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 
-	page = get_partial_node(s, get_node(s, searchnode), c);
-	if (page || node != NUMA_NO_NODE)
-		return page;
+	object = get_partial_node(s, get_node(s, searchnode), c);
+	if (object || node != NUMA_NO_NODE)
+		return object;
 
 	return get_any_partial(s, flags, c);
 }
@@ -2027,6 +2033,35 @@ slab_out_of_memory(struct kmem_cache *s,
 	}
 }
 
+static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
+			int node, struct kmem_cache_cpu **pc)
+{
+	void *object;
+	struct kmem_cache_cpu *c;
+	struct page *page = new_slab(s, flags, node);
+
+	if (page) {
+		c = __this_cpu_ptr(s->cpu_slab);
+		if (c->page)
+			flush_slab(s, c);
+
+		/*
+		 * No other reference to the page yet so we can
+		 * muck around with it freely without cmpxchg
+		 */
+		object = page->freelist;
+		page->freelist = NULL;
+
+		stat(s, ALLOC_SLAB);
+		c->node = page_to_nid(page);
+		c->page = page;
+		*pc = c;
+	} else
+		object = NULL;
+
+	return object;
+}
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
@@ -2049,7 +2084,6 @@ static void *__slab_alloc(struct kmem_ca
 			  unsigned long addr, struct kmem_cache_cpu *c)
 {
 	void **object;
-	struct page *page;
 	unsigned long flags;
 	struct page new;
 	unsigned long counters;
@@ -2064,8 +2098,7 @@ static void *__slab_alloc(struct kmem_ca
 	c = this_cpu_ptr(s->cpu_slab);
 #endif
 
-	page = c->page;
-	if (!page)
+	if (!c->page)
 		goto new_slab;
 
 	if (unlikely(!node_match(c, node))) {
@@ -2077,8 +2110,8 @@ static void *__slab_alloc(struct kmem_ca
 	stat(s, ALLOC_SLOWPATH);
 
 	do {
-		object = page->freelist;
-		counters = page->counters;
+		object = c->page->freelist;
+		counters = c->page->counters;
 		new.counters = counters;
 		VM_BUG_ON(!new.frozen);
 
@@ -2090,12 +2123,12 @@ static void *__slab_alloc(struct kmem_ca
 		 *
 		 * If there are objects left then we retrieve them
 		 * and use them to refill the per cpu queue.
-		*/
+		 */
 
-		new.inuse = page->objects;
+		new.inuse = c->page->objects;
 		new.frozen = object != NULL;
 
-	} while (!__cmpxchg_double_slab(s, page,
+	} while (!__cmpxchg_double_slab(s, c->page,
 			object, counters,
 			NULL, new.counters,
 			"__slab_alloc"));
@@ -2109,53 +2142,33 @@ static void *__slab_alloc(struct kmem_ca
 	stat(s, ALLOC_REFILL);
 
 load_freelist:
-	VM_BUG_ON(!page->frozen);
 	c->freelist = get_freepointer(s, object);
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
 	return object;
 
 new_slab:
-	page = get_partial(s, gfpflags, node, c);
-	if (page) {
-		stat(s, ALLOC_FROM_PARTIAL);
-		object = c->freelist;
+	object = get_partial(s, gfpflags, node, c);
 
-		if (kmem_cache_debug(s))
-			goto debug;
-		goto load_freelist;
-	}
+	if (unlikely(!object)) {
 
-	page = new_slab(s, gfpflags, node);
+		object = new_slab_objects(s, gfpflags, node, &c);
 
-	if (page) {
-		c = __this_cpu_ptr(s->cpu_slab);
-		if (c->page)
-			flush_slab(s, c);
+		if (unlikely(!object)) {
+			if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
+				slab_out_of_memory(s, gfpflags, node);
 
-		/*
-		 * No other reference to the page yet so we can
-		 * muck around with it freely without cmpxchg
-		 */
-		object = page->freelist;
-		page->freelist = NULL;
-
-		stat(s, ALLOC_SLAB);
-		c->node = page_to_nid(page);
-		c->page = page;
+			local_irq_restore(flags);
+			return NULL;
+		}
+	}
 
-		if (kmem_cache_debug(s))
-			goto debug;
+	if (likely(!kmem_cache_debug(s)))
 		goto load_freelist;
-	}
-	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
-		slab_out_of_memory(s, gfpflags, node);
-	local_irq_restore(flags);
-	return NULL;
 
-debug:
-	if (!object || !alloc_debug_processing(s, page, object, addr))
-		goto new_slab;
+	/* Only entered in the debug case */
+	if (!alloc_debug_processing(s, c->page, object, addr))
+		goto new_slab;	/* Slab failed checks. Next slab needed */
 
 	c->freelist = get_freepointer(s, object);
 	deactivate_slab(s, c);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [slub p4 6/7] slub: per cpu cache for partial pages
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
                   ` (4 preceding siblings ...)
  2011-08-09 21:12 ` [slub p4 5/7] slub: return object pointer from get_partial() / new_slab() Christoph Lameter
@ 2011-08-09 21:12 ` Christoph Lameter
  2011-08-20 10:40   ` Pekka Enberg
       [not found]   ` <CAF1ivSaH9fh6_QvuBkLc5t=zC4mPEAD5ZzsxOuPruDwG9MiZzw@mail.gmail.com>
  2011-08-09 21:12 ` [slub p4 7/7] slub: update slabinfo tools to report per cpu partial list statistics Christoph Lameter
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

[-- Attachment #1: per_cpu_partial --]
[-- Type: text/plain, Size: 17179 bytes --]

Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
partial pages. The partial page list is used in slab_free() to avoid
per node lock taking.

In __slab_alloc() we can then take multiple partial pages off the per
node partial list in one go reducing node lock pressure.

We can also use the per cpu partial list in slab_alloc() to avoid scanning
partial lists for pages with free objects.

The main effect of a per cpu partial list is that the per node list_lock
is taken for batches of partial pages instead of individual ones.

Potential future enhancements:

1. The pickup from the partial list could be perhaps be done without disabling
   interrupts with some work. The free path already puts the page into the
   per cpu partial list without disabling interrupts.

2. __slab_free() may have some code paths that could use optimization.

Performance:

				Before		After
./hackbench 100 process 200000
				Time: 1953.047	1564.614
./hackbench 100 process 20000
				Time: 207.176   156.940
./hackbench 100 process 20000
				Time: 204.468	156.940
./hackbench 100 process 20000
				Time: 204.879	158.772
./hackbench 10 process 20000
				Time: 20.153	15.853
./hackbench 10 process 20000
				Time: 20.153	15.986
./hackbench 10 process 20000
				Time: 19.363	16.111
./hackbench 1 process 20000
				Time: 2.518	2.307
./hackbench 1 process 20000
				Time: 2.258	2.339
./hackbench 1 process 20000
				Time: 2.864	2.163

Signed-off-by: Christoph Lameter <cl@linux.com>


---
 include/linux/mm_types.h |   14 +
 include/linux/slub_def.h |    4 
 mm/slub.c                |  339 ++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 309 insertions(+), 48 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2011-08-05 12:06:57.561873039 -0500
+++ linux-2.6/include/linux/slub_def.h	2011-08-09 13:05:13.181582001 -0500
@@ -36,12 +36,15 @@ enum stat_item {
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	CMPXCHG_DOUBLE_CPU_FAIL,/* Failure of this_cpu_cmpxchg_double */
 	CMPXCHG_DOUBLE_FAIL,	/* Number of times that cmpxchg double did not match */
+	CPU_PARTIAL_ALLOC,	/* Used cpu partial on alloc */
+	CPU_PARTIAL_FREE,	/* USed cpu partial on free */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to next available object */
 	unsigned long tid;	/* Globally unique transaction id */
 	struct page *page;	/* The slab from which we are allocating */
+	struct page *partial;	/* Partially allocated frozen slabs */
 	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
@@ -79,6 +82,7 @@ struct kmem_cache {
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
 	int offset;		/* Free pointer offset. */
+	int cpu_partial;	/* Number of per cpu partial pages to keep around */
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-08-09 13:05:12.321582002 -0500
+++ linux-2.6/mm/slub.c	2011-08-09 13:05:13.181582001 -0500
@@ -1560,7 +1560,7 @@ static inline void remove_partial(struct
  */
 static inline void *acquire_slab(struct kmem_cache *s,
 		struct kmem_cache_node *n, struct page *page,
-		struct kmem_cache_cpu *c)
+		int mode)
 {
 	void *freelist;
 	unsigned long counters;
@@ -1575,7 +1575,8 @@ static inline void *acquire_slab(struct
 		freelist = page->freelist;
 		counters = page->counters;
 		new.counters = counters;
-		new.inuse = page->objects;
+		if (mode)
+			new.inuse = page->objects;
 
 		VM_BUG_ON(new.frozen);
 		new.frozen = 1;
@@ -1586,34 +1587,20 @@ static inline void *acquire_slab(struct
 			"lock and freeze"));
 
 	remove_partial(n, page);
-
-	if (freelist) {
-		/* Populate the per cpu freelist */
-		c->page = page;
-		c->node = page_to_nid(page);
-		stat(s, ALLOC_FROM_PARTIAL);
-
-		return freelist;
-	} else {
-		/*
-		 * Slab page came from the wrong list. No object to allocate
-		 * from. Put it onto the correct list and continue partial
-		 * scan.
-		 */
-		printk(KERN_ERR "SLUB: %s : Page without available objects on"
-			" partial list\n", s->name);
-		return NULL;
-	}
+	return freelist;
 }
 
+static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+
 /*
  * Try to allocate a partial slab from a specific node.
  */
 static void *get_partial_node(struct kmem_cache *s,
 		struct kmem_cache_node *n, struct kmem_cache_cpu *c)
 {
-	struct page *page;
-	void *object;
+	struct page *page, *page2;
+	void *object = NULL;
+	int count = 0;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -1625,13 +1612,28 @@ static void *get_partial_node(struct kme
 		return NULL;
 
 	spin_lock(&n->list_lock);
-	list_for_each_entry(page, &n->partial, lru) {
-		object = acquire_slab(s, n, page, c);
-		if (object)
-			goto out;
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		void *t = acquire_slab(s, n, page, count == 0);
+		int available;
+
+		if (!t)
+			break;
+
+		if (!count) {
+			c->page = page;
+			c->node = page_to_nid(page);
+			stat(s, ALLOC_FROM_PARTIAL);
+			count++;
+			object = t;
+			available =  page->objects - page->inuse;
+		} else {
+			page->freelist = t;
+			available = put_cpu_partial(s, page, 0);
+		}
+		if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
+			break;
+
 	}
-	object = NULL;
-out:
 	spin_unlock(&n->list_lock);
 	return object;
 }
@@ -1926,6 +1928,123 @@ redo:
 	}
 }
 
+/* Unfreeze all the cpu partial slabs */
+static void unfreeze_partials(struct kmem_cache *s)
+{
+	struct kmem_cache_node *n = NULL;
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+	struct page *page;
+
+	while ((page = c->partial)) {
+		enum slab_modes { M_PARTIAL, M_FREE };
+		enum slab_modes l, m;
+		struct page new;
+		struct page old;
+
+		c->partial = page->next;
+		l = M_FREE;
+
+		do {
+
+			old.freelist = page->freelist;
+			old.counters = page->counters;
+			VM_BUG_ON(!old.frozen);
+
+			new.counters = old.counters;
+			new.freelist = old.freelist;
+
+			new.frozen = 0;
+
+			if (!new.inuse && (!n || n->nr_partial < s->min_partial))
+				m = M_FREE;
+			else {
+				struct kmem_cache_node *n2 = get_node(s,
+							page_to_nid(page));
+
+				m = M_PARTIAL;
+				if (n != n2) {
+					if (n)
+						spin_unlock(&n->list_lock);
+
+					n = n2;
+					spin_lock(&n->list_lock);
+				}
+			}
+
+			if (l != m) {
+				if (l == M_PARTIAL)
+					remove_partial(n, page);
+				else
+					add_partial(n, page, 1);
+
+				l = m;
+			}
+
+		} while (!cmpxchg_double_slab(s, page,
+				old.freelist, old.counters,
+				new.freelist, new.counters,
+				"unfreezing slab"));
+
+		if (m == M_FREE) {
+			stat(s, DEACTIVATE_EMPTY);
+			discard_slab(s, page);
+			stat(s, FREE_SLAB);
+		}
+	}
+
+	if (n)
+		spin_unlock(&n->list_lock);
+}
+
+/*
+ * Put a page that was just frozen (in __slab_free) into a partial page
+ * slot if available. This is done without interrupts disabled and without
+ * preemption disabled. The cmpxchg is racy and may put the partial page
+ * onto a random cpus partial slot.
+ *
+ * If we did not find a slot then simply move all the partials to the
+ * per node partial list.
+ */
+int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
+{
+	struct page *oldpage;
+	int pages;
+	int pobjects;
+
+	do {
+		pages = 0;
+		pobjects = 0;
+		oldpage = this_cpu_read(s->cpu_slab->partial);
+
+		if (oldpage) {
+			pobjects = oldpage->pobjects;
+			pages = oldpage->pages;
+			if (drain && pobjects > s->cpu_partial) {
+				unsigned long flags;
+				/*
+				 * partial array is full. Move the existing
+				 * set to the per node partial list.
+				 */
+				local_irq_save(flags);
+				unfreeze_partials(s);
+				local_irq_restore(flags);
+				pobjects = 0;
+				pages = 0;
+			}
+		}
+
+		pages++;
+		pobjects += page->objects - page->inuse;
+
+		page->pages = pages;
+		page->pobjects = pobjects;
+		page->next = oldpage;
+
+	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage);
+	stat(s, CPU_PARTIAL_FREE);
+	return pobjects;
+}
+
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	stat(s, CPUSLAB_FLUSH);
@@ -1941,8 +2060,12 @@ static inline void __flush_cpu_slab(stru
 {
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
-	if (likely(c && c->page))
-		flush_slab(s, c);
+	if (likely(c)) {
+		if (c->page)
+			flush_slab(s, c);
+
+		unfreeze_partials(s);
+	}
 }
 
 static void flush_cpu_slab(void *d)
@@ -2066,8 +2189,6 @@ static inline void *new_slab_objects(str
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
  *
- * Interrupts are disabled.
- *
  * Processing is still very fast if new objects have been freed to the
  * regular freelist. In that case we simply take over the regular freelist
  * as the lockless freelist and zap the regular freelist.
@@ -2100,7 +2221,7 @@ static void *__slab_alloc(struct kmem_ca
 
 	if (!c->page)
 		goto new_slab;
-
+redo:
 	if (unlikely(!node_match(c, node))) {
 		stat(s, ALLOC_NODE_MISMATCH);
 		deactivate_slab(s, c);
@@ -2133,7 +2254,7 @@ static void *__slab_alloc(struct kmem_ca
 			NULL, new.counters,
 			"__slab_alloc"));
 
-	if (unlikely(!object)) {
+	if (!object) {
 		c->page = NULL;
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
@@ -2148,6 +2269,17 @@ load_freelist:
 	return object;
 
 new_slab:
+
+	if (c->partial) {
+		c->page = c->partial;
+		c->partial = c->page->next;
+		c->node = page_to_nid(c->page);
+		stat(s, CPU_PARTIAL_ALLOC);
+		c->freelist = NULL;
+		goto redo;
+	}
+
+	/* Then do expensive stuff like retrieving pages from the partial lists */
 	object = get_partial(s, gfpflags, node, c);
 
 	if (unlikely(!object)) {
@@ -2341,16 +2473,29 @@ static void __slab_free(struct kmem_cach
 		was_frozen = new.frozen;
 		new.inuse--;
 		if ((!new.inuse || !prior) && !was_frozen && !n) {
-                        n = get_node(s, page_to_nid(page));
-			/*
-			 * Speculatively acquire the list_lock.
-			 * If the cmpxchg does not succeed then we may
-			 * drop the list_lock without any processing.
-			 *
-			 * Otherwise the list_lock will synchronize with
-			 * other processors updating the list of slabs.
-			 */
-                        spin_lock_irqsave(&n->list_lock, flags);
+
+			if (!kmem_cache_debug(s) && !prior)
+
+				/*
+				 * Slab was on no list before and will be partially empty
+				 * We can defer the list move and instead freeze it.
+				 */
+				new.frozen = 1;
+
+			else { /* Needs to be taken off a list */
+
+	                        n = get_node(s, page_to_nid(page));
+				/*
+				 * Speculatively acquire the list_lock.
+				 * If the cmpxchg does not succeed then we may
+				 * drop the list_lock without any processing.
+				 *
+				 * Otherwise the list_lock will synchronize with
+				 * other processors updating the list of slabs.
+				 */
+				spin_lock_irqsave(&n->list_lock, flags);
+
+			}
 		}
 		inuse = new.inuse;
 
@@ -2360,7 +2505,15 @@ static void __slab_free(struct kmem_cach
 		"__slab_free"));
 
 	if (likely(!n)) {
-                /*
+
+		/*
+		 * If we just froze the page then put it onto the
+		 * per cpu partial list.
+		 */
+       		if (new.frozen && !was_frozen)
+			put_cpu_partial(s, page, 1);
+
+         	/*
 		 * The list lock was not taken therefore no list
 		 * activity can be necessary.
 		 */
@@ -2429,7 +2582,6 @@ static __always_inline void slab_free(st
 	slab_free_hook(s, x);
 
 redo:
-
 	/*
 	 * Determine the currently cpus per cpu slab.
 	 * The cpu may change afterward. However that does not matter since
@@ -2919,7 +3071,34 @@ static int kmem_cache_open(struct kmem_c
 	 * The larger the object size is, the more pages we want on the partial
 	 * list to avoid pounding the page allocator excessively.
 	 */
-	set_min_partial(s, ilog2(s->size));
+	set_min_partial(s, ilog2(s->size) / 2);
+
+	/*
+	 * cpu_partial determined the maximum number of objects kept in the
+	 * per cpu partial lists of a processor.
+	 *
+	 * Per cpu partial lists mainly contain slabs that just have one
+	 * object freed. If they are used for allocation then they can be
+	 * filled up again with minimal effort. The slab will never hit the
+	 * per node partial lists and therefore no locking will be required.
+	 *
+	 * This setting also determines
+	 *
+	 * A) The number of objects from per cpu partial slabs dumped to the
+	 *    per node list when we reach the limit.
+	 * B) The number of objects in partial partial slabs to extract from the
+	 *    per node list when we run out of per cpu objects. We only fetch 50%
+	 *    to keep some capacity around for frees.
+	 */
+	if (s->size >= PAGE_SIZE)
+		s->cpu_partial = 2;
+	else if (s->size >= 1024)
+		s->cpu_partial = 6;
+	else if (s->size >= 256)
+		s->cpu_partial = 13;
+	else
+		s->cpu_partial = 30;
+
 	s->refcount = 1;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
@@ -4327,6 +4506,7 @@ static ssize_t show_slab_objects(struct
 
 		for_each_possible_cpu(cpu) {
 			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+			struct page *page;
 
 			if (!c || c->node < 0)
 				continue;
@@ -4342,6 +4522,13 @@ static ssize_t show_slab_objects(struct
 				total += x;
 				nodes[c->node] += x;
 			}
+			page = c->partial;
+
+			if (page) {
+				x = page->pobjects;
+                                total += x;
+                                nodes[c->node] += x;
+			}
 			per_cpu[c->node]++;
 		}
 	}
@@ -4493,6 +4680,27 @@ static ssize_t min_partial_store(struct
 }
 SLAB_ATTR(min_partial);
 
+static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->cpu_partial);
+}
+
+static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
+				 size_t length)
+{
+	unsigned long objects;
+	int err;
+
+	err = strict_strtoul(buf, 10, &objects);
+	if (err)
+		return err;
+
+	s->cpu_partial = objects;
+	flush_all(s);
+	return length;
+}
+SLAB_ATTR(cpu_partial);
+
 static ssize_t ctor_show(struct kmem_cache *s, char *buf)
 {
 	if (!s->ctor)
@@ -4531,6 +4739,37 @@ static ssize_t objects_partial_show(stru
 }
 SLAB_ATTR_RO(objects_partial);
 
+static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
+{
+	int objects = 0;
+	int pages = 0;
+	int cpu;
+	int len;
+
+	for_each_online_cpu(cpu) {
+		struct page *page = per_cpu_ptr(s->cpu_slab, cpu)->partial;
+
+		if (page) {
+			pages += page->pages;
+			objects += page->pobjects;
+		}
+	}
+
+	len = sprintf(buf, "%d(%d)", objects, pages);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu) {
+		struct page *page = per_cpu_ptr(s->cpu_slab, cpu) ->partial;
+
+		if (page && len < PAGE_SIZE - 20)
+			len += sprintf(buf + len, " C%d=%d(%d)", cpu,
+				page->pobjects, page->pages);
+	}
+#endif
+	return len + sprintf(buf + len, "\n");
+}
+SLAB_ATTR_RO(slabs_cpu_partial);
+
 static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
 {
 	return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
@@ -4853,6 +5092,8 @@ STAT_ATTR(DEACTIVATE_BYPASS, deactivate_
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
 STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail);
 STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail);
+STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
+STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -4861,6 +5102,7 @@ static struct attribute *slab_attrs[] =
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
 	&min_partial_attr.attr,
+	&cpu_partial_attr.attr,
 	&objects_attr.attr,
 	&objects_partial_attr.attr,
 	&partial_attr.attr,
@@ -4873,6 +5115,7 @@ static struct attribute *slab_attrs[] =
 	&destroy_by_rcu_attr.attr,
 	&shrink_attr.attr,
 	&reserved_attr.attr,
+	&slabs_cpu_partial_attr.attr,
 #ifdef CONFIG_SLUB_DEBUG
 	&total_objects_attr.attr,
 	&slabs_attr.attr,
@@ -4914,6 +5157,8 @@ static struct attribute *slab_attrs[] =
 	&order_fallback_attr.attr,
 	&cmpxchg_double_fail_attr.attr,
 	&cmpxchg_double_cpu_fail_attr.attr,
+	&cpu_partial_alloc_attr.attr,
+	&cpu_partial_free_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2011-08-05 12:06:57.571873039 -0500
+++ linux-2.6/include/linux/mm_types.h	2011-08-09 13:05:13.201582001 -0500
@@ -79,9 +79,21 @@ struct page {
 	};
 
 	/* Third double word block */
-	struct list_head lru;		/* Pageout list, eg. active_list
+	union {
+		struct list_head lru;	/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
 					 */
+		struct {		/* slub per cpu partial pages */
+			struct page *next;	/* Next partial slab */
+#ifdef CONFIG_64BIT
+			int pages;	/* Nr of partial slabs left */
+			int pobjects;	/* Approximate # of objects */
+#else
+			short int pages;
+			short int pobjects;
+#endif
+		};
+	};
 
 	/* Remainder is not double word aligned */
 	union {


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [slub p4 7/7] slub: update slabinfo tools to report per cpu partial list statistics
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
                   ` (5 preceding siblings ...)
  2011-08-09 21:12 ` [slub p4 6/7] slub: per cpu cache for partial pages Christoph Lameter
@ 2011-08-09 21:12 ` Christoph Lameter
  2011-08-13 18:28 ` [slub p4 0/7] slub: per cpu partial lists V4 David Rientjes
  2011-08-20 10:48 ` Pekka Enberg
  8 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-09 21:12 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

[-- Attachment #1: update_slabinfo_cp --]
[-- Type: text/plain, Size: 1808 bytes --]

Update the slabinfo tool to report the stats on per cpu partial list usage.

Signed-off-by: Christoph Lameter <cl@linux.com>


---
 tools/slub/slabinfo.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/tools/slub/slabinfo.c
===================================================================
--- linux-2.6.orig/tools/slub/slabinfo.c	2011-08-09 14:36:34.245334733 -0500
+++ linux-2.6/tools/slub/slabinfo.c	2011-08-09 16:01:30.392982425 -0500
@@ -42,6 +42,7 @@ struct slabinfo {
 	unsigned long deactivate_remote_frees, order_fallback;
 	unsigned long cmpxchg_double_cpu_fail, cmpxchg_double_fail;
 	unsigned long alloc_node_mismatch, deactivate_bypass;
+	unsigned long cpu_partial_alloc, cpu_partial_free;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
 } slabinfo[MAX_SLABS];
@@ -455,6 +456,11 @@ static void slab_stats(struct slabinfo *
 		s->alloc_from_partial * 100 / total_alloc,
 		s->free_remove_partial * 100 / total_free);
 
+	printf("Cpu partial list     %8lu %8lu %3lu %3lu\n",
+		s->cpu_partial_alloc, s->cpu_partial_free,
+		s->cpu_partial_alloc * 100 / total_alloc,
+		s->cpu_partial_free * 100 / total_free);
+
 	printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
 		s->deactivate_remote_frees, s->free_frozen,
 		s->deactivate_remote_frees * 100 / total_alloc,
@@ -1209,6 +1215,8 @@ static void read_slab_dir(void)
 			slab->order_fallback = get_obj("order_fallback");
 			slab->cmpxchg_double_cpu_fail = get_obj("cmpxchg_double_cpu_fail");
 			slab->cmpxchg_double_fail = get_obj("cmpxchg_double_fail");
+			slab->cpu_partial_alloc = get_obj("cpu_partial_alloc");
+			slab->cpu_partial_free = get_obj("cpu_partial_free");
 			slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
 			slab->deactivate_bypass = get_obj("deactivate_bypass");
 			chdir("..");


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 0/7] slub: per cpu partial lists V4
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
                   ` (6 preceding siblings ...)
  2011-08-09 21:12 ` [slub p4 7/7] slub: update slabinfo tools to report per cpu partial list statistics Christoph Lameter
@ 2011-08-13 18:28 ` David Rientjes
  2011-08-15  8:44   ` Pekka Enberg
  2011-08-15 14:29   ` Christoph Lameter
  2011-08-20 10:48 ` Pekka Enberg
  8 siblings, 2 replies; 20+ messages in thread
From: David Rientjes @ 2011-08-13 18:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

On Tue, 9 Aug 2011, Christoph Lameter wrote:

> The following patchset introduces per cpu partial lists which allow
> a performance increase of around ~10-20% with hackbench on my Sandybridge
> processor.
> 
> These lists help to avoid per node locking overhead. Allocator latency
> could be further reduced by making these operations work without
> disabling interrupts (like the fastpath and the free slowpath) but that
> is another project.
> 
> It is interesting to note that BSD has gone to a scheme with partial
> pages only per cpu (source: Adrian). Transfer of cpu ownerships is
> done using IPIs. Probably too much overhead for our taste. The approach
> here keeps the per node partial lists essentially meaning the "pages"
> in there have no cpu owner.
> 

I'm currently 35,000 feet above Chicago going about 611 mph, so what 
better time to benchmark this patchset on my netperf testing rack!

	threads		before		after
	 16		78031		74714  (-4.3%)
	 32		118269		115810 (-2.1%)
	 48		150787		150165 (-0.4%)
	 64		189932		187766 (-1.1%)
	 80		221189		223682 (+1.1%)
	 96		239807		246222 (+2.7%)
	112		262135		271329 (+3.5%)
	128		273612		286782 (+4.8%)
	144		280009		293943 (+5.0%)
	160		285972		299798 (+4.8%)

I'll review the patchset in detail, especially the cleanups and 
optimizations, when my wifi isn't so sketchy.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 0/7] slub: per cpu partial lists V4
  2011-08-13 18:28 ` [slub p4 0/7] slub: per cpu partial lists V4 David Rientjes
@ 2011-08-15  8:44   ` Pekka Enberg
  2011-08-15 14:29   ` Christoph Lameter
  1 sibling, 0 replies; 20+ messages in thread
From: Pekka Enberg @ 2011-08-15  8:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

On 8/13/11 9:28 PM, David Rientjes wrote:
> On Tue, 9 Aug 2011, Christoph Lameter wrote:
>
>> The following patchset introduces per cpu partial lists which allow
>> a performance increase of around ~10-20% with hackbench on my Sandybridge
>> processor.
>>
>> These lists help to avoid per node locking overhead. Allocator latency
>> could be further reduced by making these operations work without
>> disabling interrupts (like the fastpath and the free slowpath) but that
>> is another project.
>>
>> It is interesting to note that BSD has gone to a scheme with partial
>> pages only per cpu (source: Adrian). Transfer of cpu ownerships is
>> done using IPIs. Probably too much overhead for our taste. The approach
>> here keeps the per node partial lists essentially meaning the "pages"
>> in there have no cpu owner.
>>
>
> I'm currently 35,000 feet above Chicago going about 611 mph, so what
> better time to benchmark this patchset on my netperf testing rack!
>
> 	threads		before		after
> 	 16		78031		74714  (-4.3%)
> 	 32		118269		115810 (-2.1%)
> 	 48		150787		150165 (-0.4%)
> 	 64		189932		187766 (-1.1%)
> 	 80		221189		223682 (+1.1%)
> 	 96		239807		246222 (+2.7%)
> 	112		262135		271329 (+3.5%)
> 	128		273612		286782 (+4.8%)
> 	144		280009		293943 (+5.0%)
> 	160		285972		299798 (+4.8%)
>
> I'll review the patchset in detail, especially the cleanups and
> optimizations, when my wifi isn't so sketchy.

Andi, it'd be interesting to know your results for v4 of this patchset. 
I'm hoping to get the patches reviewed and merged to linux-next this week.

			Pekka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 0/7] slub: per cpu partial lists V4
  2011-08-13 18:28 ` [slub p4 0/7] slub: per cpu partial lists V4 David Rientjes
  2011-08-15  8:44   ` Pekka Enberg
@ 2011-08-15 14:29   ` Christoph Lameter
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-15 14:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Andi Kleen, tj, Metathronius Galabant,
	Matt Mackall, Eric Dumazet, Adrian Drzewiecki, linux-kernel

On Sat, 13 Aug 2011, David Rientjes wrote:

> I'm currently 35,000 feet above Chicago going about 611 mph, so what
> better time to benchmark this patchset on my netperf testing rack!
>
> 	threads		before		after
> 	 16		78031		74714  (-4.3%)
> 	 32		118269		115810 (-2.1%)
> 	 48		150787		150165 (-0.4%)
> 	 64		189932		187766 (-1.1%)
> 	 80		221189		223682 (+1.1%)
> 	 96		239807		246222 (+2.7%)
> 	112		262135		271329 (+3.5%)
> 	128		273612		286782 (+4.8%)
> 	144		280009		293943 (+5.0%)
> 	160		285972		299798 (+4.8%)

The higher the contention the better the performance. But the -4% on non
contention is worrisome.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 1/7] slub: free slabs without holding locks (V2)
  2011-08-09 21:12 ` [slub p4 1/7] slub: free slabs without holding locks (V2) Christoph Lameter
@ 2011-08-20 10:32   ` Pekka Enberg
  2011-08-20 15:58     ` Christoph Lameter
  0 siblings, 1 reply; 20+ messages in thread
From: Pekka Enberg @ 2011-08-20 10:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, rientjes

On Tue, 9 Aug 2011, Christoph Lameter wrote:
> There are two situations in which slub holds a lock while releasing
> pages:
>
> 	A. During kmem_cache_shrink()
> 	B. During kmem_cache_close()
>
> For A build a list while holding the lock and then release the pages
> later. In case of B we are the last remaining user of the slab so
> there is no need to take the listlock.
>
> After this patch all calls to the page allocator to free pages are
> done without holding any spinlocks. kmem_cache_destroy() will still
> hold the slub_lock semaphore.
>
> V1->V2. Remove kfree. Avoid locking in free_partial.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> ---
> mm/slub.c |   26 +++++++++++++-------------
> 1 file changed, 13 insertions(+), 13 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2011-08-09 13:01:59.071582163 -0500
> +++ linux-2.6/mm/slub.c	2011-08-09 13:05:00.051582012 -0500
> @@ -2970,13 +2970,13 @@ static void list_slab_objects(struct kme
>
> /*
>  * Attempt to free all partial slabs on a node.
> + * This is called from kmem_cache_close(). We must be the last thread
> + * using the cache and therefore we do not need to lock anymore.
>  */
> static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
> {

Is it possible to somehow verify that we're the last thread using the 
cache when SLUB debugging is enabled? It'd be useful for tracking down 
callers that violate this assumption.

 			Pekka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 6/7] slub: per cpu cache for partial pages
  2011-08-09 21:12 ` [slub p4 6/7] slub: per cpu cache for partial pages Christoph Lameter
@ 2011-08-20 10:40   ` Pekka Enberg
  2011-08-20 16:00     ` Christoph Lameter
       [not found]   ` <CAF1ivSaH9fh6_QvuBkLc5t=zC4mPEAD5ZzsxOuPruDwG9MiZzw@mail.gmail.com>
  1 sibling, 1 reply; 20+ messages in thread
From: Pekka Enberg @ 2011-08-20 10:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, rientjes

> @@ -2919,7 +3071,34 @@ static int kmem_cache_open(struct kmem_c
>  	 * The larger the object size is, the more pages we want on the partial
>  	 * list to avoid pounding the page allocator excessively.
>  	 */
> -	set_min_partial(s, ilog2(s->size));
> +	set_min_partial(s, ilog2(s->size) / 2);

Why do we want to make minimum size smaller?

> +
> +	/*
> +	 * cpu_partial determined the maximum number of objects kept in the
> +	 * per cpu partial lists of a processor.
> +	 *
> +	 * Per cpu partial lists mainly contain slabs that just have one
> +	 * object freed. If they are used for allocation then they can be
> +	 * filled up again with minimal effort. The slab will never hit the
> +	 * per node partial lists and therefore no locking will be required.
> +	 *
> +	 * This setting also determines
> +	 *
> +	 * A) The number of objects from per cpu partial slabs dumped to the
> +	 *    per node list when we reach the limit.
> +	 * B) The number of objects in partial partial slabs to extract from the
> +	 *    per node list when we run out of per cpu objects. We only fetch 50%
> +	 *    to keep some capacity around for frees.
> +	 */
> +	if (s->size >= PAGE_SIZE)
> +		s->cpu_partial = 2;
> +	else if (s->size >= 1024)
> +		s->cpu_partial = 6;
> +	else if (s->size >= 256)
> +		s->cpu_partial = 13;
> +	else
> +		s->cpu_partial = 30;

How did you come up with these limits?

> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h	2011-08-05 12:06:57.571873039 -0500
> +++ linux-2.6/include/linux/mm_types.h	2011-08-09 13:05:13.201582001 -0500
> @@ -79,9 +79,21 @@ struct page {
>  	};
>
>  	/* Third double word block */
> -	struct list_head lru;		/* Pageout list, eg. active_list
> +	union {
> +		struct list_head lru;	/* Pageout list, eg. active_list
>  					 * protected by zone->lru_lock !
>  					 */
> +		struct {		/* slub per cpu partial pages */
> +			struct page *next;	/* Next partial slab */
> +#ifdef CONFIG_64BIT
> +			int pages;	/* Nr of partial slabs left */
> +			int pobjects;	/* Approximate # of objects */
> +#else
> +			short int pages;
> +			short int pobjects;
> +#endif
> +		};
> +	};

Why are the sizes different on 32-bit and 64-bit? Does this change 'struct
page' size?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 2/7] slub: Remove useless statements in __slab_alloc
  2011-08-09 21:12 ` [slub p4 2/7] slub: Remove useless statements in __slab_alloc Christoph Lameter
@ 2011-08-20 10:44   ` Pekka Enberg
  2011-08-20 16:01     ` Christoph Lameter
  0 siblings, 1 reply; 20+ messages in thread
From: Pekka Enberg @ 2011-08-20 10:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, torvalds, rientjes

On Tue, 9 Aug 2011, Christoph Lameter wrote:
> Two statements in __slab_alloc() do not have any effect.
>
> 1. c->page is already set to NULL by deactivate_slab() called right before.
>
> 2. gfpflags are masked in new_slab() before being passed to the page
>   allocator. There is no need to mask gfpflags in __slab_alloc in particular
>   since most frequent processing in __slab_alloc does not require the use of a
>   gfpmask.
>
> Cc: torvalds@linux-foundation.org
> Signed-off-by: Christoph Lameter <cl@linux.com>

Linus wasn't actually on the CC of the email.

> ---
> mm/slub.c |    4 ----
> 1 file changed, 4 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2011-08-01 11:03:15.000000000 -0500
> +++ linux-2.6/mm/slub.c	2011-08-01 11:04:06.385859038 -0500
> @@ -2064,9 +2064,6 @@ static void *__slab_alloc(struct kmem_ca
> 	c = this_cpu_ptr(s->cpu_slab);
> #endif
>
> -	/* We handle __GFP_ZERO in the caller */
> -	gfpflags &= ~__GFP_ZERO;
> -

This is the part Linus felt strongly about in the past. It's not needed 
now that we mask GFP flags in new_slab() and the code path is pretty hot 
on workloads where SLAB traditionally has had better performance.

> 	page = c->page;
> 	if (!page)
> 		goto new_slab;
> @@ -2163,7 +2160,6 @@ debug:
>
> 	c->freelist = get_freepointer(s, object);
> 	deactivate_slab(s, c);
> -	c->page = NULL;
> 	c->node = NUMA_NO_NODE;
> 	local_irq_restore(flags);
> 	return object;
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 0/7] slub: per cpu partial lists V4
  2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
                   ` (7 preceding siblings ...)
  2011-08-13 18:28 ` [slub p4 0/7] slub: per cpu partial lists V4 David Rientjes
@ 2011-08-20 10:48 ` Pekka Enberg
  8 siblings, 0 replies; 20+ messages in thread
From: Pekka Enberg @ 2011-08-20 10:48 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, rientjes, andi, mgorman, akpm, torvalds

On Tue, 9 Aug 2011, Christoph Lameter wrote:
> V3->V4 : Use a single linked per cpu list instead of a per cpu array.
> 	This results in improvements even for the single threaded
> 	case. I think this is ready for more widespread testing (-next?)
> 	The number of partial pages per cpu is configurable via
> 	/sys/kernel/slab/<name>/cpu_partial
>
> V2->V3 : Work on the todo list. Still some work to be done to reduce
>         code impact and make this all cleaner. (Pekka: patch 1-3
>         are cleanup patches of general usefulness. You got #1 already
>         2+3 could be picked up w/o any issue).
>
> The following patchset introduces per cpu partial lists which allow
> a performance increase of around ~10-20% with hackbench on my Sandybridge
> processor.
>
> These lists help to avoid per node locking overhead. Allocator latency
> could be further reduced by making these operations work without
> disabling interrupts (like the fastpath and the free slowpath) but that
> is another project.
>
> It is interesting to note that BSD has gone to a scheme with partial
> pages only per cpu (source: Adrian). Transfer of cpu ownerships is
> done using IPIs. Probably too much overhead for our taste. The approach
> here keeps the per node partial lists essentially meaning the "pages"
> in there have no cpu owner.

These patches are now in slub/partial branch of slab.git. I'll probably 
queue them for linux-next early next week if they don't explode on my 
machine.

 			Pekka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 1/7] slub: free slabs without holding locks (V2)
  2011-08-20 10:32   ` Pekka Enberg
@ 2011-08-20 15:58     ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-20 15:58 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel, rientjes

On Sat, 20 Aug 2011, Pekka Enberg wrote:

> > static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
> > {
>
> Is it possible to somehow verify that we're the last thread using the cache
> when SLUB debugging is enabled? It'd be useful for tracking down callers that
> violate this assumption.

Hmmm... We do not track "threads" using slabs. I do not really know what
the "thread" entity that would access a slab is? A subsystem? A kernel
thread? A user thread? All these can access any slab at any time.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 6/7] slub: per cpu cache for partial pages
  2011-08-20 10:40   ` Pekka Enberg
@ 2011-08-20 16:00     ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-20 16:00 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel, rientjes

On Sat, 20 Aug 2011, Pekka Enberg wrote:

> > @@ -2919,7 +3071,34 @@ static int kmem_cache_open(struct kmem_c
> >  	 * The larger the object size is, the more pages we want on the
> > partial
> >  	 * list to avoid pounding the page allocator excessively.
> >  	 */
> > -	set_min_partial(s, ilog2(s->size));
> > +	set_min_partial(s, ilog2(s->size) / 2);
>
> Why do we want to make minimum size smaller?

Because we are getting additional partial pages cached for each processor.

> > +	 */
> > +	if (s->size >= PAGE_SIZE)
> > +		s->cpu_partial = 2;
> > +	else if (s->size >= 1024)
> > +		s->cpu_partial = 6;
> > +	else if (s->size >= 256)
> > +		s->cpu_partial = 13;
> > +	else
> > +		s->cpu_partial = 30;
>
> How did you come up with these limits?

These are the per cpu queue limits of SLAB.

> > +		struct {		/* slub per cpu partial pages */
> > +			struct page *next;	/* Next partial slab */
> > +#ifdef CONFIG_64BIT
> > +			int pages;	/* Nr of partial slabs left */
> > +			int pobjects;	/* Approximate # of objects */
> > +#else
> > +			short int pages;
> > +			short int pobjects;
> > +#endif
> > +		};
> > +	};
>
> Why are the sizes different on 32-bit and 64-bit? Does this change 'struct
> page' size?

int == long /2 under 64 bit
int == long on 32 bit.

without the ifdef the page struct could get bigger for 32 bit.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 2/7] slub: Remove useless statements in __slab_alloc
  2011-08-20 10:44   ` Pekka Enberg
@ 2011-08-20 16:01     ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-20 16:01 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-kernel, torvalds, rientjes

On Sat, 20 Aug 2011, Pekka Enberg wrote:

> > Cc: torvalds@linux-foundation.org
> > Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Linus wasn't actually on the CC of the email.

Quilt usually does that automatically when it finds a cc.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 6/7] slub: per cpu cache for partial pages
       [not found]   ` <CAF1ivSaH9fh6_QvuBkLc5t=zC4mPEAD5ZzsxOuPruDwG9MiZzw@mail.gmail.com>
@ 2011-08-24  7:26     ` Lin Ming
  2011-08-24 13:57       ` Christoph Lameter
  0 siblings, 1 reply; 20+ messages in thread
From: Lin Ming @ 2011-08-24  7:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, David Rientjes, Andi Kleen, tj,
	Metathronius Galabant, Matt Mackall, Eric Dumazet,
	Adrian Drzewiecki, Li, Shaohua, lkml

On Wed, Aug 10, 2011 at 5:12 AM, Christoph Lameter <cl@linux.com> wrote:
> Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
> partial pages. The partial page list is used in slab_free() to avoid
> per node lock taking.
> 
> In __slab_alloc() we can then take multiple partial pages off the per
> node partial list in one go reducing node lock pressure.
> 
> We can also use the per cpu partial list in slab_alloc() to avoid scanning
> partial lists for pages with free objects.
> 
> The main effect of a per cpu partial list is that the per node list_lock
> is taken for batches of partial pages instead of individual ones.
> 
> Potential future enhancements:
> 
> 1. The pickup from the partial list could be perhaps be done without disabling
>   interrupts with some work. The free path already puts the page into the
>   per cpu partial list without disabling interrupts.

Nice patches!

Could you share possible ways for this potential enhancement?

Thanks,
Lin Ming

> 
> 2. __slab_free() may have some code paths that could use optimization.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [slub p4 6/7] slub: per cpu cache for partial pages
  2011-08-24  7:26     ` Lin Ming
@ 2011-08-24 13:57       ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2011-08-24 13:57 UTC (permalink / raw)
  To: Lin Ming
  Cc: Pekka Enberg, David Rientjes, Andi Kleen, tj,
	Metathronius Galabant, Matt Mackall, Eric Dumazet,
	Adrian Drzewiecki, Li, Shaohua, lkml

On Wed, 24 Aug 2011, Lin Ming wrote:

> > Potential future enhancements:
> >
> > 1. The pickup from the partial list could be perhaps be done without disabling
> >   interrupts with some work. The free path already puts the page into the
> >   per cpu partial list without disabling interrupts.
>
> Nice patches!
>
> Could you share possible ways for this potential enhancement?

In order to avoid disabling interrupts in the allocation paths the state
that has to be kept in kmem_cache_cpu needs to be minimized so that it can
be swapped atomically using a this_cpu_cmpxchg. This means getting rid of
the node and slab fields I guess. Working on some patches to that effect.

Once that is done one can with a this_cpu_cmpxchg_double take over the per
cpu slab and then use that information to update the slab page etc.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-08-24 13:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-09 21:12 [slub p4 0/7] slub: per cpu partial lists V4 Christoph Lameter
2011-08-09 21:12 ` [slub p4 1/7] slub: free slabs without holding locks (V2) Christoph Lameter
2011-08-20 10:32   ` Pekka Enberg
2011-08-20 15:58     ` Christoph Lameter
2011-08-09 21:12 ` [slub p4 2/7] slub: Remove useless statements in __slab_alloc Christoph Lameter
2011-08-20 10:44   ` Pekka Enberg
2011-08-20 16:01     ` Christoph Lameter
2011-08-09 21:12 ` [slub p4 3/7] slub: Prepare inuse field in new_slab() Christoph Lameter
2011-08-09 21:12 ` [slub p4 4/7] slub: pass kmem_cache_cpu pointer to get_partial() Christoph Lameter
2011-08-09 21:12 ` [slub p4 5/7] slub: return object pointer from get_partial() / new_slab() Christoph Lameter
2011-08-09 21:12 ` [slub p4 6/7] slub: per cpu cache for partial pages Christoph Lameter
2011-08-20 10:40   ` Pekka Enberg
2011-08-20 16:00     ` Christoph Lameter
     [not found]   ` <CAF1ivSaH9fh6_QvuBkLc5t=zC4mPEAD5ZzsxOuPruDwG9MiZzw@mail.gmail.com>
2011-08-24  7:26     ` Lin Ming
2011-08-24 13:57       ` Christoph Lameter
2011-08-09 21:12 ` [slub p4 7/7] slub: update slabinfo tools to report per cpu partial list statistics Christoph Lameter
2011-08-13 18:28 ` [slub p4 0/7] slub: per cpu partial lists V4 David Rientjes
2011-08-15  8:44   ` Pekka Enberg
2011-08-15 14:29   ` Christoph Lameter
2011-08-20 10:48 ` Pekka Enberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.