All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-02  8:23 ` Alex Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex Shi @ 2011-12-02  8:23 UTC (permalink / raw)
  To: cl, penberg; +Cc: linux-kernel, linux-mm

From: Alex Shi <alexs@intel.com>

Times performance regression were due to slub add to node partial head
or tail. That inspired me to do tunning on the node partial adding, to
set a criteria for head or tail position selection when do partial
adding.
My experiment show, when used objects is less than 1/4 total objects
of slub performance will get about 1.5% improvement on netperf loopback
testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
includes sandbridge or core2.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 mm/slub.c |   18 ++++++++----------
 1 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ed3334d..c419e80 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1461,14 +1461,13 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
  *
  * list_lock must be held.
  */
-static inline void add_partial(struct kmem_cache_node *n,
-				struct page *page, int tail)
+static inline void add_partial(struct kmem_cache_node *n, struct page *page)
 {
 	n->nr_partial++;
-	if (tail == DEACTIVATE_TO_TAIL)
-		list_add_tail(&page->lru, &n->partial);
-	else
+	if (page->inuse <= page->objects / 4)
 		list_add(&page->lru, &n->partial);
+	else
+		list_add_tail(&page->lru, &n->partial);
 }
 
 /*
@@ -1829,7 +1828,7 @@ redo:
 
 		if (m == M_PARTIAL) {
 
-			add_partial(n, page, tail);
+			add_partial(n, page);
 			stat(s, tail);
 
 		} else if (m == M_FULL) {
@@ -1904,8 +1903,7 @@ static void unfreeze_partials(struct kmem_cache *s)
 				if (l == M_PARTIAL)
 					remove_partial(n, page);
 				else
-					add_partial(n, page,
-						DEACTIVATE_TO_TAIL);
+					add_partial(n, page);
 
 				l = m;
 			}
@@ -2476,7 +2474,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 */
 		if (unlikely(!prior)) {
 			remove_full(s, page);
-			add_partial(n, page, DEACTIVATE_TO_TAIL);
+			add_partial(n, page);
 			stat(s, FREE_ADD_PARTIAL);
 		}
 	}
@@ -2793,7 +2791,7 @@ static void early_kmem_cache_node_alloc(int node)
 	init_kmem_cache_node(n, kmem_cache_node);
 	inc_slabs_node(kmem_cache_node, node, page->objects);
 
-	add_partial(n, page, DEACTIVATE_TO_HEAD);
+	add_partial(n, page);
 }
 
 static void free_kmem_cache_nodes(struct kmem_cache *s)
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-02  8:23 ` Alex Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex Shi @ 2011-12-02  8:23 UTC (permalink / raw)
  To: cl, penberg; +Cc: linux-kernel, linux-mm

From: Alex Shi <alexs@intel.com>

Times performance regression were due to slub add to node partial head
or tail. That inspired me to do tunning on the node partial adding, to
set a criteria for head or tail position selection when do partial
adding.
My experiment show, when used objects is less than 1/4 total objects
of slub performance will get about 1.5% improvement on netperf loopback
testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
includes sandbridge or core2.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 mm/slub.c |   18 ++++++++----------
 1 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ed3334d..c419e80 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1461,14 +1461,13 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
  *
  * list_lock must be held.
  */
-static inline void add_partial(struct kmem_cache_node *n,
-				struct page *page, int tail)
+static inline void add_partial(struct kmem_cache_node *n, struct page *page)
 {
 	n->nr_partial++;
-	if (tail == DEACTIVATE_TO_TAIL)
-		list_add_tail(&page->lru, &n->partial);
-	else
+	if (page->inuse <= page->objects / 4)
 		list_add(&page->lru, &n->partial);
+	else
+		list_add_tail(&page->lru, &n->partial);
 }
 
 /*
@@ -1829,7 +1828,7 @@ redo:
 
 		if (m == M_PARTIAL) {
 
-			add_partial(n, page, tail);
+			add_partial(n, page);
 			stat(s, tail);
 
 		} else if (m == M_FULL) {
@@ -1904,8 +1903,7 @@ static void unfreeze_partials(struct kmem_cache *s)
 				if (l == M_PARTIAL)
 					remove_partial(n, page);
 				else
-					add_partial(n, page,
-						DEACTIVATE_TO_TAIL);
+					add_partial(n, page);
 
 				l = m;
 			}
@@ -2476,7 +2474,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 */
 		if (unlikely(!prior)) {
 			remove_full(s, page);
-			add_partial(n, page, DEACTIVATE_TO_TAIL);
+			add_partial(n, page);
 			stat(s, FREE_ADD_PARTIAL);
 		}
 	}
@@ -2793,7 +2791,7 @@ static void early_kmem_cache_node_alloc(int node)
 	init_kmem_cache_node(n, kmem_cache_node);
 	inc_slabs_node(kmem_cache_node, node, page->objects);
 
-	add_partial(n, page, DEACTIVATE_TO_HEAD);
+	add_partial(n, page);
 }
 
 static void free_kmem_cache_nodes(struct kmem_cache *s)
-- 
1.7.0.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail
  2011-12-02  8:23 ` Alex Shi
@ 2011-12-02  8:23   ` Alex Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex Shi @ 2011-12-02  8:23 UTC (permalink / raw)
  To: cl, penberg; +Cc: linux-kernel, linux-mm

From: Alex Shi <alexs@intel.com>

Since the head or tail were automaticly decided in add_partial(),
we didn't need this statistics again.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/slub_def.h |    2 --
 mm/slub.c                |   11 ++---------
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index a32bcfd..509841a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -29,8 +29,6 @@ enum stat_item {
 	CPUSLAB_FLUSH,		/* Abandoning of the cpu slab */
 	DEACTIVATE_FULL,	/* Cpu slab was full when deactivated */
 	DEACTIVATE_EMPTY,	/* Cpu slab was empty when deactivated */
-	DEACTIVATE_TO_HEAD,	/* Cpu slab was moved to the head of partials */
-	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	DEACTIVATE_BYPASS,	/* Implicit deactivation */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
diff --git a/mm/slub.c b/mm/slub.c
index c419e80..65d901f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1717,13 +1717,11 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 	enum slab_modes l = M_NONE, m = M_NONE;
 	void *freelist;
 	void *nextfree;
-	int tail = DEACTIVATE_TO_HEAD;
 	struct page new;
 	struct page old;
 
 	if (page->freelist) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
-		tail = DEACTIVATE_TO_TAIL;
 	}
 
 	c->tid = next_tid(c->tid);
@@ -1826,12 +1824,11 @@ redo:
 
 			remove_full(s, page);
 
-		if (m == M_PARTIAL) {
+		if (m == M_PARTIAL)
 
 			add_partial(n, page);
-			stat(s, tail);
 
-		} else if (m == M_FULL) {
+		else if (m == M_FULL) {
 
 			stat(s, DEACTIVATE_FULL);
 			add_full(s, n, page);
@@ -5023,8 +5020,6 @@ STAT_ATTR(FREE_SLAB, free_slab);
 STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
 STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
 STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
-STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
-STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
@@ -5088,8 +5083,6 @@ static struct attribute *slab_attrs[] = {
 	&cpuslab_flush_attr.attr,
 	&deactivate_full_attr.attr,
 	&deactivate_empty_attr.attr,
-	&deactivate_to_head_attr.attr,
-	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&deactivate_bypass_attr.attr,
 	&order_fallback_attr.attr,
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail
@ 2011-12-02  8:23   ` Alex Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex Shi @ 2011-12-02  8:23 UTC (permalink / raw)
  To: cl, penberg; +Cc: linux-kernel, linux-mm

From: Alex Shi <alexs@intel.com>

Since the head or tail were automaticly decided in add_partial(),
we didn't need this statistics again.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/slub_def.h |    2 --
 mm/slub.c                |   11 ++---------
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index a32bcfd..509841a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -29,8 +29,6 @@ enum stat_item {
 	CPUSLAB_FLUSH,		/* Abandoning of the cpu slab */
 	DEACTIVATE_FULL,	/* Cpu slab was full when deactivated */
 	DEACTIVATE_EMPTY,	/* Cpu slab was empty when deactivated */
-	DEACTIVATE_TO_HEAD,	/* Cpu slab was moved to the head of partials */
-	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	DEACTIVATE_BYPASS,	/* Implicit deactivation */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
diff --git a/mm/slub.c b/mm/slub.c
index c419e80..65d901f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1717,13 +1717,11 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 	enum slab_modes l = M_NONE, m = M_NONE;
 	void *freelist;
 	void *nextfree;
-	int tail = DEACTIVATE_TO_HEAD;
 	struct page new;
 	struct page old;
 
 	if (page->freelist) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
-		tail = DEACTIVATE_TO_TAIL;
 	}
 
 	c->tid = next_tid(c->tid);
@@ -1826,12 +1824,11 @@ redo:
 
 			remove_full(s, page);
 
-		if (m == M_PARTIAL) {
+		if (m == M_PARTIAL)
 
 			add_partial(n, page);
-			stat(s, tail);
 
-		} else if (m == M_FULL) {
+		else if (m == M_FULL) {
 
 			stat(s, DEACTIVATE_FULL);
 			add_full(s, n, page);
@@ -5023,8 +5020,6 @@ STAT_ATTR(FREE_SLAB, free_slab);
 STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
 STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
 STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
-STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
-STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
@@ -5088,8 +5083,6 @@ static struct attribute *slab_attrs[] = {
 	&cpuslab_flush_attr.attr,
 	&deactivate_full_attr.attr,
 	&deactivate_empty_attr.attr,
-	&deactivate_to_head_attr.attr,
-	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&deactivate_bypass_attr.attr,
 	&order_fallback_attr.attr,
-- 
1.7.0.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter
  2011-12-02  8:23   ` Alex Shi
@ 2011-12-02  8:23     ` Alex Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex Shi @ 2011-12-02  8:23 UTC (permalink / raw)
  To: cl, penberg; +Cc: linux-kernel, linux-mm

From: Alex Shi <alexs@intel.com>

Set selection criteria when fill per cpu partial in slow allocation path,
and check the PCP left space before filling, even maybe the data from another
CPU.
The patch can bring another 1.5% performance increase on netperf loopback
testing for our 4 or 2 sockets machines, include sandbridge, core2

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 mm/slub.c |   43 +++++++++++++++++++++++++++++++------------
 1 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 65d901f..72df387 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1542,25 +1542,44 @@ static void *get_partial_node(struct kmem_cache *s,
 
 	spin_lock(&n->list_lock);
 	list_for_each_entry_safe(page, page2, &n->partial, lru) {
-		void *t = acquire_slab(s, n, page, object == NULL);
 		int available;
+		void *t;
+		struct page *oldpage;
+		int pobjects;
 
-		if (!t)
-			break;
 
 		if (!object) {
-			c->page = page;
-			c->node = page_to_nid(page);
-			stat(s, ALLOC_FROM_PARTIAL);
-			object = t;
-			available =  page->objects - page->inuse;
+			t = acquire_slab(s, n, page, object == NULL);
+			if (!t)
+				break;
+			else {
+				c->page = page;
+				c->node = page_to_nid(page);
+				stat(s, ALLOC_FROM_PARTIAL);
+				object = t;
+			}
 		} else {
-			page->freelist = t;
-			available = put_cpu_partial(s, page, 0);
+			oldpage = this_cpu_read(s->cpu_slab->partial);
+			pobjects = oldpage ? oldpage->pobjects : 0;
+
+			if (pobjects > s->cpu_partial / 2)
+				break;
+
+			available =  page->objects - page->inuse;
+			if (available >= s->cpu_partial / 4) {
+				t = acquire_slab(s, n, page, object == NULL);
+				if (!t)
+					break;
+				else {
+					page->freelist = t;
+					if (put_cpu_partial(s, page, 0) >
+							s->cpu_partial / 2)
+						break;
+				}
+			}
 		}
-		if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
+		if (kmem_cache_debug(s))
 			break;
-
 	}
 	spin_unlock(&n->list_lock);
 	return object;
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter
@ 2011-12-02  8:23     ` Alex Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex Shi @ 2011-12-02  8:23 UTC (permalink / raw)
  To: cl, penberg; +Cc: linux-kernel, linux-mm

From: Alex Shi <alexs@intel.com>

Set selection criteria when fill per cpu partial in slow allocation path,
and check the PCP left space before filling, even maybe the data from another
CPU.
The patch can bring another 1.5% performance increase on netperf loopback
testing for our 4 or 2 sockets machines, include sandbridge, core2

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 mm/slub.c |   43 +++++++++++++++++++++++++++++++------------
 1 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 65d901f..72df387 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1542,25 +1542,44 @@ static void *get_partial_node(struct kmem_cache *s,
 
 	spin_lock(&n->list_lock);
 	list_for_each_entry_safe(page, page2, &n->partial, lru) {
-		void *t = acquire_slab(s, n, page, object == NULL);
 		int available;
+		void *t;
+		struct page *oldpage;
+		int pobjects;
 
-		if (!t)
-			break;
 
 		if (!object) {
-			c->page = page;
-			c->node = page_to_nid(page);
-			stat(s, ALLOC_FROM_PARTIAL);
-			object = t;
-			available =  page->objects - page->inuse;
+			t = acquire_slab(s, n, page, object == NULL);
+			if (!t)
+				break;
+			else {
+				c->page = page;
+				c->node = page_to_nid(page);
+				stat(s, ALLOC_FROM_PARTIAL);
+				object = t;
+			}
 		} else {
-			page->freelist = t;
-			available = put_cpu_partial(s, page, 0);
+			oldpage = this_cpu_read(s->cpu_slab->partial);
+			pobjects = oldpage ? oldpage->pobjects : 0;
+
+			if (pobjects > s->cpu_partial / 2)
+				break;
+
+			available =  page->objects - page->inuse;
+			if (available >= s->cpu_partial / 4) {
+				t = acquire_slab(s, n, page, object == NULL);
+				if (!t)
+					break;
+				else {
+					page->freelist = t;
+					if (put_cpu_partial(s, page, 0) >
+							s->cpu_partial / 2)
+						break;
+				}
+			}
 		}
-		if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
+		if (kmem_cache_debug(s))
 			break;
-
 	}
 	spin_unlock(&n->list_lock);
 	return object;
-- 
1.7.0.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02  8:23 ` Alex Shi
@ 2011-12-02 11:36   ` Eric Dumazet
  -1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-02 11:36 UTC (permalink / raw)
  To: Alex Shi; +Cc: cl, penberg, linux-kernel, linux-mm

Le vendredi 02 décembre 2011 à 16:23 +0800, Alex Shi a écrit :
> From: Alex Shi <alexs@intel.com>
> 
> Times performance regression were due to slub add to node partial head
> or tail. That inspired me to do tunning on the node partial adding, to
> set a criteria for head or tail position selection when do partial
> adding.
> My experiment show, when used objects is less than 1/4 total objects
> of slub performance will get about 1.5% improvement on netperf loopback
> testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> includes sandbridge or core2.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  mm/slub.c |   18 ++++++++----------
>  1 files changed, 8 insertions(+), 10 deletions(-)
> 

netperf (loopback or ethernet) is a known stress test for slub, and your
patch removes code that might hurt netperf, but benefit real workload.

Have you tried instead this far less intrusive solution ?

if (tail == DEACTIVATE_TO_TAIL ||
    page->inuse > page->objects / 4)
         list_add_tail(&page->lru, &n->partial);
else
         list_add(&page->lru, &n->partial);




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-02 11:36   ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-02 11:36 UTC (permalink / raw)
  To: Alex Shi; +Cc: cl, penberg, linux-kernel, linux-mm

Le vendredi 02 dA(C)cembre 2011 A  16:23 +0800, Alex Shi a A(C)crit :
> From: Alex Shi <alexs@intel.com>
> 
> Times performance regression were due to slub add to node partial head
> or tail. That inspired me to do tunning on the node partial adding, to
> set a criteria for head or tail position selection when do partial
> adding.
> My experiment show, when used objects is less than 1/4 total objects
> of slub performance will get about 1.5% improvement on netperf loopback
> testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> includes sandbridge or core2.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  mm/slub.c |   18 ++++++++----------
>  1 files changed, 8 insertions(+), 10 deletions(-)
> 

netperf (loopback or ethernet) is a known stress test for slub, and your
patch removes code that might hurt netperf, but benefit real workload.

Have you tried instead this far less intrusive solution ?

if (tail == DEACTIVATE_TO_TAIL ||
    page->inuse > page->objects / 4)
         list_add_tail(&page->lru, &n->partial);
else
         list_add(&page->lru, &n->partial);



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02  8:23 ` Alex Shi
@ 2011-12-02 14:43   ` Christoph Lameter
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-02 14:43 UTC (permalink / raw)
  To: Alex Shi; +Cc: penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Alex Shi wrote:

> From: Alex Shi <alexs@intel.com>
>
> Times performance regression were due to slub add to node partial head
> or tail. That inspired me to do tunning on the node partial adding, to
> set a criteria for head or tail position selection when do partial
> adding.
> My experiment show, when used objects is less than 1/4 total objects
> of slub performance will get about 1.5% improvement on netperf loopback
> testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> includes sandbridge or core2.

The number of free objects in a slab may have nothing to do with cache
hotness of all objects in the slab. You can only be sure that one object
(the one that was freed) is cache hot. Netperf may use them in sequence
and therefore you are likely to get series of frees on the same slab
page. How are other benchmarks affected by this change?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-02 14:43   ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-02 14:43 UTC (permalink / raw)
  To: Alex Shi; +Cc: penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Alex Shi wrote:

> From: Alex Shi <alexs@intel.com>
>
> Times performance regression were due to slub add to node partial head
> or tail. That inspired me to do tunning on the node partial adding, to
> set a criteria for head or tail position selection when do partial
> adding.
> My experiment show, when used objects is less than 1/4 total objects
> of slub performance will get about 1.5% improvement on netperf loopback
> testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> includes sandbridge or core2.

The number of free objects in a slab may have nothing to do with cache
hotness of all objects in the slab. You can only be sure that one object
(the one that was freed) is cache hot. Netperf may use them in sequence
and therefore you are likely to get series of frees on the same slab
page. How are other benchmarks affected by this change?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail
  2011-12-02  8:23   ` Alex Shi
@ 2011-12-02 14:44     ` Christoph Lameter
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-02 14:44 UTC (permalink / raw)
  To: Alex Shi; +Cc: penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Alex Shi wrote:

> Since the head or tail were automaticly decided in add_partial(),
> we didn't need this statistics again.

You need to update tools/slub/slabinfo.c as well.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail
@ 2011-12-02 14:44     ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-02 14:44 UTC (permalink / raw)
  To: Alex Shi; +Cc: penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Alex Shi wrote:

> Since the head or tail were automaticly decided in add_partial(),
> we didn't need this statistics again.

You need to update tools/slub/slabinfo.c as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02 11:36   ` Eric Dumazet
@ 2011-12-02 20:02     ` Christoph Lameter
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-02 20:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alex Shi, penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Eric Dumazet wrote:

> netperf (loopback or ethernet) is a known stress test for slub, and your
> patch removes code that might hurt netperf, but benefit real workload.
>
> Have you tried instead this far less intrusive solution ?
>
> if (tail == DEACTIVATE_TO_TAIL ||
>     page->inuse > page->objects / 4)
>          list_add_tail(&page->lru, &n->partial);
> else
>          list_add(&page->lru, &n->partial);

One could also move this logic to reside outside of the call to
add_partial(). This is called mostly from __slab_free() so the logic could
be put in there.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-02 20:02     ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-02 20:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alex Shi, penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Eric Dumazet wrote:

> netperf (loopback or ethernet) is a known stress test for slub, and your
> patch removes code that might hurt netperf, but benefit real workload.
>
> Have you tried instead this far less intrusive solution ?
>
> if (tail == DEACTIVATE_TO_TAIL ||
>     page->inuse > page->objects / 4)
>          list_add_tail(&page->lru, &n->partial);
> else
>          list_add(&page->lru, &n->partial);

One could also move this logic to reside outside of the call to
add_partial(). This is called mostly from __slab_free() so the logic could
be put in there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02 20:02     ` Christoph Lameter
@ 2011-12-05  2:21       ` Shaohua Li
  -1 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-05  2:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Shi, Alex, penberg, linux-kernel, linux-mm

On Sat, 2011-12-03 at 04:02 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Eric Dumazet wrote:
> 
> > netperf (loopback or ethernet) is a known stress test for slub, and your
> > patch removes code that might hurt netperf, but benefit real workload.
> >
> > Have you tried instead this far less intrusive solution ?
> >
> > if (tail == DEACTIVATE_TO_TAIL ||
> >     page->inuse > page->objects / 4)
> >          list_add_tail(&page->lru, &n->partial);
> > else
> >          list_add(&page->lru, &n->partial);
> 
> One could also move this logic to reside outside of the call to
> add_partial(). This is called mostly from __slab_free() so the logic could
> be put in there.
I'm wondering where the improvement comes from. The new added partial
page almost always has few free objects (the inuse < objects/4 isn't
popular I thought), that's why we add it to list tail.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-05  2:21       ` Shaohua Li
  0 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-05  2:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Shi, Alex, penberg, linux-kernel, linux-mm

On Sat, 2011-12-03 at 04:02 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Eric Dumazet wrote:
> 
> > netperf (loopback or ethernet) is a known stress test for slub, and your
> > patch removes code that might hurt netperf, but benefit real workload.
> >
> > Have you tried instead this far less intrusive solution ?
> >
> > if (tail == DEACTIVATE_TO_TAIL ||
> >     page->inuse > page->objects / 4)
> >          list_add_tail(&page->lru, &n->partial);
> > else
> >          list_add(&page->lru, &n->partial);
> 
> One could also move this logic to reside outside of the call to
> add_partial(). This is called mostly from __slab_free() so the logic could
> be put in there.
I'm wondering where the improvement comes from. The new added partial
page almost always has few free objects (the inuse < objects/4 isn't
popular I thought), that's why we add it to list tail.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02 11:36   ` Eric Dumazet
@ 2011-12-05  3:28     ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-05  3:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: cl, penberg, linux-kernel, linux-mm

On Fri, 2011-12-02 at 19:36 +0800, Eric Dumazet wrote:
> Le vendredi 02 décembre 2011 à 16:23 +0800, Alex Shi a écrit :
> > From: Alex Shi <alexs@intel.com>
> > 
> > Times performance regression were due to slub add to node partial head
> > or tail. That inspired me to do tunning on the node partial adding, to
> > set a criteria for head or tail position selection when do partial
> > adding.
> > My experiment show, when used objects is less than 1/4 total objects
> > of slub performance will get about 1.5% improvement on netperf loopback
> > testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> > includes sandbridge or core2.
> > 
> > Signed-off-by: Alex Shi <alex.shi@intel.com>
> > ---
> >  mm/slub.c |   18 ++++++++----------
> >  1 files changed, 8 insertions(+), 10 deletions(-)
> > 
> 
> netperf (loopback or ethernet) is a known stress test for slub, and your
> patch removes code that might hurt netperf, but benefit real workload.
> 
> Have you tried instead this far less intrusive solution ?
> 
> if (tail == DEACTIVATE_TO_TAIL ||
>     page->inuse > page->objects / 4)
>          list_add_tail(&page->lru, &n->partial);
> else
>          list_add(&page->lru, &n->partial);

For loopback netperf, it has no clear performance change on all
platforms. 
For hackbench testing, it has a bit worse on 2P NHM 0.5~1%, but it is
helpful to increase about 2% on 4P(8cores * 2SMT) NHM machine. 

I was thought no much cache effect on hot or cold after per cpu partial
adding. but seems for hackbench, node partial still has much effect. 


> 
> 
> 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-05  3:28     ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-05  3:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: cl, penberg, linux-kernel, linux-mm

On Fri, 2011-12-02 at 19:36 +0800, Eric Dumazet wrote:
> Le vendredi 02 dA(C)cembre 2011 A  16:23 +0800, Alex Shi a A(C)crit :
> > From: Alex Shi <alexs@intel.com>
> > 
> > Times performance regression were due to slub add to node partial head
> > or tail. That inspired me to do tunning on the node partial adding, to
> > set a criteria for head or tail position selection when do partial
> > adding.
> > My experiment show, when used objects is less than 1/4 total objects
> > of slub performance will get about 1.5% improvement on netperf loopback
> > testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> > includes sandbridge or core2.
> > 
> > Signed-off-by: Alex Shi <alex.shi@intel.com>
> > ---
> >  mm/slub.c |   18 ++++++++----------
> >  1 files changed, 8 insertions(+), 10 deletions(-)
> > 
> 
> netperf (loopback or ethernet) is a known stress test for slub, and your
> patch removes code that might hurt netperf, but benefit real workload.
> 
> Have you tried instead this far less intrusive solution ?
> 
> if (tail == DEACTIVATE_TO_TAIL ||
>     page->inuse > page->objects / 4)
>          list_add_tail(&page->lru, &n->partial);
> else
>          list_add(&page->lru, &n->partial);

For loopback netperf, it has no clear performance change on all
platforms. 
For hackbench testing, it has a bit worse on 2P NHM 0.5~1%, but it is
helpful to increase about 2% on 4P(8cores * 2SMT) NHM machine. 

I was thought no much cache effect on hot or cold after per cpu partial
adding. but seems for hackbench, node partial still has much effect. 


> 
> 
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02 14:43   ` Christoph Lameter
@ 2011-12-05  9:22     ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-05  9:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: penberg, linux-kernel, linux-mm, Andi Kleen

On Fri, 2011-12-02 at 22:43 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Alex Shi wrote:
> 
> > From: Alex Shi <alexs@intel.com>
> >
> > Times performance regression were due to slub add to node partial head
> > or tail. That inspired me to do tunning on the node partial adding, to
> > set a criteria for head or tail position selection when do partial
> > adding.
> > My experiment show, when used objects is less than 1/4 total objects
> > of slub performance will get about 1.5% improvement on netperf loopback
> > testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> > includes sandbridge or core2.
> 
> The number of free objects in a slab may have nothing to do with cache
> hotness of all objects in the slab. You can only be sure that one object
> (the one that was freed) is cache hot. Netperf may use them in sequence
> and therefore you are likely to get series of frees on the same slab
> page. How are other benchmarks affected by this change?

Previous testing depends on 3.2-rc1, that show hackbench performance has
no clear change, and netperf get some benefit. But seems after
irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting
these results. 

As to the cache hot benefit, my understanding is that if the same object
was reused, it contents will be refilled from memory anyway. but it will
save a CPU cache line replace action. 

But think through the lock contention on node->list_lock, like
explanation of commit 130655ef0979. more free objects will reduce the
contentions of this lock. It is some tricks to do balance of them. :( 




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-05  9:22     ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-05  9:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: penberg, linux-kernel, linux-mm, Andi Kleen

On Fri, 2011-12-02 at 22:43 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Alex Shi wrote:
> 
> > From: Alex Shi <alexs@intel.com>
> >
> > Times performance regression were due to slub add to node partial head
> > or tail. That inspired me to do tunning on the node partial adding, to
> > set a criteria for head or tail position selection when do partial
> > adding.
> > My experiment show, when used objects is less than 1/4 total objects
> > of slub performance will get about 1.5% improvement on netperf loopback
> > testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> > includes sandbridge or core2.
> 
> The number of free objects in a slab may have nothing to do with cache
> hotness of all objects in the slab. You can only be sure that one object
> (the one that was freed) is cache hot. Netperf may use them in sequence
> and therefore you are likely to get series of frees on the same slab
> page. How are other benchmarks affected by this change?

Previous testing depends on 3.2-rc1, that show hackbench performance has
no clear change, and netperf get some benefit. But seems after
irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting
these results. 

As to the cache hot benefit, my understanding is that if the same object
was reused, it contents will be refilled from memory anyway. but it will
save a CPU cache line replace action. 

But think through the lock contention on node->list_lock, like
explanation of commit 130655ef0979. more free objects will reduce the
contentions of this lock. It is some tricks to do balance of them. :( 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02 20:02     ` Christoph Lameter
@ 2011-12-05 10:01       ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-05 10:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Eric Dumazet, penberg, linux-kernel, linux-mm

On Sat, 2011-12-03 at 04:02 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Eric Dumazet wrote:
> 
> > netperf (loopback or ethernet) is a known stress test for slub, and your
> > patch removes code that might hurt netperf, but benefit real workload.
> >
> > Have you tried instead this far less intrusive solution ?
> >
> > if (tail == DEACTIVATE_TO_TAIL ||
> >     page->inuse > page->objects / 4)
> >          list_add_tail(&page->lru, &n->partial);
> > else
> >          list_add(&page->lru, &n->partial);
> 
> One could also move this logic to reside outside of the call to
> add_partial(). This is called mostly from __slab_free() so the logic could
> be put in there.
> 

After pcp adding, add_partial just be used in put_cpu_partial ->
unfreeze_partial without debug setting. If we need to do change, guess
it's better in this function. 

BTW
I collection some data with my PCP statistics patch. I will be very glad
if you like it. 

[alexs@lkp-ne04 ~]$ sudo grep . /sys/kernel/slab/kmalloc-256/*

/sys/kernel/slab/kmalloc-256/alloc_from_partial:4955645 
/sys/kernel/slab/kmalloc-256/alloc_from_pcp:6753981 
...
/sys/kernel/slab/kmalloc-256/pcp_from_free:11743977 
/sys/kernel/slab/kmalloc-256/pcp_from_node:5948883 
...
/sys/kernel/slab/kmalloc-256/unfreeze_pcp:834262 


--------------

>From aa754e20b81cb9f5ab63800a084858d25c18db31 Mon Sep 17 00:00:00 2001
From: Alex shi <alex.shi@intel.com>
Date: Tue, 6 Dec 2011 01:49:16 +0800
Subject: [PATCH] slub: per cpu partial statistics collection

PCP statistics were not collected in detail now. Add and change some variables
for this.

changed:
cpu_partial_alloc --> alloc_from_pcp,
cpu_partial_free  --> pcp_from_free, /* pcp refilled from slab free */

added:
pcp_from_node, /* pcp refilled from node partial */
unfreeze_pcp,  /* unfreeze pcp */

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/slub_def.h |    8 +++++---
 mm/slub.c                |   22 ++++++++++++++--------
 tools/slub/slabinfo.c    |   12 ++++++------
 3 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index a32bcfd..1c2669b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -21,7 +21,7 @@ enum stat_item {
 	FREE_FROZEN,		/* Freeing to frozen slab */
 	FREE_ADD_PARTIAL,	/* Freeing moves slab to partial list */
 	FREE_REMOVE_PARTIAL,	/* Freeing removes last object */
-	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from partial list */
+	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from node partial list */
 	ALLOC_SLAB,		/* Cpu slab acquired from page allocator */
 	ALLOC_REFILL,		/* Refill cpu slab from slab freelist */
 	ALLOC_NODE_MISMATCH,	/* Switching cpu slab */
@@ -36,8 +36,10 @@ enum stat_item {
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	CMPXCHG_DOUBLE_CPU_FAIL,/* Failure of this_cpu_cmpxchg_double */
 	CMPXCHG_DOUBLE_FAIL,	/* Number of times that cmpxchg double did not match */
-	CPU_PARTIAL_ALLOC,	/* Used cpu partial on alloc */
-	CPU_PARTIAL_FREE,	/* USed cpu partial on free */
+	ALLOC_FROM_PCP,		/* Used cpu partial on alloc */
+	PCP_FROM_FREE,		/* Fill cpu partial from free */
+	PCP_FROM_NODE,		/* Fill cpu partial from node partial */
+	UNFREEZE_PCP,		/* Unfreeze per cpu partial */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
diff --git a/mm/slub.c b/mm/slub.c
index ed3334d..5843846 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1558,6 +1558,7 @@ static void *get_partial_node(struct kmem_cache *s,
 		} else {
 			page->freelist = t;
 			available = put_cpu_partial(s, page, 0);
+			stat(s, PCP_FROM_NODE);
 		}
 		if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
 			break;
@@ -1968,6 +1969,7 @@ int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 				local_irq_restore(flags);
 				pobjects = 0;
 				pages = 0;
+				stat(s, UNFREEZE_PCP);
 			}
 		}
 
@@ -1979,7 +1981,6 @@ int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 		page->next = oldpage;
 
 	} while (irqsafe_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage);
-	stat(s, CPU_PARTIAL_FREE);
 	return pobjects;
 }
 
@@ -2212,7 +2213,7 @@ new_slab:
 		c->page = c->partial;
 		c->partial = c->page->next;
 		c->node = page_to_nid(c->page);
-		stat(s, CPU_PARTIAL_ALLOC);
+		stat(s, ALLOC_FROM_PCP);
 		c->freelist = NULL;
 		goto redo;
 	}
@@ -2448,9 +2449,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * If we just froze the page then put it onto the
 		 * per cpu partial list.
 		 */
-		if (new.frozen && !was_frozen)
+		if (new.frozen && !was_frozen) {
 			put_cpu_partial(s, page, 1);
-
+			stat(s, PCP_FROM_FREE);
+		}
 		/*
 		 * The list lock was not taken therefore no list
 		 * activity can be necessary.
@@ -5032,8 +5034,10 @@ STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
 STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail);
 STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail);
-STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
-STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
+STAT_ATTR(ALLOC_FROM_PCP, alloc_from_pcp);
+STAT_ATTR(PCP_FROM_FREE, pcp_from_free);
+STAT_ATTR(PCP_FROM_NODE, pcp_from_node);
+STAT_ATTR(UNFREEZE_PCP, unfreeze_pcp);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -5097,8 +5101,10 @@ static struct attribute *slab_attrs[] = {
 	&order_fallback_attr.attr,
 	&cmpxchg_double_fail_attr.attr,
 	&cmpxchg_double_cpu_fail_attr.attr,
-	&cpu_partial_alloc_attr.attr,
-	&cpu_partial_free_attr.attr,
+	&alloc_from_pcp_attr.attr,
+	&pcp_from_free_attr.attr,
+	&pcp_from_node_attr.attr,
+	&unfreeze_pcp_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,
diff --git a/tools/slub/slabinfo.c b/tools/slub/slabinfo.c
index 164cbcf..d8f67f0 100644
--- a/tools/slub/slabinfo.c
+++ b/tools/slub/slabinfo.c
@@ -42,7 +42,7 @@ struct slabinfo {
 	unsigned long deactivate_remote_frees, order_fallback;
 	unsigned long cmpxchg_double_cpu_fail, cmpxchg_double_fail;
 	unsigned long alloc_node_mismatch, deactivate_bypass;
-	unsigned long cpu_partial_alloc, cpu_partial_free;
+	unsigned long alloc_from_pcp, pcp_from_free;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
 } slabinfo[MAX_SLABS];
@@ -457,9 +457,9 @@ static void slab_stats(struct slabinfo *s)
 		s->free_remove_partial * 100 / total_free);
 
 	printf("Cpu partial list     %8lu %8lu %3lu %3lu\n",
-		s->cpu_partial_alloc, s->cpu_partial_free,
-		s->cpu_partial_alloc * 100 / total_alloc,
-		s->cpu_partial_free * 100 / total_free);
+		s->alloc_from_pcp, s->pcp_from_free,
+		s->alloc_from_pcp * 100 / total_alloc,
+		s->pcp_from_free * 100 / total_free);
 
 	printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
 		s->deactivate_remote_frees, s->free_frozen,
@@ -1215,8 +1215,8 @@ static void read_slab_dir(void)
 			slab->order_fallback = get_obj("order_fallback");
 			slab->cmpxchg_double_cpu_fail = get_obj("cmpxchg_double_cpu_fail");
 			slab->cmpxchg_double_fail = get_obj("cmpxchg_double_fail");
-			slab->cpu_partial_alloc = get_obj("cpu_partial_alloc");
-			slab->cpu_partial_free = get_obj("cpu_partial_free");
+			slab->alloc_from_pcp = get_obj("alloc_from_pcp");
+			slab->pcp_from_free = get_obj("pcp_from_free");
 			slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
 			slab->deactivate_bypass = get_obj("deactivate_bypass");
 			chdir("..");
-- 
1.7.0.1




^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-05 10:01       ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-05 10:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Eric Dumazet, penberg, linux-kernel, linux-mm

On Sat, 2011-12-03 at 04:02 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Eric Dumazet wrote:
> 
> > netperf (loopback or ethernet) is a known stress test for slub, and your
> > patch removes code that might hurt netperf, but benefit real workload.
> >
> > Have you tried instead this far less intrusive solution ?
> >
> > if (tail == DEACTIVATE_TO_TAIL ||
> >     page->inuse > page->objects / 4)
> >          list_add_tail(&page->lru, &n->partial);
> > else
> >          list_add(&page->lru, &n->partial);
> 
> One could also move this logic to reside outside of the call to
> add_partial(). This is called mostly from __slab_free() so the logic could
> be put in there.
> 

After pcp adding, add_partial just be used in put_cpu_partial ->
unfreeze_partial without debug setting. If we need to do change, guess
it's better in this function. 

BTW
I collection some data with my PCP statistics patch. I will be very glad
if you like it. 

[alexs@lkp-ne04 ~]$ sudo grep . /sys/kernel/slab/kmalloc-256/*

/sys/kernel/slab/kmalloc-256/alloc_from_partial:4955645 
/sys/kernel/slab/kmalloc-256/alloc_from_pcp:6753981 
...
/sys/kernel/slab/kmalloc-256/pcp_from_free:11743977 
/sys/kernel/slab/kmalloc-256/pcp_from_node:5948883 
...
/sys/kernel/slab/kmalloc-256/unfreeze_pcp:834262 


--------------

>From aa754e20b81cb9f5ab63800a084858d25c18db31 Mon Sep 17 00:00:00 2001
From: Alex shi <alex.shi@intel.com>
Date: Tue, 6 Dec 2011 01:49:16 +0800
Subject: [PATCH] slub: per cpu partial statistics collection

PCP statistics were not collected in detail now. Add and change some variables
for this.

changed:
cpu_partial_alloc --> alloc_from_pcp,
cpu_partial_free  --> pcp_from_free, /* pcp refilled from slab free */

added:
pcp_from_node, /* pcp refilled from node partial */
unfreeze_pcp,  /* unfreeze pcp */

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/slub_def.h |    8 +++++---
 mm/slub.c                |   22 ++++++++++++++--------
 tools/slub/slabinfo.c    |   12 ++++++------
 3 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index a32bcfd..1c2669b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -21,7 +21,7 @@ enum stat_item {
 	FREE_FROZEN,		/* Freeing to frozen slab */
 	FREE_ADD_PARTIAL,	/* Freeing moves slab to partial list */
 	FREE_REMOVE_PARTIAL,	/* Freeing removes last object */
-	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from partial list */
+	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from node partial list */
 	ALLOC_SLAB,		/* Cpu slab acquired from page allocator */
 	ALLOC_REFILL,		/* Refill cpu slab from slab freelist */
 	ALLOC_NODE_MISMATCH,	/* Switching cpu slab */
@@ -36,8 +36,10 @@ enum stat_item {
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	CMPXCHG_DOUBLE_CPU_FAIL,/* Failure of this_cpu_cmpxchg_double */
 	CMPXCHG_DOUBLE_FAIL,	/* Number of times that cmpxchg double did not match */
-	CPU_PARTIAL_ALLOC,	/* Used cpu partial on alloc */
-	CPU_PARTIAL_FREE,	/* USed cpu partial on free */
+	ALLOC_FROM_PCP,		/* Used cpu partial on alloc */
+	PCP_FROM_FREE,		/* Fill cpu partial from free */
+	PCP_FROM_NODE,		/* Fill cpu partial from node partial */
+	UNFREEZE_PCP,		/* Unfreeze per cpu partial */
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
diff --git a/mm/slub.c b/mm/slub.c
index ed3334d..5843846 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1558,6 +1558,7 @@ static void *get_partial_node(struct kmem_cache *s,
 		} else {
 			page->freelist = t;
 			available = put_cpu_partial(s, page, 0);
+			stat(s, PCP_FROM_NODE);
 		}
 		if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
 			break;
@@ -1968,6 +1969,7 @@ int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 				local_irq_restore(flags);
 				pobjects = 0;
 				pages = 0;
+				stat(s, UNFREEZE_PCP);
 			}
 		}
 
@@ -1979,7 +1981,6 @@ int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 		page->next = oldpage;
 
 	} while (irqsafe_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage);
-	stat(s, CPU_PARTIAL_FREE);
 	return pobjects;
 }
 
@@ -2212,7 +2213,7 @@ new_slab:
 		c->page = c->partial;
 		c->partial = c->page->next;
 		c->node = page_to_nid(c->page);
-		stat(s, CPU_PARTIAL_ALLOC);
+		stat(s, ALLOC_FROM_PCP);
 		c->freelist = NULL;
 		goto redo;
 	}
@@ -2448,9 +2449,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * If we just froze the page then put it onto the
 		 * per cpu partial list.
 		 */
-		if (new.frozen && !was_frozen)
+		if (new.frozen && !was_frozen) {
 			put_cpu_partial(s, page, 1);
-
+			stat(s, PCP_FROM_FREE);
+		}
 		/*
 		 * The list lock was not taken therefore no list
 		 * activity can be necessary.
@@ -5032,8 +5034,10 @@ STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
 STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail);
 STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail);
-STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
-STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
+STAT_ATTR(ALLOC_FROM_PCP, alloc_from_pcp);
+STAT_ATTR(PCP_FROM_FREE, pcp_from_free);
+STAT_ATTR(PCP_FROM_NODE, pcp_from_node);
+STAT_ATTR(UNFREEZE_PCP, unfreeze_pcp);
 #endif
 
 static struct attribute *slab_attrs[] = {
@@ -5097,8 +5101,10 @@ static struct attribute *slab_attrs[] = {
 	&order_fallback_attr.attr,
 	&cmpxchg_double_fail_attr.attr,
 	&cmpxchg_double_cpu_fail_attr.attr,
-	&cpu_partial_alloc_attr.attr,
-	&cpu_partial_free_attr.attr,
+	&alloc_from_pcp_attr.attr,
+	&pcp_from_free_attr.attr,
+	&pcp_from_node_attr.attr,
+	&unfreeze_pcp_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,
diff --git a/tools/slub/slabinfo.c b/tools/slub/slabinfo.c
index 164cbcf..d8f67f0 100644
--- a/tools/slub/slabinfo.c
+++ b/tools/slub/slabinfo.c
@@ -42,7 +42,7 @@ struct slabinfo {
 	unsigned long deactivate_remote_frees, order_fallback;
 	unsigned long cmpxchg_double_cpu_fail, cmpxchg_double_fail;
 	unsigned long alloc_node_mismatch, deactivate_bypass;
-	unsigned long cpu_partial_alloc, cpu_partial_free;
+	unsigned long alloc_from_pcp, pcp_from_free;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
 } slabinfo[MAX_SLABS];
@@ -457,9 +457,9 @@ static void slab_stats(struct slabinfo *s)
 		s->free_remove_partial * 100 / total_free);
 
 	printf("Cpu partial list     %8lu %8lu %3lu %3lu\n",
-		s->cpu_partial_alloc, s->cpu_partial_free,
-		s->cpu_partial_alloc * 100 / total_alloc,
-		s->cpu_partial_free * 100 / total_free);
+		s->alloc_from_pcp, s->pcp_from_free,
+		s->alloc_from_pcp * 100 / total_alloc,
+		s->pcp_from_free * 100 / total_free);
 
 	printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
 		s->deactivate_remote_frees, s->free_frozen,
@@ -1215,8 +1215,8 @@ static void read_slab_dir(void)
 			slab->order_fallback = get_obj("order_fallback");
 			slab->cmpxchg_double_cpu_fail = get_obj("cmpxchg_double_cpu_fail");
 			slab->cmpxchg_double_fail = get_obj("cmpxchg_double_fail");
-			slab->cpu_partial_alloc = get_obj("cpu_partial_alloc");
-			slab->cpu_partial_free = get_obj("cpu_partial_free");
+			slab->alloc_from_pcp = get_obj("alloc_from_pcp");
+			slab->pcp_from_free = get_obj("pcp_from_free");
 			slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
 			slab->deactivate_bypass = get_obj("deactivate_bypass");
 			chdir("..");
-- 
1.7.0.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-05  9:22     ` Alex,Shi
@ 2011-12-06 21:06       ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-06 21:06 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Andi Kleen

On Mon, 5 Dec 2011, Alex,Shi wrote:

> Previous testing depends on 3.2-rc1, that show hackbench performance has
> no clear change, and netperf get some benefit. But seems after
> irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting
> these results. 
> 

netperf will also degrade with this change on some machines, there's no 
clear heuristic that can be used to benefit all workloads when deciding 
where to add a partial slab into the list.  Cache hotness is great but 
your patch doesn't address situations where frees happen to a partial slab 
such that they may be entirely free (or at least below your 1:4 inuse to 
nr_objs threshold) at the time you want to deactivate the cpu slab.

I had a patchset that iterated the partial list and found the "most free" 
partial slab (and terminated prematurely if a threshold had been reached, 
much like yours) and selected that one, and it helped netperf 2-3% in my 
testing.  So I disagree with determining where to add a partial slab to 
the list at the time of free because it doesn't infer its state at the 
time of cpu slab deactivation.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-06 21:06       ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-06 21:06 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Andi Kleen

On Mon, 5 Dec 2011, Alex,Shi wrote:

> Previous testing depends on 3.2-rc1, that show hackbench performance has
> no clear change, and netperf get some benefit. But seems after
> irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting
> these results. 
> 

netperf will also degrade with this change on some machines, there's no 
clear heuristic that can be used to benefit all workloads when deciding 
where to add a partial slab into the list.  Cache hotness is great but 
your patch doesn't address situations where frees happen to a partial slab 
such that they may be entirely free (or at least below your 1:4 inuse to 
nr_objs threshold) at the time you want to deactivate the cpu slab.

I had a patchset that iterated the partial list and found the "most free" 
partial slab (and terminated prematurely if a threshold had been reached, 
much like yours) and selected that one, and it helped netperf 2-3% in my 
testing.  So I disagree with determining where to add a partial slab to 
the list at the time of free because it doesn't infer its state at the 
time of cpu slab deactivation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail
  2011-12-02  8:23   ` Alex Shi
@ 2011-12-06 21:08     ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-06 21:08 UTC (permalink / raw)
  To: Alex Shi; +Cc: cl, penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Alex Shi wrote:

> From: Alex Shi <alexs@intel.com>
> 
> Since the head or tail were automaticly decided in add_partial(),
> we didn't need this statistics again.
> 

Umm, we shouldn't need to remove these statistics at all: if there is 
logic in add_partial() to determine whether to add it to the head or tail, 
the stats can still be incremented there appropriately.  It would actually 
be helpful to cite those stats for your netperf benchmarking when 
determining whether patches should be merged or not.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail
@ 2011-12-06 21:08     ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-06 21:08 UTC (permalink / raw)
  To: Alex Shi; +Cc: cl, penberg, linux-kernel, linux-mm

On Fri, 2 Dec 2011, Alex Shi wrote:

> From: Alex Shi <alexs@intel.com>
> 
> Since the head or tail were automaticly decided in add_partial(),
> we didn't need this statistics again.
> 

Umm, we shouldn't need to remove these statistics at all: if there is 
logic in add_partial() to determine whether to add it to the head or tail, 
the stats can still be incremented there appropriately.  It would actually 
be helpful to cite those stats for your netperf benchmarking when 
determining whether patches should be merged or not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-06 21:06       ` David Rientjes
@ 2011-12-07  5:11         ` Shaohua Li
  -1 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-07  5:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 2011-12-07 at 05:06 +0800, David Rientjes wrote:
> On Mon, 5 Dec 2011, Alex,Shi wrote:
> 
> > Previous testing depends on 3.2-rc1, that show hackbench performance has
> > no clear change, and netperf get some benefit. But seems after
> > irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting
> > these results. 
> > 
> 
> netperf will also degrade with this change on some machines, there's no 
> clear heuristic that can be used to benefit all workloads when deciding 
> where to add a partial slab into the list.  Cache hotness is great but 
> your patch doesn't address situations where frees happen to a partial slab 
> such that they may be entirely free (or at least below your 1:4 inuse to 
> nr_objs threshold) at the time you want to deactivate the cpu slab.
> 
> I had a patchset that iterated the partial list and found the "most free" 
> partial slab (and terminated prematurely if a threshold had been reached, 
> much like yours) and selected that one, and it helped netperf 2-3% in my 
> testing.  So I disagree with determining where to add a partial slab to 
> the list at the time of free because it doesn't infer its state at the 
> time of cpu slab deactivation.
interesting. I did similar experiment before (try to sort the page
according to free number), but it appears quite hard. The free number of
a page is dynamic, eg more slabs can be freed when the page is in
partial list. And in netperf test, the partial list could be very very
long. Can you post your patch, I definitely what to look at it.
What I have about the partial list is it wastes a lot of memory. My test
shows about 50% memory is wasted. I'm thinking not always fetching the
oldest page from the partial list, because chances that objects of
oldest page can all be freed is high. I haven't done any test yet,
wondering if it could be helpful.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-07  5:11         ` Shaohua Li
  0 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-07  5:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 2011-12-07 at 05:06 +0800, David Rientjes wrote:
> On Mon, 5 Dec 2011, Alex,Shi wrote:
> 
> > Previous testing depends on 3.2-rc1, that show hackbench performance has
> > no clear change, and netperf get some benefit. But seems after
> > irqsafe_cpu_cmpxchg patch, the result has some change. I am collecting
> > these results. 
> > 
> 
> netperf will also degrade with this change on some machines, there's no 
> clear heuristic that can be used to benefit all workloads when deciding 
> where to add a partial slab into the list.  Cache hotness is great but 
> your patch doesn't address situations where frees happen to a partial slab 
> such that they may be entirely free (or at least below your 1:4 inuse to 
> nr_objs threshold) at the time you want to deactivate the cpu slab.
> 
> I had a patchset that iterated the partial list and found the "most free" 
> partial slab (and terminated prematurely if a threshold had been reached, 
> much like yours) and selected that one, and it helped netperf 2-3% in my 
> testing.  So I disagree with determining where to add a partial slab to 
> the list at the time of free because it doesn't infer its state at the 
> time of cpu slab deactivation.
interesting. I did similar experiment before (try to sort the page
according to free number), but it appears quite hard. The free number of
a page is dynamic, eg more slabs can be freed when the page is in
partial list. And in netperf test, the partial list could be very very
long. Can you post your patch, I definitely what to look at it.
What I have about the partial list is it wastes a lot of memory. My test
shows about 50% memory is wasted. I'm thinking not always fetching the
oldest page from the partial list, because chances that objects of
oldest page can all be freed is high. I haven't done any test yet,
wondering if it could be helpful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-07  5:11         ` Shaohua Li
@ 2011-12-07  7:28           ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-07  7:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 7 Dec 2011, Shaohua Li wrote:

> interesting. I did similar experiment before (try to sort the page
> according to free number), but it appears quite hard. The free number of
> a page is dynamic, eg more slabs can be freed when the page is in
> partial list. And in netperf test, the partial list could be very very
> long. Can you post your patch, I definitely what to look at it.

It was over a couple of years ago and the slub code has changed 
significantly since then, but you can see the general concept of the "slab 
thrashing" problem with netperf and my solution back then:

	http://marc.info/?l=linux-kernel&m=123839191416478
	http://marc.info/?l=linux-kernel&m=123839203016592
	http://marc.info/?l=linux-kernel&m=123839202916583

I also had a separate patchset that, instead of this approach, would just 
iterate through the partial list in get_partial_node() looking for 
anything where the number of free objects met a certain threshold, which 
still defaulted to 25% and instantly picked it.  The overhead was taking 
slab_lock() for each page, but that was nullified by the performance 
speedup of using the alloc fastpath a majority of the time for both 
kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
one or two allocs.  If no partial slab met the threshold, the slab_lock() 
is held of the partial slab with the most free objects and returned 
instead.

> What I have about the partial list is it wastes a lot of memory.

That's not going to be helped with the above approach since we typically 
try to fill a partial slab with many free objects, but it also won't be 
severely impacted because if the threshold is kept small enough, then we 
simply return the first partial slab that meets the criteria.  That allows 
the partial slabs at the end of the list to hopefully become mostly free.

And, for completeness, there's also a possibility that you have some 
completely free slabs on the partial list that coule be freed back to the 
buddy allocator by decreasing min_partial by way of 
/sys/kernel/slab/cache/min_partial at the risk of performance and then 
invoke /sys/kernel/slab/cache/shrink to free the unused slabs.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-07  7:28           ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-07  7:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 7 Dec 2011, Shaohua Li wrote:

> interesting. I did similar experiment before (try to sort the page
> according to free number), but it appears quite hard. The free number of
> a page is dynamic, eg more slabs can be freed when the page is in
> partial list. And in netperf test, the partial list could be very very
> long. Can you post your patch, I definitely what to look at it.

It was over a couple of years ago and the slub code has changed 
significantly since then, but you can see the general concept of the "slab 
thrashing" problem with netperf and my solution back then:

	http://marc.info/?l=linux-kernel&m=123839191416478
	http://marc.info/?l=linux-kernel&m=123839203016592
	http://marc.info/?l=linux-kernel&m=123839202916583

I also had a separate patchset that, instead of this approach, would just 
iterate through the partial list in get_partial_node() looking for 
anything where the number of free objects met a certain threshold, which 
still defaulted to 25% and instantly picked it.  The overhead was taking 
slab_lock() for each page, but that was nullified by the performance 
speedup of using the alloc fastpath a majority of the time for both 
kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
one or two allocs.  If no partial slab met the threshold, the slab_lock() 
is held of the partial slab with the most free objects and returned 
instead.

> What I have about the partial list is it wastes a lot of memory.

That's not going to be helped with the above approach since we typically 
try to fill a partial slab with many free objects, but it also won't be 
severely impacted because if the threshold is kept small enough, then we 
simply return the first partial slab that meets the criteria.  That allows 
the partial slabs at the end of the list to hopefully become mostly free.

And, for completeness, there's also a possibility that you have some 
completely free slabs on the partial list that coule be freed back to the 
buddy allocator by decreasing min_partial by way of 
/sys/kernel/slab/cache/min_partial at the risk of performance and then 
invoke /sys/kernel/slab/cache/shrink to free the unused slabs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-02 14:43   ` Christoph Lameter
@ 2011-12-09  8:30     ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-09  8:30 UTC (permalink / raw)
  To: Christoph Lameter, David Rientjes
  Cc: penberg, linux-kernel, linux-mm, Eric Dumazet

On Fri, 2011-12-02 at 22:43 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Alex Shi wrote:
> 
> > From: Alex Shi <alexs@intel.com>
> >
> > Times performance regression were due to slub add to node partial head
> > or tail. That inspired me to do tunning on the node partial adding, to
> > set a criteria for head or tail position selection when do partial
> > adding.
> > My experiment show, when used objects is less than 1/4 total objects
> > of slub performance will get about 1.5% improvement on netperf loopback
> > testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> > includes sandbridge or core2.
> 
> The number of free objects in a slab may have nothing to do with cache
> hotness of all objects in the slab. You can only be sure that one object
> (the one that was freed) is cache hot. Netperf may use them in sequence
> and therefore you are likely to get series of frees on the same slab
> page. How are other benchmarks affected by this change?


I did some experiments on add_partial judgment against rc4, like to put
the slub into node partial head or tail according to free objects, or
like Eric's suggest to combine the external parameter, like below: 

        n->nr_partial++;
-       if (tail == DEACTIVATE_TO_TAIL)
+       if (tail == DEACTIVATE_TO_TAIL || 
+               page->inuse > page->objects /2)
                list_add_tail(&page->lru, &n->partial);
        else
                list_add(&page->lru, &n->partial);

But the result is out of my expectation before. Now we set all of slub
into the tail of node partial, we get the best performance, even it is
just a slight improvement. 

{
        n->nr_partial++;
-       if (tail == DEACTIVATE_TO_TAIL)
-               list_add_tail(&page->lru, &n->partial);
-       else
-               list_add(&page->lru, &n->partial);
+       list_add_tail(&page->lru, &n->partial);
 }
 
This change can bring about 2% improvement on our WSM-ep machine, and 1%
improvement on our SNB-ep and NHM-ex machine. and no clear effect for
core2 machine. on hackbench process benchmark.

 	./hackbench 100 process 2000 
 
For multiple clients loopback netperf, only a suspicious 1% improvement
on our 2 sockets machine. and others have no clear effect. 

But, when I check the deactivate_to_head/to_tail statistics on original
code, the to_head is just hundreds or thousands times, while to_tail is
called about teens millions times. 

David, could you like to try above change? move all slub to partial
tail. 

add_partial statistics collection patch: 
---
commit 1ff731282acb521f3a7c2e3fb94d35ec4d0ff07e
Author: Alex Shi <alex.shi@intel.com>
Date:   Fri Dec 9 18:12:14 2011 +0800

    slub: statistics collection for add_partial

diff --git a/mm/slub.c b/mm/slub.c
index 5843846..a2b1143 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1904,10 +1904,11 @@ static void unfreeze_partials(struct kmem_cache *s)
 			if (l != m) {
 				if (l == M_PARTIAL)
 					remove_partial(n, page);
-				else
+				else {
 					add_partial(n, page,
 						DEACTIVATE_TO_TAIL);
-
+					stat(s, DEACTIVATE_TO_TAIL);
+				}
 				l = m;
 			}
 
@@ -2480,6 +2481,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 			remove_full(s, page);
 			add_partial(n, page, DEACTIVATE_TO_TAIL);
 			stat(s, FREE_ADD_PARTIAL);
+			stat(s, DEACTIVATE_TO_TAIL);
 		}
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);




^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-09  8:30     ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-09  8:30 UTC (permalink / raw)
  To: Christoph Lameter, David Rientjes
  Cc: penberg, linux-kernel, linux-mm, Eric Dumazet

On Fri, 2011-12-02 at 22:43 +0800, Christoph Lameter wrote:
> On Fri, 2 Dec 2011, Alex Shi wrote:
> 
> > From: Alex Shi <alexs@intel.com>
> >
> > Times performance regression were due to slub add to node partial head
> > or tail. That inspired me to do tunning on the node partial adding, to
> > set a criteria for head or tail position selection when do partial
> > adding.
> > My experiment show, when used objects is less than 1/4 total objects
> > of slub performance will get about 1.5% improvement on netperf loopback
> > testing with 2048 clients, wherever on our 4 or 2 sockets platforms,
> > includes sandbridge or core2.
> 
> The number of free objects in a slab may have nothing to do with cache
> hotness of all objects in the slab. You can only be sure that one object
> (the one that was freed) is cache hot. Netperf may use them in sequence
> and therefore you are likely to get series of frees on the same slab
> page. How are other benchmarks affected by this change?


I did some experiments on add_partial judgment against rc4, like to put
the slub into node partial head or tail according to free objects, or
like Eric's suggest to combine the external parameter, like below: 

        n->nr_partial++;
-       if (tail == DEACTIVATE_TO_TAIL)
+       if (tail == DEACTIVATE_TO_TAIL || 
+               page->inuse > page->objects /2)
                list_add_tail(&page->lru, &n->partial);
        else
                list_add(&page->lru, &n->partial);

But the result is out of my expectation before. Now we set all of slub
into the tail of node partial, we get the best performance, even it is
just a slight improvement. 

{
        n->nr_partial++;
-       if (tail == DEACTIVATE_TO_TAIL)
-               list_add_tail(&page->lru, &n->partial);
-       else
-               list_add(&page->lru, &n->partial);
+       list_add_tail(&page->lru, &n->partial);
 }
 
This change can bring about 2% improvement on our WSM-ep machine, and 1%
improvement on our SNB-ep and NHM-ex machine. and no clear effect for
core2 machine. on hackbench process benchmark.

 	./hackbench 100 process 2000 
 
For multiple clients loopback netperf, only a suspicious 1% improvement
on our 2 sockets machine. and others have no clear effect. 

But, when I check the deactivate_to_head/to_tail statistics on original
code, the to_head is just hundreds or thousands times, while to_tail is
called about teens millions times. 

David, could you like to try above change? move all slub to partial
tail. 

add_partial statistics collection patch: 
---
commit 1ff731282acb521f3a7c2e3fb94d35ec4d0ff07e
Author: Alex Shi <alex.shi@intel.com>
Date:   Fri Dec 9 18:12:14 2011 +0800

    slub: statistics collection for add_partial

diff --git a/mm/slub.c b/mm/slub.c
index 5843846..a2b1143 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1904,10 +1904,11 @@ static void unfreeze_partials(struct kmem_cache *s)
 			if (l != m) {
 				if (l == M_PARTIAL)
 					remove_partial(n, page);
-				else
+				else {
 					add_partial(n, page,
 						DEACTIVATE_TO_TAIL);
-
+					stat(s, DEACTIVATE_TO_TAIL);
+				}
 				l = m;
 			}
 
@@ -2480,6 +2481,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 			remove_full(s, page);
 			add_partial(n, page, DEACTIVATE_TO_TAIL);
 			stat(s, FREE_ADD_PARTIAL);
+			stat(s, DEACTIVATE_TO_TAIL);
 		}
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-09  8:30     ` Alex,Shi
@ 2011-12-09 10:10       ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-09 10:10 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Fri, 9 Dec 2011, Alex,Shi wrote:

> I did some experiments on add_partial judgment against rc4, like to put
> the slub into node partial head or tail according to free objects, or
> like Eric's suggest to combine the external parameter, like below: 
> 
>         n->nr_partial++;
> -       if (tail == DEACTIVATE_TO_TAIL)
> +       if (tail == DEACTIVATE_TO_TAIL || 
> +               page->inuse > page->objects /2)
>                 list_add_tail(&page->lru, &n->partial);
>         else
>                 list_add(&page->lru, &n->partial);
> 
> But the result is out of my expectation before.

I don't think you'll get consistent results for all workloads with 
something like this, some things may appear better and other things may 
appear worse.  That's why I've always disagreed with determining whether 
it should be added to the head or to the tail at the time of deactivation: 
you know nothing about frees happening to that slab subsequent to the 
decision you've made.  The only thing that's guaranteed is that you've 
through cache hot objects out the window and potentially increased the 
amount of internally fragmented slabs and/or unnecessarily long partial 
lists.

> Now we set all of slub
> into the tail of node partial, we get the best performance, even it is
> just a slight improvement. 
> 
> {
>         n->nr_partial++;
> -       if (tail == DEACTIVATE_TO_TAIL)
> -               list_add_tail(&page->lru, &n->partial);
> -       else
> -               list_add(&page->lru, &n->partial);
> +       list_add_tail(&page->lru, &n->partial);
>  }
>  
> This change can bring about 2% improvement on our WSM-ep machine, and 1%
> improvement on our SNB-ep and NHM-ex machine. and no clear effect for
> core2 machine. on hackbench process benchmark.
> 
>  	./hackbench 100 process 2000 
>  
> For multiple clients loopback netperf, only a suspicious 1% improvement
> on our 2 sockets machine. and others have no clear effect. 
> 
> But, when I check the deactivate_to_head/to_tail statistics on original
> code, the to_head is just hundreds or thousands times, while to_tail is
> called about teens millions times. 
> 
> David, could you like to try above change? move all slub to partial
> tail. 
> 
> add_partial statistics collection patch: 
> ---
> commit 1ff731282acb521f3a7c2e3fb94d35ec4d0ff07e
> Author: Alex Shi <alex.shi@intel.com>
> Date:   Fri Dec 9 18:12:14 2011 +0800
> 
>     slub: statistics collection for add_partial
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 5843846..a2b1143 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1904,10 +1904,11 @@ static void unfreeze_partials(struct kmem_cache *s)
>  			if (l != m) {
>  				if (l == M_PARTIAL)
>  					remove_partial(n, page);
> -				else
> +				else {
>  					add_partial(n, page,
>  						DEACTIVATE_TO_TAIL);
> -
> +					stat(s, DEACTIVATE_TO_TAIL);
> +				}
>  				l = m;
>  			}
>  
> @@ -2480,6 +2481,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  			remove_full(s, page);
>  			add_partial(n, page, DEACTIVATE_TO_TAIL);
>  			stat(s, FREE_ADD_PARTIAL);
> +			stat(s, DEACTIVATE_TO_TAIL);
>  		}
>  	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);

Not sure what you're asking me to test, you would like this:

	{
	        n->nr_partial++;
	-       if (tail == DEACTIVATE_TO_TAIL)
	-               list_add_tail(&page->lru, &n->partial);
	-       else
	-               list_add(&page->lru, &n->partial);
	+       list_add_tail(&page->lru, &n->partial);
	}

with the statistics patch above?  I typically run with CONFIG_SLUB_STATS 
disabled since it impacts performance so heavily and I'm not sure what 
information you're looking for with regards to those stats.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-09 10:10       ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-09 10:10 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Fri, 9 Dec 2011, Alex,Shi wrote:

> I did some experiments on add_partial judgment against rc4, like to put
> the slub into node partial head or tail according to free objects, or
> like Eric's suggest to combine the external parameter, like below: 
> 
>         n->nr_partial++;
> -       if (tail == DEACTIVATE_TO_TAIL)
> +       if (tail == DEACTIVATE_TO_TAIL || 
> +               page->inuse > page->objects /2)
>                 list_add_tail(&page->lru, &n->partial);
>         else
>                 list_add(&page->lru, &n->partial);
> 
> But the result is out of my expectation before.

I don't think you'll get consistent results for all workloads with 
something like this, some things may appear better and other things may 
appear worse.  That's why I've always disagreed with determining whether 
it should be added to the head or to the tail at the time of deactivation: 
you know nothing about frees happening to that slab subsequent to the 
decision you've made.  The only thing that's guaranteed is that you've 
through cache hot objects out the window and potentially increased the 
amount of internally fragmented slabs and/or unnecessarily long partial 
lists.

> Now we set all of slub
> into the tail of node partial, we get the best performance, even it is
> just a slight improvement. 
> 
> {
>         n->nr_partial++;
> -       if (tail == DEACTIVATE_TO_TAIL)
> -               list_add_tail(&page->lru, &n->partial);
> -       else
> -               list_add(&page->lru, &n->partial);
> +       list_add_tail(&page->lru, &n->partial);
>  }
>  
> This change can bring about 2% improvement on our WSM-ep machine, and 1%
> improvement on our SNB-ep and NHM-ex machine. and no clear effect for
> core2 machine. on hackbench process benchmark.
> 
>  	./hackbench 100 process 2000 
>  
> For multiple clients loopback netperf, only a suspicious 1% improvement
> on our 2 sockets machine. and others have no clear effect. 
> 
> But, when I check the deactivate_to_head/to_tail statistics on original
> code, the to_head is just hundreds or thousands times, while to_tail is
> called about teens millions times. 
> 
> David, could you like to try above change? move all slub to partial
> tail. 
> 
> add_partial statistics collection patch: 
> ---
> commit 1ff731282acb521f3a7c2e3fb94d35ec4d0ff07e
> Author: Alex Shi <alex.shi@intel.com>
> Date:   Fri Dec 9 18:12:14 2011 +0800
> 
>     slub: statistics collection for add_partial
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 5843846..a2b1143 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1904,10 +1904,11 @@ static void unfreeze_partials(struct kmem_cache *s)
>  			if (l != m) {
>  				if (l == M_PARTIAL)
>  					remove_partial(n, page);
> -				else
> +				else {
>  					add_partial(n, page,
>  						DEACTIVATE_TO_TAIL);
> -
> +					stat(s, DEACTIVATE_TO_TAIL);
> +				}
>  				l = m;
>  			}
>  
> @@ -2480,6 +2481,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  			remove_full(s, page);
>  			add_partial(n, page, DEACTIVATE_TO_TAIL);
>  			stat(s, FREE_ADD_PARTIAL);
> +			stat(s, DEACTIVATE_TO_TAIL);
>  		}
>  	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);

Not sure what you're asking me to test, you would like this:

	{
	        n->nr_partial++;
	-       if (tail == DEACTIVATE_TO_TAIL)
	-               list_add_tail(&page->lru, &n->partial);
	-       else
	-               list_add(&page->lru, &n->partial);
	+       list_add_tail(&page->lru, &n->partial);
	}

with the statistics patch above?  I typically run with CONFIG_SLUB_STATS 
disabled since it impacts performance so heavily and I'm not sure what 
information you're looking for with regards to those stats.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-09 10:10       ` David Rientjes
@ 2011-12-09 13:40         ` Shi, Alex
  -1 siblings, 0 replies; 78+ messages in thread
From: Shi, Alex @ 2011-12-09 13:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

> > I did some experiments on add_partial judgment against rc4, like to put
> > the slub into node partial head or tail according to free objects, or
> > like Eric's suggest to combine the external parameter, like below:
> >
> >         n->nr_partial++;
> > -       if (tail == DEACTIVATE_TO_TAIL)
> > +       if (tail == DEACTIVATE_TO_TAIL ||
> > +               page->inuse > page->objects /2)
> >                 list_add_tail(&page->lru, &n->partial);
> >         else
> >                 list_add(&page->lru, &n->partial);
> >
> > But the result is out of my expectation before.
> 
> I don't think you'll get consistent results for all workloads with
> something like this, some things may appear better and other things may
> appear worse.  That's why I've always disagreed with determining whether
> it should be added to the head or to the tail at the time of deactivation:
> you know nothing about frees happening to that slab subsequent to the
> decision you've made.  The only thing that's guaranteed is that you've
> through cache hot objects out the window and potentially increased the
> amount of internally fragmented slabs and/or unnecessarily long partial
> lists.

I said it not my original expectation doesn't mean my data has problem. :) 
Of course any testing may have result variation. But it is benchmark accordingly, and there are lot technical to tuning your testing to make its stand division acceptable, like to sync your system in a clear status, to close unnecessary services, to use separate working disks for your testing etc. etc. For this data, like on my SNB-EP machine, (the following data is not stands for Intel, it is just my personal data). 
4 times result of hackbench on this patch are 5.59, 5.475, 5.47833, 5.504
And more results on original rc4 are from 5.54 to 5.61, the stand division of results show is stable and believable on my side. But since in our handreds benchmarks, only hackbench and loopback netperf is sensitive with slub change, and since it seems you did some testing on this. I thought you may like to do a double confirm with real data. 

In fact, I also collected the 'perf stat' for cache missing or reference data, but that wave too much not stabled like hackbench itself.
 
> Not sure what you're asking me to test, you would like this:
> 
> 	{
> 	        n->nr_partial++;
> 	-       if (tail == DEACTIVATE_TO_TAIL)
> 	-               list_add_tail(&page->lru, &n->partial);
> 	-       else
> 	-               list_add(&page->lru, &n->partial);
> 	+       list_add_tail(&page->lru, &n->partial);
> 	}
> 
> with the statistics patch above?  I typically run with CONFIG_SLUB_STATS
> disabled since it impacts performance so heavily and I'm not sure what
> information you're looking for with regards to those stats.

NO, when you collect data, please close SLUB_STAT in kernel config.  _to_head statistics collection patch just tell you, I collected the statistics not include add_partial in early_kmem_cache_node_alloc(). And other places of add_partial were covered. Of course, the kernel with statistic can not be used to measure performance. 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-09 13:40         ` Shi, Alex
  0 siblings, 0 replies; 78+ messages in thread
From: Shi, Alex @ 2011-12-09 13:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

> > I did some experiments on add_partial judgment against rc4, like to put
> > the slub into node partial head or tail according to free objects, or
> > like Eric's suggest to combine the external parameter, like below:
> >
> >         n->nr_partial++;
> > -       if (tail == DEACTIVATE_TO_TAIL)
> > +       if (tail == DEACTIVATE_TO_TAIL ||
> > +               page->inuse > page->objects /2)
> >                 list_add_tail(&page->lru, &n->partial);
> >         else
> >                 list_add(&page->lru, &n->partial);
> >
> > But the result is out of my expectation before.
> 
> I don't think you'll get consistent results for all workloads with
> something like this, some things may appear better and other things may
> appear worse.  That's why I've always disagreed with determining whether
> it should be added to the head or to the tail at the time of deactivation:
> you know nothing about frees happening to that slab subsequent to the
> decision you've made.  The only thing that's guaranteed is that you've
> through cache hot objects out the window and potentially increased the
> amount of internally fragmented slabs and/or unnecessarily long partial
> lists.

I said it not my original expectation doesn't mean my data has problem. :) 
Of course any testing may have result variation. But it is benchmark accordingly, and there are lot technical to tuning your testing to make its stand division acceptable, like to sync your system in a clear status, to close unnecessary services, to use separate working disks for your testing etc. etc. For this data, like on my SNB-EP machine, (the following data is not stands for Intel, it is just my personal data). 
4 times result of hackbench on this patch are 5.59, 5.475, 5.47833, 5.504
And more results on original rc4 are from 5.54 to 5.61, the stand division of results show is stable and believable on my side. But since in our handreds benchmarks, only hackbench and loopback netperf is sensitive with slub change, and since it seems you did some testing on this. I thought you may like to do a double confirm with real data. 

In fact, I also collected the 'perf stat' for cache missing or reference data, but that wave too much not stabled like hackbench itself.
 
> Not sure what you're asking me to test, you would like this:
> 
> 	{
> 	        n->nr_partial++;
> 	-       if (tail == DEACTIVATE_TO_TAIL)
> 	-               list_add_tail(&page->lru, &n->partial);
> 	-       else
> 	-               list_add(&page->lru, &n->partial);
> 	+       list_add_tail(&page->lru, &n->partial);
> 	}
> 
> with the statistics patch above?  I typically run with CONFIG_SLUB_STATS
> disabled since it impacts performance so heavily and I'm not sure what
> information you're looking for with regards to those stats.

NO, when you collect data, please close SLUB_STAT in kernel config.  _to_head statistics collection patch just tell you, I collected the statistics not include add_partial in early_kmem_cache_node_alloc(). And other places of add_partial were covered. Of course, the kernel with statistic can not be used to measure performance. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-07  7:28           ` David Rientjes
@ 2011-12-12  2:43             ` Shaohua Li
  -1 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-12  2:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> On Wed, 7 Dec 2011, Shaohua Li wrote:
> 
> > interesting. I did similar experiment before (try to sort the page
> > according to free number), but it appears quite hard. The free number of
> > a page is dynamic, eg more slabs can be freed when the page is in
> > partial list. And in netperf test, the partial list could be very very
> > long. Can you post your patch, I definitely what to look at it.
> 
> It was over a couple of years ago and the slub code has changed 
> significantly since then, but you can see the general concept of the "slab 
> thrashing" problem with netperf and my solution back then:
> 
> 	http://marc.info/?l=linux-kernel&m=123839191416478
> 	http://marc.info/?l=linux-kernel&m=123839203016592
> 	http://marc.info/?l=linux-kernel&m=123839202916583
> 
> I also had a separate patchset that, instead of this approach, would just 
> iterate through the partial list in get_partial_node() looking for 
> anything where the number of free objects met a certain threshold, which 
> still defaulted to 25% and instantly picked it.  The overhead was taking 
> slab_lock() for each page, but that was nullified by the performance 
> speedup of using the alloc fastpath a majority of the time for both 
> kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> is held of the partial slab with the most free objects and returned 
> instead.
With the per-cpu partial list, I didn't see any workload which is still
suffering from the list lock, so I suppose both the trashing approach
and pick 25% used slab approach don't help. The per-cpu partial list
flushes the whole per-cpu partial list after s->cpu_partial objects are
freed, this is a little aggressive, because the per-cpu partial list
need refilled again soon after an allocation. I had experiment to have
separate per-cpu alloc/free partial list, which can avoid this. but
again, I didn't see any workload still suffering list lock issue even
with netperf which stress slub much. did you see such workload?


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-12  2:43             ` Shaohua Li
  0 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-12  2:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> On Wed, 7 Dec 2011, Shaohua Li wrote:
> 
> > interesting. I did similar experiment before (try to sort the page
> > according to free number), but it appears quite hard. The free number of
> > a page is dynamic, eg more slabs can be freed when the page is in
> > partial list. And in netperf test, the partial list could be very very
> > long. Can you post your patch, I definitely what to look at it.
> 
> It was over a couple of years ago and the slub code has changed 
> significantly since then, but you can see the general concept of the "slab 
> thrashing" problem with netperf and my solution back then:
> 
> 	http://marc.info/?l=linux-kernel&m=123839191416478
> 	http://marc.info/?l=linux-kernel&m=123839203016592
> 	http://marc.info/?l=linux-kernel&m=123839202916583
> 
> I also had a separate patchset that, instead of this approach, would just 
> iterate through the partial list in get_partial_node() looking for 
> anything where the number of free objects met a certain threshold, which 
> still defaulted to 25% and instantly picked it.  The overhead was taking 
> slab_lock() for each page, but that was nullified by the performance 
> speedup of using the alloc fastpath a majority of the time for both 
> kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> is held of the partial slab with the most free objects and returned 
> instead.
With the per-cpu partial list, I didn't see any workload which is still
suffering from the list lock, so I suppose both the trashing approach
and pick 25% used slab approach don't help. The per-cpu partial list
flushes the whole per-cpu partial list after s->cpu_partial objects are
freed, this is a little aggressive, because the per-cpu partial list
need refilled again soon after an allocation. I had experiment to have
separate per-cpu alloc/free partial list, which can avoid this. but
again, I didn't see any workload still suffering list lock issue even
with netperf which stress slub much. did you see such workload?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-12  2:43             ` Shaohua Li
@ 2011-12-12  4:14               ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-12  4:14 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > 
> > > interesting. I did similar experiment before (try to sort the page
> > > according to free number), but it appears quite hard. The free number of
> > > a page is dynamic, eg more slabs can be freed when the page is in
> > > partial list. And in netperf test, the partial list could be very very
> > > long. Can you post your patch, I definitely what to look at it.
> > 
> > It was over a couple of years ago and the slub code has changed 
> > significantly since then, but you can see the general concept of the "slab 
> > thrashing" problem with netperf and my solution back then:
> > 
> > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > 
> > I also had a separate patchset that, instead of this approach, would just 
> > iterate through the partial list in get_partial_node() looking for 
> > anything where the number of free objects met a certain threshold, which 
> > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > slab_lock() for each page, but that was nullified by the performance 
> > speedup of using the alloc fastpath a majority of the time for both 
> > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > is held of the partial slab with the most free objects and returned 
> > instead.
> With the per-cpu partial list, I didn't see any workload which is still
> suffering from the list lock, 

The merge error that you fixed in 3.2-rc1 for hackbench regression is
due to add slub to node partial head. And data of hackbench show node
partial is still heavy used in allocation. 

/sys/kernel/slab/kmalloc-256/alloc_fastpath:225208640 
/sys/kernel/slab/kmalloc-256/alloc_from_partial:5276300 
/sys/kernel/slab/kmalloc-256/alloc_from_pcp:8326041 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-12  4:14               ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-12  4:14 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > 
> > > interesting. I did similar experiment before (try to sort the page
> > > according to free number), but it appears quite hard. The free number of
> > > a page is dynamic, eg more slabs can be freed when the page is in
> > > partial list. And in netperf test, the partial list could be very very
> > > long. Can you post your patch, I definitely what to look at it.
> > 
> > It was over a couple of years ago and the slub code has changed 
> > significantly since then, but you can see the general concept of the "slab 
> > thrashing" problem with netperf and my solution back then:
> > 
> > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > 
> > I also had a separate patchset that, instead of this approach, would just 
> > iterate through the partial list in get_partial_node() looking for 
> > anything where the number of free objects met a certain threshold, which 
> > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > slab_lock() for each page, but that was nullified by the performance 
> > speedup of using the alloc fastpath a majority of the time for both 
> > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > is held of the partial slab with the most free objects and returned 
> > instead.
> With the per-cpu partial list, I didn't see any workload which is still
> suffering from the list lock, 

The merge error that you fixed in 3.2-rc1 for hackbench regression is
due to add slub to node partial head. And data of hackbench show node
partial is still heavy used in allocation. 

/sys/kernel/slab/kmalloc-256/alloc_fastpath:225208640 
/sys/kernel/slab/kmalloc-256/alloc_from_partial:5276300 
/sys/kernel/slab/kmalloc-256/alloc_from_pcp:8326041 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-12  4:35                 ` Shaohua Li
@ 2011-12-12  4:25                   ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-12  4:25 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 12:35 +0800, Li, Shaohua wrote:
> On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote:
> > On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> > > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > > > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > > > 
> > > > > interesting. I did similar experiment before (try to sort the page
> > > > > according to free number), but it appears quite hard. The free number of
> > > > > a page is dynamic, eg more slabs can be freed when the page is in
> > > > > partial list. And in netperf test, the partial list could be very very
> > > > > long. Can you post your patch, I definitely what to look at it.
> > > > 
> > > > It was over a couple of years ago and the slub code has changed 
> > > > significantly since then, but you can see the general concept of the "slab 
> > > > thrashing" problem with netperf and my solution back then:
> > > > 
> > > > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > > > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > > > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > > > 
> > > > I also had a separate patchset that, instead of this approach, would just 
> > > > iterate through the partial list in get_partial_node() looking for 
> > > > anything where the number of free objects met a certain threshold, which 
> > > > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > > > slab_lock() for each page, but that was nullified by the performance 
> > > > speedup of using the alloc fastpath a majority of the time for both 
> > > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > > > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > > > is held of the partial slab with the most free objects and returned 
> > > > instead.
> > > With the per-cpu partial list, I didn't see any workload which is still
> > > suffering from the list lock, 
> > 
> > The merge error that you fixed in 3.2-rc1 for hackbench regression is
> > due to add slub to node partial head. And data of hackbench show node
> > partial is still heavy used in allocation. 
> The patch is already in base kernel, did you mean even with it you still
> saw the list locking issue with latest kernel?
> 

Yes, list_lock still hurt performance. It will be helpful if you can do
some optimize for it. 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-12  4:25                   ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-12  4:25 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 12:35 +0800, Li, Shaohua wrote:
> On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote:
> > On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> > > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > > > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > > > 
> > > > > interesting. I did similar experiment before (try to sort the page
> > > > > according to free number), but it appears quite hard. The free number of
> > > > > a page is dynamic, eg more slabs can be freed when the page is in
> > > > > partial list. And in netperf test, the partial list could be very very
> > > > > long. Can you post your patch, I definitely what to look at it.
> > > > 
> > > > It was over a couple of years ago and the slub code has changed 
> > > > significantly since then, but you can see the general concept of the "slab 
> > > > thrashing" problem with netperf and my solution back then:
> > > > 
> > > > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > > > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > > > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > > > 
> > > > I also had a separate patchset that, instead of this approach, would just 
> > > > iterate through the partial list in get_partial_node() looking for 
> > > > anything where the number of free objects met a certain threshold, which 
> > > > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > > > slab_lock() for each page, but that was nullified by the performance 
> > > > speedup of using the alloc fastpath a majority of the time for both 
> > > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > > > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > > > is held of the partial slab with the most free objects and returned 
> > > > instead.
> > > With the per-cpu partial list, I didn't see any workload which is still
> > > suffering from the list lock, 
> > 
> > The merge error that you fixed in 3.2-rc1 for hackbench regression is
> > due to add slub to node partial head. And data of hackbench show node
> > partial is still heavy used in allocation. 
> The patch is already in base kernel, did you mean even with it you still
> saw the list locking issue with latest kernel?
> 

Yes, list_lock still hurt performance. It will be helpful if you can do
some optimize for it. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-12  4:14               ` Alex,Shi
@ 2011-12-12  4:35                 ` Shaohua Li
  -1 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-12  4:35 UTC (permalink / raw)
  To: Shi, Alex
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote:
> On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > > 
> > > > interesting. I did similar experiment before (try to sort the page
> > > > according to free number), but it appears quite hard. The free number of
> > > > a page is dynamic, eg more slabs can be freed when the page is in
> > > > partial list. And in netperf test, the partial list could be very very
> > > > long. Can you post your patch, I definitely what to look at it.
> > > 
> > > It was over a couple of years ago and the slub code has changed 
> > > significantly since then, but you can see the general concept of the "slab 
> > > thrashing" problem with netperf and my solution back then:
> > > 
> > > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > > 
> > > I also had a separate patchset that, instead of this approach, would just 
> > > iterate through the partial list in get_partial_node() looking for 
> > > anything where the number of free objects met a certain threshold, which 
> > > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > > slab_lock() for each page, but that was nullified by the performance 
> > > speedup of using the alloc fastpath a majority of the time for both 
> > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > > is held of the partial slab with the most free objects and returned 
> > > instead.
> > With the per-cpu partial list, I didn't see any workload which is still
> > suffering from the list lock, 
> 
> The merge error that you fixed in 3.2-rc1 for hackbench regression is
> due to add slub to node partial head. And data of hackbench show node
> partial is still heavy used in allocation. 
The patch is already in base kernel, did you mean even with it you still
saw the list locking issue with latest kernel?


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-12  4:35                 ` Shaohua Li
  0 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-12  4:35 UTC (permalink / raw)
  To: Shi, Alex
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote:
> On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > > 
> > > > interesting. I did similar experiment before (try to sort the page
> > > > according to free number), but it appears quite hard. The free number of
> > > > a page is dynamic, eg more slabs can be freed when the page is in
> > > > partial list. And in netperf test, the partial list could be very very
> > > > long. Can you post your patch, I definitely what to look at it.
> > > 
> > > It was over a couple of years ago and the slub code has changed 
> > > significantly since then, but you can see the general concept of the "slab 
> > > thrashing" problem with netperf and my solution back then:
> > > 
> > > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > > 
> > > I also had a separate patchset that, instead of this approach, would just 
> > > iterate through the partial list in get_partial_node() looking for 
> > > anything where the number of free objects met a certain threshold, which 
> > > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > > slab_lock() for each page, but that was nullified by the performance 
> > > speedup of using the alloc fastpath a majority of the time for both 
> > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > > is held of the partial slab with the most free objects and returned 
> > > instead.
> > With the per-cpu partial list, I didn't see any workload which is still
> > suffering from the list lock, 
> 
> The merge error that you fixed in 3.2-rc1 for hackbench regression is
> due to add slub to node partial head. And data of hackbench show node
> partial is still heavy used in allocation. 
The patch is already in base kernel, did you mean even with it you still
saw the list locking issue with latest kernel?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-12  4:25                   ` Alex,Shi
@ 2011-12-12  4:48                     ` Shaohua Li
  -1 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-12  4:48 UTC (permalink / raw)
  To: Alex,Shi
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 12:25 +0800, Alex,Shi wrote:
> On Mon, 2011-12-12 at 12:35 +0800, Li, Shaohua wrote:
> > On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote:
> > > On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> > > > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > > > > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > > > > 
> > > > > > interesting. I did similar experiment before (try to sort the page
> > > > > > according to free number), but it appears quite hard. The free number of
> > > > > > a page is dynamic, eg more slabs can be freed when the page is in
> > > > > > partial list. And in netperf test, the partial list could be very very
> > > > > > long. Can you post your patch, I definitely what to look at it.
> > > > > 
> > > > > It was over a couple of years ago and the slub code has changed 
> > > > > significantly since then, but you can see the general concept of the "slab 
> > > > > thrashing" problem with netperf and my solution back then:
> > > > > 
> > > > > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > > > > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > > > > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > > > > 
> > > > > I also had a separate patchset that, instead of this approach, would just 
> > > > > iterate through the partial list in get_partial_node() looking for 
> > > > > anything where the number of free objects met a certain threshold, which 
> > > > > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > > > > slab_lock() for each page, but that was nullified by the performance 
> > > > > speedup of using the alloc fastpath a majority of the time for both 
> > > > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > > > > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > > > > is held of the partial slab with the most free objects and returned 
> > > > > instead.
> > > > With the per-cpu partial list, I didn't see any workload which is still
> > > > suffering from the list lock, 
> > > 
> > > The merge error that you fixed in 3.2-rc1 for hackbench regression is
> > > due to add slub to node partial head. And data of hackbench show node
> > > partial is still heavy used in allocation. 
> > The patch is already in base kernel, did you mean even with it you still
> > saw the list locking issue with latest kernel?
> > 
> 
> Yes, list_lock still hurt performance. It will be helpful if you can do
> some optimize for it. 
please post data and the workload. In my test, I didn't see the locking
takes significant time with perf. the slub stat you posted in last mail
shows most allocation goes the fast path.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-12  4:48                     ` Shaohua Li
  0 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-12  4:48 UTC (permalink / raw)
  To: Alex,Shi
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

On Mon, 2011-12-12 at 12:25 +0800, Alex,Shi wrote:
> On Mon, 2011-12-12 at 12:35 +0800, Li, Shaohua wrote:
> > On Mon, 2011-12-12 at 12:14 +0800, Shi, Alex wrote:
> > > On Mon, 2011-12-12 at 10:43 +0800, Li, Shaohua wrote:
> > > > On Wed, 2011-12-07 at 15:28 +0800, David Rientjes wrote:
> > > > > On Wed, 7 Dec 2011, Shaohua Li wrote:
> > > > > 
> > > > > > interesting. I did similar experiment before (try to sort the page
> > > > > > according to free number), but it appears quite hard. The free number of
> > > > > > a page is dynamic, eg more slabs can be freed when the page is in
> > > > > > partial list. And in netperf test, the partial list could be very very
> > > > > > long. Can you post your patch, I definitely what to look at it.
> > > > > 
> > > > > It was over a couple of years ago and the slub code has changed 
> > > > > significantly since then, but you can see the general concept of the "slab 
> > > > > thrashing" problem with netperf and my solution back then:
> > > > > 
> > > > > 	http://marc.info/?l=linux-kernel&m=123839191416478
> > > > > 	http://marc.info/?l=linux-kernel&m=123839203016592
> > > > > 	http://marc.info/?l=linux-kernel&m=123839202916583
> > > > > 
> > > > > I also had a separate patchset that, instead of this approach, would just 
> > > > > iterate through the partial list in get_partial_node() looking for 
> > > > > anything where the number of free objects met a certain threshold, which 
> > > > > still defaulted to 25% and instantly picked it.  The overhead was taking 
> > > > > slab_lock() for each page, but that was nullified by the performance 
> > > > > speedup of using the alloc fastpath a majority of the time for both 
> > > > > kmalloc-256 and kmalloc-2k when in the past it had only been able to serve 
> > > > > one or two allocs.  If no partial slab met the threshold, the slab_lock() 
> > > > > is held of the partial slab with the most free objects and returned 
> > > > > instead.
> > > > With the per-cpu partial list, I didn't see any workload which is still
> > > > suffering from the list lock, 
> > > 
> > > The merge error that you fixed in 3.2-rc1 for hackbench regression is
> > > due to add slub to node partial head. And data of hackbench show node
> > > partial is still heavy used in allocation. 
> > The patch is already in base kernel, did you mean even with it you still
> > saw the list locking issue with latest kernel?
> > 
> 
> Yes, list_lock still hurt performance. It will be helpful if you can do
> some optimize for it. 
please post data and the workload. In my test, I didn't see the locking
takes significant time with perf. the slub stat you posted in last mail
shows most allocation goes the fast path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-12  2:43             ` Shaohua Li
@ 2011-12-12  6:09               ` Eric Dumazet
  -1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-12  6:09 UTC (permalink / raw)
  To: Shaohua Li
  Cc: David Rientjes, Shi, Alex, Christoph Lameter, penberg,
	linux-kernel, linux-mm, Andi Kleen

Le lundi 12 décembre 2011 à 10:43 +0800, Shaohua Li a écrit :

> With the per-cpu partial list, I didn't see any workload which is still
> suffering from the list lock, so I suppose both the trashing approach
> and pick 25% used slab approach don't help. The per-cpu partial list
> flushes the whole per-cpu partial list after s->cpu_partial objects are
> freed, this is a little aggressive, because the per-cpu partial list
> need refilled again soon after an allocation. I had experiment to have
> separate per-cpu alloc/free partial list, which can avoid this. but
> again, I didn't see any workload still suffering list lock issue even
> with netperf which stress slub much. did you see such workload?

You can have such a workload using pktgen, to force stress on receiver,
and packet losses because of socket backlog limit reached.

Sender machine : pktgen (sending UDP packets to target)

Receiver :
 - SMP machine,
 - mono threaded application reading packets as fast as possible.
 - But softirq handler running on a separate cpu.

Result : Only a part of allocated skbs (two slub allocations per skb),
are delivered to socket, others are freed by softirq handler.


Using net-next tree (where some drivers use the new build_skb() service
right before packet being delivered to network stack)

CPU0: softirq handler
CPU4: application cpu

$ cd /sys/kernel/slab/:t-0002048 ; grep . *
aliases:0
align:8
grep: alloc_calls: Function not implemented
alloc_fastpath:8199220 C0=8196915 C1=306 C2=63 C3=297 C4=319 C5=550
C6=722 C7=48
alloc_from_partial:13931406 C0=13931401 C3=1 C5=4
alloc_node_mismatch:0
alloc_refill:70871657 C0=70871629 C1=2 C3=3 C4=9 C5=11 C6=3
alloc_slab:1335 C0=1216 C1=17 C2=2 C3=15 C4=17 C5=22 C6=44 C7=2
alloc_slowpath:155455299 C0=155455144 C1=18 C2=1 C3=21 C4=27 C5=40 C6=47
C7=1
cache_dma:0
cmpxchg_double_cpu_fail:0
cmpxchg_double_fail:27341 C0=12769 C4=14572
cpu_partial:6
cpu_partial_alloc:70650909 C0=70650899 C3=3 C4=2 C5=4 C6=1
cpu_partial_free:136279924 C0=71504388 C1=13 C2=1 C3=52 C4=64775461 C5=6
C6=2 C7=1
cpuslab_flush:0
cpu_slabs:29
deactivate_bypass:84583642 C0=84583515 C1=16 C2=1 C3=18 C4=18 C5=29
C6=44 C7=1
deactivate_empty:570 C0=80 C3=34 C4=456
deactivate_full:0
deactivate_remote_frees:0
deactivate_to_head:0
deactivate_to_tail:0
destroy_by_rcu:0
free_add_partial:0
grep: free_calls: Function not implemented
free_fastpath:89153 C0=88972 C1=34 C2=35 C3=27 C4=12 C5=23 C6=30 C7=20
free_frozen:5971363 C0=554097 C1=196 C2=14 C3=730 C4=5416278 C5=16 C6=19
C7=13
free_remove_partial:401 C1=1 C4=400
free_slab:971 C0=80 C1=1 C3=34 C4=856
free_slowpath:92913113 C0=21090357 C1=212 C2=15 C3=784 C4=71821691 C5=19
C6=21 C7=14
hwcache_align:0
min_partial:5
objects:1873
object_size:2048
objects_partial:945
objs_per_slab:16
order:3
order_fallback:0
partial:306
poison:0
reclaim_account:0
red_zone:0
reserved:0
sanity_checks:0
slabs:364
slabs_cpu_partial:21(21) C0=3(3) C1=6(6) C2=1(1) C3=7(7) C5=2(2) C6=1(1)
C7=1(1)
slab_size:2048
store_user:0
total_objects:5824
trace:0

$ cd /sys/kernel/slab/skbuff_head_cache ; grep . *
aliases:6
align:8
grep: alloc_calls: Function not implemented
alloc_fastpath:89181782 C0=89173048 C1=1599 C2=1357 C3=2140 C4=802 C5=675 C6=638 C7=1523
alloc_from_partial:412658 C0=412658
alloc_node_mismatch:0
alloc_refill:593417 C0=593189 C1=19 C2=15 C3=24 C4=51 C5=18 C6=17 C7=84
alloc_slab:2831313 C0=2831285 C1=2 C2=2 C3=2 C4=2 C5=12 C6=4 C7=4
alloc_slowpath:4430371 C0=4430112 C1=20 C2=17 C3=25 C4=57 C5=31 C6=21 C7=88
cache_dma:0
cmpxchg_double_cpu_fail:0
cmpxchg_double_fail:1 C0=1
cpu_partial:30
cpu_partial_alloc:592991 C0=592981 C2=1 C4=5 C5=2 C6=1 C7=1
cpu_partial_free:4429836 C0=592981 C1=25 C2=19 C3=23 C4=3836767 C5=6 C6=8 C7=7
cpuslab_flush:0
cpu_slabs:107
deactivate_bypass:3836954 C0=3836923 C1=1 C2=2 C3=1 C4=6 C5=13 C6=4 C7=4
deactivate_empty:2831168 C4=2831168
deactivate_full:0
deactivate_remote_frees:0
deactivate_to_head:0
deactivate_to_tail:0
destroy_by_rcu:0
free_add_partial:0
grep: free_calls: Function not implemented
free_fastpath:21192924 C0=21186268 C1=1420 C2=1204 C3=1966 C4=572 C5=349 C6=380 C7=765
free_frozen:67988498 C0=516 C1=121 C2=85 C3=841 C4=67986468 C5=215 C6=76 C7=176
free_remove_partial:18 C4=18
free_slab:2831186 C4=2831186
free_slowpath:71825749 C0=609 C1=146 C2=104 C3=864 C4=71823538 C5=221 C6=84 C7=183
hwcache_align:0
min_partial:5
objects:2494
object_size:192
objects_partial:121
objs_per_slab:21
order:0
order_fallback:0
partial:14
poison:0
reclaim_account:0
red_zone:0
reserved:0
sanity_checks:0
slabs:127
slabs_cpu_partial:99(99) C1=25(25) C2=18(18) C3=23(23) C4=16(16) C5=4(4) C6=7(7) C7=6(6)
slab_size:192
store_user:0
total_objects:2667
trace:0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-12  6:09               ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-12  6:09 UTC (permalink / raw)
  To: Shaohua Li
  Cc: David Rientjes, Shi, Alex, Christoph Lameter, penberg,
	linux-kernel, linux-mm, Andi Kleen

Le lundi 12 dA(C)cembre 2011 A  10:43 +0800, Shaohua Li a A(C)crit :

> With the per-cpu partial list, I didn't see any workload which is still
> suffering from the list lock, so I suppose both the trashing approach
> and pick 25% used slab approach don't help. The per-cpu partial list
> flushes the whole per-cpu partial list after s->cpu_partial objects are
> freed, this is a little aggressive, because the per-cpu partial list
> need refilled again soon after an allocation. I had experiment to have
> separate per-cpu alloc/free partial list, which can avoid this. but
> again, I didn't see any workload still suffering list lock issue even
> with netperf which stress slub much. did you see such workload?

You can have such a workload using pktgen, to force stress on receiver,
and packet losses because of socket backlog limit reached.

Sender machine : pktgen (sending UDP packets to target)

Receiver :
 - SMP machine,
 - mono threaded application reading packets as fast as possible.
 - But softirq handler running on a separate cpu.

Result : Only a part of allocated skbs (two slub allocations per skb),
are delivered to socket, others are freed by softirq handler.


Using net-next tree (where some drivers use the new build_skb() service
right before packet being delivered to network stack)

CPU0: softirq handler
CPU4: application cpu

$ cd /sys/kernel/slab/:t-0002048 ; grep . *
aliases:0
align:8
grep: alloc_calls: Function not implemented
alloc_fastpath:8199220 C0=8196915 C1=306 C2=63 C3=297 C4=319 C5=550
C6=722 C7=48
alloc_from_partial:13931406 C0=13931401 C3=1 C5=4
alloc_node_mismatch:0
alloc_refill:70871657 C0=70871629 C1=2 C3=3 C4=9 C5=11 C6=3
alloc_slab:1335 C0=1216 C1=17 C2=2 C3=15 C4=17 C5=22 C6=44 C7=2
alloc_slowpath:155455299 C0=155455144 C1=18 C2=1 C3=21 C4=27 C5=40 C6=47
C7=1
cache_dma:0
cmpxchg_double_cpu_fail:0
cmpxchg_double_fail:27341 C0=12769 C4=14572
cpu_partial:6
cpu_partial_alloc:70650909 C0=70650899 C3=3 C4=2 C5=4 C6=1
cpu_partial_free:136279924 C0=71504388 C1=13 C2=1 C3=52 C4=64775461 C5=6
C6=2 C7=1
cpuslab_flush:0
cpu_slabs:29
deactivate_bypass:84583642 C0=84583515 C1=16 C2=1 C3=18 C4=18 C5=29
C6=44 C7=1
deactivate_empty:570 C0=80 C3=34 C4=456
deactivate_full:0
deactivate_remote_frees:0
deactivate_to_head:0
deactivate_to_tail:0
destroy_by_rcu:0
free_add_partial:0
grep: free_calls: Function not implemented
free_fastpath:89153 C0=88972 C1=34 C2=35 C3=27 C4=12 C5=23 C6=30 C7=20
free_frozen:5971363 C0=554097 C1=196 C2=14 C3=730 C4=5416278 C5=16 C6=19
C7=13
free_remove_partial:401 C1=1 C4=400
free_slab:971 C0=80 C1=1 C3=34 C4=856
free_slowpath:92913113 C0=21090357 C1=212 C2=15 C3=784 C4=71821691 C5=19
C6=21 C7=14
hwcache_align:0
min_partial:5
objects:1873
object_size:2048
objects_partial:945
objs_per_slab:16
order:3
order_fallback:0
partial:306
poison:0
reclaim_account:0
red_zone:0
reserved:0
sanity_checks:0
slabs:364
slabs_cpu_partial:21(21) C0=3(3) C1=6(6) C2=1(1) C3=7(7) C5=2(2) C6=1(1)
C7=1(1)
slab_size:2048
store_user:0
total_objects:5824
trace:0

$ cd /sys/kernel/slab/skbuff_head_cache ; grep . *
aliases:6
align:8
grep: alloc_calls: Function not implemented
alloc_fastpath:89181782 C0=89173048 C1=1599 C2=1357 C3=2140 C4=802 C5=675 C6=638 C7=1523
alloc_from_partial:412658 C0=412658
alloc_node_mismatch:0
alloc_refill:593417 C0=593189 C1=19 C2=15 C3=24 C4=51 C5=18 C6=17 C7=84
alloc_slab:2831313 C0=2831285 C1=2 C2=2 C3=2 C4=2 C5=12 C6=4 C7=4
alloc_slowpath:4430371 C0=4430112 C1=20 C2=17 C3=25 C4=57 C5=31 C6=21 C7=88
cache_dma:0
cmpxchg_double_cpu_fail:0
cmpxchg_double_fail:1 C0=1
cpu_partial:30
cpu_partial_alloc:592991 C0=592981 C2=1 C4=5 C5=2 C6=1 C7=1
cpu_partial_free:4429836 C0=592981 C1=25 C2=19 C3=23 C4=3836767 C5=6 C6=8 C7=7
cpuslab_flush:0
cpu_slabs:107
deactivate_bypass:3836954 C0=3836923 C1=1 C2=2 C3=1 C4=6 C5=13 C6=4 C7=4
deactivate_empty:2831168 C4=2831168
deactivate_full:0
deactivate_remote_frees:0
deactivate_to_head:0
deactivate_to_tail:0
destroy_by_rcu:0
free_add_partial:0
grep: free_calls: Function not implemented
free_fastpath:21192924 C0=21186268 C1=1420 C2=1204 C3=1966 C4=572 C5=349 C6=380 C7=765
free_frozen:67988498 C0=516 C1=121 C2=85 C3=841 C4=67986468 C5=215 C6=76 C7=176
free_remove_partial:18 C4=18
free_slab:2831186 C4=2831186
free_slowpath:71825749 C0=609 C1=146 C2=104 C3=864 C4=71823538 C5=221 C6=84 C7=183
hwcache_align:0
min_partial:5
objects:2494
object_size:192
objects_partial:121
objs_per_slab:21
order:0
order_fallback:0
partial:14
poison:0
reclaim_account:0
red_zone:0
reserved:0
sanity_checks:0
slabs:127
slabs_cpu_partial:99(99) C1=25(25) C2=18(18) C3=23(23) C4=16(16) C5=4(4) C6=7(7) C7=6(6)
slab_size:192
store_user:0
total_objects:2667
trace:0


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-12  4:48                     ` Shaohua Li
@ 2011-12-12  6:17                       ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-12  6:17 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

> > > > > With the per-cpu partial list, I didn't see any workload which is still
> > > > > suffering from the list lock, 
> > > > 
> > > > The merge error that you fixed in 3.2-rc1 for hackbench regression is
> > > > due to add slub to node partial head. And data of hackbench show node
> > > > partial is still heavy used in allocation. 
> > > The patch is already in base kernel, did you mean even with it you still
> > > saw the list locking issue with latest kernel?
> > > 
> > 
> > Yes, list_lock still hurt performance. It will be helpful if you can do
> > some optimize for it. 
> please post data and the workload. In my test, I didn't see the locking
> takes significant time with perf. the slub stat you posted in last mail
> shows most allocation goes the fast path.
> 

It is 'hackbench 100 process 2000' 

How it's proved there are not locking time on list_lock? 

Does the following profile can express sth on slub alloc with my
previous slub stat data?

113517 total                                      0.0164
 11065 unix_stream_recvmsg                        5.9425
 10862 copy_user_generic_string                 169.7188
  6011 __alloc_skb                               18.7844
  5857 __kmalloc_node_track_caller               14.3203
  5748 unix_stream_sendmsg                        5.4124
  4841 kmem_cache_alloc_node                     17.3513
  4494 skb_release_head_state                    34.0455
  3697 __slab_free                                5.3194


Actually, we know lock content is there, because any change on node
partial list may cause clear performance change on hackbench process
benchmark. But for direct evidence. Any debug may heavy press its using,
and make data unbelievable.  So, I'd like hear any suggestions to figure
out it clearly. 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-12  6:17                       ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-12  6:17 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel,
	linux-mm, Andi Kleen

> > > > > With the per-cpu partial list, I didn't see any workload which is still
> > > > > suffering from the list lock, 
> > > > 
> > > > The merge error that you fixed in 3.2-rc1 for hackbench regression is
> > > > due to add slub to node partial head. And data of hackbench show node
> > > > partial is still heavy used in allocation. 
> > > The patch is already in base kernel, did you mean even with it you still
> > > saw the list locking issue with latest kernel?
> > > 
> > 
> > Yes, list_lock still hurt performance. It will be helpful if you can do
> > some optimize for it. 
> please post data and the workload. In my test, I didn't see the locking
> takes significant time with perf. the slub stat you posted in last mail
> shows most allocation goes the fast path.
> 

It is 'hackbench 100 process 2000' 

How it's proved there are not locking time on list_lock? 

Does the following profile can express sth on slub alloc with my
previous slub stat data?

113517 total                                      0.0164
 11065 unix_stream_recvmsg                        5.9425
 10862 copy_user_generic_string                 169.7188
  6011 __alloc_skb                               18.7844
  5857 __kmalloc_node_track_caller               14.3203
  5748 unix_stream_sendmsg                        5.4124
  4841 kmem_cache_alloc_node                     17.3513
  4494 skb_release_head_state                    34.0455
  3697 __slab_free                                5.3194


Actually, we know lock content is there, because any change on node
partial list may cause clear performance change on hackbench process
benchmark. But for direct evidence. Any debug may heavy press its using,
and make data unbelievable.  So, I'd like hear any suggestions to figure
out it clearly. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-09 10:10       ` David Rientjes
@ 2011-12-13 13:01         ` Shi, Alex
  -1 siblings, 0 replies; 78+ messages in thread
From: Shi, Alex @ 2011-12-13 13:01 UTC (permalink / raw)
  To: Shi, Alex, David Rientjes
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

> > > -       if (tail == DEACTIVATE_TO_TAIL)
> > > +       if (tail == DEACTIVATE_TO_TAIL ||
> > > +               page->inuse > page->objects /2)
> > >                 list_add_tail(&page->lru, &n->partial);
> > >         else
> > >                 list_add(&page->lru, &n->partial);
> > >

> > with the statistics patch above?  I typically run with CONFIG_SLUB_STATS
> > disabled since it impacts performance so heavily and I'm not sure what
> > information you're looking for with regards to those stats.
> 
> NO, when you collect data, please close SLUB_STAT in kernel config.  _to_head
> statistics collection patch just tell you, I collected the statistics not include
> add_partial in early_kmem_cache_node_alloc(). And other places of
> add_partial were covered. Of course, the kernel with statistic can not be used
> to measure performance.

David, Did you have time to give a try?  

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-13 13:01         ` Shi, Alex
  0 siblings, 0 replies; 78+ messages in thread
From: Shi, Alex @ 2011-12-13 13:01 UTC (permalink / raw)
  To: Shi, Alex, David Rientjes
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

> > > -       if (tail == DEACTIVATE_TO_TAIL)
> > > +       if (tail == DEACTIVATE_TO_TAIL ||
> > > +               page->inuse > page->objects /2)
> > >                 list_add_tail(&page->lru, &n->partial);
> > >         else
> > >                 list_add(&page->lru, &n->partial);
> > >

> > with the statistics patch above?  I typically run with CONFIG_SLUB_STATS
> > disabled since it impacts performance so heavily and I'm not sure what
> > information you're looking for with regards to those stats.
> 
> NO, when you collect data, please close SLUB_STAT in kernel config.  _to_head
> statistics collection patch just tell you, I collected the statistics not include
> add_partial in early_kmem_cache_node_alloc(). And other places of
> add_partial were covered. Of course, the kernel with statistic can not be used
> to measure performance.

David, Did you have time to give a try?  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-12  2:43             ` Shaohua Li
@ 2011-12-14  1:29               ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  1:29 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Mon, 12 Dec 2011, Shaohua Li wrote:

> With the per-cpu partial list, I didn't see any workload which is still
> suffering from the list lock, so I suppose both the trashing approach
> and pick 25% used slab approach don't help.

This doesn't necessarily have anything to do with contention on list_lock, 
it has to do with the fact that ~99% of allocations come from the slowpath 
since the cpu slab only has one free object when it is activated, that's 
what the statistics indicated for kmalloc-256 and kmalloc-2k.  That's what 
I called "slab thrashing": the continual deactivation of the cpu slab and 
picking from the partial list that would only have one or two free objects 
causing the vast majority of allocations to require the slowpath.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  1:29               ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  1:29 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Mon, 12 Dec 2011, Shaohua Li wrote:

> With the per-cpu partial list, I didn't see any workload which is still
> suffering from the list lock, so I suppose both the trashing approach
> and pick 25% used slab approach don't help.

This doesn't necessarily have anything to do with contention on list_lock, 
it has to do with the fact that ~99% of allocations come from the slowpath 
since the cpu slab only has one free object when it is activated, that's 
what the statistics indicated for kmalloc-256 and kmalloc-2k.  That's what 
I called "slab thrashing": the continual deactivation of the cpu slab and 
picking from the partial list that would only have one or two free objects 
causing the vast majority of allocations to require the slowpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-09 13:40         ` Shi, Alex
@ 2011-12-14  1:38           ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  1:38 UTC (permalink / raw)
  To: Shi, Alex
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Fri, 9 Dec 2011, Shi, Alex wrote:

> Of course any testing may have result variation. But it is benchmark 
> accordingly, and there are lot technical to tuning your testing to make 
> its stand division acceptable, like to sync your system in a clear 
> status, to close unnecessary services, to use separate working disks for 
> your testing etc. etc. For this data, like on my SNB-EP machine, (the 
> following data is not stands for Intel, it is just my personal data). 

I always run benchmarks with freshly booted machines and disabling all but 
the most basic and required userspace for my testing environment, I can 
assure you that my comparison of slab and slub on netperf TCP_RR isn't 
because of any noise from userspace.

> 4 times result of hackbench on this patch are 5.59, 5.475, 5.47833, 
> 5.504

I haven't been running hackbench benchmarks, sorry.  I was always under 
the assumption that slub still was slightly better than slab with 
hackbench since that was used as justification for it becoming the default 
allocator and also because Christoph had patches merged recently which 
improved its performance on slub.  I've been speaking only about my 
history with netperf TCP_RR when using slub.

> > Not sure what you're asking me to test, you would like this:
> > 
> > 	{
> > 	        n->nr_partial++;
> > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > 	-               list_add_tail(&page->lru, &n->partial);
> > 	-       else
> > 	-               list_add(&page->lru, &n->partial);
> > 	+       list_add_tail(&page->lru, &n->partial);
> > 	}
> > 
> > with the statistics patch above?  I typically run with CONFIG_SLUB_STATS
> > disabled since it impacts performance so heavily and I'm not sure what
> > information you're looking for with regards to those stats.
> 
> NO, when you collect data, please close SLUB_STAT in kernel config.  
> _to_head statistics collection patch just tell you, I collected the 
> statistics not include add_partial in early_kmem_cache_node_alloc(). And 
> other places of add_partial were covered. Of course, the kernel with 
> statistic can not be used to measure performance. 
> 

Ok, I'll benchmark netperf TCP_RR comparing Linus' latest -git both with 
and without the above change.  It was confusing because you had three 
diffs in your email, I wasn't sure which or combination of which you 
wanted me to try :)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  1:38           ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  1:38 UTC (permalink / raw)
  To: Shi, Alex
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Fri, 9 Dec 2011, Shi, Alex wrote:

> Of course any testing may have result variation. But it is benchmark 
> accordingly, and there are lot technical to tuning your testing to make 
> its stand division acceptable, like to sync your system in a clear 
> status, to close unnecessary services, to use separate working disks for 
> your testing etc. etc. For this data, like on my SNB-EP machine, (the 
> following data is not stands for Intel, it is just my personal data). 

I always run benchmarks with freshly booted machines and disabling all but 
the most basic and required userspace for my testing environment, I can 
assure you that my comparison of slab and slub on netperf TCP_RR isn't 
because of any noise from userspace.

> 4 times result of hackbench on this patch are 5.59, 5.475, 5.47833, 
> 5.504

I haven't been running hackbench benchmarks, sorry.  I was always under 
the assumption that slub still was slightly better than slab with 
hackbench since that was used as justification for it becoming the default 
allocator and also because Christoph had patches merged recently which 
improved its performance on slub.  I've been speaking only about my 
history with netperf TCP_RR when using slub.

> > Not sure what you're asking me to test, you would like this:
> > 
> > 	{
> > 	        n->nr_partial++;
> > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > 	-               list_add_tail(&page->lru, &n->partial);
> > 	-       else
> > 	-               list_add(&page->lru, &n->partial);
> > 	+       list_add_tail(&page->lru, &n->partial);
> > 	}
> > 
> > with the statistics patch above?  I typically run with CONFIG_SLUB_STATS
> > disabled since it impacts performance so heavily and I'm not sure what
> > information you're looking for with regards to those stats.
> 
> NO, when you collect data, please close SLUB_STAT in kernel config.  
> _to_head statistics collection patch just tell you, I collected the 
> statistics not include add_partial in early_kmem_cache_node_alloc(). And 
> other places of add_partial were covered. Of course, the kernel with 
> statistic can not be used to measure performance. 
> 

Ok, I'll benchmark netperf TCP_RR comparing Linus' latest -git both with 
and without the above change.  It was confusing because you had three 
diffs in your email, I wasn't sure which or combination of which you 
wanted me to try :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  1:38           ` David Rientjes
@ 2011-12-14  2:36             ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  2:36 UTC (permalink / raw)
  To: Shi, Alex
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Tue, 13 Dec 2011, David Rientjes wrote:

> > > 	{
> > > 	        n->nr_partial++;
> > > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > > 	-               list_add_tail(&page->lru, &n->partial);
> > > 	-       else
> > > 	-               list_add(&page->lru, &n->partial);
> > > 	+       list_add_tail(&page->lru, &n->partial);
> > > 	}
> > > 

2 machines (one netserver, one netperf) both with 16 cores, 64GB memory 
with netperf-2.4.5 comparing Linus' -git with and without this patch:

	threads		SLUB		SLUB+patch
	 16		116614		117213 (+0.5%)
	 32		216436		215065 (-0.6%)
	 48		299991		299399 (-0.2%)
	 64		373753		374617 (+0.2%)
	 80		435688		435765 (UNCH)
	 96		494630		496590 (+0.4%)
	112		546766		546259 (-0.1%)

This suggests the difference is within the noise, so this patch neither 
helps nor hurts netperf on my setup, as expected.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  2:36             ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  2:36 UTC (permalink / raw)
  To: Shi, Alex
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Tue, 13 Dec 2011, David Rientjes wrote:

> > > 	{
> > > 	        n->nr_partial++;
> > > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > > 	-               list_add_tail(&page->lru, &n->partial);
> > > 	-       else
> > > 	-               list_add(&page->lru, &n->partial);
> > > 	+       list_add_tail(&page->lru, &n->partial);
> > > 	}
> > > 

2 machines (one netserver, one netperf) both with 16 cores, 64GB memory 
with netperf-2.4.5 comparing Linus' -git with and without this patch:

	threads		SLUB		SLUB+patch
	 16		116614		117213 (+0.5%)
	 32		216436		215065 (-0.6%)
	 48		299991		299399 (-0.2%)
	 64		373753		374617 (+0.2%)
	 80		435688		435765 (UNCH)
	 96		494630		496590 (+0.4%)
	112		546766		546259 (-0.1%)

This suggests the difference is within the noise, so this patch neither 
helps nor hurts netperf on my setup, as expected.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  2:43                 ` Shaohua Li
@ 2011-12-14  2:38                   ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  2:38 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 14 Dec 2011, Shaohua Li wrote:

> if vast majority of allocation needs picking from partial list of node,
> the list_lock will have contention too. But I'd say avoiding the slab
> thrashing does increase fastpath.

Right, that's why my 2009 patchset would attempt to grab the partial slab 
with the highest number of free objects to a certain threshold before 
falling back to others and it improved performance somewhat.  This was 
with the per-node partial lists, however, and the slowpath has been 
significantly rewritten since then.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  2:38                   ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2011-12-14  2:38 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 14 Dec 2011, Shaohua Li wrote:

> if vast majority of allocation needs picking from partial list of node,
> the list_lock will have contention too. But I'd say avoiding the slab
> thrashing does increase fastpath.

Right, that's why my 2009 patchset would attempt to grab the partial slab 
with the highest number of free objects to a certain threshold before 
falling back to others and it improved performance somewhat.  This was 
with the per-node partial lists, however, and the slowpath has been 
significantly rewritten since then.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  1:29               ` David Rientjes
@ 2011-12-14  2:43                 ` Shaohua Li
  -1 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-14  2:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 2011-12-14 at 09:29 +0800, David Rientjes wrote:
> On Mon, 12 Dec 2011, Shaohua Li wrote:
> 
> > With the per-cpu partial list, I didn't see any workload which is still
> > suffering from the list lock, so I suppose both the trashing approach
> > and pick 25% used slab approach don't help.
> 
> This doesn't necessarily have anything to do with contention on list_lock, 
> it has to do with the fact that ~99% of allocations come from the slowpath 
> since the cpu slab only has one free object when it is activated, that's 
> what the statistics indicated for kmalloc-256 and kmalloc-2k.  That's what 
> I called "slab thrashing": the continual deactivation of the cpu slab and 
> picking from the partial list that would only have one or two free objects 
> causing the vast majority of allocations to require the slowpath.
if vast majority of allocation needs picking from partial list of node,
the list_lock will have contention too. But I'd say avoiding the slab
thrashing does increase fastpath. How much it can improve performance I
don't know. The slowpath (not involving list_lock case, so picking
per-cpu partial list) is already _very_ fast these days.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  2:43                 ` Shaohua Li
  0 siblings, 0 replies; 78+ messages in thread
From: Shaohua Li @ 2011-12-14  2:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shi, Alex, Christoph Lameter, penberg, linux-kernel, linux-mm,
	Andi Kleen

On Wed, 2011-12-14 at 09:29 +0800, David Rientjes wrote:
> On Mon, 12 Dec 2011, Shaohua Li wrote:
> 
> > With the per-cpu partial list, I didn't see any workload which is still
> > suffering from the list lock, so I suppose both the trashing approach
> > and pick 25% used slab approach don't help.
> 
> This doesn't necessarily have anything to do with contention on list_lock, 
> it has to do with the fact that ~99% of allocations come from the slowpath 
> since the cpu slab only has one free object when it is activated, that's 
> what the statistics indicated for kmalloc-256 and kmalloc-2k.  That's what 
> I called "slab thrashing": the continual deactivation of the cpu slab and 
> picking from the partial list that would only have one or two free objects 
> causing the vast majority of allocations to require the slowpath.
if vast majority of allocation needs picking from partial list of node,
the list_lock will have contention too. But I'd say avoiding the slab
thrashing does increase fastpath. How much it can improve performance I
don't know. The slowpath (not involving list_lock case, so picking
per-cpu partial list) is already _very_ fast these days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  2:36             ` David Rientjes
@ 2011-12-14  6:06               ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-14  6:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Wed, 2011-12-14 at 10:36 +0800, David Rientjes wrote:
> On Tue, 13 Dec 2011, David Rientjes wrote:
> 
> > > > 	{
> > > > 	        n->nr_partial++;
> > > > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > > > 	-               list_add_tail(&page->lru, &n->partial);
> > > > 	-       else
> > > > 	-               list_add(&page->lru, &n->partial);
> > > > 	+       list_add_tail(&page->lru, &n->partial);
> > > > 	}
> > > > 
> 
> 2 machines (one netserver, one netperf) both with 16 cores, 64GB memory 
> with netperf-2.4.5 comparing Linus' -git with and without this patch:
> 
> 	threads		SLUB		SLUB+patch
> 	 16		116614		117213 (+0.5%)
> 	 32		216436		215065 (-0.6%)
> 	 48		299991		299399 (-0.2%)
> 	 64		373753		374617 (+0.2%)
> 	 80		435688		435765 (UNCH)
> 	 96		494630		496590 (+0.4%)
> 	112		546766		546259 (-0.1%)
> 
> This suggests the difference is within the noise, so this patch neither 
> helps nor hurts netperf on my setup, as expected.

Thanks for the data. Real netperf is hard to give enough press on SLUB.
but as I mentioned before, I also didn't find real performance change on
my loopback netperf testing. 

I retested hackbench again. about 1% performance increase still exists
on my 2 sockets SNB/WSM and 4 sockets NHM.  and no performance drop for
other machines. 

Christoph, what's comments you like to offer for the results or for this
code change? 




^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  6:06               ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-14  6:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, penberg, linux-kernel, linux-mm, Eric Dumazet

On Wed, 2011-12-14 at 10:36 +0800, David Rientjes wrote:
> On Tue, 13 Dec 2011, David Rientjes wrote:
> 
> > > > 	{
> > > > 	        n->nr_partial++;
> > > > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > > > 	-               list_add_tail(&page->lru, &n->partial);
> > > > 	-       else
> > > > 	-               list_add(&page->lru, &n->partial);
> > > > 	+       list_add_tail(&page->lru, &n->partial);
> > > > 	}
> > > > 
> 
> 2 machines (one netserver, one netperf) both with 16 cores, 64GB memory 
> with netperf-2.4.5 comparing Linus' -git with and without this patch:
> 
> 	threads		SLUB		SLUB+patch
> 	 16		116614		117213 (+0.5%)
> 	 32		216436		215065 (-0.6%)
> 	 48		299991		299399 (-0.2%)
> 	 64		373753		374617 (+0.2%)
> 	 80		435688		435765 (UNCH)
> 	 96		494630		496590 (+0.4%)
> 	112		546766		546259 (-0.1%)
> 
> This suggests the difference is within the noise, so this patch neither 
> helps nor hurts netperf on my setup, as expected.

Thanks for the data. Real netperf is hard to give enough press on SLUB.
but as I mentioned before, I also didn't find real performance change on
my loopback netperf testing. 

I retested hackbench again. about 1% performance increase still exists
on my 2 sockets SNB/WSM and 4 sockets NHM.  and no performance drop for
other machines. 

Christoph, what's comments you like to offer for the results or for this
code change? 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  6:06               ` Alex,Shi
@ 2011-12-14  6:44                 ` Eric Dumazet
  -1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-14  6:44 UTC (permalink / raw)
  To: Alex,Shi
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel, linux-mm

Le mercredi 14 décembre 2011 à 14:06 +0800, Alex,Shi a écrit :
> On Wed, 2011-12-14 at 10:36 +0800, David Rientjes wrote:
> > On Tue, 13 Dec 2011, David Rientjes wrote:
> > 
> > > > > 	{
> > > > > 	        n->nr_partial++;
> > > > > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > > > > 	-               list_add_tail(&page->lru, &n->partial);
> > > > > 	-       else
> > > > > 	-               list_add(&page->lru, &n->partial);
> > > > > 	+       list_add_tail(&page->lru, &n->partial);
> > > > > 	}
> > > > > 
> > 
> > 2 machines (one netserver, one netperf) both with 16 cores, 64GB memory 
> > with netperf-2.4.5 comparing Linus' -git with and without this patch:
> > 
> > 	threads		SLUB		SLUB+patch
> > 	 16		116614		117213 (+0.5%)
> > 	 32		216436		215065 (-0.6%)
> > 	 48		299991		299399 (-0.2%)
> > 	 64		373753		374617 (+0.2%)
> > 	 80		435688		435765 (UNCH)
> > 	 96		494630		496590 (+0.4%)
> > 	112		546766		546259 (-0.1%)
> > 
> > This suggests the difference is within the noise, so this patch neither 
> > helps nor hurts netperf on my setup, as expected.
> 
> Thanks for the data. Real netperf is hard to give enough press on SLUB.
> but as I mentioned before, I also didn't find real performance change on
> my loopback netperf testing. 
> 
> I retested hackbench again. about 1% performance increase still exists
> on my 2 sockets SNB/WSM and 4 sockets NHM.  and no performance drop for
> other machines. 
> 
> Christoph, what's comments you like to offer for the results or for this
> code change? 

I believe far more aggressive mechanism is needed to help these
workloads.

Please note that the COLD/HOT page concept is not very well used in
kernel, because its not really obvious that some decisions are always
good (or maybe this is not well known)

We should try to batch things a bit, instead of doing a very small unit
of work in slow path.

We now have a very fast fastpath, but inefficient slow path.

SLAB has a litle cache per cpu, we could add one to SLUB for freed
objects, not belonging to current slab. This could avoid all these
activate/deactivate overhead.




^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  6:44                 ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-14  6:44 UTC (permalink / raw)
  To: Alex,Shi
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel, linux-mm

Le mercredi 14 dA(C)cembre 2011 A  14:06 +0800, Alex,Shi a A(C)crit :
> On Wed, 2011-12-14 at 10:36 +0800, David Rientjes wrote:
> > On Tue, 13 Dec 2011, David Rientjes wrote:
> > 
> > > > > 	{
> > > > > 	        n->nr_partial++;
> > > > > 	-       if (tail == DEACTIVATE_TO_TAIL)
> > > > > 	-               list_add_tail(&page->lru, &n->partial);
> > > > > 	-       else
> > > > > 	-               list_add(&page->lru, &n->partial);
> > > > > 	+       list_add_tail(&page->lru, &n->partial);
> > > > > 	}
> > > > > 
> > 
> > 2 machines (one netserver, one netperf) both with 16 cores, 64GB memory 
> > with netperf-2.4.5 comparing Linus' -git with and without this patch:
> > 
> > 	threads		SLUB		SLUB+patch
> > 	 16		116614		117213 (+0.5%)
> > 	 32		216436		215065 (-0.6%)
> > 	 48		299991		299399 (-0.2%)
> > 	 64		373753		374617 (+0.2%)
> > 	 80		435688		435765 (UNCH)
> > 	 96		494630		496590 (+0.4%)
> > 	112		546766		546259 (-0.1%)
> > 
> > This suggests the difference is within the noise, so this patch neither 
> > helps nor hurts netperf on my setup, as expected.
> 
> Thanks for the data. Real netperf is hard to give enough press on SLUB.
> but as I mentioned before, I also didn't find real performance change on
> my loopback netperf testing. 
> 
> I retested hackbench again. about 1% performance increase still exists
> on my 2 sockets SNB/WSM and 4 sockets NHM.  and no performance drop for
> other machines. 
> 
> Christoph, what's comments you like to offer for the results or for this
> code change? 

I believe far more aggressive mechanism is needed to help these
workloads.

Please note that the COLD/HOT page concept is not very well used in
kernel, because its not really obvious that some decisions are always
good (or maybe this is not well known)

We should try to batch things a bit, instead of doing a very small unit
of work in slow path.

We now have a very fast fastpath, but inefficient slow path.

SLAB has a litle cache per cpu, we could add one to SLUB for freed
objects, not belonging to current slab. This could avoid all these
activate/deactivate overhead.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  6:44                 ` Eric Dumazet
@ 2011-12-14  6:47                   ` Pekka Enberg
  -1 siblings, 0 replies; 78+ messages in thread
From: Pekka Enberg @ 2011-12-14  6:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alex,Shi, David Rientjes, Christoph Lameter, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Eric Dumazet wrote:
> We should try to batch things a bit, instead of doing a very small unit
> of work in slow path.
>
> We now have a very fast fastpath, but inefficient slow path.
>
> SLAB has a litle cache per cpu, we could add one to SLUB for freed
> objects, not belonging to current slab. This could avoid all these
> activate/deactivate overhead.

Yeah, this is definitely worth looking at.

 			Pekka

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  6:47                   ` Pekka Enberg
  0 siblings, 0 replies; 78+ messages in thread
From: Pekka Enberg @ 2011-12-14  6:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alex,Shi, David Rientjes, Christoph Lameter, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Eric Dumazet wrote:
> We should try to batch things a bit, instead of doing a very small unit
> of work in slow path.
>
> We now have a very fast fastpath, but inefficient slow path.
>
> SLAB has a litle cache per cpu, we could add one to SLUB for freed
> objects, not belonging to current slab. This could avoid all these
> activate/deactivate overhead.

Yeah, this is definitely worth looking at.

 			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  6:44                 ` Eric Dumazet
@ 2011-12-14  6:56                   ` Alex,Shi
  -1 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-14  6:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel, linux-mm


> > Thanks for the data. Real netperf is hard to give enough press on SLUB.
> > but as I mentioned before, I also didn't find real performance change on
> > my loopback netperf testing. 
> > 
> > I retested hackbench again. about 1% performance increase still exists
> > on my 2 sockets SNB/WSM and 4 sockets NHM.  and no performance drop for
> > other machines. 
> > 
> > Christoph, what's comments you like to offer for the results or for this
> > code change? 
> 
> I believe far more aggressive mechanism is needed to help these
> workloads.
> 
> Please note that the COLD/HOT page concept is not very well used in
> kernel, because its not really obvious that some decisions are always
> good (or maybe this is not well known)

Hope Christoph know everything of SLUB. :) 
> 
> We should try to batch things a bit, instead of doing a very small unit
> of work in slow path.
> 
> We now have a very fast fastpath, but inefficient slow path.
> 
> SLAB has a litle cache per cpu, we could add one to SLUB for freed
> objects, not belonging to current slab. This could avoid all these
> activate/deactivate overhead.

Maybe worth to try or maybe Christoph had studied this? 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14  6:56                   ` Alex,Shi
  0 siblings, 0 replies; 78+ messages in thread
From: Alex,Shi @ 2011-12-14  6:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Rientjes, Christoph Lameter, penberg, linux-kernel, linux-mm


> > Thanks for the data. Real netperf is hard to give enough press on SLUB.
> > but as I mentioned before, I also didn't find real performance change on
> > my loopback netperf testing. 
> > 
> > I retested hackbench again. about 1% performance increase still exists
> > on my 2 sockets SNB/WSM and 4 sockets NHM.  and no performance drop for
> > other machines. 
> > 
> > Christoph, what's comments you like to offer for the results or for this
> > code change? 
> 
> I believe far more aggressive mechanism is needed to help these
> workloads.
> 
> Please note that the COLD/HOT page concept is not very well used in
> kernel, because its not really obvious that some decisions are always
> good (or maybe this is not well known)

Hope Christoph know everything of SLUB. :) 
> 
> We should try to batch things a bit, instead of doing a very small unit
> of work in slow path.
> 
> We now have a very fast fastpath, but inefficient slow path.
> 
> SLAB has a litle cache per cpu, we could add one to SLUB for freed
> objects, not belonging to current slab. This could avoid all these
> activate/deactivate overhead.

Maybe worth to try or maybe Christoph had studied this? 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  6:47                   ` Pekka Enberg
@ 2011-12-14 14:53                     ` Christoph Lameter
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-14 14:53 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Eric Dumazet, Alex,Shi, David Rientjes, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Pekka Enberg wrote:

> On Wed, 14 Dec 2011, Eric Dumazet wrote:
> > We should try to batch things a bit, instead of doing a very small unit
> > of work in slow path.
> >
> > We now have a very fast fastpath, but inefficient slow path.
> >
> > SLAB has a litle cache per cpu, we could add one to SLUB for freed
> > objects, not belonging to current slab. This could avoid all these
> > activate/deactivate overhead.
>
> Yeah, this is definitely worth looking at.

We have been down this road repeatedly. Nick tried it, I tried it and
neither got us to something we liked. Please consult the archives.

There was a whole patch series last year that I did introducing per cpu
caches which ended up in the "unified" patches. See the archives for the
various attempts please.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14 14:53                     ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-14 14:53 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Eric Dumazet, Alex,Shi, David Rientjes, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Pekka Enberg wrote:

> On Wed, 14 Dec 2011, Eric Dumazet wrote:
> > We should try to batch things a bit, instead of doing a very small unit
> > of work in slow path.
> >
> > We now have a very fast fastpath, but inefficient slow path.
> >
> > SLAB has a litle cache per cpu, we could add one to SLUB for freed
> > objects, not belonging to current slab. This could avoid all these
> > activate/deactivate overhead.
>
> Yeah, this is definitely worth looking at.

We have been down this road repeatedly. Nick tried it, I tried it and
neither got us to something we liked. Please consult the archives.

There was a whole patch series last year that I did introducing per cpu
caches which ended up in the "unified" patches. See the archives for the
various attempts please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14  6:56                   ` Alex,Shi
@ 2011-12-14 14:59                     ` Christoph Lameter
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-14 14:59 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Eric Dumazet, David Rientjes, penberg, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Alex,Shi wrote:

> > Please note that the COLD/HOT page concept is not very well used in
> > kernel, because its not really obvious that some decisions are always
> > good (or maybe this is not well known)
>
> Hope Christoph know everything of SLUB. :)

Well yes we have been back and forth on hot/cold page things repeatedly in
the page allocator as well. Caching is not always good. There are
particular loads that usually do very well with caching. Others do not.
Caching can cause useless processing and pollute caching. It is also a
cause for OS noise due to cache maintenance at random (for the app guys)
times where they do not want that to happen.

> > We should try to batch things a bit, instead of doing a very small unit
> > of work in slow path.
> >
> > We now have a very fast fastpath, but inefficient slow path.
> >
> > SLAB has a litle cache per cpu, we could add one to SLUB for freed
> > objects, not belonging to current slab. This could avoid all these
> > activate/deactivate overhead.
>
> Maybe worth to try or maybe Christoph had studied this?

Many people have done patchsets like this. There are various permutations
on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been
proposed over the years. Caches tend to grow and get rather numerous (see
SLAB) and the design of SLUB was to counter that. There is a reason it was
called SLUB. The U stands for Unqueued and was intended to avoid the
excessive caching problems that I ended up when reworking SLAB for NUMA
support.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14 14:59                     ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-14 14:59 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Eric Dumazet, David Rientjes, penberg, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Alex,Shi wrote:

> > Please note that the COLD/HOT page concept is not very well used in
> > kernel, because its not really obvious that some decisions are always
> > good (or maybe this is not well known)
>
> Hope Christoph know everything of SLUB. :)

Well yes we have been back and forth on hot/cold page things repeatedly in
the page allocator as well. Caching is not always good. There are
particular loads that usually do very well with caching. Others do not.
Caching can cause useless processing and pollute caching. It is also a
cause for OS noise due to cache maintenance at random (for the app guys)
times where they do not want that to happen.

> > We should try to batch things a bit, instead of doing a very small unit
> > of work in slow path.
> >
> > We now have a very fast fastpath, but inefficient slow path.
> >
> > SLAB has a litle cache per cpu, we could add one to SLUB for freed
> > objects, not belonging to current slab. This could avoid all these
> > activate/deactivate overhead.
>
> Maybe worth to try or maybe Christoph had studied this?

Many people have done patchsets like this. There are various permutations
on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been
proposed over the years. Caches tend to grow and get rather numerous (see
SLAB) and the design of SLUB was to counter that. There is a reason it was
called SLUB. The U stands for Unqueued and was intended to avoid the
excessive caching problems that I ended up when reworking SLAB for NUMA
support.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14 14:59                     ` Christoph Lameter
@ 2011-12-14 17:33                       ` Eric Dumazet
  -1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-14 17:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alex,Shi, David Rientjes, penberg, linux-kernel, linux-mm

Le mercredi 14 décembre 2011 à 08:59 -0600, Christoph Lameter a écrit :

> Many people have done patchsets like this. 

Things changed a lot recently. There is room for improvements.

At least we can exchange ideas _before_ coding a new patchset ?

> There are various permutations
> on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been
> proposed over the years. Caches tend to grow and get rather numerous (see
> SLAB) and the design of SLUB was to counter that. There is a reason it was
> called SLUB. The U stands for Unqueued and was intended to avoid the
> excessive caching problems that I ended up when reworking SLAB for NUMA
> support.

Current 'one active slab' per cpu is a one level cache.

It really is a _queue_ containing a fair amount of objects.

'Unqueued' in SLUB is marketing hype :=)

When we have one producer (say network interrupt handler) feeding
millions of network packets to N consumers (other cpus), each free is
slowpath. They all want to touch page->freelist and slow the producer as
well because of false sharing.

Furthermore, when the producer hits socket queue limits, it mostly frees
skbs that were allocated in the 'not very recent past', and its own
freeing also hit slow path (because memory blocks of the skb are no
longer in the current active slab). It competes with frees done by
consumers as well.

Adding a second _small_ cache to queue X objects per cpu would help to
keep the active slab longer and more 'private' (its 'struct page' not
touched too often by other cpus) for a given cpu.

It would limit number of cache line misses we currently have because of
conflicting accesses to page->freelist just to push one _single_ object
(and n->list_lock in less extent)

My initial idea would be to use a cache of 4 slots per cpu, but be able
to queue many objects per slot, if they all belong to same slab/page.

In case we must make room in the cache (all slots occupied), we take one
slot and dequeue all objects in one round. No extra latency compared to
current schem.




^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14 17:33                       ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2011-12-14 17:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alex,Shi, David Rientjes, penberg, linux-kernel, linux-mm

Le mercredi 14 dA(C)cembre 2011 A  08:59 -0600, Christoph Lameter a A(C)crit :

> Many people have done patchsets like this. 

Things changed a lot recently. There is room for improvements.

At least we can exchange ideas _before_ coding a new patchset ?

> There are various permutations
> on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been
> proposed over the years. Caches tend to grow and get rather numerous (see
> SLAB) and the design of SLUB was to counter that. There is a reason it was
> called SLUB. The U stands for Unqueued and was intended to avoid the
> excessive caching problems that I ended up when reworking SLAB for NUMA
> support.

Current 'one active slab' per cpu is a one level cache.

It really is a _queue_ containing a fair amount of objects.

'Unqueued' in SLUB is marketing hype :=)

When we have one producer (say network interrupt handler) feeding
millions of network packets to N consumers (other cpus), each free is
slowpath. They all want to touch page->freelist and slow the producer as
well because of false sharing.

Furthermore, when the producer hits socket queue limits, it mostly frees
skbs that were allocated in the 'not very recent past', and its own
freeing also hit slow path (because memory blocks of the skb are no
longer in the current active slab). It competes with frees done by
consumers as well.

Adding a second _small_ cache to queue X objects per cpu would help to
keep the active slab longer and more 'private' (its 'struct page' not
touched too often by other cpus) for a given cpu.

It would limit number of cache line misses we currently have because of
conflicting accesses to page->freelist just to push one _single_ object
(and n->list_lock in less extent)

My initial idea would be to use a cache of 4 slots per cpu, but be able
to queue many objects per slot, if they all belong to same slab/page.

In case we must make room in the cache (all slots occupied), we take one
slot and dequeue all objects in one round. No extra latency compared to
current schem.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
  2011-12-14 17:33                       ` Eric Dumazet
@ 2011-12-14 18:26                         ` Christoph Lameter
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-14 18:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alex,Shi, David Rientjes, penberg, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Eric Dumazet wrote:

> > Many people have done patchsets like this.
>
> Things changed a lot recently. There is room for improvements.
>
> At least we can exchange ideas _before_ coding a new patchset ?

Sure but I hope we do not simply rehash what has been said before and
recode what others have code in years past.

> > There are various permutations
> > on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been
> > proposed over the years. Caches tend to grow and get rather numerous (see
> > SLAB) and the design of SLUB was to counter that. There is a reason it was
> > called SLUB. The U stands for Unqueued and was intended to avoid the
> > excessive caching problems that I ended up when reworking SLAB for NUMA
> > support.
>
> Current 'one active slab' per cpu is a one level cache.
>
> It really is a _queue_ containing a fair amount of objects.

In some sense you are right. It is a set of objects linked together. That
can be called a queue and it has certain cache hot effects. It is not a
qeueue in the SLAB sense meaning a configurable array of pointers.

> My initial idea would be to use a cache of 4 slots per cpu, but be able
> to queue many objects per slot, if they all belong to same slab/page.

Nick did that. Please read up on his work. I think it was named SLQB.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 1/3] slub: set a criteria for slub node partial adding
@ 2011-12-14 18:26                         ` Christoph Lameter
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Lameter @ 2011-12-14 18:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alex,Shi, David Rientjes, penberg, linux-kernel, linux-mm

On Wed, 14 Dec 2011, Eric Dumazet wrote:

> > Many people have done patchsets like this.
>
> Things changed a lot recently. There is room for improvements.
>
> At least we can exchange ideas _before_ coding a new patchset ?

Sure but I hope we do not simply rehash what has been said before and
recode what others have code in years past.

> > There are various permutations
> > on SL?B (I dont remember them all SLEB, SLXB, SLQB etc) that have been
> > proposed over the years. Caches tend to grow and get rather numerous (see
> > SLAB) and the design of SLUB was to counter that. There is a reason it was
> > called SLUB. The U stands for Unqueued and was intended to avoid the
> > excessive caching problems that I ended up when reworking SLAB for NUMA
> > support.
>
> Current 'one active slab' per cpu is a one level cache.
>
> It really is a _queue_ containing a fair amount of objects.

In some sense you are right. It is a set of objects linked together. That
can be called a queue and it has certain cache hot effects. It is not a
qeueue in the SLAB sense meaning a configurable array of pointers.

> My initial idea would be to use a cache of 4 slots per cpu, but be able
> to queue many objects per slot, if they all belong to same slab/page.

Nick did that. Please read up on his work. I think it was named SLQB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2011-12-14 18:26 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-02  8:23 [PATCH 1/3] slub: set a criteria for slub node partial adding Alex Shi
2011-12-02  8:23 ` Alex Shi
2011-12-02  8:23 ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Alex Shi
2011-12-02  8:23   ` Alex Shi
2011-12-02  8:23   ` [PATCH 3/3] slub: fill per cpu partial only when free objects larger than one quarter Alex Shi
2011-12-02  8:23     ` Alex Shi
2011-12-02 14:44   ` [PATCH 2/3] slub: remove unnecessary statistics, deactivate_to_head/tail Christoph Lameter
2011-12-02 14:44     ` Christoph Lameter
2011-12-06 21:08   ` David Rientjes
2011-12-06 21:08     ` David Rientjes
2011-12-02 11:36 ` [PATCH 1/3] slub: set a criteria for slub node partial adding Eric Dumazet
2011-12-02 11:36   ` Eric Dumazet
2011-12-02 20:02   ` Christoph Lameter
2011-12-02 20:02     ` Christoph Lameter
2011-12-05  2:21     ` Shaohua Li
2011-12-05  2:21       ` Shaohua Li
2011-12-05 10:01     ` Alex,Shi
2011-12-05 10:01       ` Alex,Shi
2011-12-05  3:28   ` Alex,Shi
2011-12-05  3:28     ` Alex,Shi
2011-12-02 14:43 ` Christoph Lameter
2011-12-02 14:43   ` Christoph Lameter
2011-12-05  9:22   ` Alex,Shi
2011-12-05  9:22     ` Alex,Shi
2011-12-06 21:06     ` David Rientjes
2011-12-06 21:06       ` David Rientjes
2011-12-07  5:11       ` Shaohua Li
2011-12-07  5:11         ` Shaohua Li
2011-12-07  7:28         ` David Rientjes
2011-12-07  7:28           ` David Rientjes
2011-12-12  2:43           ` Shaohua Li
2011-12-12  2:43             ` Shaohua Li
2011-12-12  4:14             ` Alex,Shi
2011-12-12  4:14               ` Alex,Shi
2011-12-12  4:35               ` Shaohua Li
2011-12-12  4:35                 ` Shaohua Li
2011-12-12  4:25                 ` Alex,Shi
2011-12-12  4:25                   ` Alex,Shi
2011-12-12  4:48                   ` Shaohua Li
2011-12-12  4:48                     ` Shaohua Li
2011-12-12  6:17                     ` Alex,Shi
2011-12-12  6:17                       ` Alex,Shi
2011-12-12  6:09             ` Eric Dumazet
2011-12-12  6:09               ` Eric Dumazet
2011-12-14  1:29             ` David Rientjes
2011-12-14  1:29               ` David Rientjes
2011-12-14  2:43               ` Shaohua Li
2011-12-14  2:43                 ` Shaohua Li
2011-12-14  2:38                 ` David Rientjes
2011-12-14  2:38                   ` David Rientjes
2011-12-09  8:30   ` Alex,Shi
2011-12-09  8:30     ` Alex,Shi
2011-12-09 10:10     ` David Rientjes
2011-12-09 10:10       ` David Rientjes
2011-12-09 13:40       ` Shi, Alex
2011-12-09 13:40         ` Shi, Alex
2011-12-14  1:38         ` David Rientjes
2011-12-14  1:38           ` David Rientjes
2011-12-14  2:36           ` David Rientjes
2011-12-14  2:36             ` David Rientjes
2011-12-14  6:06             ` Alex,Shi
2011-12-14  6:06               ` Alex,Shi
2011-12-14  6:44               ` Eric Dumazet
2011-12-14  6:44                 ` Eric Dumazet
2011-12-14  6:47                 ` Pekka Enberg
2011-12-14  6:47                   ` Pekka Enberg
2011-12-14 14:53                   ` Christoph Lameter
2011-12-14 14:53                     ` Christoph Lameter
2011-12-14  6:56                 ` Alex,Shi
2011-12-14  6:56                   ` Alex,Shi
2011-12-14 14:59                   ` Christoph Lameter
2011-12-14 14:59                     ` Christoph Lameter
2011-12-14 17:33                     ` Eric Dumazet
2011-12-14 17:33                       ` Eric Dumazet
2011-12-14 18:26                       ` Christoph Lameter
2011-12-14 18:26                         ` Christoph Lameter
2011-12-13 13:01       ` Shi, Alex
2011-12-13 13:01         ` Shi, Alex

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.