All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-05  1:36 ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

We had to insert a preempt enable/disable in the fastpath a while ago
in order to guarantee that tid and kmem_cache_cpu are retrieved on the
same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
can move the process to other cpu during retrieving data.

Now, I reach the solution to remove preempt enable/disable in the fastpath.
If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
are retrieved by separate this_cpu operation, it means that they are
retrieved on the same cpu. If not matched, we just have to retry it.

With this guarantee, preemption enable/disable isn't need at all even if
CONFIG_PREEMPT, so this patch removes it.

I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)

Below is the result of Christoph's slab_test reported by
Jesper Dangaard Brouer.

* Before

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

* After

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

Most of the results are better than before.

Note that this change slightly worses performance in !CONFIG_PREEMPT,
roughly 0.3%. Implementing each case separately would help performance,
but, since it's so marginal, I didn't do that. This would help
maintanance since we have same code for all cases.

Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index fe376fe..0624608 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2398,13 +2398,15 @@ redo:
 	 * reading from one cpu area. That does not matter as long
 	 * as we end up on the original cpu again when doing the cmpxchg.
 	 *
-	 * Preemption is disabled for the retrieval of the tid because that
-	 * must occur from the current processor. We cannot allow rescheduling
-	 * on a different processor between the determination of the pointer
-	 * and the retrieval of the tid.
+	 * We should guarantee that tid and kmem_cache are retrieved on
+	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
+	 * to check if it is matched or not.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	barrier();
 
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
@@ -2412,8 +2414,6 @@ redo:
 	 * occurs on the right processor and that there was no operation on the
 	 * linked list in between.
 	 */
-	tid = c->tid;
-	preempt_enable();
 
 	object = c->freelist;
 	page = c->page;
@@ -2659,11 +2659,11 @@ redo:
 	 * data is retrieved via this pointer. If we are on the same cpu
 	 * during the cmpxchg then the free will succedd.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
-
-	tid = c->tid;
-	preempt_enable();
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	barrier();
 
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-05  1:36 ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

We had to insert a preempt enable/disable in the fastpath a while ago
in order to guarantee that tid and kmem_cache_cpu are retrieved on the
same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
can move the process to other cpu during retrieving data.

Now, I reach the solution to remove preempt enable/disable in the fastpath.
If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
are retrieved by separate this_cpu operation, it means that they are
retrieved on the same cpu. If not matched, we just have to retry it.

With this guarantee, preemption enable/disable isn't need at all even if
CONFIG_PREEMPT, so this patch removes it.

I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)

Below is the result of Christoph's slab_test reported by
Jesper Dangaard Brouer.

* Before

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

* After

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

Most of the results are better than before.

Note that this change slightly worses performance in !CONFIG_PREEMPT,
roughly 0.3%. Implementing each case separately would help performance,
but, since it's so marginal, I didn't do that. This would help
maintanance since we have same code for all cases.

Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index fe376fe..0624608 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2398,13 +2398,15 @@ redo:
 	 * reading from one cpu area. That does not matter as long
 	 * as we end up on the original cpu again when doing the cmpxchg.
 	 *
-	 * Preemption is disabled for the retrieval of the tid because that
-	 * must occur from the current processor. We cannot allow rescheduling
-	 * on a different processor between the determination of the pointer
-	 * and the retrieval of the tid.
+	 * We should guarantee that tid and kmem_cache are retrieved on
+	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
+	 * to check if it is matched or not.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	barrier();
 
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
@@ -2412,8 +2414,6 @@ redo:
 	 * occurs on the right processor and that there was no operation on the
 	 * linked list in between.
 	 */
-	tid = c->tid;
-	preempt_enable();
 
 	object = c->freelist;
 	page = c->page;
@@ -2659,11 +2659,11 @@ redo:
 	 * data is retrieved via this pointer. If we are on the same cpu
 	 * during the cmpxchg then the free will succedd.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
-
-	tid = c->tid;
-	preempt_enable();
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	barrier();
 
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] mm: don't use compound_head() in virt_to_head_page()
  2015-01-05  1:36 ` Joonsoo Kim
@ 2015-01-05  1:36   ` Joonsoo Kim
  -1 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

compound_head() is implemented with assumption that there would be
race condition when checking tail flag. This assumption is only true
when we try to access arbitrary positioned struct page.

The situation that virt_to_head_page() is called is different case.
We call virt_to_head_page() only in the range of allocated pages,
so there is no race condition on tail flag. In this case, we don't
need to handle race condition and we can reduce overhead slightly.
This patch implements compound_head_fast() which is similar with
compound_head() except tail flag race handling. And then,
virt_to_head_page() uses this optimized function to improve performance.

I saw 1.8% win in a fast-path loop over kmem_cache_alloc/free,
(14.063 ns -> 13.810 ns) if target object is on tail page.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mm.h |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f80d019..0460e2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -453,6 +453,13 @@ static inline struct page *compound_head(struct page *page)
 	return page;
 }
 
+static inline struct page *compound_head_fast(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		return page->first_page;
+	return page;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -531,7 +538,8 @@ static inline void get_page(struct page *page)
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
-	return compound_head(page);
+
+	return compound_head_fast(page);
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] mm: don't use compound_head() in virt_to_head_page()
@ 2015-01-05  1:36   ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

compound_head() is implemented with assumption that there would be
race condition when checking tail flag. This assumption is only true
when we try to access arbitrary positioned struct page.

The situation that virt_to_head_page() is called is different case.
We call virt_to_head_page() only in the range of allocated pages,
so there is no race condition on tail flag. In this case, we don't
need to handle race condition and we can reduce overhead slightly.
This patch implements compound_head_fast() which is similar with
compound_head() except tail flag race handling. And then,
virt_to_head_page() uses this optimized function to improve performance.

I saw 1.8% win in a fast-path loop over kmem_cache_alloc/free,
(14.063 ns -> 13.810 ns) if target object is on tail page.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mm.h |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f80d019..0460e2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -453,6 +453,13 @@ static inline struct page *compound_head(struct page *page)
 	return page;
 }
 
+static inline struct page *compound_head_fast(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		return page->first_page;
+	return page;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -531,7 +538,8 @@ static inline void get_page(struct page *page)
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
-	return compound_head(page);
+
+	return compound_head_fast(page);
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-05  1:36 ` Joonsoo Kim
@ 2015-01-05 14:52   ` Christoph Lameter
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-01-05 14:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Mon, 5 Jan 2015, Joonsoo Kim wrote:

> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-05 14:52   ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-01-05 14:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Mon, 5 Jan 2015, Joonsoo Kim wrote:

> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] mm: don't use compound_head() in virt_to_head_page()
  2015-01-05  1:36   ` Joonsoo Kim
@ 2015-01-05 14:53     ` Christoph Lameter
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-01-05 14:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Mon, 5 Jan 2015, Joonsoo Kim wrote:

> This patch implements compound_head_fast() which is similar with
> compound_head() except tail flag race handling. And then,
> virt_to_head_page() uses this optimized function to improve performance.

Yeah that is how it was before and how it should be

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] mm: don't use compound_head() in virt_to_head_page()
@ 2015-01-05 14:53     ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-01-05 14:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Mon, 5 Jan 2015, Joonsoo Kim wrote:

> This patch implements compound_head_fast() which is similar with
> compound_head() except tail flag race handling. And then,
> virt_to_head_page() uses this optimized function to improve performance.

Yeah that is how it was before and how it should be

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-05  1:36 ` Joonsoo Kim
@ 2015-01-06  3:03   ` Davidlohr Bueso
  -1 siblings, 0 replies; 26+ messages in thread
From: Davidlohr Bueso @ 2015-01-06  3:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();

I don't see the compiler reodering the object/page stores below, since c
is updated in the loop anyway. Is this really necessary (same goes for
slab_free)? The generated code by gcc 4.8 looks correct without it.
Additionally, the implied barriers for preemption control aren't really
the same semantics used here (if that is actually the reason why you are
using them).

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-06  3:03   ` Davidlohr Bueso
  0 siblings, 0 replies; 26+ messages in thread
From: Davidlohr Bueso @ 2015-01-06  3:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();

I don't see the compiler reodering the object/page stores below, since c
is updated in the loop anyway. Is this really necessary (same goes for
slab_free)? The generated code by gcc 4.8 looks correct without it.
Additionally, the implied barriers for preemption control aren't really
the same semantics used here (if that is actually the reason why you are
using them).

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  3:03   ` Davidlohr Bueso
@ 2015-01-06  8:09     ` Joonsoo Kim
  -1 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-06  8:09 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > -	preempt_disable();
> > -	c = this_cpu_ptr(s->cpu_slab);
> > +	do {
> > +		tid = this_cpu_read(s->cpu_slab->tid);
> > +		c = this_cpu_ptr(s->cpu_slab);
> > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	barrier();
> 
> I don't see the compiler reodering the object/page stores below, since c
> is updated in the loop anyway. Is this really necessary (same goes for
> slab_free)? The generated code by gcc 4.8 looks correct without it.
> Additionally, the implied barriers for preemption control aren't really
> the same semantics used here (if that is actually the reason why you are
> using them).

Hello,

I'd like to use tid as a pivot so it should be fetched before fetching
anything on c. Is it impossible even if !CONFIG_PREEMPT without
barrier()?

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-06  8:09     ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-06  8:09 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > -	preempt_disable();
> > -	c = this_cpu_ptr(s->cpu_slab);
> > +	do {
> > +		tid = this_cpu_read(s->cpu_slab->tid);
> > +		c = this_cpu_ptr(s->cpu_slab);
> > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	barrier();
> 
> I don't see the compiler reodering the object/page stores below, since c
> is updated in the loop anyway. Is this really necessary (same goes for
> slab_free)? The generated code by gcc 4.8 looks correct without it.
> Additionally, the implied barriers for preemption control aren't really
> the same semantics used here (if that is actually the reason why you are
> using them).

Hello,

I'd like to use tid as a pivot so it should be fetched before fetching
anything on c. Is it impossible even if !CONFIG_PREEMPT without
barrier()?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  8:09     ` Joonsoo Kim
@ 2015-01-06 17:02       ` Davidlohr Bueso
  -1 siblings, 0 replies; 26+ messages in thread
From: Davidlohr Bueso @ 2015-01-06 17:02 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Tue, 2015-01-06 at 17:09 +0900, Joonsoo Kim wrote:
> On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> > On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > > -	preempt_disable();
> > > -	c = this_cpu_ptr(s->cpu_slab);
> > > +	do {
> > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > +		c = this_cpu_ptr(s->cpu_slab);
> > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > +	barrier();
> > 
> > I don't see the compiler reodering the object/page stores below, since c
> > is updated in the loop anyway. Is this really necessary (same goes for
> > slab_free)? The generated code by gcc 4.8 looks correct without it.
> > Additionally, the implied barriers for preemption control aren't really
> > the same semantics used here (if that is actually the reason why you are
> > using them).
> 
> Hello,
> 
> I'd like to use tid as a pivot so it should be fetched before fetching
> anything on c. Is it impossible even if !CONFIG_PREEMPT without
> barrier()?

You'd need a smp_wmb() in between tid and c in the loop then, which
looks quite unpleasant. All in all disabling preemption isn't really
that expensive, and you should redo your performance number if you go
this way.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-06 17:02       ` Davidlohr Bueso
  0 siblings, 0 replies; 26+ messages in thread
From: Davidlohr Bueso @ 2015-01-06 17:02 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Tue, 2015-01-06 at 17:09 +0900, Joonsoo Kim wrote:
> On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> > On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > > -	preempt_disable();
> > > -	c = this_cpu_ptr(s->cpu_slab);
> > > +	do {
> > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > +		c = this_cpu_ptr(s->cpu_slab);
> > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > +	barrier();
> > 
> > I don't see the compiler reodering the object/page stores below, since c
> > is updated in the loop anyway. Is this really necessary (same goes for
> > slab_free)? The generated code by gcc 4.8 looks correct without it.
> > Additionally, the implied barriers for preemption control aren't really
> > the same semantics used here (if that is actually the reason why you are
> > using them).
> 
> Hello,
> 
> I'd like to use tid as a pivot so it should be fetched before fetching
> anything on c. Is it impossible even if !CONFIG_PREEMPT without
> barrier()?

You'd need a smp_wmb() in between tid and c in the loop then, which
looks quite unpleasant. All in all disabling preemption isn't really
that expensive, and you should redo your performance number if you go
this way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06 17:02       ` Davidlohr Bueso
@ 2015-01-08  7:44         ` Joonsoo Kim
  -1 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-08  7:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Tue, Jan 06, 2015 at 09:02:17AM -0800, Davidlohr Bueso wrote:
> On Tue, 2015-01-06 at 17:09 +0900, Joonsoo Kim wrote:
> > On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> > > On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > > > -	preempt_disable();
> > > > -	c = this_cpu_ptr(s->cpu_slab);
> > > > +	do {
> > > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > > +		c = this_cpu_ptr(s->cpu_slab);
> > > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > > +	barrier();
> > > 
> > > I don't see the compiler reodering the object/page stores below, since c
> > > is updated in the loop anyway. Is this really necessary (same goes for
> > > slab_free)? The generated code by gcc 4.8 looks correct without it.
> > > Additionally, the implied barriers for preemption control aren't really
> > > the same semantics used here (if that is actually the reason why you are
> > > using them).
> > 
> > Hello,
> > 
> > I'd like to use tid as a pivot so it should be fetched before fetching
> > anything on c. Is it impossible even if !CONFIG_PREEMPT without
> > barrier()?
> 
> You'd need a smp_wmb() in between tid and c in the loop then, which
> looks quite unpleasant. All in all disabling preemption isn't really
> that expensive, and you should redo your performance number if you go
> this way.

This barrier() is not for read/write synchronization between cpus.
All read/write operation to cpu_slab would happen on correct cpu in
successful case. What I'd need to guarantee here is to prevent
reordering between fetching operation for correctness of algorithm. In
this case, barrier() seems enough to me. Am I wrong?

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-08  7:44         ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-08  7:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Tue, Jan 06, 2015 at 09:02:17AM -0800, Davidlohr Bueso wrote:
> On Tue, 2015-01-06 at 17:09 +0900, Joonsoo Kim wrote:
> > On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> > > On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > > > -	preempt_disable();
> > > > -	c = this_cpu_ptr(s->cpu_slab);
> > > > +	do {
> > > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > > +		c = this_cpu_ptr(s->cpu_slab);
> > > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > > +	barrier();
> > > 
> > > I don't see the compiler reodering the object/page stores below, since c
> > > is updated in the loop anyway. Is this really necessary (same goes for
> > > slab_free)? The generated code by gcc 4.8 looks correct without it.
> > > Additionally, the implied barriers for preemption control aren't really
> > > the same semantics used here (if that is actually the reason why you are
> > > using them).
> > 
> > Hello,
> > 
> > I'd like to use tid as a pivot so it should be fetched before fetching
> > anything on c. Is it impossible even if !CONFIG_PREEMPT without
> > barrier()?
> 
> You'd need a smp_wmb() in between tid and c in the loop then, which
> looks quite unpleasant. All in all disabling preemption isn't really
> that expensive, and you should redo your performance number if you go
> this way.

This barrier() is not for read/write synchronization between cpus.
All read/write operation to cpu_slab would happen on correct cpu in
successful case. What I'd need to guarantee here is to prevent
reordering between fetching operation for correctness of algorithm. In
this case, barrier() seems enough to me. Am I wrong?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-08  7:44         ` Joonsoo Kim
@ 2015-01-09  3:34           ` Christoph Lameter
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-01-09  3:34 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Thu, 8 Jan 2015, Joonsoo Kim wrote:

> > You'd need a smp_wmb() in between tid and c in the loop then, which
> > looks quite unpleasant. All in all disabling preemption isn't really
> > that expensive, and you should redo your performance number if you go
> > this way.
>
> This barrier() is not for read/write synchronization between cpus.
> All read/write operation to cpu_slab would happen on correct cpu in
> successful case. What I'd need to guarantee here is to prevent
> reordering between fetching operation for correctness of algorithm. In
> this case, barrier() seems enough to me. Am I wrong?

You are right.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-09  3:34           ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-01-09  3:34 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Thu, 8 Jan 2015, Joonsoo Kim wrote:

> > You'd need a smp_wmb() in between tid and c in the loop then, which
> > looks quite unpleasant. All in all disabling preemption isn't really
> > that expensive, and you should redo your performance number if you go
> > this way.
>
> This barrier() is not for read/write synchronization between cpus.
> All read/write operation to cpu_slab would happen on correct cpu in
> successful case. What I'd need to guarantee here is to prevent
> reordering between fetching operation for correctness of algorithm. In
> this case, barrier() seems enough to me. Am I wrong?

You are right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  2:25     ` Steven Rostedt
@ 2015-01-06  8:27       ` Joonsoo Kim
  -1 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-06  8:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Hillf Danton, Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, 'Jesper Dangaard Brouer'

On Mon, Jan 05, 2015 at 09:25:02PM -0500, Steven Rostedt wrote:
> On Tue, 6 Jan 2015 10:32:47 +0900
> Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> 
> > > > +++ b/mm/slub.c
> > > > @@ -2398,13 +2398,15 @@ redo:
> > > >  	 * reading from one cpu area. That does not matter as long
> > > >  	 * as we end up on the original cpu again when doing the cmpxchg.
> > > >  	 *
> > > > -	 * Preemption is disabled for the retrieval of the tid because that
> > > > -	 * must occur from the current processor. We cannot allow rescheduling
> > > > -	 * on a different processor between the determination of the pointer
> > > > -	 * and the retrieval of the tid.
> > > > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > > > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > > > +	 * to check if it is matched or not.
> > > >  	 */
> > > > -	preempt_disable();
> > > > -	c = this_cpu_ptr(s->cpu_slab);
> > > > +	do {
> > > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > > +		c = this_cpu_ptr(s->cpu_slab);
> > > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > > +	barrier();
> > > 
> > > Help maintenance more if barrier is documented in commit message.
> > 
> > Hello,
> > 
> > Okay. Will add some information about this barrier in commit message.
> 
> A comment in the commit message is useless. Adding a small comment
> above the barrier() call itself would be much more useful.

Okay. Will do.

Thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-06  8:27       ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-06  8:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Hillf Danton, Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, 'Jesper Dangaard Brouer'

On Mon, Jan 05, 2015 at 09:25:02PM -0500, Steven Rostedt wrote:
> On Tue, 6 Jan 2015 10:32:47 +0900
> Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> 
> > > > +++ b/mm/slub.c
> > > > @@ -2398,13 +2398,15 @@ redo:
> > > >  	 * reading from one cpu area. That does not matter as long
> > > >  	 * as we end up on the original cpu again when doing the cmpxchg.
> > > >  	 *
> > > > -	 * Preemption is disabled for the retrieval of the tid because that
> > > > -	 * must occur from the current processor. We cannot allow rescheduling
> > > > -	 * on a different processor between the determination of the pointer
> > > > -	 * and the retrieval of the tid.
> > > > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > > > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > > > +	 * to check if it is matched or not.
> > > >  	 */
> > > > -	preempt_disable();
> > > > -	c = this_cpu_ptr(s->cpu_slab);
> > > > +	do {
> > > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > > +		c = this_cpu_ptr(s->cpu_slab);
> > > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > > +	barrier();
> > > 
> > > Help maintenance more if barrier is documented in commit message.
> > 
> > Hello,
> > 
> > Okay. Will add some information about this barrier in commit message.
> 
> A comment in the commit message is useless. Adding a small comment
> above the barrier() call itself would be much more useful.

Okay. Will do.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  1:32   ` Joonsoo Kim
@ 2015-01-06  2:25     ` Steven Rostedt
  -1 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2015-01-06  2:25 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hillf Danton, Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, 'Jesper Dangaard Brouer'

On Tue, 6 Jan 2015 10:32:47 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:


> > > +++ b/mm/slub.c
> > > @@ -2398,13 +2398,15 @@ redo:
> > >  	 * reading from one cpu area. That does not matter as long
> > >  	 * as we end up on the original cpu again when doing the cmpxchg.
> > >  	 *
> > > -	 * Preemption is disabled for the retrieval of the tid because that
> > > -	 * must occur from the current processor. We cannot allow rescheduling
> > > -	 * on a different processor between the determination of the pointer
> > > -	 * and the retrieval of the tid.
> > > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > > +	 * to check if it is matched or not.
> > >  	 */
> > > -	preempt_disable();
> > > -	c = this_cpu_ptr(s->cpu_slab);
> > > +	do {
> > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > +		c = this_cpu_ptr(s->cpu_slab);
> > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > +	barrier();
> > 
> > Help maintenance more if barrier is documented in commit message.
> 
> Hello,
> 
> Okay. Will add some information about this barrier in commit message.

A comment in the commit message is useless. Adding a small comment
above the barrier() call itself would be much more useful.

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-06  2:25     ` Steven Rostedt
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2015-01-06  2:25 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hillf Danton, Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, 'Jesper Dangaard Brouer'

On Tue, 6 Jan 2015 10:32:47 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:


> > > +++ b/mm/slub.c
> > > @@ -2398,13 +2398,15 @@ redo:
> > >  	 * reading from one cpu area. That does not matter as long
> > >  	 * as we end up on the original cpu again when doing the cmpxchg.
> > >  	 *
> > > -	 * Preemption is disabled for the retrieval of the tid because that
> > > -	 * must occur from the current processor. We cannot allow rescheduling
> > > -	 * on a different processor between the determination of the pointer
> > > -	 * and the retrieval of the tid.
> > > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > > +	 * to check if it is matched or not.
> > >  	 */
> > > -	preempt_disable();
> > > -	c = this_cpu_ptr(s->cpu_slab);
> > > +	do {
> > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > +		c = this_cpu_ptr(s->cpu_slab);
> > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > +	barrier();
> > 
> > Help maintenance more if barrier is documented in commit message.
> 
> Hello,
> 
> Okay. Will add some information about this barrier in commit message.

A comment in the commit message is useless. Adding a small comment
above the barrier() call itself would be much more useful.

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-05  8:37 ` Hillf Danton
@ 2015-01-06  1:32   ` Joonsoo Kim
  -1 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-06  1:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, Steven Rostedt,
	'Jesper Dangaard Brouer'

On Mon, Jan 05, 2015 at 04:37:35PM +0800, Hillf Danton wrote:
> > 
> > We had to insert a preempt enable/disable in the fastpath a while ago
> > in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> > same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> > can move the process to other cpu during retrieving data.
> > 
> > Now, I reach the solution to remove preempt enable/disable in the fastpath.
> > If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> > are retrieved by separate this_cpu operation, it means that they are
> > retrieved on the same cpu. If not matched, we just have to retry it.
> > 
> > With this guarantee, preemption enable/disable isn't need at all even if
> > CONFIG_PREEMPT, so this patch removes it.
> > 
> > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> > 
> > Below is the result of Christoph's slab_test reported by
> > Jesper Dangaard Brouer.
> > 
> > * Before
> > 
> >  Single thread testing
> >  =====================
> >  1. Kmalloc: Repeatedly allocate then free test
> >  10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
> >  10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
> >  10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
> >  10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
> >  10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
> >  10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
> >  10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
> >  10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
> >  10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
> >  10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
> >  10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
> >  10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
> >  2. Kmalloc: alloc/free test
> >  10000 times kmalloc(8)/kfree -> 68 cycles
> >  10000 times kmalloc(16)/kfree -> 68 cycles
> >  10000 times kmalloc(32)/kfree -> 69 cycles
> >  10000 times kmalloc(64)/kfree -> 68 cycles
> >  10000 times kmalloc(128)/kfree -> 68 cycles
> >  10000 times kmalloc(256)/kfree -> 68 cycles
> >  10000 times kmalloc(512)/kfree -> 74 cycles
> >  10000 times kmalloc(1024)/kfree -> 75 cycles
> >  10000 times kmalloc(2048)/kfree -> 74 cycles
> >  10000 times kmalloc(4096)/kfree -> 74 cycles
> >  10000 times kmalloc(8192)/kfree -> 75 cycles
> >  10000 times kmalloc(16384)/kfree -> 510 cycles
> > 
> > * After
> > 
> >  Single thread testing
> >  =====================
> >  1. Kmalloc: Repeatedly allocate then free test
> >  10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
> >  10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
> >  10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
> >  10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
> >  10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
> >  10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
> >  10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
> >  10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
> >  10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
> >  10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
> >  10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
> >  10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
> >  2. Kmalloc: alloc/free test
> >  10000 times kmalloc(8)/kfree -> 65 cycles
> >  10000 times kmalloc(16)/kfree -> 66 cycles
> >  10000 times kmalloc(32)/kfree -> 65 cycles
> >  10000 times kmalloc(64)/kfree -> 66 cycles
> >  10000 times kmalloc(128)/kfree -> 66 cycles
> >  10000 times kmalloc(256)/kfree -> 71 cycles
> >  10000 times kmalloc(512)/kfree -> 72 cycles
> >  10000 times kmalloc(1024)/kfree -> 71 cycles
> >  10000 times kmalloc(2048)/kfree -> 71 cycles
> >  10000 times kmalloc(4096)/kfree -> 71 cycles
> >  10000 times kmalloc(8192)/kfree -> 65 cycles
> >  10000 times kmalloc(16384)/kfree -> 511 cycles
> > 
> > Most of the results are better than before.
> > 
> > Note that this change slightly worses performance in !CONFIG_PREEMPT,
> > roughly 0.3%. Implementing each case separately would help performance,
> > but, since it's so marginal, I didn't do that. This would help
> > maintanance since we have same code for all cases.
> > 
> > Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> >  mm/slub.c |   26 +++++++++++++-------------
> >  1 file changed, 13 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index fe376fe..0624608 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2398,13 +2398,15 @@ redo:
> >  	 * reading from one cpu area. That does not matter as long
> >  	 * as we end up on the original cpu again when doing the cmpxchg.
> >  	 *
> > -	 * Preemption is disabled for the retrieval of the tid because that
> > -	 * must occur from the current processor. We cannot allow rescheduling
> > -	 * on a different processor between the determination of the pointer
> > -	 * and the retrieval of the tid.
> > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > +	 * to check if it is matched or not.
> >  	 */
> > -	preempt_disable();
> > -	c = this_cpu_ptr(s->cpu_slab);
> > +	do {
> > +		tid = this_cpu_read(s->cpu_slab->tid);
> > +		c = this_cpu_ptr(s->cpu_slab);
> > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	barrier();
> 
> Help maintenance more if barrier is documented in commit message.

Hello,

Okay. Will add some information about this barrier in commit message.

Thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-06  1:32   ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-01-06  1:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, Steven Rostedt,
	'Jesper Dangaard Brouer'

On Mon, Jan 05, 2015 at 04:37:35PM +0800, Hillf Danton wrote:
> > 
> > We had to insert a preempt enable/disable in the fastpath a while ago
> > in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> > same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> > can move the process to other cpu during retrieving data.
> > 
> > Now, I reach the solution to remove preempt enable/disable in the fastpath.
> > If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> > are retrieved by separate this_cpu operation, it means that they are
> > retrieved on the same cpu. If not matched, we just have to retry it.
> > 
> > With this guarantee, preemption enable/disable isn't need at all even if
> > CONFIG_PREEMPT, so this patch removes it.
> > 
> > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> > 
> > Below is the result of Christoph's slab_test reported by
> > Jesper Dangaard Brouer.
> > 
> > * Before
> > 
> >  Single thread testing
> >  =====================
> >  1. Kmalloc: Repeatedly allocate then free test
> >  10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
> >  10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
> >  10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
> >  10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
> >  10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
> >  10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
> >  10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
> >  10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
> >  10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
> >  10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
> >  10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
> >  10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
> >  2. Kmalloc: alloc/free test
> >  10000 times kmalloc(8)/kfree -> 68 cycles
> >  10000 times kmalloc(16)/kfree -> 68 cycles
> >  10000 times kmalloc(32)/kfree -> 69 cycles
> >  10000 times kmalloc(64)/kfree -> 68 cycles
> >  10000 times kmalloc(128)/kfree -> 68 cycles
> >  10000 times kmalloc(256)/kfree -> 68 cycles
> >  10000 times kmalloc(512)/kfree -> 74 cycles
> >  10000 times kmalloc(1024)/kfree -> 75 cycles
> >  10000 times kmalloc(2048)/kfree -> 74 cycles
> >  10000 times kmalloc(4096)/kfree -> 74 cycles
> >  10000 times kmalloc(8192)/kfree -> 75 cycles
> >  10000 times kmalloc(16384)/kfree -> 510 cycles
> > 
> > * After
> > 
> >  Single thread testing
> >  =====================
> >  1. Kmalloc: Repeatedly allocate then free test
> >  10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
> >  10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
> >  10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
> >  10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
> >  10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
> >  10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
> >  10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
> >  10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
> >  10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
> >  10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
> >  10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
> >  10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
> >  2. Kmalloc: alloc/free test
> >  10000 times kmalloc(8)/kfree -> 65 cycles
> >  10000 times kmalloc(16)/kfree -> 66 cycles
> >  10000 times kmalloc(32)/kfree -> 65 cycles
> >  10000 times kmalloc(64)/kfree -> 66 cycles
> >  10000 times kmalloc(128)/kfree -> 66 cycles
> >  10000 times kmalloc(256)/kfree -> 71 cycles
> >  10000 times kmalloc(512)/kfree -> 72 cycles
> >  10000 times kmalloc(1024)/kfree -> 71 cycles
> >  10000 times kmalloc(2048)/kfree -> 71 cycles
> >  10000 times kmalloc(4096)/kfree -> 71 cycles
> >  10000 times kmalloc(8192)/kfree -> 65 cycles
> >  10000 times kmalloc(16384)/kfree -> 511 cycles
> > 
> > Most of the results are better than before.
> > 
> > Note that this change slightly worses performance in !CONFIG_PREEMPT,
> > roughly 0.3%. Implementing each case separately would help performance,
> > but, since it's so marginal, I didn't do that. This would help
> > maintanance since we have same code for all cases.
> > 
> > Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> >  mm/slub.c |   26 +++++++++++++-------------
> >  1 file changed, 13 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index fe376fe..0624608 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2398,13 +2398,15 @@ redo:
> >  	 * reading from one cpu area. That does not matter as long
> >  	 * as we end up on the original cpu again when doing the cmpxchg.
> >  	 *
> > -	 * Preemption is disabled for the retrieval of the tid because that
> > -	 * must occur from the current processor. We cannot allow rescheduling
> > -	 * on a different processor between the determination of the pointer
> > -	 * and the retrieval of the tid.
> > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > +	 * to check if it is matched or not.
> >  	 */
> > -	preempt_disable();
> > -	c = this_cpu_ptr(s->cpu_slab);
> > +	do {
> > +		tid = this_cpu_read(s->cpu_slab->tid);
> > +		c = this_cpu_ptr(s->cpu_slab);
> > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	barrier();
> 
> Help maintenance more if barrier is documented in commit message.

Hello,

Okay. Will add some information about this barrier in commit message.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-05  8:37 ` Hillf Danton
  0 siblings, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2015-01-05  8:37 UTC (permalink / raw)
  To: 'Joonsoo Kim'
  Cc: Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, Steven Rostedt,
	'Jesper Dangaard Brouer'

> 
> We had to insert a preempt enable/disable in the fastpath a while ago
> in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> can move the process to other cpu during retrieving data.
> 
> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.
> 
> With this guarantee, preemption enable/disable isn't need at all even if
> CONFIG_PREEMPT, so this patch removes it.
> 
> I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> 
> Below is the result of Christoph's slab_test reported by
> Jesper Dangaard Brouer.
> 
> * Before
> 
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
>  10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
>  10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
>  10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
>  10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
>  10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
>  10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
>  10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
>  10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
>  10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 68 cycles
>  10000 times kmalloc(16)/kfree -> 68 cycles
>  10000 times kmalloc(32)/kfree -> 69 cycles
>  10000 times kmalloc(64)/kfree -> 68 cycles
>  10000 times kmalloc(128)/kfree -> 68 cycles
>  10000 times kmalloc(256)/kfree -> 68 cycles
>  10000 times kmalloc(512)/kfree -> 74 cycles
>  10000 times kmalloc(1024)/kfree -> 75 cycles
>  10000 times kmalloc(2048)/kfree -> 74 cycles
>  10000 times kmalloc(4096)/kfree -> 74 cycles
>  10000 times kmalloc(8192)/kfree -> 75 cycles
>  10000 times kmalloc(16384)/kfree -> 510 cycles
> 
> * After
> 
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
>  10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
>  10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
>  10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
>  10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
>  10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
>  10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
>  10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
>  10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
>  10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 65 cycles
>  10000 times kmalloc(16)/kfree -> 66 cycles
>  10000 times kmalloc(32)/kfree -> 65 cycles
>  10000 times kmalloc(64)/kfree -> 66 cycles
>  10000 times kmalloc(128)/kfree -> 66 cycles
>  10000 times kmalloc(256)/kfree -> 71 cycles
>  10000 times kmalloc(512)/kfree -> 72 cycles
>  10000 times kmalloc(1024)/kfree -> 71 cycles
>  10000 times kmalloc(2048)/kfree -> 71 cycles
>  10000 times kmalloc(4096)/kfree -> 71 cycles
>  10000 times kmalloc(8192)/kfree -> 65 cycles
>  10000 times kmalloc(16384)/kfree -> 511 cycles
> 
> Most of the results are better than before.
> 
> Note that this change slightly worses performance in !CONFIG_PREEMPT,
> roughly 0.3%. Implementing each case separately would help performance,
> but, since it's so marginal, I didn't do that. This would help
> maintanance since we have same code for all cases.
> 
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/slub.c |   26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fe376fe..0624608 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2398,13 +2398,15 @@ redo:
>  	 * reading from one cpu area. That does not matter as long
>  	 * as we end up on the original cpu again when doing the cmpxchg.
>  	 *
> -	 * Preemption is disabled for the retrieval of the tid because that
> -	 * must occur from the current processor. We cannot allow rescheduling
> -	 * on a different processor between the determination of the pointer
> -	 * and the retrieval of the tid.
> +	 * We should guarantee that tid and kmem_cache are retrieved on
> +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> +	 * to check if it is matched or not.
>  	 */
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();

Help maintenance more if barrier is documented in commit message.
> 
>  	/*
>  	 * The transaction ids are globally unique per cpu and per operation on
> @@ -2412,8 +2414,6 @@ redo:
>  	 * occurs on the right processor and that there was no operation on the
>  	 * linked list in between.
>  	 */
> -	tid = c->tid;
> -	preempt_enable();
> 
>  	object = c->freelist;
>  	page = c->page;
> @@ -2659,11 +2659,11 @@ redo:
>  	 * data is retrieved via this pointer. If we are on the same cpu
>  	 * during the cmpxchg then the free will succedd.
>  	 */
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> -
> -	tid = c->tid;
> -	preempt_enable();
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();
> 
ditto
>  	if (likely(page == c->page)) {
>  		set_freepointer(s, object, c->freelist);
> --
> 1.7.9.5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-05  8:37 ` Hillf Danton
  0 siblings, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2015-01-05  8:37 UTC (permalink / raw)
  To: 'Joonsoo Kim'
  Cc: Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes',
	linux-kernel, linux-mm, Steven Rostedt,
	'Jesper Dangaard Brouer'

> 
> We had to insert a preempt enable/disable in the fastpath a while ago
> in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> can move the process to other cpu during retrieving data.
> 
> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.
> 
> With this guarantee, preemption enable/disable isn't need at all even if
> CONFIG_PREEMPT, so this patch removes it.
> 
> I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> 
> Below is the result of Christoph's slab_test reported by
> Jesper Dangaard Brouer.
> 
> * Before
> 
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
>  10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
>  10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
>  10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
>  10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
>  10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
>  10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
>  10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
>  10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
>  10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 68 cycles
>  10000 times kmalloc(16)/kfree -> 68 cycles
>  10000 times kmalloc(32)/kfree -> 69 cycles
>  10000 times kmalloc(64)/kfree -> 68 cycles
>  10000 times kmalloc(128)/kfree -> 68 cycles
>  10000 times kmalloc(256)/kfree -> 68 cycles
>  10000 times kmalloc(512)/kfree -> 74 cycles
>  10000 times kmalloc(1024)/kfree -> 75 cycles
>  10000 times kmalloc(2048)/kfree -> 74 cycles
>  10000 times kmalloc(4096)/kfree -> 74 cycles
>  10000 times kmalloc(8192)/kfree -> 75 cycles
>  10000 times kmalloc(16384)/kfree -> 510 cycles
> 
> * After
> 
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
>  10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
>  10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
>  10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
>  10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
>  10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
>  10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
>  10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
>  10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
>  10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 65 cycles
>  10000 times kmalloc(16)/kfree -> 66 cycles
>  10000 times kmalloc(32)/kfree -> 65 cycles
>  10000 times kmalloc(64)/kfree -> 66 cycles
>  10000 times kmalloc(128)/kfree -> 66 cycles
>  10000 times kmalloc(256)/kfree -> 71 cycles
>  10000 times kmalloc(512)/kfree -> 72 cycles
>  10000 times kmalloc(1024)/kfree -> 71 cycles
>  10000 times kmalloc(2048)/kfree -> 71 cycles
>  10000 times kmalloc(4096)/kfree -> 71 cycles
>  10000 times kmalloc(8192)/kfree -> 65 cycles
>  10000 times kmalloc(16384)/kfree -> 511 cycles
> 
> Most of the results are better than before.
> 
> Note that this change slightly worses performance in !CONFIG_PREEMPT,
> roughly 0.3%. Implementing each case separately would help performance,
> but, since it's so marginal, I didn't do that. This would help
> maintanance since we have same code for all cases.
> 
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/slub.c |   26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fe376fe..0624608 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2398,13 +2398,15 @@ redo:
>  	 * reading from one cpu area. That does not matter as long
>  	 * as we end up on the original cpu again when doing the cmpxchg.
>  	 *
> -	 * Preemption is disabled for the retrieval of the tid because that
> -	 * must occur from the current processor. We cannot allow rescheduling
> -	 * on a different processor between the determination of the pointer
> -	 * and the retrieval of the tid.
> +	 * We should guarantee that tid and kmem_cache are retrieved on
> +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> +	 * to check if it is matched or not.
>  	 */
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();

Help maintenance more if barrier is documented in commit message.
> 
>  	/*
>  	 * The transaction ids are globally unique per cpu and per operation on
> @@ -2412,8 +2414,6 @@ redo:
>  	 * occurs on the right processor and that there was no operation on the
>  	 * linked list in between.
>  	 */
> -	tid = c->tid;
> -	preempt_enable();
> 
>  	object = c->freelist;
>  	page = c->page;
> @@ -2659,11 +2659,11 @@ redo:
>  	 * data is retrieved via this pointer. If we are on the same cpu
>  	 * during the cmpxchg then the free will succedd.
>  	 */
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> -
> -	tid = c->tid;
> -	preempt_enable();
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();
> 
ditto
>  	if (likely(page == c->page)) {
>  		set_freepointer(s, object, c->freelist);
> --
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-01-09  3:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-05  1:36 [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Joonsoo Kim
2015-01-05  1:36 ` Joonsoo Kim
2015-01-05  1:36 ` [PATCH 2/2] mm: don't use compound_head() in virt_to_head_page() Joonsoo Kim
2015-01-05  1:36   ` Joonsoo Kim
2015-01-05 14:53   ` Christoph Lameter
2015-01-05 14:53     ` Christoph Lameter
2015-01-05 14:52 ` [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Christoph Lameter
2015-01-05 14:52   ` Christoph Lameter
2015-01-06  3:03 ` Davidlohr Bueso
2015-01-06  3:03   ` Davidlohr Bueso
2015-01-06  8:09   ` Joonsoo Kim
2015-01-06  8:09     ` Joonsoo Kim
2015-01-06 17:02     ` Davidlohr Bueso
2015-01-06 17:02       ` Davidlohr Bueso
2015-01-08  7:44       ` Joonsoo Kim
2015-01-08  7:44         ` Joonsoo Kim
2015-01-09  3:34         ` Christoph Lameter
2015-01-09  3:34           ` Christoph Lameter
2015-01-05  8:37 Hillf Danton
2015-01-05  8:37 ` Hillf Danton
2015-01-06  1:32 ` Joonsoo Kim
2015-01-06  1:32   ` Joonsoo Kim
2015-01-06  2:25   ` Steven Rostedt
2015-01-06  2:25     ` Steven Rostedt
2015-01-06  8:27     ` Joonsoo Kim
2015-01-06  8:27       ` Joonsoo Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.