linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-15  7:40 Joonsoo Kim
  2015-01-15  7:40 ` [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page() Joonsoo Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Joonsoo Kim @ 2015-01-15  7:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

We had to insert a preempt enable/disable in the fastpath a while ago
in order to guarantee that tid and kmem_cache_cpu are retrieved on the
same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
can move the process to other cpu during retrieving data.

Now, I reach the solution to remove preempt enable/disable in the fastpath.
If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
are retrieved by separate this_cpu operation, it means that they are
retrieved on the same cpu. If not matched, we just have to retry it.

With this guarantee, preemption enable/disable isn't need at all even if
CONFIG_PREEMPT, so this patch removes it.

I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)

Below is the result of Christoph's slab_test reported by
Jesper Dangaard Brouer.

* Before

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

* After

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

Most of the results are better than before.

Note that this change slightly worses performance in !CONFIG_PREEMPT,
roughly 0.3%. Implementing each case separately would help performance,
but, since it's so marginal, I didn't do that. This would help
maintanance since we have same code for all cases.

Change from v1: add comment about barrier() usage

Acked-by: Christoph Lameter <cl@linux.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |   35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index fe376fe..ceee1d7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2398,13 +2398,24 @@ redo:
 	 * reading from one cpu area. That does not matter as long
 	 * as we end up on the original cpu again when doing the cmpxchg.
 	 *
-	 * Preemption is disabled for the retrieval of the tid because that
-	 * must occur from the current processor. We cannot allow rescheduling
-	 * on a different processor between the determination of the pointer
-	 * and the retrieval of the tid.
+	 * We should guarantee that tid and kmem_cache are retrieved on
+	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
+	 * to check if it is matched or not.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+
+	/*
+	 * Irqless object alloc/free alogorithm used here depends on sequence
+	 * of fetching cpu_slab's data. tid should be fetched before anything
+	 * on c to guarantee that object and page associated with previous tid
+	 * won't be used with current tid. If we fetch tid first, object and
+	 * page could be one associated with next tid and our alloc/free
+	 * request will be failed. In this case, we will retry. So, no problem.
+	 */
+	barrier();
 
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
@@ -2412,8 +2423,6 @@ redo:
 	 * occurs on the right processor and that there was no operation on the
 	 * linked list in between.
 	 */
-	tid = c->tid;
-	preempt_enable();
 
 	object = c->freelist;
 	page = c->page;
@@ -2659,11 +2668,13 @@ redo:
 	 * data is retrieved via this pointer. If we are on the same cpu
 	 * during the cmpxchg then the free will succedd.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
 
-	tid = c->tid;
-	preempt_enable();
+	/* Same with comment on barrier() in slab_alloc_node() */
+	barrier();
 
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page()
  2015-01-15  7:40 [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Joonsoo Kim
@ 2015-01-15  7:40 ` Joonsoo Kim
  2015-01-16  1:16   ` Andrew Morton
  2015-01-15  8:10 ` [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Jesper Dangaard Brouer
  2015-01-16  1:16 ` Andrew Morton
  2 siblings, 1 reply; 16+ messages in thread
From: Joonsoo Kim @ 2015-01-15  7:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

compound_head() is implemented with assumption that there would be
race condition when checking tail flag. This assumption is only true
when we try to access arbitrary positioned struct page.

The situation that virt_to_head_page() is called is different case.
We call virt_to_head_page() only in the range of allocated pages,
so there is no race condition on tail flag. In this case, we don't
need to handle race condition and we can reduce overhead slightly.
This patch implements compound_head_fast() which is similar with
compound_head() except tail flag race handling. And then,
virt_to_head_page() uses this optimized function to improve performance.

I saw 1.8% win in a fast-path loop over kmem_cache_alloc/free,
(14.063 ns -> 13.810 ns) if target object is on tail page.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mm.h |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f80d019..0460e2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -453,6 +453,13 @@ static inline struct page *compound_head(struct page *page)
 	return page;
 }
 
+static inline struct page *compound_head_fast(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		return page->first_page;
+	return page;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -531,7 +538,8 @@ static inline void get_page(struct page *page)
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
-	return compound_head(page);
+
+	return compound_head_fast(page);
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-15  7:40 [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Joonsoo Kim
  2015-01-15  7:40 ` [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page() Joonsoo Kim
@ 2015-01-15  8:10 ` Jesper Dangaard Brouer
  2015-01-16  1:16 ` Andrew Morton
  2 siblings, 0 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2015-01-15  8:10 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, rostedt, Thomas Gleixner, brouer

On Thu, 15 Jan 2015 16:40:32 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

[...]
> 
> I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> 
> Below is the result of Christoph's slab_test reported by
> Jesper Dangaard Brouer.
>
[...]

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

> Acked-by: Christoph Lameter <cl@linux.com>
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/slub.c |   35 +++++++++++++++++++++++------------
>  1 file changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fe376fe..ceee1d7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2398,13 +2398,24 @@ redo:
[...]
>  	 */
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +
> +	/*
> +	 * Irqless object alloc/free alogorithm used here depends on sequence

Spelling of algorithm contains a typo ^^ 

> +	 * of fetching cpu_slab's data. tid should be fetched before anything
> +	 * on c to guarantee that object and page associated with previous tid
> +	 * won't be used with current tid. If we fetch tid first, object and
> +	 * page could be one associated with next tid and our alloc/free
> +	 * request will be failed. In this case, we will retry. So, no problem.
> +	 */
> +	barrier();

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-15  7:40 [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Joonsoo Kim
  2015-01-15  7:40 ` [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page() Joonsoo Kim
  2015-01-15  8:10 ` [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Jesper Dangaard Brouer
@ 2015-01-16  1:16 ` Andrew Morton
  2015-01-16  1:30   ` Steven Rostedt
  2015-01-16  3:28   ` Christoph Lameter
  2 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2015-01-16  1:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Thu, 15 Jan 2015 16:40:32 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> We had to insert a preempt enable/disable in the fastpath a while ago
> in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> can move the process to other cpu during retrieving data.
> 
> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.
> 
> With this guarantee, preemption enable/disable isn't need at all even if
> CONFIG_PREEMPT, so this patch removes it.
> 
> I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)

I'm surprised.  preempt_disable/enable are pretty fast.  I wonder why
this makes a measurable difference.  Perhaps preempt_enable()'s call to
preempt_schedule() added pain?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page()
  2015-01-15  7:40 ` [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page() Joonsoo Kim
@ 2015-01-16  1:16   ` Andrew Morton
  2015-01-16  3:30     ` Christoph Lameter
  2015-01-19  6:16     ` Joonsoo Kim
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2015-01-16  1:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Thu, 15 Jan 2015 16:40:33 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> compound_head() is implemented with assumption that there would be
> race condition when checking tail flag. This assumption is only true
> when we try to access arbitrary positioned struct page.
> 
> The situation that virt_to_head_page() is called is different case.
> We call virt_to_head_page() only in the range of allocated pages,
> so there is no race condition on tail flag. In this case, we don't
> need to handle race condition and we can reduce overhead slightly.
> This patch implements compound_head_fast() which is similar with
> compound_head() except tail flag race handling. And then,
> virt_to_head_page() uses this optimized function to improve performance.
> 
> I saw 1.8% win in a fast-path loop over kmem_cache_alloc/free,
> (14.063 ns -> 13.810 ns) if target object is on tail page.
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -453,6 +453,13 @@ static inline struct page *compound_head(struct page *page)
>  	return page;
>  }
>  
> +static inline struct page *compound_head_fast(struct page *page)
> +{
> +	if (unlikely(PageTail(page)))
> +		return page->first_page;
> +	return page;
> +}

Can we please have some code comments which let people know when they
should and shouldn't use compound_head_fast()?  I shouldn't have to say
this :(

>  /*
>   * The atomic page->_mapcount, starts from -1: so that transitions
>   * both from it and to it can be tracked, using atomic_inc_and_test
> @@ -531,7 +538,8 @@ static inline void get_page(struct page *page)
>  static inline struct page *virt_to_head_page(const void *x)
>  {
>  	struct page *page = virt_to_page(x);
> -	return compound_head(page);
> +
> +	return compound_head_fast(page);

And perhaps some explanation here as to why virt_to_head_page() can
safely use compound_head_fast().  There's an assumption here that
nobody will be dismantling the compound page while virt_to_head_page()
is in progress, yes?  And this assumption also holds for the calling
code, because otherwise the virt_to_head_page() return value is kinda
meaningless.

This is tricky stuff - let's spell it out carefully.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  1:16 ` Andrew Morton
@ 2015-01-16  1:30   ` Steven Rostedt
  2015-01-16  3:27     ` Christoph Lameter
  2015-01-16  3:28   ` Christoph Lameter
  1 sibling, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2015-01-16  1:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, Thomas Gleixner,
	Peter Zijlstra

On Thu, 15 Jan 2015 17:16:34 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> 
> I'm surprised.  preempt_disable/enable are pretty fast.  I wonder why
> this makes a measurable difference.  Perhaps preempt_enable()'s call
> to preempt_schedule() added pain?

profiling function tracing I discovered that accessing preempt_count
was actually quite expensive, even just to read. But it may not be as
bad since Peter Zijlstra converted preempt_count to a per_cpu variable.
Although, IIRC, the perf profiling showed the access to the %gs
register was where the time consuming was happening, which is what
I believe per_cpu variables still use.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  1:30   ` Steven Rostedt
@ 2015-01-16  3:27     ` Christoph Lameter
  2015-01-16  3:51       ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2015-01-16  3:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, Thomas Gleixner,
	Peter Zijlstra

On Thu, 15 Jan 2015, Steven Rostedt wrote:

> profiling function tracing I discovered that accessing preempt_count
> was actually quite expensive, even just to read. But it may not be as
> bad since Peter Zijlstra converted preempt_count to a per_cpu variable.
> Although, IIRC, the perf profiling showed the access to the %gs
> register was where the time consuming was happening, which is what
> I believe per_cpu variables still use.

The %gs register is not used since the address of the per cpu area is
available as one of the first fields in the per cpu areas.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  1:16 ` Andrew Morton
  2015-01-16  1:30   ` Steven Rostedt
@ 2015-01-16  3:28   ` Christoph Lameter
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2015-01-16  3:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Thu, 15 Jan 2015, Andrew Morton wrote:

> > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
>
> I'm surprised.  preempt_disable/enable are pretty fast.  I wonder why
> this makes a measurable difference.  Perhaps preempt_enable()'s call to
> preempt_schedule() added pain?

The rest of the fastpath is already highly optimized. That is why
something like preempt enable/disable has such a disproportionate effect.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page()
  2015-01-16  1:16   ` Andrew Morton
@ 2015-01-16  3:30     ` Christoph Lameter
  2015-01-19  6:16     ` Joonsoo Kim
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2015-01-16  3:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Thu, 15 Jan 2015, Andrew Morton wrote:

> And perhaps some explanation here as to why virt_to_head_page() can
> safely use compound_head_fast().  There's an assumption here that
> nobody will be dismantling the compound page while virt_to_head_page()
> is in progress, yes?  And this assumption also holds for the calling
> code, because otherwise the virt_to_head_page() return value is kinda
> meaningless.

I think this assumption is pretty natural to make. A coupound_head that
works well while dismantling a compound page should be marked specially
and Joonsoo's definition should be the standard.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  3:27     ` Christoph Lameter
@ 2015-01-16  3:51       ` Steven Rostedt
  2015-01-16  3:57         ` Christoph Lameter
  2015-01-16  4:04         ` Steven Rostedt
  0 siblings, 2 replies; 16+ messages in thread
From: Steven Rostedt @ 2015-01-16  3:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, Thomas Gleixner,
	Peter Zijlstra

On Thu, 15 Jan 2015 21:27:14 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> 
> The %gs register is not used since the address of the per cpu area is
> available as one of the first fields in the per cpu areas.

Have you disassembled your code?

Looking at put_cpu_partial() from 3.19-rc3 where it does:

		oldpage = this_cpu_read(s->cpu_slab->partial);

I get:

		mov    %gs:0x18(%rax),%rdx

Looks to me that %gs is used.


I haven't done benchmarks in a while, so perhaps accessing the %gs
segment isn't as expensive as I saw it before. I'll have to profile
function tracing on my i7 and see where things are slow again.


-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  3:51       ` Steven Rostedt
@ 2015-01-16  3:57         ` Christoph Lameter
  2015-01-16  4:07           ` Steven Rostedt
  2015-01-16  4:04         ` Steven Rostedt
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2015-01-16  3:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, Thomas Gleixner,
	Peter Zijlstra

> I get:
>
> 		mov    %gs:0x18(%rax),%rdx
>
> Looks to me that %gs is used.

%gs is used as a segment prefix. That does not add significant cycles.
Retrieving the content of %gs and loading it into another register would
be expensive in terms of cpu cycles.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  3:51       ` Steven Rostedt
  2015-01-16  3:57         ` Christoph Lameter
@ 2015-01-16  4:04         ` Steven Rostedt
  1 sibling, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2015-01-16  4:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, Thomas Gleixner,
	Peter Zijlstra

On Thu, 15 Jan 2015 22:51:30 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> I haven't done benchmarks in a while, so perhaps accessing the %gs
> segment isn't as expensive as I saw it before. I'll have to profile
> function tracing on my i7 and see where things are slow again.

I just ran it on my i7, and yeah, the %gs access isn't much worse than
any of the other instructions. I had an old box that recently died that
I did my last benchmarks on, so that was probably why it made such a
difference.

-- Steve


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  3:57         ` Christoph Lameter
@ 2015-01-16  4:07           ` Steven Rostedt
  2015-01-16 13:40             ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2015-01-16  4:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, Thomas Gleixner,
	Peter Zijlstra

On Thu, 15 Jan 2015 21:57:58 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> > I get:
> >
> > 		mov    %gs:0x18(%rax),%rdx
> >
> > Looks to me that %gs is used.
> 
> %gs is used as a segment prefix. That does not add significant cycles.
> Retrieving the content of %gs and loading it into another register
> would be expensive in terms of cpu cycles.

OK, maybe that's what I saw in my previous benchmarks. Again, that was
a while ago.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16  4:07           ` Steven Rostedt
@ 2015-01-16 13:40             ` Eric Dumazet
  2015-01-16 16:37               ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-01-16 13:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Christoph Lameter, Andrew Morton, Joonsoo Kim, Pekka Enberg,
	David Rientjes, linux-mm, linux-kernel, Jesper Dangaard Brouer,
	Thomas Gleixner, Peter Zijlstra

On Thu, 2015-01-15 at 23:07 -0500, Steven Rostedt wrote:
> On Thu, 15 Jan 2015 21:57:58 -0600 (CST)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > > I get:
> > >
> > > 		mov    %gs:0x18(%rax),%rdx
> > >
> > > Looks to me that %gs is used.
> > 
> > %gs is used as a segment prefix. That does not add significant cycles.
> > Retrieving the content of %gs and loading it into another register
> > would be expensive in terms of cpu cycles.
> 
> OK, maybe that's what I saw in my previous benchmarks. Again, that was
> a while ago.

I made same observation about 3 years ago, on old cpus.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-16 13:40             ` Eric Dumazet
@ 2015-01-16 16:37               ` Steven Rostedt
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2015-01-16 16:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Andrew Morton, Joonsoo Kim, Pekka Enberg,
	David Rientjes, linux-mm, linux-kernel, Jesper Dangaard Brouer,
	Thomas Gleixner, Peter Zijlstra

On Fri, 16 Jan 2015 05:40:59 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I made same observation about 3 years ago, on old cpus.
> 

Thank you for letting me know. I was thinking I was going insane!

(yeah yeah, there's lots of people who will still say that I've already
gone insane, but at least I know my memory is still intact)

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page()
  2015-01-16  1:16   ` Andrew Morton
  2015-01-16  3:30     ` Christoph Lameter
@ 2015-01-19  6:16     ` Joonsoo Kim
  1 sibling, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2015-01-19  6:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Thu, Jan 15, 2015 at 05:16:46PM -0800, Andrew Morton wrote:
> On Thu, 15 Jan 2015 16:40:33 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > compound_head() is implemented with assumption that there would be
> > race condition when checking tail flag. This assumption is only true
> > when we try to access arbitrary positioned struct page.
> > 
> > The situation that virt_to_head_page() is called is different case.
> > We call virt_to_head_page() only in the range of allocated pages,
> > so there is no race condition on tail flag. In this case, we don't
> > need to handle race condition and we can reduce overhead slightly.
> > This patch implements compound_head_fast() which is similar with
> > compound_head() except tail flag race handling. And then,
> > virt_to_head_page() uses this optimized function to improve performance.
> > 
> > I saw 1.8% win in a fast-path loop over kmem_cache_alloc/free,
> > (14.063 ns -> 13.810 ns) if target object is on tail page.
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -453,6 +453,13 @@ static inline struct page *compound_head(struct page *page)
> >  	return page;
> >  }
> >  
> > +static inline struct page *compound_head_fast(struct page *page)
> > +{
> > +	if (unlikely(PageTail(page)))
> > +		return page->first_page;
> > +	return page;
> > +}
> 
> Can we please have some code comments which let people know when they
> should and shouldn't use compound_head_fast()?  I shouldn't have to say
> this :(

Okay.
> 
> >  /*
> >   * The atomic page->_mapcount, starts from -1: so that transitions
> >   * both from it and to it can be tracked, using atomic_inc_and_test
> > @@ -531,7 +538,8 @@ static inline void get_page(struct page *page)
> >  static inline struct page *virt_to_head_page(const void *x)
> >  {
> >  	struct page *page = virt_to_page(x);
> > -	return compound_head(page);
> > +
> > +	return compound_head_fast(page);
> 
> And perhaps some explanation here as to why virt_to_head_page() can
> safely use compound_head_fast().  There's an assumption here that
> nobody will be dismantling the compound page while virt_to_head_page()
> is in progress, yes?  And this assumption also holds for the calling
> code, because otherwise the virt_to_head_page() return value is kinda
> meaningless.
> 
> This is tricky stuff - let's spell it out carefully.

Okay.

I already sent v3 and it would have proper code comments.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-01-19  6:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-15  7:40 [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Joonsoo Kim
2015-01-15  7:40 ` [PATCH v2 2/2] mm: don't use compound_head() in virt_to_head_page() Joonsoo Kim
2015-01-16  1:16   ` Andrew Morton
2015-01-16  3:30     ` Christoph Lameter
2015-01-19  6:16     ` Joonsoo Kim
2015-01-15  8:10 ` [PATCH v2 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Jesper Dangaard Brouer
2015-01-16  1:16 ` Andrew Morton
2015-01-16  1:30   ` Steven Rostedt
2015-01-16  3:27     ` Christoph Lameter
2015-01-16  3:51       ` Steven Rostedt
2015-01-16  3:57         ` Christoph Lameter
2015-01-16  4:07           ` Steven Rostedt
2015-01-16 13:40             ` Eric Dumazet
2015-01-16 16:37               ` Steven Rostedt
2015-01-16  4:04         ` Steven Rostedt
2015-01-16  3:28   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).