linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] slub: prefetch next freelist pointer in slab_alloc()
@ 2011-12-16 15:25 Eric Dumazet
  2011-12-16 16:31 ` Christoph Lameter
  2011-12-18 22:47 ` David Rientjes
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2011-12-16 15:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, David Rientjes, Alex,Shi, Shaohua Li,
	Matt Mackall

Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.

Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.

Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.

Before patch :

# perf stat -r 32 hackbench 50 process 4000 >/dev/null

 Performance counter stats for 'hackbench 50 process 4000' (32 runs):

     327577,471718 task-clock                #   15,821 CPUs utilized            ( +-  0,64% )
        28 866 491 context-switches          #    0,088 M/sec                    ( +-  1,80% )
         1 506 929 CPU-migrations            #    0,005 M/sec                    ( +-  3,24% )
           127 151 page-faults               #    0,000 M/sec                    ( +-  0,16% )
   829 399 813 448 cycles                    #    2,532 GHz                      ( +-  0,64% )
   580 664 691 740 stalled-cycles-frontend   #   70,01% frontend cycles idle     ( +-  0,71% )
   197 431 700 448 stalled-cycles-backend    #   23,80% backend  cycles idle     ( +-  1,03% )
   503 548 648 975 instructions              #    0,61  insns per cycle        
                                             #    1,15  stalled cycles per insn  ( +-  0,46% )
    95 780 068 471 branches                  #  292,389 M/sec                    ( +-  0,48% )
     1 426 407 916 branch-misses             #    1,49% of all branches          ( +-  1,35% )

      20,705679994 seconds time elapsed                                          ( +-  0,64% )

After patch :

# perf stat -r 32 hackbench 50 process 4000 >/dev/null

 Performance counter stats for 'hackbench 50 process 4000' (32 runs):

     286236,542804 task-clock                #   15,786 CPUs utilized            ( +-  1,32% )
        19 703 372 context-switches          #    0,069 M/sec                    ( +-  4,99% )
         1 658 249 CPU-migrations            #    0,006 M/sec                    ( +-  6,62% )
           126 776 page-faults               #    0,000 M/sec                    ( +-  0,12% )
   724 636 593 213 cycles                    #    2,532 GHz                      ( +-  1,32% )
   499 320 714 837 stalled-cycles-frontend   #   68,91% frontend cycles idle     ( +-  1,47% )
   156 555 126 809 stalled-cycles-backend    #   21,60% backend  cycles idle     ( +-  2,22% )
   463 897 792 661 instructions              #    0,64  insns per cycle        
                                             #    1,08  stalled cycles per insn  ( +-  0,94% )
    87 717 352 563 branches                  #  306,451 M/sec                    ( +-  0,99% )
       941 738 280 branch-misses             #    1,07% of all branches          ( +-  3,35% )

      18,132070670 seconds time elapsed                                          ( +-  1,30% )

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: Matt Mackall <mpm@selenic.com> 
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
---
 mm/slub.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index ed3334d..16b850d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -269,6 +269,11 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
 	return *(void **)(object + s->offset);
 }
 
+static void prefetch_freepointer(const struct kmem_cache *s, void *object)
+{
+	prefetch(object + s->offset);
+}
+
 static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
 {
 	void *p;
@@ -2292,6 +2297,8 @@ redo:
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
+		void *next_object = get_freepointer_safe(s, object);
+
 		/*
 		 * The cmpxchg will only match if there was no additional
 		 * operation and if we are on the right processor.
@@ -2307,11 +2314,12 @@ redo:
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer_safe(s, object), next_tid(tid)))) {
+				next_object, next_tid(tid)))) {
 
 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
 		}
+		prefetch_freepointer(s, next_object);
 		stat(s, ALLOC_FASTPATH);
 	}
 



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2011-12-16 15:25 [PATCH] slub: prefetch next freelist pointer in slab_alloc() Eric Dumazet
@ 2011-12-16 16:31 ` Christoph Lameter
  2011-12-16 17:18   ` Eric Dumazet
  2012-01-24 19:54   ` Pekka Enberg
  2011-12-18 22:47 ` David Rientjes
  1 sibling, 2 replies; 9+ messages in thread
From: Christoph Lameter @ 2011-12-16 16:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Pekka Enberg, David Rientjes, Alex,Shi, Shaohua Li,
	Matt Mackall

On Fri, 16 Dec 2011, Eric Dumazet wrote:

> Recycling a page is a problem, since freelist link chain is hot on
> cpu(s) which freed objects, and possibly very cold on cpu currently
> owning slab.

Good idea. How do the tcp benchmarks look after this?

Looks sane.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2011-12-16 16:31 ` Christoph Lameter
@ 2011-12-16 17:18   ` Eric Dumazet
  2011-12-17 22:56     ` Eric Dumazet
  2012-01-24 19:54   ` Pekka Enberg
  1 sibling, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2011-12-16 17:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, David Rientjes, Alex,Shi, Shaohua Li,
	Matt Mackall

Le vendredi 16 décembre 2011 à 10:31 -0600, Christoph Lameter a écrit :
> On Fri, 16 Dec 2011, Eric Dumazet wrote:
> 
> > Recycling a page is a problem, since freelist link chain is hot on
> > cpu(s) which freed objects, and possibly very cold on cpu currently
> > owning slab.
> 
> Good idea. How do the tcp benchmarks look after this?
> 
> Looks sane.
> 
> Acked-by: Christoph Lameter <cl@linux.com>

Thanks !

I wouldnt expect TCP being a huge win (most of cpu is consumed in tcp
stack, not really memory allocations), but still...

[I expect much better gain on an UDP load, where memory allocator costs
are higher ]

$ cat netperf.sh
for in in `seq 1 32`
do
 netperf -H 192.168.20.110 -v 0 -l -100000 -t TCP_RR &
done
wait

If cpu0 handles network interrupts, and other cpus run applications :

Before

 Performance counter stats for './netperf.sh':

      38001,927957 task-clock                #    2,344 CPUs utilized          
         3 306 138 context-switches          #    0,087 M/sec                  
                79 CPU-migrations            #    0,000 M/sec                  
             9 656 page-faults               #    0,000 M/sec                  
    83 564 329 446 cycles                    #    2,199 GHz                    
    61 350 744 867 stalled-cycles-frontend   #   73,42% frontend cycles idle   
    34 907 541 687 stalled-cycles-backend    #   41,77% backend  cycles idle   
    44 739 971 752 instructions              #    0,54  insns per cycle        
                                             #    1,37  stalled cycles per insn
     8 662 005 669 branches                  #  227,936 M/sec                  
       249 555 153 branch-misses             #    2,88% of all branches        

      16,214220448 seconds time elapsed

After :

 Performance counter stats for './netperf.sh':

      37035,347847 task-clock                #    2,374 CPUs utilized          
         3 314 540 context-switches          #    0,089 M/sec                  
               131 CPU-migrations            #    0,000 M/sec                  
             9 691 page-faults               #    0,000 M/sec                  
    81 783 678 294 cycles                    #    2,208 GHz                    
    59 595 242 695 stalled-cycles-frontend   #   72,87% frontend cycles idle   
    34 367 813 304 stalled-cycles-backend    #   42,02% backend  cycles idle   
    44 698 853 546 instructions              #    0,55  insns per cycle        
                                             #    1,33  stalled cycles per insn
     8 654 940 308 branches                  #  233,694 M/sec                  
       245 578 562 branch-misses             #    2,84% of all branches        

      15,597940419 seconds time elapsed



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2011-12-16 17:18   ` Eric Dumazet
@ 2011-12-17 22:56     ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2011-12-17 22:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, David Rientjes, Alex,Shi, Shaohua Li,
	Matt Mackall

Le vendredi 16 décembre 2011 à 18:18 +0100, Eric Dumazet a écrit :

> I wouldnt expect TCP being a huge win (most of cpu is consumed in tcp
> stack, not really memory allocations), but still...
> 
> [I expect much better gain on an UDP load, where memory allocator costs
> are higher ]

Update on benches. UDP results are really good.

UDP test : One cpu (cpu0) handling NIC irqs, one cpu (cpu1) running a
mono threaded UDP receiver (only receives UDP messages, no xmits)

NUMA machine, feeded with 1.000.000 64bytes packets per second (from
another pktgen machine)

cpu0/cpu1 are on different sockets, to force cache line bouncings and
stress SLUB (allocations done on cpu0, frees on cpu1)

bnx2x adapter (using new build_skb() service for low memory latencies,
available in net-next tree)


Before slub prefetch patch :
	590.000 messages received per second by application,
	410.000 drops per second.


After slub prefetch patch :
	740.000 messages received per second by application,
	260.000 drops per second.


[ If application runs on cpu2 (same socket than cpu0), it can receive
920.000 pps (after patch) instead of 890.000 pps (before patch) ]

Thanks



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2011-12-16 15:25 [PATCH] slub: prefetch next freelist pointer in slab_alloc() Eric Dumazet
  2011-12-16 16:31 ` Christoph Lameter
@ 2011-12-18 22:47 ` David Rientjes
  1 sibling, 0 replies; 9+ messages in thread
From: David Rientjes @ 2011-12-18 22:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, linux-kernel, Pekka Enberg, Alex,Shi,
	Shaohua Li, Matt Mackall

On Fri, 16 Dec 2011, Eric Dumazet wrote:

> Recycling a page is a problem, since freelist link chain is hot on
> cpu(s) which freed objects, and possibly very cold on cpu currently
> owning slab.
> 
> Adding a prefetch of cache line containing the pointer to next object in
> slab_alloc() helps a lot in many workloads, in particular on assymetric
> ones (allocations done on one cpu, frees on another cpus). Added cost is
> three machine instructions only.
> 
> Examples on my dual socket quad core ht machine (Intel CPU E5540
> @2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
> 
> Before patch :
> 
> # perf stat -r 32 hackbench 50 process 4000 >/dev/null
> 
>  Performance counter stats for 'hackbench 50 process 4000' (32 runs):
> 
>      327577,471718 task-clock                #   15,821 CPUs utilized            ( +-  0,64% )
>         28 866 491 context-switches          #    0,088 M/sec                    ( +-  1,80% )
>          1 506 929 CPU-migrations            #    0,005 M/sec                    ( +-  3,24% )
>            127 151 page-faults               #    0,000 M/sec                    ( +-  0,16% )
>    829 399 813 448 cycles                    #    2,532 GHz                      ( +-  0,64% )
>    580 664 691 740 stalled-cycles-frontend   #   70,01% frontend cycles idle     ( +-  0,71% )
>    197 431 700 448 stalled-cycles-backend    #   23,80% backend  cycles idle     ( +-  1,03% )
>    503 548 648 975 instructions              #    0,61  insns per cycle        
>                                              #    1,15  stalled cycles per insn  ( +-  0,46% )
>     95 780 068 471 branches                  #  292,389 M/sec                    ( +-  0,48% )
>      1 426 407 916 branch-misses             #    1,49% of all branches          ( +-  1,35% )
> 
>       20,705679994 seconds time elapsed                                          ( +-  0,64% )
> 
> After patch :
> 
> # perf stat -r 32 hackbench 50 process 4000 >/dev/null
> 
>  Performance counter stats for 'hackbench 50 process 4000' (32 runs):
> 
>      286236,542804 task-clock                #   15,786 CPUs utilized            ( +-  1,32% )
>         19 703 372 context-switches          #    0,069 M/sec                    ( +-  4,99% )
>          1 658 249 CPU-migrations            #    0,006 M/sec                    ( +-  6,62% )
>            126 776 page-faults               #    0,000 M/sec                    ( +-  0,12% )
>    724 636 593 213 cycles                    #    2,532 GHz                      ( +-  1,32% )
>    499 320 714 837 stalled-cycles-frontend   #   68,91% frontend cycles idle     ( +-  1,47% )
>    156 555 126 809 stalled-cycles-backend    #   21,60% backend  cycles idle     ( +-  2,22% )
>    463 897 792 661 instructions              #    0,64  insns per cycle        
>                                              #    1,08  stalled cycles per insn  ( +-  0,94% )
>     87 717 352 563 branches                  #  306,451 M/sec                    ( +-  0,99% )
>        941 738 280 branch-misses             #    1,07% of all branches          ( +-  3,35% )
> 
>       18,132070670 seconds time elapsed                                          ( +-  1,30% )
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Pekka Enberg <penberg@kernel.org>
> CC: Matt Mackall <mpm@selenic.com> 
> CC: David Rientjes <rientjes@google.com>
> CC: "Alex,Shi" <alex.shi@intel.com>
> CC: Shaohua Li <shaohua.li@intel.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2011-12-16 16:31 ` Christoph Lameter
  2011-12-16 17:18   ` Eric Dumazet
@ 2012-01-24 19:54   ` Pekka Enberg
  2012-01-30 21:32     ` Geert Uytterhoeven
  1 sibling, 1 reply; 9+ messages in thread
From: Pekka Enberg @ 2012-01-24 19:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, linux-kernel, David Rientjes, Alex,Shi, Shaohua Li,
	Matt Mackall

On Fri, 16 Dec 2011, Eric Dumazet wrote:
>> Recycling a page is a problem, since freelist link chain is hot on
>> cpu(s) which freed objects, and possibly very cold on cpu currently
>> owning slab.

On Fri, 16 Dec 2011, Christoph Lameter wrote:
> Good idea. How do the tcp benchmarks look after this?
>
> Looks sane.
>
> Acked-by: Christoph Lameter <cl@linux.com>

Applied, thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2012-01-24 19:54   ` Pekka Enberg
@ 2012-01-30 21:32     ` Geert Uytterhoeven
  2012-01-30 21:53       ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Geert Uytterhoeven @ 2012-01-30 21:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Eric Dumazet, linux-kernel, David Rientjes,
	Alex,Shi, Shaohua Li, Matt Mackall, Linux-Next

On Tue, Jan 24, 2012 at 20:54, Pekka Enberg <penberg@kernel.org> wrote:
> On Fri, 16 Dec 2011, Eric Dumazet wrote:
>>> Recycling a page is a problem, since freelist link chain is hot on
>>> cpu(s) which freed objects, and possibly very cold on cpu currently
>>> owning slab.
>
> On Fri, 16 Dec 2011, Christoph Lameter wrote:
>> Good idea. How do the tcp benchmarks look after this?
>>
>> Looks sane.
>>
>> Acked-by: Christoph Lameter <cl@linux.com>
>
> Applied, thanks!

m68k/allmodconfig at http://kisskb.ellerman.id.au/kisskb/buildresult/5527349/

mm/slub.c:274: error: implicit declaration of function 'prefetch'

Sorry, didn't notice it earlier due to other build breakage in -next.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2012-01-30 21:32     ` Geert Uytterhoeven
@ 2012-01-30 21:53       ` Christoph Lameter
  2012-02-09 20:00         ` Geert Uytterhoeven
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2012-01-30 21:53 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Pekka Enberg, Eric Dumazet, linux-kernel, David Rientjes,
	Alex,Shi, Shaohua Li, Matt Mackall, Linux-Next

On Mon, 30 Jan 2012, Geert Uytterhoeven wrote:

> On Tue, Jan 24, 2012 at 20:54, Pekka Enberg <penberg@kernel.org> wrote:
> > On Fri, 16 Dec 2011, Eric Dumazet wrote:
> >>> Recycling a page is a problem, since freelist link chain is hot on
> >>> cpu(s) which freed objects, and possibly very cold on cpu currently
> >>> owning slab.
> >
> > On Fri, 16 Dec 2011, Christoph Lameter wrote:
> >> Good idea. How do the tcp benchmarks look after this?
> >>
> >> Looks sane.
> >>
> >> Acked-by: Christoph Lameter <cl@linux.com>
> >
> > Applied, thanks!
>
> m68k/allmodconfig at http://kisskb.ellerman.id.au/kisskb/buildresult/5527349/
>
> mm/slub.c:274: error: implicit declaration of function 'prefetch'
>
> Sorry, didn't notice it earlier due to other build breakage in -next.

Does this fix it?


Subject: slub: include include for prefetch

Otherwise m68k breaks.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2012-01-13 10:04:06.000000000 -0600
+++ linux-2.6/mm/slub.c	2012-01-30 15:51:55.000000000 -0600
@@ -29,6 +29,7 @@
 #include <linux/math64.h>
 #include <linux/fault-inject.h>
 #include <linux/stacktrace.h>
+#include <linux/prefetch.h>

 #include <trace/events/kmem.h>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] slub: prefetch next freelist pointer in slab_alloc()
  2012-01-30 21:53       ` Christoph Lameter
@ 2012-02-09 20:00         ` Geert Uytterhoeven
  0 siblings, 0 replies; 9+ messages in thread
From: Geert Uytterhoeven @ 2012-02-09 20:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Eric Dumazet, linux-kernel, David Rientjes,
	Alex,Shi, Shaohua Li, Matt Mackall, Linux-Next

On Mon, Jan 30, 2012 at 22:53, Christoph Lameter <cl@linux.com> wrote:
> On Mon, 30 Jan 2012, Geert Uytterhoeven wrote:
>> m68k/allmodconfig at http://kisskb.ellerman.id.au/kisskb/buildresult/5527349/
>>
>> mm/slub.c:274: error: implicit declaration of function 'prefetch'
>>
>> Sorry, didn't notice it earlier due to other build breakage in -next.
>
> Does this fix it?

Yep. Thx!

> Subject: slub: include include for prefetch
>
> Otherwise m68k breaks.
>
> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

> ---
>  mm/slub.c |    1 +
>  1 file changed, 1 insertion(+)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c    2012-01-13 10:04:06.000000000 -0600
> +++ linux-2.6/mm/slub.c 2012-01-30 15:51:55.000000000 -0600
> @@ -29,6 +29,7 @@
>  #include <linux/math64.h>
>  #include <linux/fault-inject.h>
>  #include <linux/stacktrace.h>
> +#include <linux/prefetch.h>
>
>  #include <trace/events/kmem.h>

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-02-09 20:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-16 15:25 [PATCH] slub: prefetch next freelist pointer in slab_alloc() Eric Dumazet
2011-12-16 16:31 ` Christoph Lameter
2011-12-16 17:18   ` Eric Dumazet
2011-12-17 22:56     ` Eric Dumazet
2012-01-24 19:54   ` Pekka Enberg
2012-01-30 21:32     ` Geert Uytterhoeven
2012-01-30 21:53       ` Christoph Lameter
2012-02-09 20:00         ` Geert Uytterhoeven
2011-12-18 22:47 ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).