From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759941Ab1LPPZp (ORCPT ); Fri, 16 Dec 2011 10:25:45 -0500 Received: from mail-ww0-f44.google.com ([74.125.82.44]:56235 "EHLO mail-ww0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755826Ab1LPPZj (ORCPT ); Fri, 16 Dec 2011 10:25:39 -0500 Message-ID: <1324049134.25554.29.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC> Subject: [PATCH] slub: prefetch next freelist pointer in slab_alloc() From: Eric Dumazet To: Christoph Lameter Cc: linux-kernel , Pekka Enberg , David Rientjes , "Alex,Shi" , Shaohua Li , Matt Mackall Date: Fri, 16 Dec 2011 16:25:34 +0100 Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.2.1- Content-Transfer-Encoding: 7bit Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Recycling a page is a problem, since freelist link chain is hot on cpu(s) which freed objects, and possibly very cold on cpu currently owning slab. Adding a prefetch of cache line containing the pointer to next object in slab_alloc() helps a lot in many workloads, in particular on assymetric ones (allocations done on one cpu, frees on another cpus). Added cost is three machine instructions only. Examples on my dual socket quad core ht machine (Intel CPU E5540 @2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel. Before patch : # perf stat -r 32 hackbench 50 process 4000 >/dev/null Performance counter stats for 'hackbench 50 process 4000' (32 runs): 327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% ) 28 866 491 context-switches # 0,088 M/sec ( +- 1,80% ) 1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% ) 127 151 page-faults # 0,000 M/sec ( +- 0,16% ) 829 399 813 448 cycles # 2,532 GHz ( +- 0,64% ) 580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% ) 197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% ) 503 548 648 975 instructions # 0,61 insns per cycle # 1,15 stalled cycles per insn ( +- 0,46% ) 95 780 068 471 branches # 292,389 M/sec ( +- 0,48% ) 1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% ) 20,705679994 seconds time elapsed ( +- 0,64% ) After patch : # perf stat -r 32 hackbench 50 process 4000 >/dev/null Performance counter stats for 'hackbench 50 process 4000' (32 runs): 286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% ) 19 703 372 context-switches # 0,069 M/sec ( +- 4,99% ) 1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% ) 126 776 page-faults # 0,000 M/sec ( +- 0,12% ) 724 636 593 213 cycles # 2,532 GHz ( +- 1,32% ) 499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% ) 156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% ) 463 897 792 661 instructions # 0,64 insns per cycle # 1,08 stalled cycles per insn ( +- 0,94% ) 87 717 352 563 branches # 306,451 M/sec ( +- 0,99% ) 941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% ) 18,132070670 seconds time elapsed ( +- 1,30% ) Signed-off-by: Eric Dumazet CC: Pekka Enberg CC: Matt Mackall CC: David Rientjes CC: "Alex,Shi" CC: Shaohua Li --- mm/slub.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index ed3334d..16b850d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -269,6 +269,11 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) return *(void **)(object + s->offset); } +static void prefetch_freepointer(const struct kmem_cache *s, void *object) +{ + prefetch(object + s->offset); +} + static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) { void *p; @@ -2292,6 +2297,8 @@ redo: object = __slab_alloc(s, gfpflags, node, addr, c); else { + void *next_object = get_freepointer_safe(s, object); + /* * The cmpxchg will only match if there was no additional * operation and if we are on the right processor. @@ -2307,11 +2314,12 @@ redo: if (unlikely(!irqsafe_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, object, tid, - get_freepointer_safe(s, object), next_tid(tid)))) { + next_object, next_tid(tid)))) { note_cmpxchg_failure("slab_alloc", s, tid); goto redo; } + prefetch_freepointer(s, next_object); stat(s, ALLOC_FASTPATH); }