Re: [PATCH v2] mm, slub: Use prefetchw instead of prefetch

From: Vlastimil Babka <vbabka@suse.cz>
To: Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org
Cc: Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] mm, slub: Use prefetchw instead of prefetch
Date: Tue, 19 Oct 2021 09:11:54 +0200	[thread overview]
Message-ID: <bf496398-d42f-05dc-927d-b4c601bd2d19@suse.cz> (raw)
In-Reply-To: <20211011144331.70084-1-42.hyeyoo@gmail.com>

On 10/11/21 16:43, Hyeonggon Yoo wrote:
> commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
> slab_alloc()") introduced prefetch_freepointer() because when other cpu(s)
> freed objects into a page that current cpu owns, the freelist link is
> hot on cpu(s) which freed objects and possibly very cold on current cpu.
> 
> But if freelist link chain is hot on cpu(s) which freed objects,
> it's better to invalidate that chain because they're not going to access
> again within a short time.
> 
> So use prefetchw instead of prefetch. On supported architectures like x86
> and arm, it invalidates other copied instances of a cache line when
> prefetching it.
> 
> Before:
> 
> Time: 91.677
> 
>  Performance counter stats for 'hackbench -g 100 -l 10000':
>         1462938.07 msec cpu-clock                 #   15.908 CPUs utilized
>           18072550      context-switches          #   12.354 K/sec
>            1018814      cpu-migrations            #  696.416 /sec
>             104558      page-faults               #   71.471 /sec
>      1580035699271      cycles                    #    1.080 GHz                      (54.51%)
>      2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
>         5702204863      branch-misses                                                 (54.28%)
>       643368500985      cache-references          #  439.778 M/sec                    (54.26%)
>        18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
>       642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
>        18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
>       653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
>         3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
>       537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
>          114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
>       630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
>        22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)
> 
>       91.964452802 seconds time elapsed
> 
>       43.416742000 seconds user
>     1422.441123000 seconds sys
> 
> After:
> 
> Time: 90.220
> 
>  Performance counter stats for 'hackbench -g 100 -l 10000':
>         1437418.48 msec cpu-clock                 #   15.880 CPUs utilized
>           17694068      context-switches          #   12.310 K/sec
>             958257      cpu-migrations            #  666.651 /sec
>             100604      page-faults               #   69.989 /sec
>      1583259429428      cycles                    #    1.101 GHz                      (54.57%)
>      2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
>         5594202389      branch-misses                                                 (54.36%)
>       643113574524      cache-references          #  447.409 M/sec                    (54.39%)
>        18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
>       640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
>        17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
>       651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
>         3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
>       535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
>          113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
>       628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
>        22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)
> 
>       90.514819303 seconds time elapsed
> 
>       43.877656000 seconds user
>     1397.176001000 seconds sys

Wouldn't expect such noticeable difference. Maybe it would diminish when
repeating and taking average. But guess it's at least not worse with
prefetchw, so...

> Link: https://lkml.org/lkml/2021/10/8/598 
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 3d2025f7163b..ce3d8b11215c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -354,7 +354,7 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
>  
>  static void prefetch_freepointer(const struct kmem_cache *s, void *object)
>  {
> -	prefetch(object + s->offset);
> +	prefetchw(object + s->offset);
>  }
>  
>  static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
>