Re: [PATCH RFC] mm+net: allow to set kmem_cache create flag for SLAB_NEVER_MERGE

From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	netdev@vger.kernel.org, linux-mm@kvack.org
Cc: brouer@redhat.com, Christoph Lameter <cl@linux.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	penberg@kernel.org, Jakub Kicinski <kuba@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	edumazet@google.com, pabeni@redhat.com,
	Matthew Wilcox <willy@infradead.org>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH RFC] mm+net: allow to set kmem_cache create flag for SLAB_NEVER_MERGE
Date: Wed, 31 May 2023 15:59:26 +0200	[thread overview]
Message-ID: <fe64a059-f4c8-5983-3b08-f84552d1ce61@redhat.com> (raw)
In-Reply-To: <81597717-0fed-5fd0-37d0-857d976b9d40@suse.cz>

On 31/05/2023 14.03, Vlastimil Babka wrote:
> On 1/17/23 14:40, Jesper Dangaard Brouer wrote:
>> Allow API users of kmem_cache_create to specify that they don't want
>> any slab merge or aliasing (with similar sized objects). Use this in
>> network stack and kfence_test.
>>
>> The SKB (sk_buff) kmem_cache slab is critical for network performance.
>> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
>> performance by amortising the alloc/free cost.
>>
>> For the bulk API to perform efficiently the slub fragmentation need to
>> be low. Especially for the SLUB allocator, the efficiency of bulk free
>> API depend on objects belonging to the same slab (page).
>>
>> When running different network performance microbenchmarks, I started
>> to notice that performance was reduced (slightly) when machines had
>> longer uptimes. I believe the cause was 'skbuff_head_cache' got
>> aliased/merged into the general slub for 256 bytes sized objects (with
>> my kernel config, without CONFIG_HARDENED_USERCOPY).
>>
>> For SKB kmem_cache network stack have reasons for not merging, but it
>> varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
>> We want to explicitly set SLAB_NEVER_MERGE for this kmem_cache.
>>
>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> Since this idea was revived by David [1], and neither patch worked as is,
> but yours was more complete and first, I have fixed it up as below. The
> skbuff part itself will be best submitted separately afterwards so we don't
> get conflicts between trees etc. Comments?
> 

Thanks for following up on this! :-)
I like the adjustments, ACKed below.

I'm okay with submitting changes to net/core/skbuff.c separately.

> ----8<----
>  From 485d3f58f3e797306b803102573e7f1367af2ad2 Mon Sep 17 00:00:00 2001
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Tue, 17 Jan 2023 14:40:00 +0100
> Subject: [PATCH] mm/slab: introduce kmem_cache flag SLAB_NO_MERGE
> 
> Allow API users of kmem_cache_create to specify that they don't want
> any slab merge or aliasing (with similar sized objects). Use this in
> kfence_test.
> 
> The SKB (sk_buff) kmem_cache slab is critical for network performance.
> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
> performance by amortising the alloc/free cost.
> 
> For the bulk API to perform efficiently the slub fragmentation need to
> be low. Especially for the SLUB allocator, the efficiency of bulk free
> API depend on objects belonging to the same slab (page).
> 
> When running different network performance microbenchmarks, I started
> to notice that performance was reduced (slightly) when machines had
> longer uptimes. I believe the cause was 'skbuff_head_cache' got
> aliased/merged into the general slub for 256 bytes sized objects (with
> my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> For SKB kmem_cache network stack have reasons for not merging, but it
> varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY).
> We want to explicitly set SLAB_NO_MERGE for this kmem_cache.
> 
> Another use case for the flag has been described by David Sterba [1]:
> 
>> This can be used for more fine grained control over the caches or for
>> debugging builds where separate slabs can verify that no objects leak.
> 
>> The slab_nomerge boot option is too coarse and would need to be
>> enabled on all testing hosts. There are some other ways how to disable
>> merging, e.g. a slab constructor but this disables poisoning besides
>> that it adds additional overhead. Other flags are internal and may
>> have other semantics.
> 
>> A concrete example what motivates the flag. During 'btrfs balance'
>> slab top reported huge increase in caches like
> 
>>   1330095 1330095 100%    0.10K  34105       39    136420K Acpi-ParseExt
>>   1734684 1734684 100%    0.14K  61953       28    247812K pid_namespace
>>   8244036 6873075  83%    0.11K 229001       36    916004K khugepaged_mm_slot
> 
>> which was confusing and that it's because of slab merging was not the
>> first idea.  After rebooting with slab_nomerge all the caches were
>> from btrfs_ namespace as expected.
> 
> [1] https://lore.kernel.org/all/20230524101748.30714-1-dsterba@suse.com/
> 
> [ vbabka@suse.cz: rename to SLAB_NO_MERGE, change the flag value to the
>    one proposed by David so it does not collide with internal SLAB/SLUB
>    flags, write a comment for the flag, expand changelog, drop the skbuff
>    part to be handled spearately ]
> 
> Reported-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com

> ---
>   include/linux/slab.h    | 12 ++++++++++++
>   mm/kfence/kfence_test.c |  7 +++----
>   mm/slab.h               |  5 +++--
>   mm/slab_common.c        |  2 +-
>   4 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 6b3e155b70bf..72bc906d8bc7 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -106,6 +106,18 @@
>   /* Avoid kmemleak tracing */
>   #define SLAB_NOLEAKTRACE	((slab_flags_t __force)0x00800000U)
>   
> +/*
> + * Prevent merging with compatible kmem caches. This flag should be used
> + * cautiously. Valid use cases:
> + *
> + * - caches created for self-tests (e.g. kunit)
> + * - general caches created and used by a subsystem, only when a
> + *   (subsystem-specific) debug option is enabled
> + * - performance critical caches, should be very rare and consulted with slab
> + *   maintainers, and not used together with CONFIG_SLUB_TINY
> + */
> +#define SLAB_NO_MERGE		((slab_flags_t __force)0x01000000U)
> +
>   /* Fault injection mark */
>   #ifdef CONFIG_FAILSLAB
>   # define SLAB_FAILSLAB		((slab_flags_t __force)0x02000000U)
> diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
> index 6aee19a79236..9e008a336d9f 100644
> --- a/mm/kfence/kfence_test.c
> +++ b/mm/kfence/kfence_test.c
> @@ -191,11 +191,10 @@ static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t fla
>   	kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);
>   
>   	/*
> -	 * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
> -	 * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
> -	 * allocate via memcg, if enabled.
> +	 * Use SLAB_NO_MERGE to prevent merging with existing caches.
> +	 * Use SLAB_ACCOUNT to allocate via memcg, if enabled.
>   	 */
> -	flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
> +	flags |= SLAB_NO_MERGE | SLAB_ACCOUNT;
>   	test_cache = kmem_cache_create("test", size, 1, flags, ctor);
>   	KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
>   
> diff --git a/mm/slab.h b/mm/slab.h
> index f01ac256a8f5..9005ddc51cf8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -294,11 +294,11 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s)
>   #if defined(CONFIG_SLAB)
>   #define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \
>   			  SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | \
> -			  SLAB_ACCOUNT)
> +			  SLAB_ACCOUNT | SLAB_NO_MERGE)
>   #elif defined(CONFIG_SLUB)
>   #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
>   			  SLAB_TEMPORARY | SLAB_ACCOUNT | \
> -			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC)
> +			  SLAB_NO_USER_FLAGS | SLAB_KMALLOC | SLAB_NO_MERGE)
>   #else
>   #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE)
>   #endif
> @@ -319,6 +319,7 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s)
>   			      SLAB_TEMPORARY | \
>   			      SLAB_ACCOUNT | \
>   			      SLAB_KMALLOC | \
> +			      SLAB_NO_MERGE | \
>   			      SLAB_NO_USER_FLAGS)
>   
>   bool __kmem_cache_empty(struct kmem_cache *);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 607249785c07..0e0a617eae7d 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -47,7 +47,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
>    */
>   #define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
>   		SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
> -		SLAB_FAILSLAB | kasan_never_merge())
> +		SLAB_FAILSLAB | SLAB_NO_MERGE | kasan_never_merge())
>   
>   #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
>   			 SLAB_CACHE_DMA32 | SLAB_ACCOUNT)