Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage

From: Feng Tang <feng.tang@intel.com>
To: Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>
Cc: "Sang, Oliver" <oliver.sang@intel.com>,
	Jay Patel <jaypatel@linux.ibm.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	lkp <lkp@intel.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>,
	"cl@linux.com" <cl@linux.com>,
	"penberg@kernel.org" <penberg@kernel.org>,
	"rientjes@google.com" <rientjes@google.com>,
	"iamjoonsoo.kim@lge.com" <iamjoonsoo.kim@lge.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"aneesh.kumar@linux.ibm.com" <aneesh.kumar@linux.ibm.com>,
	"tsahu@linux.ibm.com" <tsahu@linux.ibm.com>,
	"piyushs@linux.ibm.com" <piyushs@linux.ibm.com>
Subject: Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage
Date: Tue, 25 Jul 2023 17:12:06 +0800	[thread overview]
Message-ID: <ZL+R5kJpnHMUgGY2@feng-clx> (raw)
In-Reply-To: <CAB=+i9SNS-Z8-WARiivMBy5gibZDCkpS+sk8v+2awvyffAwB8g@mail.gmail.com>

On Tue, Jul 25, 2023 at 12:13:56PM +0900, Hyeonggon Yoo wrote:
[...]
> >
> > I run the reproduce command in a local 2-socket box:
> >
> > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l" "30000" "-s" "100"
> >
> > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and
> > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced
> > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was bumped
> > from 2 to 4. The setting of 'skbuff_head_cache' was kept unchanged.
> >
> > And this compiled with the perf-profile info from 0Day's report, that the
> > 'list_lock' contention is increased with the patch:
> >
> >     13.71%    13.70%  [kernel.kallsyms]         [k] native_queued_spin_lock_slowpath                            -      -
> > 5.80% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreeze_partials;skb_release_data;consume_skb;unix_stream_read_generic;unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_read;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read
> > 5.56% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_partial_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node;__kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_iter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_write
> 
> Oh... neither of the assumptions were not true.
> AFAICS it's a case of decreasing slab order increases lock contention,
> 
> The number of cached objects per CPU is mostly the same (not exactly same,
> because the cpu slab is not accounted for),

Yes, this makes sense!

> but only increases the
> number of slabs
> to process while taking slabs (get_partial_node()), and flushing the current
> cpu partial list. (put_cpu_partial() -> __unfreeze_partials())
> 
> Can we do better in this situation? improve __unfreeze_partials()?

We can check that, IMHO, current MIN_PARTIAL and MAX_PARTIAL are too
small as a global parameter, especially for server platforms with
hundreds of GB or TBs memory.

As for 'list_lock', I'm thinking of bumping the number of per-cpu
objects in set_cpu_partial(), at least give user an option to do
that for sever platforms with huge mount of memory. Will do some test
around it, and let 0Day's peformance testing framework monitor
for any regression.

Thanks,
Feng

> 
> > Also I tried to restore the slub_max_order to 3, and the regression was
> > gone.
> >
> >  static unsigned int slub_max_order =
> > -       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2;
> > +       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 3;
> >  static unsigned int slub_min_objects;