Re: [PATCH] slub: avoid scanning all partial slabs in get_slabinfo()

From: Chengming Zhou <zhouchengming@bytedance.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Jianfeng Wang <jianfeng.w.wang@oracle.com>
Cc: cl@linux.com, penberg@kernel.org, iamjoonsoo.kim@lge.com,
	akpm@linux-foundation.org, roman.gushchin@linux.dev,
	42.hyeyoo@gmail.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] slub: avoid scanning all partial slabs in get_slabinfo()
Date: Mon, 19 Feb 2024 17:29:02 +0800	[thread overview]
Message-ID: <5cf40e33-d1ae-4ac9-9d01-559b86f853a8@bytedance.com> (raw)
In-Reply-To: <fee76a21-fbc5-4ad8-b4bf-ba8a8e7cee8f@suse.cz>

On 2024/2/19 16:30, Vlastimil Babka wrote:
> On 2/18/24 20:25, David Rientjes wrote:
>> On Thu, 15 Feb 2024, Jianfeng Wang wrote:
>>
>>> When reading "/proc/slabinfo", the kernel needs to report the number of
>>> free objects for each kmem_cache. The current implementation relies on
>>> count_partial() that counts the number of free objects by scanning each
>>> kmem_cache_node's partial slab list and summing free objects from all
>>> partial slabs in the list. This process must hold per kmem_cache_node
>>> spinlock and disable IRQ. Consequently, it can block slab allocation
>>> requests on other CPU cores and cause timeouts for network devices etc.,
>>> if the partial slab list is long. In production, even NMI watchdog can
>>> be triggered because some slab caches have a long partial list: e.g.,
>>> for "buffer_head", the number of partial slabs was observed to be ~1M
>>> in one kmem_cache_node. This problem was also observed by several

Not sure if this situation is normal? It maybe very fragmented, right?

SLUB completely depend on the timing order to place partial slabs in node,
which maybe suboptimal in some cases. Maybe we could introduce anti-fragment
mechanism like fullness grouping in zsmalloc to have multiple lists based
on fullness grouping? Just some random thoughts... :)

>>> others [1-2] in the past.
>>>
>>> The fix is to maintain a counter of free objects for each kmem_cache.
>>> Then, in get_slabinfo(), use the counter rather than count_partial()
>>> when reporting the number of free objects for a slab cache. per-cpu
>>> counter is used to minimize atomic or lock operation.
>>>
>>> Benchmark: run hackbench on a dual-socket 72-CPU bare metal machine
>>> with 256 GB memory and Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.3 GHz.
>>> The command is "hackbench 18 thread 20000". Each group gets 10 runs.
>>>
>>
>> This seems particularly intrusive for the common path to optimize for 
>> reading of /proc/slabinfo, and that's shown in the benchmark result.
>>
>> Could you discuss the /proc/slabinfo usage model a bit?  It's not clear if 
>> this is being continuously read, or whether even a single read in 
>> isolation is problematic.
>>
>> That said, optimizing for reading /proc/slabinfo at the cost of runtime 
>> performance degradation doesn't sound like the right trade-off.
> 
> It should be possible to make this overhead smaller by restricting the
> counter only to partial list slabs, as [2] did. This would keep it out of
> the fast paths, where it's really not acceptable.
> Note [2] used atomic_long_t and the percpu counters used here should be
> lower overhead. So basically try to get the best of both attemps.

Right, the current count_partial() also only iterate over slabs on the
node partial list, doesn't include slabs on the cpu partial list. So this
new percpu counter should also only include slabs on node partial list.
Then the overhead should be lower.