archive mirror
 help / color / mirror / Atom feed
From: Xunlei Pang <>
To: Vlastimil Babka <>,
	Xunlei Pang <>,
	Christoph Lameter <>,
	Christoph Lameter <>,
	Pekka Enberg <>, Roman Gushchin <>,
	Konstantin Khlebnikov <>,
	David Rientjes <>,
	Matthew Wilcox <>,
	Shu Ming <>,
	Andrew Morton <>
	Wen Yang <>,
	James Wang <>
Subject: Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
Date: Thu, 18 Mar 2021 12:52:15 +0800	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On 3/18/21 2:45 AM, Vlastimil Babka wrote:
> On 3/17/21 8:54 AM, Xunlei Pang wrote:
>> The node list_lock in count_partial() spends long time iterating
>> in case of large amount of partial page lists, which can cause
>> thunder herd effect to the list_lock contention.
>> We have HSF RT(High-speed Service Framework Response-Time) monitors,
>> the RT figures fluctuated randomly, then we deployed a tool detecting
>> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
>> the list_lock cost nearly 100ms with irq off issued by "ss", this also
>> caused network timeouts.
>> This patch introduces two counters to maintain the actual number
>> of partial objects dynamically instead of iterating the partial
>> page lists with list_lock held.
>> New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
>> The main operations are under list_lock in slow path, its performance
>> impact is expected to be minimal except the __slab_free() path.
>> The only concern of introducing partial counter is that partial_free_objs
>> may cause cacheline contention and false sharing issues in case of same
>> SLUB concurrent __slab_free(), so define it to be a percpu counter and
>> places it carefully.
> Hm I wonder, is it possible that this will eventually overflow/underflow the
> counter on some CPU? (I guess practially only on 32bit). Maybe the operations
> that are already done under n->list_lock should flush the percpu counter to a
> shared counter?

You are right, thanks a lot for noticing this.

> ...
>> @@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  		head, new.counters,
>>  		"__slab_free"));
>> +	if (!was_frozen && prior) {
>> +		if (n)
>> +			__update_partial_free(n, cnt);
>> +		else
>> +			__update_partial_free(get_node(s, page_to_nid(page)), cnt);
>> +	}
> I would guess this is the part that makes your measurements notice that
> (although tiny) difference. We didn't need to obtain the node pointer before and
> now we do. And that is really done just for the per-node breakdown in "objects"
> and "objects_partial" files under /sys/kernel/slab - distinguishing nodes is not
> needed for /proc/slabinfo. So that kinda justifies putting this under a new
> CONFIG as you did. Although perhaps somebody interested in these kind of stats
> would enable CONFIG_SLUB_STATS anyway, so that's still an option to use instead
> of introducing a new oddly specific CONFIG? At least until somebody comes up and
> presents an use case where they want the per-node breakdowns in /sys but cannot
> But I'm also still thinking about simply counting all free objects (for the
> purposes of accurate <active_objs> in /proc/slabinfo) as a percpu variable in
> struct kmem_cache itself. That would basically put this_cpu_add() in all the
> fast paths, but AFAICS thanks to the segment register it doesn't mean disabling
> interrupts nor a LOCK operation, so maybe it wouldn't be that bad? And it
> shouldn't need to deal with these node pointers. So maybe that would be
> acceptable for CONFIG_SLUB_DEBUG? Guess I'll have to try...

The percpu operation itself should be fine, it looks to be cacheline
pingpong issue due to extra percpu counter access, so making it
cacheline aligned improves a little according to my tests.

  reply	other threads:[~2021-03-18  4:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17  7:54 [PATCH v4 0/3] mm/slub: Fix count_partial() problem Xunlei Pang
2021-03-17  7:54 ` [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Xunlei Pang
2021-03-17 18:45   ` Vlastimil Babka
2021-03-18  4:52     ` Xunlei Pang [this message]
2021-03-18 12:18   ` Vlastimil Babka
2021-03-18 12:56     ` Xunlei Pang
2021-03-22  1:46       ` Shu Ming
2021-03-22 10:22         ` Vlastimil Babka
2021-03-29  1:58           ` Shu Ming
2021-03-17  7:54 ` [PATCH v4 2/3] percpu: Export per_cpu_sum() Xunlei Pang
2021-03-17  7:54 ` [PATCH v4 3/3] mm/slub: Get rid of count_partial() Xunlei Pang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).