All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: kernel test robot <oliver.sang@intel.com>, Nadav Amit <namit@vmware.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, lkp@intel.com, ying.huang@intel.com,
	feng.tang@intel.com, zhengjun.xing@linux.intel.com,
	fengwei.yin@intel.com
Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression
Date: Thu, 17 Mar 2022 11:38:46 -0700	[thread overview]
Message-ID: <c85ae95a-6603-ca0d-a653-b3f2f7069e20@intel.com> (raw)
In-Reply-To: <20220317090415.GE735@xsang-OptiPlex-9020>

On 3/17/22 02:04, kernel test robot wrote:
> FYI, we noticed a -13.2% regression of will-it-scale.per_thread_ops due to commit:
...
> commit: 6035152d8eebe16a5bb60398d3e05dc7799067b0 ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
...
>      24.77 ±  2%      +8.1       32.86 ±  3%  perf-profile.self.cycles-pp.llist_add_batch


tl;dr: This commit made the tlb_is_not_lazy() check happen earlier.
That earlier check can miss threads _going_ lazy because if mmap_lock
contention.  Fewer lazy threads means more IPIs and lower performance.

===

There's a lot of noise in that profile, but I filtered most of it out.
The main thing is that, somehow the llist_add() in
smp_call_function_many_cond() got more expensive.  Either we're doing
more of them or the cacheline is bouncing around more.

Turns out that we're sending *more* IPIs with this patch applied than
without.  That shouldn't happen since the old code did the same exact
logical check:

	if (cond_func && !cond_func(cpu, info))
        	continue;

and the new code does:

	if (tlb_is_not_lazy(cpu))
		...

where cond_func==tlb_is_not_lazy.

So, what's the difference?  Timing.  With the old scheme, if a CPU
enters lazy mode between native_flush_tlb_others() and
the loop in smp_call_function_many_cond(), it won't get an IPI and won't
need to do the llist_add().

I stuck some printk()s in there and can confirm that the
earlier-calculated mask always seems to have more bits set, at least
when running will-it-scale tests that induce TLB flush IPIs.

I was kinda surprised that there were so many threads going idle with a
cpu-eating micro like this.  But, it makes sense since they're
contending on mmap_lock.  Basically, since TLB-flushing operations like
mmap() hold mmap_lock for write they tend to *force* other threads into
idle.  Idle threads are lazy and they tend to _become_ lazy around the
time that the flushing starts.

This new "early lazy check" behavior could theoretically work both ways.
 If threads tended to be waking up from idle when TLB flushes were being
sent, this would tend to reduce the number of IPIs.  But, since they
tend to be going to sleep it increases the number of IPIs.

Anybody have a better theory?  I think we should probably revert the commit.

WARNING: multiple messages have this Message-ID (diff)
From: Dave Hansen <dave.hansen@intel.com>
To: lkp@lists.01.org
Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression
Date: Thu, 17 Mar 2022 11:38:46 -0700	[thread overview]
Message-ID: <c85ae95a-6603-ca0d-a653-b3f2f7069e20@intel.com> (raw)
In-Reply-To: <20220317090415.GE735@xsang-OptiPlex-9020>

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

On 3/17/22 02:04, kernel test robot wrote:
> FYI, we noticed a -13.2% regression of will-it-scale.per_thread_ops due to commit:
...
> commit: 6035152d8eebe16a5bb60398d3e05dc7799067b0 ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
...
>      24.77 ±  2%      +8.1       32.86 ±  3%  perf-profile.self.cycles-pp.llist_add_batch


tl;dr: This commit made the tlb_is_not_lazy() check happen earlier.
That earlier check can miss threads _going_ lazy because if mmap_lock
contention.  Fewer lazy threads means more IPIs and lower performance.

===

There's a lot of noise in that profile, but I filtered most of it out.
The main thing is that, somehow the llist_add() in
smp_call_function_many_cond() got more expensive.  Either we're doing
more of them or the cacheline is bouncing around more.

Turns out that we're sending *more* IPIs with this patch applied than
without.  That shouldn't happen since the old code did the same exact
logical check:

	if (cond_func && !cond_func(cpu, info))
        	continue;

and the new code does:

	if (tlb_is_not_lazy(cpu))
		...

where cond_func==tlb_is_not_lazy.

So, what's the difference?  Timing.  With the old scheme, if a CPU
enters lazy mode between native_flush_tlb_others() and
the loop in smp_call_function_many_cond(), it won't get an IPI and won't
need to do the llist_add().

I stuck some printk()s in there and can confirm that the
earlier-calculated mask always seems to have more bits set, at least
when running will-it-scale tests that induce TLB flush IPIs.

I was kinda surprised that there were so many threads going idle with a
cpu-eating micro like this.  But, it makes sense since they're
contending on mmap_lock.  Basically, since TLB-flushing operations like
mmap() hold mmap_lock for write they tend to *force* other threads into
idle.  Idle threads are lazy and they tend to _become_ lazy around the
time that the flushing starts.

This new "early lazy check" behavior could theoretically work both ways.
 If threads tended to be waking up from idle when TLB flushes were being
sent, this would tend to reduce the number of IPIs.  But, since they
tend to be going to sleep it increases the number of IPIs.

Anybody have a better theory?  I think we should probably revert the commit.

  reply	other threads:[~2022-03-17 18:39 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17  9:04 [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression kernel test robot
2022-03-17  9:04 ` kernel test robot
2022-03-17 18:38 ` Dave Hansen [this message]
2022-03-17 18:38   ` Dave Hansen
2022-03-17 19:02   ` Nadav Amit
2022-03-17 19:02     ` Nadav Amit
2022-03-17 19:11     ` Dave Hansen
2022-03-17 19:11       ` Dave Hansen
2022-03-17 20:32       ` Nadav Amit
2022-03-17 20:32         ` Nadav Amit
2022-03-17 20:49         ` Dave Hansen
2022-03-17 20:49           ` Dave Hansen
2022-03-18  2:56           ` Oliver Sang
2022-03-18  2:56             ` Oliver Sang
2022-03-18  0:16         ` Dave Hansen
2022-03-18  0:16           ` Dave Hansen
2022-03-18  0:20           ` Nadav Amit
2022-03-18  0:20             ` Nadav Amit
2022-03-18  0:45             ` Dave Hansen
2022-03-18  0:45               ` Dave Hansen
2022-03-18  3:02               ` Nadav Amit
2022-03-18  3:02                 ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c85ae95a-6603-ca0d-a653-b3f2f7069e20@intel.com \
    --to=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=lkp@lists.01.org \
    --cc=mingo@kernel.org \
    --cc=namit@vmware.com \
    --cc=oliver.sang@intel.com \
    --cc=ying.huang@intel.com \
    --cc=zhengjun.xing@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.