All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <namit@vmware.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: kernel test robot <oliver.sang@intel.com>,
	Ingo Molnar <mingo@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"lkp@lists.01.org" <lkp@lists.01.org>,
	"lkp@intel.com" <lkp@intel.com>,
	"ying.huang@intel.com" <ying.huang@intel.com>,
	"feng.tang@intel.com" <feng.tang@intel.com>,
	"zhengjun.xing@linux.intel.com" <zhengjun.xing@linux.intel.com>,
	"fengwei.yin@intel.com" <fengwei.yin@intel.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression
Date: Fri, 18 Mar 2022 03:02:49 +0000	[thread overview]
Message-ID: <DB87059A-66EC-43A6-82C5-2D890F7A02AF@vmware.com> (raw)
In-Reply-To: <d4f62008-faa7-2931-5690-f29f9544b81b@intel.com>



> On Mar 17, 2022, at 5:45 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/17/22 17:20, Nadav Amit wrote:
>> I don’t have other data right now. Let me run some measurements later
>> tonight. I understand your explanation, but I still do not see how
>> much “later” can the lazy check be that it really matters. Just
>> strange.
> 
> These will-it-scale tests are really brutal.  They're usually sitting in
> really tight kernel entry/exit loops.  Everything is pounding on kernel
> locks and bouncing cachelines around like crazy.  It might only be a few
> thousand cycles between two successive kernel entries.
> 
> Things like the call_single_queue cacheline have to be dragged from
> other CPUs *and* there are locks that you can spin on.  While a thread
> is doing all this spinning, it is forcing more and more threads into the
> lazy TLB state.  The longer you spin, the more threads have entered the
> kernel, contended on the mmap_lock and gone idle.
> 
> Is it really surprising that a loop that can take hundreds of locks can
> take a long time?
> 
>                for_each_cpu(cpu, cfd->cpumask) {
>                        csd_lock(csd);
> 			...
> 		}

Thanks for the clarification. It took me some time to rehash. Yes, my patch
should get reverted.

So I think I now get what you are talking about: this loop can take a lot
of time, which beforehand I did not see. But I am not sure it is exactly as
you describe (unless I am missing something). So I guess you are right, but I
am sharing my understanding.

So let’s go over what overheads are induced (or not) in the loop:

(1) Contended csd_lock(): csd_lock() is not a cross-core lock. In this
workload, which does not use asynchronous IPIs, it is not contended.

(2) Cache-misses on csd_lock(). I am not sure this really induces overheads or
at least has to induce overhead.

On one hand, although two CSDs can reside in the same cacheline, we run
csd_lock_wait() eventually to check our smp-call was served. So the cacheline
of the CSD should eventually be present in the IPI-initiator's cache
(ready for the next invocation). On the other hand, as there is no write
access by csd_lock_wait(), then due to MESI, the CSD cache-line might still
be shared, and would require invalidation on the next csd_lock() invocation.

I would presume that as there are no data dependencies the CPU can continue
speculatively continue execution past csd_lock() even if there is a
cache-miss. So I don’t see it really induing an overhead.

(3) Cache-line misses on llist_add(). This makes perfect sense, and I don’t
see a easy way around, especially that apparently on x86 FAA and CAS
latency is similar.

(4) cpu_tlbstate_shared.is_lazy - well, this is only added once you revert the
patch.

One thing to note, although I am not sure it is really relevant here
(because your explanation on core stuck on mmap_lock makes sense), is that
there are two possible reasons for fewer IPIs:

(a) Skipping shootdowns (i.e., you find remote CPUs to be lazy); and
(b) Save IPIs (i.e. llist_add() finds that another IPI is already pending).

I have some vague ideas how to shorten the loop in
smp_call_function_many_cond(), but I guess for now revert is the way to
go.

Thanks for you patience. Yes, revert.

WARNING: multiple messages have this Message-ID (diff)
From: Nadav Amit <namit@vmware.com>
To: lkp@lists.01.org
Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression
Date: Fri, 18 Mar 2022 03:02:49 +0000	[thread overview]
Message-ID: <DB87059A-66EC-43A6-82C5-2D890F7A02AF@vmware.com> (raw)
In-Reply-To: <d4f62008-faa7-2931-5690-f29f9544b81b@intel.com>

[-- Attachment #1: Type: text/plain, Size: 3345 bytes --]



> On Mar 17, 2022, at 5:45 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/17/22 17:20, Nadav Amit wrote:
>> I don’t have other data right now. Let me run some measurements later
>> tonight. I understand your explanation, but I still do not see how
>> much “later” can the lazy check be that it really matters. Just
>> strange.
> 
> These will-it-scale tests are really brutal.  They're usually sitting in
> really tight kernel entry/exit loops.  Everything is pounding on kernel
> locks and bouncing cachelines around like crazy.  It might only be a few
> thousand cycles between two successive kernel entries.
> 
> Things like the call_single_queue cacheline have to be dragged from
> other CPUs *and* there are locks that you can spin on.  While a thread
> is doing all this spinning, it is forcing more and more threads into the
> lazy TLB state.  The longer you spin, the more threads have entered the
> kernel, contended on the mmap_lock and gone idle.
> 
> Is it really surprising that a loop that can take hundreds of locks can
> take a long time?
> 
>                for_each_cpu(cpu, cfd->cpumask) {
>                        csd_lock(csd);
> 			...
> 		}

Thanks for the clarification. It took me some time to rehash. Yes, my patch
should get reverted.

So I think I now get what you are talking about: this loop can take a lot
of time, which beforehand I did not see. But I am not sure it is exactly as
you describe (unless I am missing something). So I guess you are right, but I
am sharing my understanding.

So let’s go over what overheads are induced (or not) in the loop:

(1) Contended csd_lock(): csd_lock() is not a cross-core lock. In this
workload, which does not use asynchronous IPIs, it is not contended.

(2) Cache-misses on csd_lock(). I am not sure this really induces overheads or
at least has to induce overhead.

On one hand, although two CSDs can reside in the same cacheline, we run
csd_lock_wait() eventually to check our smp-call was served. So the cacheline
of the CSD should eventually be present in the IPI-initiator's cache
(ready for the next invocation). On the other hand, as there is no write
access by csd_lock_wait(), then due to MESI, the CSD cache-line might still
be shared, and would require invalidation on the next csd_lock() invocation.

I would presume that as there are no data dependencies the CPU can continue
speculatively continue execution past csd_lock() even if there is a
cache-miss. So I don’t see it really induing an overhead.

(3) Cache-line misses on llist_add(). This makes perfect sense, and I don’t
see a easy way around, especially that apparently on x86 FAA and CAS
latency is similar.

(4) cpu_tlbstate_shared.is_lazy - well, this is only added once you revert the
patch.

One thing to note, although I am not sure it is really relevant here
(because your explanation on core stuck on mmap_lock makes sense), is that
there are two possible reasons for fewer IPIs:

(a) Skipping shootdowns (i.e., you find remote CPUs to be lazy); and
(b) Save IPIs (i.e. llist_add() finds that another IPI is already pending).

I have some vague ideas how to shorten the loop in
smp_call_function_many_cond(), but I guess for now revert is the way to
go.

Thanks for you patience. Yes, revert.

  reply	other threads:[~2022-03-18  3:02 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17  9:04 [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression kernel test robot
2022-03-17  9:04 ` kernel test robot
2022-03-17 18:38 ` Dave Hansen
2022-03-17 18:38   ` Dave Hansen
2022-03-17 19:02   ` Nadav Amit
2022-03-17 19:02     ` Nadav Amit
2022-03-17 19:11     ` Dave Hansen
2022-03-17 19:11       ` Dave Hansen
2022-03-17 20:32       ` Nadav Amit
2022-03-17 20:32         ` Nadav Amit
2022-03-17 20:49         ` Dave Hansen
2022-03-17 20:49           ` Dave Hansen
2022-03-18  2:56           ` Oliver Sang
2022-03-18  2:56             ` Oliver Sang
2022-03-18  0:16         ` Dave Hansen
2022-03-18  0:16           ` Dave Hansen
2022-03-18  0:20           ` Nadav Amit
2022-03-18  0:20             ` Nadav Amit
2022-03-18  0:45             ` Dave Hansen
2022-03-18  0:45               ` Dave Hansen
2022-03-18  3:02               ` Nadav Amit [this message]
2022-03-18  3:02                 ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DB87059A-66EC-43A6-82C5-2D890F7A02AF@vmware.com \
    --to=namit@vmware.com \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=lkp@lists.01.org \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=oliver.sang@intel.com \
    --cc=ying.huang@intel.com \
    --cc=zhengjun.xing@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.