All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Feng Tang <feng.tang@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	kernel test robot <rong.a.chen@intel.com>,
	Ingo Molnar <mingo@kernel.org>,
	Vince Weaver <vincent.weaver@maine.edu>,
	Jiri Olsa <jolsa@kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>,
	Ravi Bangoria <ravi.bangoria@linux.ibm.com>,
	Stephane Eranian <eranian@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, andi.kleen@intel.com, "Huang,
	Ying" <ying.huang@intel.com>
Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression
Date: Mon, 24 Feb 2020 11:24:41 -0800	[thread overview]
Message-ID: <CAHk-=wjkSb1OkiCSn_fzf2v7A=K0bNsUEeQa+06XMhTO+oQUaA@mail.gmail.com> (raw)
In-Reply-To: <20200224021915.GC5061@shbuild999.sh.intel.com>

On Sun, Feb 23, 2020 at 6:19 PM Feng Tang <feng.tang@intel.com> wrote:
>
> > What was it without the alignment?
>
> For 5.0-rc6:
>         ffffffff8225b4c0 d types__ptrace
>         ffffffff8225b4e0 D root_user
>         ffffffff8225b580 D init_user_ns
>
> For 5.0-rc6 + 81ec3f3c4c4
>         ffffffff8225b580 d types__ptrace
>         ffffffff8225b5a0 D root_user
>         ffffffff8225b640 D init_user_ns
>
> The sigpending and __count are in the same cachline.

Ok, so they used to be 32-byte aligned, and making it 64-byte aligned
changed something.

None of it makes any sense, though, since as you say, the two fields
you see having cache movement are still in the same cacheline.

The only difference ends up being whether they are in the first or
second half of the cacheline.

I thought that Cascade Lake ends up having full-cacheline transfers at
all caching levels, though, so even that shouldn't matter.

That said, it's a 2-socket system, so maybe there's something in the
cache transfer between sockets that cares which half of the cacheline
goes first.

Or maybe some critical-word-first logic that is simply buggy (or just
has unfortunate interactions).

I did try your little micro-benchmark on my desktop (small
8-core/16-thread 9900K CPU, just to verify the hotspots.

It does show that the fact that we have *two* atomics is a big deal:
the profiles show __sigqueue_alloc as having about half the cost being
that initial "lock xadd" for the refcount update, and a quarter being
the "lock inc" for the sigpending update.

The sigpending update is cheaper, because clearly the cacheline is
generally on the same core (since we just got it for the refcount).

The dequeuing isn't quite as clear in the profiles, because the "lock
decl" is in __dequeue_signal(), and then we have that free_uid ->
refcount_dec_and_lock_irqsave() chain to the 'lock  cmpxchg' which is
the combined lock and decrement (it's basically
refcount_dec_not_one()).

The rest is xsave/restore and the userspace return (which is very
expensive due to the Intel CPU bugs - 30% of all CPU cycles are on
that stupid 'verw').

I'm guessing this might be another thing that makes Cascade Lake show
things: maybe Intel fixed the CPU bug, and thus the contention is much
more visible because it's not being hidden by the overhead?

ANYWAY.

Considering that none of the numbers make any sense at all, I think
that what's going in is (WARNING: wild handwaving commences) that this
is just extremely timing-sensitive for just _when_ the cacheline
transfer happens, and depending on pure bad luck you can get into a
situation where the likelihood that there's a transfer between the two
locked accesses (particularly maybe on the dequeuing path where they
aren't right next to each other), so instead of doing both accesses
with the same cacheline ownership, you get a bounce in between them.

And maybe there is some data transfer path where the cacheline is
transferred as two 32-byte transfers, and if the two words are in the
"high" 32 bytes, it takes longer to get them initially, and then it's
also likelier that you end up losing it again between accesses.

Yeah, if we could harness the energy from that handwaving, we could
probably power a small village.

I don't know. This does not seem to be a particularly serious load.
But it does feel like it should be possible to combine the two atomic
accesses into one, where you don't need to do the refcount thing
except for the case where sigcount goes from zero to non-zero (and
back to zero again).

But is it worth spending resources on it?

It might be a good idea to ask a hardware person why that 32-byte
cacheline placement might matter on that platform.

Does anybody else have any ideas?

                Linus

WARNING: multiple messages have this Message-ID (diff)
From: Linus Torvalds <torvalds@linux-foundation.org>
To: lkp@lists.01.org
Subject: Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression
Date: Mon, 24 Feb 2020 11:24:41 -0800	[thread overview]
Message-ID: <CAHk-=wjkSb1OkiCSn_fzf2v7A=K0bNsUEeQa+06XMhTO+oQUaA@mail.gmail.com> (raw)
In-Reply-To: <20200224021915.GC5061@shbuild999.sh.intel.com>

[-- Attachment #1: Type: text/plain, Size: 3846 bytes --]

On Sun, Feb 23, 2020 at 6:19 PM Feng Tang <feng.tang@intel.com> wrote:
>
> > What was it without the alignment?
>
> For 5.0-rc6:
>         ffffffff8225b4c0 d types__ptrace
>         ffffffff8225b4e0 D root_user
>         ffffffff8225b580 D init_user_ns
>
> For 5.0-rc6 + 81ec3f3c4c4
>         ffffffff8225b580 d types__ptrace
>         ffffffff8225b5a0 D root_user
>         ffffffff8225b640 D init_user_ns
>
> The sigpending and __count are in the same cachline.

Ok, so they used to be 32-byte aligned, and making it 64-byte aligned
changed something.

None of it makes any sense, though, since as you say, the two fields
you see having cache movement are still in the same cacheline.

The only difference ends up being whether they are in the first or
second half of the cacheline.

I thought that Cascade Lake ends up having full-cacheline transfers at
all caching levels, though, so even that shouldn't matter.

That said, it's a 2-socket system, so maybe there's something in the
cache transfer between sockets that cares which half of the cacheline
goes first.

Or maybe some critical-word-first logic that is simply buggy (or just
has unfortunate interactions).

I did try your little micro-benchmark on my desktop (small
8-core/16-thread 9900K CPU, just to verify the hotspots.

It does show that the fact that we have *two* atomics is a big deal:
the profiles show __sigqueue_alloc as having about half the cost being
that initial "lock xadd" for the refcount update, and a quarter being
the "lock inc" for the sigpending update.

The sigpending update is cheaper, because clearly the cacheline is
generally on the same core (since we just got it for the refcount).

The dequeuing isn't quite as clear in the profiles, because the "lock
decl" is in __dequeue_signal(), and then we have that free_uid ->
refcount_dec_and_lock_irqsave() chain to the 'lock  cmpxchg' which is
the combined lock and decrement (it's basically
refcount_dec_not_one()).

The rest is xsave/restore and the userspace return (which is very
expensive due to the Intel CPU bugs - 30% of all CPU cycles are on
that stupid 'verw').

I'm guessing this might be another thing that makes Cascade Lake show
things: maybe Intel fixed the CPU bug, and thus the contention is much
more visible because it's not being hidden by the overhead?

ANYWAY.

Considering that none of the numbers make any sense at all, I think
that what's going in is (WARNING: wild handwaving commences) that this
is just extremely timing-sensitive for just _when_ the cacheline
transfer happens, and depending on pure bad luck you can get into a
situation where the likelihood that there's a transfer between the two
locked accesses (particularly maybe on the dequeuing path where they
aren't right next to each other), so instead of doing both accesses
with the same cacheline ownership, you get a bounce in between them.

And maybe there is some data transfer path where the cacheline is
transferred as two 32-byte transfers, and if the two words are in the
"high" 32 bytes, it takes longer to get them initially, and then it's
also likelier that you end up losing it again between accesses.

Yeah, if we could harness the energy from that handwaving, we could
probably power a small village.

I don't know. This does not seem to be a particularly serious load.
But it does feel like it should be possible to combine the two atomic
accesses into one, where you don't need to do the refcount thing
except for the case where sigcount goes from zero to non-zero (and
back to zero again).

But is it worth spending resources on it?

It might be a good idea to ask a hardware person why that 32-byte
cacheline placement might matter on that platform.

Does anybody else have any ideas?

                Linus

  parent reply	other threads:[~2020-02-24 19:25 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-05 12:32 [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression kernel test robot
2020-02-05 12:32 ` kernel test robot
2020-02-05 12:58 ` Peter Zijlstra
2020-02-05 12:58   ` Peter Zijlstra
2020-02-06  3:04   ` [LKP] " Li, Philip
2020-02-06  3:04     ` Li, Philip
2020-02-21  8:03   ` [LKP] " Feng Tang
2020-02-21  8:03     ` Feng Tang
2020-02-21 10:58     ` [LKP] " Peter Zijlstra
2020-02-21 10:58       ` Peter Zijlstra
2020-02-21 13:20     ` [LKP] " Jiri Olsa
2020-02-21 13:20       ` Jiri Olsa
2020-02-23 14:11       ` [LKP] " Feng Tang
2020-02-23 14:11         ` Feng Tang
2020-02-23 17:37         ` [LKP] " Linus Torvalds
2020-02-23 17:37           ` Linus Torvalds
2020-02-24  0:33           ` [LKP] " Feng Tang
2020-02-24  0:33             ` Feng Tang
2020-02-24  1:06             ` [LKP] " Linus Torvalds
2020-02-24  1:06               ` Linus Torvalds
2020-02-24  1:58               ` [LKP] " Huang, Ying
2020-02-24  1:58                 ` Huang, Ying
2020-02-24  2:19               ` [LKP] " Feng Tang
2020-02-24  2:19                 ` Feng Tang
2020-02-24 13:20                 ` [LKP] " Feng Tang
2020-02-24 13:20                   ` Feng Tang
2020-02-24 19:24                 ` Linus Torvalds [this message]
2020-02-24 19:24                   ` Linus Torvalds
2020-02-24 19:42                   ` [LKP] " Kleen, Andi
2020-02-24 19:42                     ` Kleen, Andi
2020-02-24 20:09                   ` [LKP] " Linus Torvalds
2020-02-24 20:09                     ` Linus Torvalds
2020-02-24 20:47                     ` [LKP] " Linus Torvalds
2020-02-24 20:47                       ` Linus Torvalds
2020-02-24 21:20                       ` [LKP] " Eric W. Biederman
2020-02-24 21:20                         ` Eric W. Biederman
2020-02-24 21:43                         ` [LKP] " Linus Torvalds
2020-02-24 21:43                           ` Linus Torvalds
2020-02-24 21:59                           ` [LKP] " Eric W. Biederman
2020-02-24 21:59                             ` Eric W. Biederman
2020-02-24 22:12                             ` [LKP] " Linus Torvalds
2020-02-24 22:12                               ` Linus Torvalds
2020-02-25  2:57                       ` [LKP] " Feng Tang
2020-02-25  2:57                         ` Feng Tang
2020-02-25  3:15                         ` [LKP] " Linus Torvalds
2020-02-25  3:15                           ` Linus Torvalds
2020-02-25  4:53                           ` [LKP] " Feng Tang
2020-02-25  4:53                             ` Feng Tang
2020-02-23 19:36         ` [LKP] " Jiri Olsa
2020-02-23 19:36           ` Jiri Olsa
2020-02-24  1:14           ` Feng Tang
2020-02-21 18:05     ` [LKP] " Kleen, Andi
2020-02-21 18:05       ` Kleen, Andi
2020-02-22 12:43       ` [LKP] " Feng Tang
2020-02-22 12:43         ` Feng Tang
2020-02-22 17:08         ` [LKP] " Kleen, Andi
2020-02-22 17:08           ` Kleen, Andi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wjkSb1OkiCSn_fzf2v7A=K0bNsUEeQa+06XMhTO+oQUaA@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=acme@kernel.org \
    --cc=acme@redhat.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=andi.kleen@intel.com \
    --cc=eranian@google.com \
    --cc=feng.tang@intel.com \
    --cc=jolsa@kernel.org \
    --cc=jolsa@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@lists.01.org \
    --cc=mingo@kernel.org \
    --cc=naveen.n.rao@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=ravi.bangoria@linux.ibm.com \
    --cc=rong.a.chen@intel.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.weaver@maine.edu \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.