All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86,switch_mm: skip atomic operations for init_mm
@ 2018-06-01 12:28 Rik van Riel
  2018-06-01 15:11 ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-01 12:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Song Liu, kernel-team, mingo, luto, tglx, x86

Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
kernels,using 2.4% of a 48 CPU system during a netperf to localhost run.
Digging into the profile, we noticed that cpumask_clear_cpu and
cpumask_set_cpu together take about half of the CPU time taken by
switch_mm_irqs_off.

However, the CPUs running netperf end up switching back and forth
between netperf and the idle task, which does not require changes
to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
the most heavily contended one in the system.`

Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
(mostly the idle task) reduced CPU use of switch_mm_irqs_off
from 2.4% of the CPU to 1.9% of the CPU, with the following
netperf commandline:

./super_netperf 300 -P 0 -t TCP_RR -p 8888 -H kerneltest008.09.atn1 -l 30 \
     -- -r 300,300 -o -s 1M,1M -S 1M,1M

perf output w/o this patch:
    1.26%  netserver        [kernel.vmlinux]          [k] switch_mm_irqs_off
    1.17%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off

perf output w/ this patch:
    1.01%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off
    0.88%  netserver        [kernel.vmlinux]          [k] switch_mm_irqs_off

Netperf throughput is about the same before and after.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-and-tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/mm/tlb.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e055d1a06699..c8f9c550f7ec 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -288,12 +288,14 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		/* Stop remote flushes for the previous mm */
 		VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
 				real_prev != &init_mm);
-		cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+		if (real_prev != &init_mm)
+			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
 		/*
 		 * Start remote flushes and then read tlb_gen.
 		 */
-		cpumask_set_cpu(cpu, mm_cpumask(next));
+		if (next != &init_mm)
+			cpumask_set_cpu(cpu, mm_cpumask(next));
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 12:28 [PATCH] x86,switch_mm: skip atomic operations for init_mm Rik van Riel
@ 2018-06-01 15:11 ` Andy Lutomirski
  2018-06-01 18:22   ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2018-06-01 15:11 UTC (permalink / raw)
  To: riel
  Cc: LKML, songliubraving, kernel-team, Ingo Molnar,
	Andrew Lutomirski, Thomas Gleixner, X86 ML, Peter Zijlstra,
	Mike Galbraith

On Fri, Jun 1, 2018 at 5:28 AM Rik van Riel <riel@surriel.com> wrote:
>
> Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
> kernels,using 2.4% of a 48 CPU system during a netperf to localhost run.
> Digging into the profile, we noticed that cpumask_clear_cpu and
> cpumask_set_cpu together take about half of the CPU time taken by
> switch_mm_irqs_off.
>
> However, the CPUs running netperf end up switching back and forth
> between netperf and the idle task, which does not require changes
> to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
> the most heavily contended one in the system.`
>
> Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
> (mostly the idle task) reduced CPU use of switch_mm_irqs_off
> from 2.4% of the CPU to 1.9% of the CPU, with the following
> netperf commandline:

I'm conceptually fine with this change.  Does mm_cpumask(&init_mm) end
up in a deterministic state?

Mike, depending on exactly what's going on with your benchmark, this
might help recover a bit of your performance, too.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 15:11 ` Andy Lutomirski
@ 2018-06-01 18:22   ` Rik van Riel
  2018-06-01 18:48     ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-01 18:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, songliubraving, kernel-team, Ingo Molnar, Thomas Gleixner,
	X86 ML, Peter Zijlstra, Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1639 bytes --]

On Fri, 2018-06-01 at 08:11 -0700, Andy Lutomirski wrote:
> On Fri, Jun 1, 2018 at 5:28 AM Rik van Riel <riel@surriel.com> wrote:
> > 
> > Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
> > kernels,using 2.4% of a 48 CPU system during a netperf to localhost
> > run.
> > Digging into the profile, we noticed that cpumask_clear_cpu and
> > cpumask_set_cpu together take about half of the CPU time taken by
> > switch_mm_irqs_off.
> > 
> > However, the CPUs running netperf end up switching back and forth
> > between netperf and the idle task, which does not require changes
> > to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
> > the most heavily contended one in the system.`
> > 
> > Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
> > (mostly the idle task) reduced CPU use of switch_mm_irqs_off
> > from 2.4% of the CPU to 1.9% of the CPU, with the following
> > netperf commandline:
> 
> I'm conceptually fine with this change.  Does mm_cpumask(&init_mm)
> end
> up in a deterministic state?

Given that we do not touch mm_cpumask(&init_mm)
any more, and that bitmask never appears to be
used for things like tlb shootdowns (kernel TLB
shootdowns simply go to everybody), I suspect
it ends up in whatever state it is initialized
to on startup.

I had not looked into this much, because it does
not appear to be used for anything.

> Mike, depending on exactly what's going on with your benchmark, this
> might help recover a bit of your performance, too.

It will be interesting to know how this change
impacts others.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 18:22   ` Rik van Riel
@ 2018-06-01 18:48     ` Mike Galbraith
  2018-06-01 19:43       ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2018-06-01 18:48 UTC (permalink / raw)
  To: Rik van Riel, Andy Lutomirski
  Cc: LKML, songliubraving, kernel-team, Ingo Molnar, Thomas Gleixner,
	X86 ML, Peter Zijlstra

On Fri, 2018-06-01 at 14:22 -0400, Rik van Riel wrote:
> On Fri, 2018-06-01 at 08:11 -0700, Andy Lutomirski wrote:
> > On Fri, Jun 1, 2018 at 5:28 AM Rik van Riel <riel@surriel.com> wrote:
> > > 
> > > Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
> > > kernels,using 2.4% of a 48 CPU system during a netperf to localhost
> > > run.
> > > Digging into the profile, we noticed that cpumask_clear_cpu and
> > > cpumask_set_cpu together take about half of the CPU time taken by
> > > switch_mm_irqs_off.
> > > 
> > > However, the CPUs running netperf end up switching back and forth
> > > between netperf and the idle task, which does not require changes
> > > to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
> > > the most heavily contended one in the system.`
> > > 
> > > Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
> > > (mostly the idle task) reduced CPU use of switch_mm_irqs_off
> > > from 2.4% of the CPU to 1.9% of the CPU, with the following
> > > netperf commandline:
> > 
> > I'm conceptually fine with this change.  Does mm_cpumask(&init_mm)
> > end
> > up in a deterministic state?
> 
> Given that we do not touch mm_cpumask(&init_mm)
> any more, and that bitmask never appears to be
> used for things like tlb shootdowns (kernel TLB
> shootdowns simply go to everybody), I suspect
> it ends up in whatever state it is initialized
> to on startup.
> 
> I had not looked into this much, because it does
> not appear to be used for anything.
> 
> > Mike, depending on exactly what's going on with your benchmark, this
> > might help recover a bit of your performance, too.
> 
> It will be interesting to know how this change
> impacts others.

previous pipe-test numbers
4.13.16         2.024978 usecs/loop -- avg 2.045250 977.9 KHz
4.14.47         2.234518 usecs/loop -- avg 2.227716 897.8 KHz
4.15.18         2.287815 usecs/loop -- avg 2.295858 871.1 KHz
4.16.13         2.286036 usecs/loop -- avg 2.279057 877.6 KHz
4.17.0.g88a8676 2.288231 usecs/loop -- avg 2.288917 873.8 KHz

new numbers
4.17.0.g0512e01 2.268629 usecs/loop -- avg 2.269493 881.3 KHz
4.17.0.g0512e01 2.035401 usecs/loop -- avg 2.038341 981.2 KHz +andy
4.17.0.g0512e01 2.238701 usecs/loop -- avg 2.231828 896.1 KHz -andy+rik

There might be something there with your change Rik, but it's small
enough to be wary of variance.  Andy's "invert the return of
tlb_defer_switch_to_init_mm()" is OTOH pretty clear.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 18:48     ` Mike Galbraith
@ 2018-06-01 19:43       ` Rik van Riel
  2018-06-01 20:03         ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-01 19:43 UTC (permalink / raw)
  To: Mike Galbraith, Andy Lutomirski
  Cc: LKML, songliubraving, kernel-team, Ingo Molnar, Thomas Gleixner,
	X86 ML, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3080 bytes --]

On Fri, 2018-06-01 at 20:48 +0200, Mike Galbraith wrote:
> On Fri, 2018-06-01 at 14:22 -0400, Rik van Riel wrote:
> > On Fri, 2018-06-01 at 08:11 -0700, Andy Lutomirski wrote:
> > > On Fri, Jun 1, 2018 at 5:28 AM Rik van Riel <riel@surriel.com>
> > > wrote:
> > > > 
> > > > Song noticed switch_mm_irqs_off taking a lot of CPU time in
> > > > recent
> > > > kernels,using 2.4% of a 48 CPU system during a netperf to
> > > > localhost
> > > > run.
> > > > Digging into the profile, we noticed that cpumask_clear_cpu and
> > > > cpumask_set_cpu together take about half of the CPU time taken
> > > > by
> > > > switch_mm_irqs_off.
> > > > 
> > > > However, the CPUs running netperf end up switching back and
> > > > forth
> > > > between netperf and the idle task, which does not require
> > > > changes
> > > > to the mm_cpumask. Furthermore, the init_mm cpumask ends up
> > > > being
> > > > the most heavily contended one in the system.`
> > > > 
> > > > Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
> > > > (mostly the idle task) reduced CPU use of switch_mm_irqs_off
> > > > from 2.4% of the CPU to 1.9% of the CPU, with the following
> > > > netperf commandline:
> > > 
> > > I'm conceptually fine with this change.  Does
> > > mm_cpumask(&init_mm)
> > > end
> > > up in a deterministic state?
> > 
> > Given that we do not touch mm_cpumask(&init_mm)
> > any more, and that bitmask never appears to be
> > used for things like tlb shootdowns (kernel TLB
> > shootdowns simply go to everybody), I suspect
> > it ends up in whatever state it is initialized
> > to on startup.
> > 
> > I had not looked into this much, because it does
> > not appear to be used for anything.
> > 
> > > Mike, depending on exactly what's going on with your benchmark,
> > > this
> > > might help recover a bit of your performance, too.
> > 
> > It will be interesting to know how this change
> > impacts others.
> 
> previous pipe-test numbers
> 4.13.16         2.024978 usecs/loop -- avg 2.045250 977.9 KHz
> 4.14.47         2.234518 usecs/loop -- avg 2.227716 897.8 KHz
> 4.15.18         2.287815 usecs/loop -- avg 2.295858 871.1 KHz
> 4.16.13         2.286036 usecs/loop -- avg 2.279057 877.6 KHz
> 4.17.0.g88a8676 2.288231 usecs/loop -- avg 2.288917 873.8 KHz
> 
> new numbers
> 4.17.0.g0512e01 2.268629 usecs/loop -- avg 2.269493 881.3 KHz
> 4.17.0.g0512e01 2.035401 usecs/loop -- avg 2.038341 981.2 KHz +andy
> 4.17.0.g0512e01 2.238701 usecs/loop -- avg 2.231828 896.1 KHz
> -andy+rik
> 
> There might be something there with your change Rik, but it's small
> enough to be wary of variance.  Andy's "invert the return of
> tlb_defer_switch_to_init_mm()" is OTOH pretty clear.

If inverting the return value of that function helps
some systems, chances are the other value might help
other systems.

That makes you wonder whether it might make sense
to always switch to lazy TLB mode, and only call
switch_mm at TLB flush time, regardless of whether
the CPU supports PCID...

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 19:43       ` Rik van Riel
@ 2018-06-01 20:03         ` Andy Lutomirski
  2018-06-01 20:35           ` Rik van Riel
  2018-06-02  3:39           ` Mike Galbraith
  0 siblings, 2 replies; 18+ messages in thread
From: Andy Lutomirski @ 2018-06-01 20:03 UTC (permalink / raw)
  To: riel
  Cc: Mike Galbraith, Andrew Lutomirski, LKML, songliubraving,
	kernel-team, Ingo Molnar, Thomas Gleixner, X86 ML,
	Peter Zijlstra

On Fri, Jun 1, 2018 at 12:43 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2018-06-01 at 20:48 +0200, Mike Galbraith wrote:
> > On Fri, 2018-06-01 at 14:22 -0400, Rik van Riel wrote:
> > > On Fri, 2018-06-01 at 08:11 -0700, Andy Lutomirski wrote:
> > > > On Fri, Jun 1, 2018 at 5:28 AM Rik van Riel <riel@surriel.com>
> > > > wrote:
> > > > >
> > > > > Song noticed switch_mm_irqs_off taking a lot of CPU time in
> > > > > recent
> > > > > kernels,using 2.4% of a 48 CPU system during a netperf to
> > > > > localhost
> > > > > run.
> > > > > Digging into the profile, we noticed that cpumask_clear_cpu and
> > > > > cpumask_set_cpu together take about half of the CPU time taken
> > > > > by
> > > > > switch_mm_irqs_off.
> > > > >
> > > > > However, the CPUs running netperf end up switching back and
> > > > > forth
> > > > > between netperf and the idle task, which does not require
> > > > > changes
> > > > > to the mm_cpumask. Furthermore, the init_mm cpumask ends up
> > > > > being
> > > > > the most heavily contended one in the system.`
> > > > >
> > > > > Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
> > > > > (mostly the idle task) reduced CPU use of switch_mm_irqs_off
> > > > > from 2.4% of the CPU to 1.9% of the CPU, with the following
> > > > > netperf commandline:
> > > >
> > > > I'm conceptually fine with this change.  Does
> > > > mm_cpumask(&init_mm)
> > > > end
> > > > up in a deterministic state?
> > >
> > > Given that we do not touch mm_cpumask(&init_mm)
> > > any more, and that bitmask never appears to be
> > > used for things like tlb shootdowns (kernel TLB
> > > shootdowns simply go to everybody), I suspect
> > > it ends up in whatever state it is initialized
> > > to on startup.
> > >
> > > I had not looked into this much, because it does
> > > not appear to be used for anything.
> > >
> > > > Mike, depending on exactly what's going on with your benchmark,
> > > > this
> > > > might help recover a bit of your performance, too.
> > >
> > > It will be interesting to know how this change
> > > impacts others.
> >
> > previous pipe-test number
> > 4.13.16         2.024978 usecs/loop -- avg 2.045250 977.9 KHz
> > 4.14.47         2.234518 usecs/loop -- avg 2.227716 897.8 KHz
> > 4.15.18         2.287815 usecs/loop -- avg 2.295858 871.1 KHz
> > 4.16.13         2.286036 usecs/loop -- avg 2.279057 877.6 KHz
> > 4.17.0.g88a8676 2.288231 usecs/loop -- avg 2.288917 873.8 KHz
> >
> > new numbers
> > 4.17.0.g0512e01 2.268629 usecs/loop -- avg 2.269493 881.3 KHz
> > 4.17.0.g0512e01 2.035401 usecs/loop -- avg 2.038341 981.2 KHz +andy
> > 4.17.0.g0512e01 2.238701 usecs/loop -- avg 2.231828 896.1 KHz
> > -andy+rik
> >
> > There might be something there with your change Rik, but it's small
> > enough to be wary of variance.  Andy's "invert the return of
> > tlb_defer_switch_to_init_mm()" is OTOH pretty clear.
>
> If inverting the return value of that function helps
> some systems, chances are the other value might help
> other systems.
>
> That makes you wonder whether it might make sense
> to always switch to lazy TLB mode, and only call
> switch_mm at TLB flush time, regardless of whether
> the CPU supports PCID...
>

Mike, you never did say: do you have PCID on your CPU?  Also, what is
your workload doing to cause so many switches back and forth between
init_mm and a task.

The point of the optimization is that switching to init_mm() should be
fairly fast on a PCID system, whereas an IPI to do the deferred flush
is very expensive regardless of PCID.  I wonder if we could do
something fancy where we stay on the task mm for short idles but
switch to init_mm before going deeply idle.  We'd also want to switch
to init_mm when we go idle due to migrating the previously running
task, I think.

Previously, it was hard to make any decisions based on the target idle
state because the idle code chose its target state so late in the
idling process that the CPU was halfway shut down already.  But I
think this is fixed now.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 20:03         ` Andy Lutomirski
@ 2018-06-01 20:35           ` Rik van Riel
  2018-06-01 21:21             ` Andy Lutomirski
  2018-06-02  3:39           ` Mike Galbraith
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-01 20:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Galbraith, LKML, songliubraving, kernel-team, Ingo Molnar,
	Thomas Gleixner, X86 ML, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Fri, 2018-06-01 at 13:03 -0700, Andy Lutomirski wrote:
> Mike, you never did say: do you have PCID on your CPU?  Also, what is
> your workload doing to cause so many switches back and forth between
> init_mm and a task.
> 
> The point of the optimization is that switching to init_mm() should
> be
> fairly fast on a PCID system, whereas an IPI to do the deferred flush
> is very expensive regardless of PCID. 

While I am sure that bit is true, Song and I
observed about 4x as much CPU use in the atomic
operations in cpumask_clear_cpu and cpumask_set_cpu
(inside switch_mm_irqs_off) as we saw CPU used
in the %cr3 reload itself.

Given how expensive those cpumask updates are,
lazy TLB mode might always be worth it, especially
on larger systems.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 20:35           ` Rik van Riel
@ 2018-06-01 21:21             ` Andy Lutomirski
  2018-06-01 22:13               ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2018-06-01 21:21 UTC (permalink / raw)
  To: riel
  Cc: Andrew Lutomirski, Mike Galbraith, LKML, songliubraving,
	kernel-team, Ingo Molnar, Thomas Gleixner, X86 ML,
	Peter Zijlstra

On Fri, Jun 1, 2018 at 1:35 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2018-06-01 at 13:03 -0700, Andy Lutomirski wrote:
> > Mike, you never did say: do you have PCID on your CPU?  Also, what is
> > your workload doing to cause so many switches back and forth between
> > init_mm and a task.
> >
> > The point of the optimization is that switching to init_mm() should
> > be
> > fairly fast on a PCID system, whereas an IPI to do the deferred flush
> > is very expensive regardless of PCID.
>
> While I am sure that bit is true, Song and I
> observed about 4x as much CPU use in the atomic
> operations in cpumask_clear_cpu and cpumask_set_cpu
> (inside switch_mm_irqs_off) as we saw CPU used
> in the %cr3 reload itself.
>
> Given how expensive those cpumask updates are,
> lazy TLB mode might always be worth it, especially
> on larger systems.
>

Hmm.  I wonder if there's a more clever data structure than a bitmap
that we could be using here.  Each CPU only ever needs to be in one
mm's cpumask, and each cpu only ever changes its own state in the
bitmask.  And writes are much less common than reads for most
workloads.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 21:21             ` Andy Lutomirski
@ 2018-06-01 22:13               ` Rik van Riel
  2018-06-02  3:35                 ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-01 22:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Galbraith, LKML, songliubraving, kernel-team, Ingo Molnar,
	Thomas Gleixner, X86 ML, Peter Zijlstra

On Fri, 1 Jun 2018 14:21:58 -0700
Andy Lutomirski <luto@kernel.org> wrote:

> Hmm.  I wonder if there's a more clever data structure than a bitmap
> that we could be using here.  Each CPU only ever needs to be in one
> mm's cpumask, and each cpu only ever changes its own state in the
> bitmask.  And writes are much less common than reads for most
> workloads.

It would be easy enough to add an mm_struct pointer to the
per-cpu tlbstate struct, and iterate over those.

However, that would be an orthogonal change to optimizing
lazy TLB mode. 

Does the (untested) patch below make sense as a potential
improvement to the lazy TLB heuristic?

---8<---
Subject: x86,tlb: workload dependent per CPU lazy TLB switch

Lazy TLB mode is a tradeoff between flushing the TLB and touching
the mm_cpumask(&init_mm) at context switch time, versus potentially
incurring a remote TLB flush IPI while in lazy TLB mode.

Whether this pays off is likely to be workload dependent more than
anything else. However, the current heuristic keys off hardware type.

This patch changes the lazy TLB mode heuristic to a dynamic, per-CPU
decision, dependent on whether we recently received a remote TLB
shootdown while in lazy TLB mode.

This is a very simple heuristic. When a CPU receives a remote TLB
shootdown IPI while in lazy TLB mode, a counter in the same cache
line is set to 16. Every time we skip lazy TLB mode, the counter
is decremented.

While the counter is zero (no recent TLB flush IPIs), allow lazy TLB mode.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h | 32 ++++++++++++++++----------------
 arch/x86/mm/tlb.c               |  4 ++++
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6690cd3fc8b1..f06a934e317d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 #define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
 #endif
 
-static inline bool tlb_defer_switch_to_init_mm(void)
-{
-	/*
-	 * If we have PCID, then switching to init_mm is reasonably
-	 * fast.  If we don't have PCID, then switching to init_mm is
-	 * quite slow, so we try to defer it in the hopes that we can
-	 * avoid it entirely.  The latter approach runs the risk of
-	 * receiving otherwise unnecessary IPIs.
-	 *
-	 * This choice is just a heuristic.  The tlb code can handle this
-	 * function returning true or false regardless of whether we have
-	 * PCID.
-	 */
-	return !static_cpu_has(X86_FEATURE_PCID);
-}
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -179,6 +163,7 @@ struct tlb_state {
 	struct mm_struct *loaded_mm;
 	u16 loaded_mm_asid;
 	u16 next_asid;
+	u16 flushed_while_lazy;
 	/* last user mm's ctx id */
 	u64 last_ctx_id;
 
@@ -246,6 +231,21 @@ struct tlb_state {
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
+static inline bool tlb_defer_switch_to_init_mm(void)
+{
+	/*
+	 * If the CPU recently received a TLB flush IPI while in lazy
+	 * TLB mode, do a straight switch to the idle task, and skip
+	 * lazy TLB mode for now.
+	 */
+	if (this_cpu_read(cpu_tlbstate.flushed_while_lazy)) {
+		this_cpu_dec(cpu_tlbstate.flushed_while_lazy);
+		return false;
+	}
+
+	return true;
+}
+
 /* Initialize cr4 shadow for this CPU. */
 static inline void cr4_init_shadow(void)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e055d1a06699..d8b0b7b236f3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -454,7 +454,11 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * paging-structure cache to avoid speculatively reading
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
+		 *
+		 * Skip lazy TLB mode for the next 16 context switches,
+		 * in case more TLB flush IPIs are coming.
 		 */
+		this_cpu_write(cpu_tlbstate.flushed_while_lazy, 16);
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
 	}

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 22:13               ` Rik van Riel
@ 2018-06-02  3:35                 ` Andy Lutomirski
  2018-06-02  5:04                   ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2018-06-02  3:35 UTC (permalink / raw)
  To: riel
  Cc: Andrew Lutomirski, Mike Galbraith, LKML, songliubraving,
	kernel-team, Ingo Molnar, Thomas Gleixner, X86 ML,
	Peter Zijlstra

On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 1 Jun 2018 14:21:58 -0700
> Andy Lutomirski <luto@kernel.org> wrote:
>
> > Hmm.  I wonder if there's a more clever data structure than a bitmap
> > that we could be using here.  Each CPU only ever needs to be in one
> > mm's cpumask, and each cpu only ever changes its own state in the
> > bitmask.  And writes are much less common than reads for most
> > workloads.
>
> It would be easy enough to add an mm_struct pointer to the
> per-cpu tlbstate struct, and iterate over those.
>
> However, that would be an orthogonal change to optimizing
> lazy TLB mode.
>
> Does the (untested) patch below make sense as a potential
> improvement to the lazy TLB heuristic?
>
> ---8<---
> Subject: x86,tlb: workload dependent per CPU lazy TLB switch
>
> Lazy TLB mode is a tradeoff between flushing the TLB and touching
> the mm_cpumask(&init_mm) at context switch time, versus potentially
> incurring a remote TLB flush IPI while in lazy TLB mode.
>
> Whether this pays off is likely to be workload dependent more than
> anything else. However, the current heuristic keys off hardware type.
>
> This patch changes the lazy TLB mode heuristic to a dynamic, per-CPU
> decision, dependent on whether we recently received a remote TLB
> shootdown while in lazy TLB mode.
>
> This is a very simple heuristic. When a CPU receives a remote TLB
> shootdown IPI while in lazy TLB mode, a counter in the same cache
> line is set to 16. Every time we skip lazy TLB mode, the counter
> is decremented.
>
> While the counter is zero (no recent TLB flush IPIs), allow lazy TLB mode.

Hmm, cute.  That's not a bad idea at all.  It would be nice to get
some kind of real benchmark on both PCID and !PCID.  If nothing else,
I would expect the threshold (16 in your patch) to want to be lower on
PCID systems.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-01 20:03         ` Andy Lutomirski
  2018-06-01 20:35           ` Rik van Riel
@ 2018-06-02  3:39           ` Mike Galbraith
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2018-06-02  3:39 UTC (permalink / raw)
  To: Andy Lutomirski, riel
  Cc: LKML, songliubraving, kernel-team, Ingo Molnar, Thomas Gleixner,
	X86 ML, Peter Zijlstra

On Fri, 2018-06-01 at 13:03 -0700, Andy Lutomirski wrote:
> 
> Mike, you never did say: do you have PCID on your CPU?

Yes.

>   Also, what is
> your workload doing to cause so many switches back and forth between
> init_mm and a task.

pipe-test measures pipe round trip, does nearly nothing but schedule.  

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-02  3:35                 ` Andy Lutomirski
@ 2018-06-02  5:04                   ` Rik van Riel
  2018-06-02 20:14                     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-02  5:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Galbraith, LKML, songliubraving, kernel-team, Ingo Molnar,
	Thomas Gleixner, X86 ML, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2493 bytes --]

On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
> On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@surriel.com> wrote:
> > 
> > On Fri, 1 Jun 2018 14:21:58 -0700
> > Andy Lutomirski <luto@kernel.org> wrote:
> > 
> > > Hmm.  I wonder if there's a more clever data structure than a
> > > bitmap
> > > that we could be using here.  Each CPU only ever needs to be in
> > > one
> > > mm's cpumask, and each cpu only ever changes its own state in the
> > > bitmask.  And writes are much less common than reads for most
> > > workloads.
> > 
> > It would be easy enough to add an mm_struct pointer to the
> > per-cpu tlbstate struct, and iterate over those.
> > 
> > However, that would be an orthogonal change to optimizing
> > lazy TLB mode.
> > 
> > Does the (untested) patch below make sense as a potential
> > improvement to the lazy TLB heuristic?
> > 
> > ---8<---
> > Subject: x86,tlb: workload dependent per CPU lazy TLB switch
> > 
> > Lazy TLB mode is a tradeoff between flushing the TLB and touching
> > the mm_cpumask(&init_mm) at context switch time, versus potentially
> > incurring a remote TLB flush IPI while in lazy TLB mode.
> > 
> > Whether this pays off is likely to be workload dependent more than
> > anything else. However, the current heuristic keys off hardware
> > type.
> > 
> > This patch changes the lazy TLB mode heuristic to a dynamic, per-
> > CPU
> > decision, dependent on whether we recently received a remote TLB
> > shootdown while in lazy TLB mode.
> > 
> > This is a very simple heuristic. When a CPU receives a remote TLB
> > shootdown IPI while in lazy TLB mode, a counter in the same cache
> > line is set to 16. Every time we skip lazy TLB mode, the counter
> > is decremented.
> > 
> > While the counter is zero (no recent TLB flush IPIs), allow lazy
> > TLB mode.
> 
> Hmm, cute.  That's not a bad idea at all.  It would be nice to get
> some kind of real benchmark on both PCID and !PCID.  If nothing else,
> I would expect the threshold (16 in your patch) to want to be lower
> on
> PCID systems.

That depends on how well we manage to get rid of
the cpumask manipulation overhead. On the PCID
system we first found this issue, the atomic
accesses to the mm_cpumask took about 4x as much
CPU time as the TLB invalidation itself.

That kinda limits how much the cost of cheaper
TLB flushes actually help :)

I agree this code should get some testing.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-02  5:04                   ` Rik van Riel
@ 2018-06-02 20:14                     ` Andy Lutomirski
  2018-06-03  0:51                       ` Song Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2018-06-02 20:14 UTC (permalink / raw)
  To: riel
  Cc: Andrew Lutomirski, Mike Galbraith, LKML, songliubraving,
	kernel-team, Ingo Molnar, Thomas Gleixner, X86 ML,
	Peter Zijlstra

On Fri, Jun 1, 2018 at 10:04 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
> > On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@surriel.com> wrote:
> > >
> > > On Fri, 1 Jun 2018 14:21:58 -0700
> > > Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > > Hmm.  I wonder if there's a more clever data structure than a
> > > > bitmap
> > > > that we could be using here.  Each CPU only ever needs to be in
> > > > one
> > > > mm's cpumask, and each cpu only ever changes its own state in the
> > > > bitmask.  And writes are much less common than reads for most
> > > > workloads.
> > >
> > > It would be easy enough to add an mm_struct pointer to the
> > > per-cpu tlbstate struct, and iterate over those.
> > >
> > > However, that would be an orthogonal change to optimizing
> > > lazy TLB mode.
> > >
> > > Does the (untested) patch below make sense as a potential
> > > improvement to the lazy TLB heuristic?
> > >
> > > ---8<---
> > > Subject: x86,tlb: workload dependent per CPU lazy TLB switch
> > >
> > > Lazy TLB mode is a tradeoff between flushing the TLB and touching
> > > the mm_cpumask(&init_mm) at context switch time, versus potentially
> > > incurring a remote TLB flush IPI while in lazy TLB mode.
> > >
> > > Whether this pays off is likely to be workload dependent more than
> > > anything else. However, the current heuristic keys off hardware
> > > type.
> > >
> > > This patch changes the lazy TLB mode heuristic to a dynamic, per-
> > > CPU
> > > decision, dependent on whether we recently received a remote TLB
> > > shootdown while in lazy TLB mode.
> > >
> > > This is a very simple heuristic. When a CPU receives a remote TLB
> > > shootdown IPI while in lazy TLB mode, a counter in the same cache
> > > line is set to 16. Every time we skip lazy TLB mode, the counter
> > > is decremented.
> > >
> > > While the counter is zero (no recent TLB flush IPIs), allow lazy
> > > TLB mode.
> >
> > Hmm, cute.  That's not a bad idea at all.  It would be nice to get
> > some kind of real benchmark on both PCID and !PCID.  If nothing else,
> > I would expect the threshold (16 in your patch) to want to be lower
> > on
> > PCID systems.
>
> That depends on how well we manage to get rid of
> the cpumask manipulation overhead. On the PCID
> system we first found this issue, the atomic
> accesses to the mm_cpumask took about 4x as much
> CPU time as the TLB invalidation itself.
>
> That kinda limits how much the cost of cheaper
> TLB flushes actually help :)
>
> I agree this code should get some testing.
>

Just to check: in the workload where you're seeing this problem, are
you using an mm with many threads?  I would imagine that, if you only
have one or two threads, the bit operations aren't so bad.

I wonder if just having a whole cacheline per node for the cpumask
would solve the problem.  I don't love the idea of having every flush
operation scan cpu_tlbstate for every single CPU -- we'll end up with
nasty contention on the cpu_tlbstate cache lines on some workloads.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-02 20:14                     ` Andy Lutomirski
@ 2018-06-03  0:51                       ` Song Liu
  2018-06-03  1:38                         ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2018-06-03  0:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: riel, Mike Galbraith, LKML, Kernel Team, Ingo Molnar,
	Thomas Gleixner, X86 ML, Peter Zijlstra



> On Jun 2, 2018, at 1:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, Jun 1, 2018 at 10:04 PM Rik van Riel <riel@surriel.com> wrote:
>> 
>> On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@surriel.com> wrote:
>>>> 
>>>> On Fri, 1 Jun 2018 14:21:58 -0700
>>>> Andy Lutomirski <luto@kernel.org> wrote:
>>>> 
>>>>> Hmm.  I wonder if there's a more clever data structure than a
>>>>> bitmap
>>>>> that we could be using here.  Each CPU only ever needs to be in
>>>>> one
>>>>> mm's cpumask, and each cpu only ever changes its own state in the
>>>>> bitmask.  And writes are much less common than reads for most
>>>>> workloads.
>>>> 
>>>> It would be easy enough to add an mm_struct pointer to the
>>>> per-cpu tlbstate struct, and iterate over those.
>>>> 
>>>> However, that would be an orthogonal change to optimizing
>>>> lazy TLB mode.
>>>> 
>>>> Does the (untested) patch below make sense as a potential
>>>> improvement to the lazy TLB heuristic?
>>>> 
>>>> ---8<---
>>>> Subject: x86,tlb: workload dependent per CPU lazy TLB switch
>>>> 
>>>> Lazy TLB mode is a tradeoff between flushing the TLB and touching
>>>> the mm_cpumask(&init_mm) at context switch time, versus potentially
>>>> incurring a remote TLB flush IPI while in lazy TLB mode.
>>>> 
>>>> Whether this pays off is likely to be workload dependent more than
>>>> anything else. However, the current heuristic keys off hardware
>>>> type.
>>>> 
>>>> This patch changes the lazy TLB mode heuristic to a dynamic, per-
>>>> CPU
>>>> decision, dependent on whether we recently received a remote TLB
>>>> shootdown while in lazy TLB mode.
>>>> 
>>>> This is a very simple heuristic. When a CPU receives a remote TLB
>>>> shootdown IPI while in lazy TLB mode, a counter in the same cache
>>>> line is set to 16. Every time we skip lazy TLB mode, the counter
>>>> is decremented.
>>>> 
>>>> While the counter is zero (no recent TLB flush IPIs), allow lazy
>>>> TLB mode.
>>> 
>>> Hmm, cute.  That's not a bad idea at all.  It would be nice to get
>>> some kind of real benchmark on both PCID and !PCID.  If nothing else,
>>> I would expect the threshold (16 in your patch) to want to be lower
>>> on
>>> PCID systems.
>> 
>> That depends on how well we manage to get rid of
>> the cpumask manipulation overhead. On the PCID
>> system we first found this issue, the atomic
>> accesses to the mm_cpumask took about 4x as much
>> CPU time as the TLB invalidation itself.
>> 
>> That kinda limits how much the cost of cheaper
>> TLB flushes actually help :)
>> 
>> I agree this code should get some testing.
>> 
> 
> Just to check: in the workload where you're seeing this problem, are
> you using an mm with many threads?  I would imagine that, if you only
> have one or two threads, the bit operations aren't so bad.

Yes, we are running netperf/netserver with 300 threads. We don't see
this much overhead in with real workload. 

Here are some test results. The tests are on 2-socket system with E5-2678 v3, 
so 48 logical cores with PCID. I tested 3 kernel (and tunes 
flushed_while_lazy):

Baseline: 0512e01345 in upstream
Baseline + lazy-tlb: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch"   
Baseline + both patches: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch" + "x86,switch_mm: skip atomic operations for init_mm"    

The tests are wrong with the following options: 

./super_netperf 300 -P 0 -t TCP_RR -p 8888 -H <target_host> -l 60 -- -r 100,100 -o -s 1M,1M -S 1M,1M

There are the results:
                       flushed_while_lazy    Throughput     %-cpu-on-switch_mm_irqs_off() 

Baseline                      N/A             2.02756e+06         2.82%    (1.54% in ctx of netserver + 1.28% in ctx of swapper)
Baseline + lazy-tlb            16             1.92706e+06         1.19%    (only in ctx of swapper, same for all cases below)
Baseline + lazy-tlb            4              2.03863e+06         1.12%    Good option 1
Baseline + lazy-tlb            2              1.93847e+06         1.22%
Baseline + both patches        64             1.92219e+06         1.09%
Baseline + both patches        16             1.93597e+06         1.15%
Baseline + both patches        8              1.92175e+06         1.11%
Baseline + both patches        4              2.03465e+06         1.09%    Good option 2
Baseline + both patches        2              2.03603e+06         1.10%    Good option 3
Baseline + both patches        1              1.9241e+06          1.06%

I highlighted 3 good options above. They are about the same for this 
test case. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-03  0:51                       ` Song Liu
@ 2018-06-03  1:38                         ` Rik van Riel
  2018-06-06 18:17                           ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-03  1:38 UTC (permalink / raw)
  To: Song Liu, Andy Lutomirski
  Cc: Mike Galbraith, LKML, Kernel Team, Ingo Molnar, Thomas Gleixner,
	X86 ML, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On Sun, 2018-06-03 at 00:51 +0000, Song Liu wrote:

> > Just to check: in the workload where you're seeing this problem,
> > are
> > you using an mm with many threads?  I would imagine that, if you
> > only
> > have one or two threads, the bit operations aren't so bad.
> 
> Yes, we are running netperf/netserver with 300 threads. We don't see
> this much overhead in with real workload. 

We may not, but there are some crazy workloads out
there in the world. Think of some Java programs with
thousands of threads, causing a million context
switches a second on a large system.

I like Andy's idea of having one cache line with
a cpumask per node. That seems like it will have
fewer downsides for tasks with fewer threads running
on giant systems.

I'll throw out the code I was working on, and look
into implementing that :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-03  1:38                         ` Rik van Riel
@ 2018-06-06 18:17                           ` Andy Lutomirski
  2018-06-06 19:00                             ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2018-06-06 18:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: songliubraving, Andrew Lutomirski, Mike Galbraith, LKML,
	kernel-team, Ingo Molnar, Thomas Gleixner, X86 ML,
	Peter Zijlstra

On Sat, Jun 2, 2018 at 6:38 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Sun, 2018-06-03 at 00:51 +0000, Song Liu wrote:
>
> > > Just to check: in the workload where you're seeing this problem,
> > > are
> > > you using an mm with many threads?  I would imagine that, if you
> > > only
> > > have one or two threads, the bit operations aren't so bad.
> >
> > Yes, we are running netperf/netserver with 300 threads. We don't see
> > this much overhead in with real workload.
>
> We may not, but there are some crazy workloads out
> there in the world. Think of some Java programs with
> thousands of threads, causing a million context
> switches a second on a large system.
>
> I like Andy's idea of having one cache line with
> a cpumask per node. That seems like it will have
> fewer downsides for tasks with fewer threads running
> on giant systems.
>
> I'll throw out the code I was working on, and look
> into implementing that :)
>

I'm not sure you should throw your patch out.  It's a decent idea, too.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-06 18:17                           ` Andy Lutomirski
@ 2018-06-06 19:00                             ` Rik van Riel
  2018-06-06 19:23                               ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2018-06-06 19:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: songliubraving, Mike Galbraith, LKML, kernel-team, Ingo Molnar,
	Thomas Gleixner, X86 ML, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]

On Wed, 2018-06-06 at 11:17 -0700, Andy Lutomirski wrote:
> On Sat, Jun 2, 2018 at 6:38 PM Rik van Riel <riel@surriel.com> wrote:
> > 
> > On Sun, 2018-06-03 at 00:51 +0000, Song Liu wrote:
> > 
> > > > Just to check: in the workload where you're seeing this
> > > > problem,
> > > > are
> > > > you using an mm with many threads?  I would imagine that, if
> > > > you
> > > > only
> > > > have one or two threads, the bit operations aren't so bad.
> > > 
> > > Yes, we are running netperf/netserver with 300 threads. We don't
> > > see
> > > this much overhead in with real workload.
> > 
> > We may not, but there are some crazy workloads out
> > there in the world. Think of some Java programs with
> > thousands of threads, causing a million context
> > switches a second on a large system.
> > 
> > I like Andy's idea of having one cache line with
> > a cpumask per node. That seems like it will have
> > fewer downsides for tasks with fewer threads running
> > on giant systems.
> > 
> > I'll throw out the code I was working on, and look
> > into implementing that :)
> > 
> 
> I'm not sure you should throw your patch out.  It's a decent idea,
> too.

Oh, I still have it saved, but the cpumask per
NUMA node looks like it could have a big impact,
with less guesswork or side effects.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
  2018-06-06 19:00                             ` Rik van Riel
@ 2018-06-06 19:23                               ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2018-06-06 19:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Lutomirski, songliubraving, Mike Galbraith, LKML,
	kernel-team, Ingo Molnar, Thomas Gleixner, X86 ML,
	Peter Zijlstra

On Wed, Jun 6, 2018 at 12:00 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2018-06-06 at 11:17 -0700, Andy Lutomirski wrote:
> > On Sat, Jun 2, 2018 at 6:38 PM Rik van Riel <riel@surriel.com> wrote:
> > >
> > > On Sun, 2018-06-03 at 00:51 +0000, Song Liu wrote:
> > >
> > > > > Just to check: in the workload where you're seeing this
> > > > > problem,
> > > > > are
> > > > > you using an mm with many threads?  I would imagine that, if
> > > > > you
> > > > > only
> > > > > have one or two threads, the bit operations aren't so bad.
> > > >
> > > > Yes, we are running netperf/netserver with 300 threads. We don't
> > > > see
> > > > this much overhead in with real workload.
> > >
> > > We may not, but there are some crazy workloads out
> > > there in the world. Think of some Java programs with
> > > thousands of threads, causing a million context
> > > switches a second on a large system.
> > >
> > > I like Andy's idea of having one cache line with
> > > a cpumask per node. That seems like it will have
> > > fewer downsides for tasks with fewer threads running
> > > on giant systems.
> > >
> > > I'll throw out the code I was working on, and look
> > > into implementing that :)
> > >
> >
> > I'm not sure you should throw your patch out.  It's a decent idea,
> > too.
>
> Oh, I still have it saved, but the cpumask per
> NUMA node looks like it could have a big impact,
> with less guesswork or side effects.
>

Also, even with your other patch, we'd still have a win from the
improved data structure -- switching back and forth between init_mm
and something else is definitely not the only time we hammer the
cpumask cache lines.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-06-06 19:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-01 12:28 [PATCH] x86,switch_mm: skip atomic operations for init_mm Rik van Riel
2018-06-01 15:11 ` Andy Lutomirski
2018-06-01 18:22   ` Rik van Riel
2018-06-01 18:48     ` Mike Galbraith
2018-06-01 19:43       ` Rik van Riel
2018-06-01 20:03         ` Andy Lutomirski
2018-06-01 20:35           ` Rik van Riel
2018-06-01 21:21             ` Andy Lutomirski
2018-06-01 22:13               ` Rik van Riel
2018-06-02  3:35                 ` Andy Lutomirski
2018-06-02  5:04                   ` Rik van Riel
2018-06-02 20:14                     ` Andy Lutomirski
2018-06-03  0:51                       ` Song Liu
2018-06-03  1:38                         ` Rik van Riel
2018-06-06 18:17                           ` Andy Lutomirski
2018-06-06 19:00                             ` Rik van Riel
2018-06-06 19:23                               ` Andy Lutomirski
2018-06-02  3:39           ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.