linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
@ 2016-08-25 19:04 Rik van Riel
  2016-08-25 19:42 ` H. Peter Anvin
  2016-08-28  8:11 ` Andy Lutomirski
  0 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2016-08-25 19:04 UTC (permalink / raw)
  To: serebrin; +Cc: mingo, peterz, hpa, linux-kernel, luto, bp, mgorman, tglx

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %cr3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch is still ugly, and the sched.h include needs to be cleaned
up a lot (how would the scheduler people like to see the context switch
blocking abstracted?)

This patch deals with the issue by introducing a third tlb state,
TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
context switch. A CPU is transitioned from TLBSTATE_LAZY to
TLBSTATE_FLUSH with the rq lock held, to prevent context switches.

Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.

This patch is totally untested, because I am at a conference right
now, and Benjamin has the test case :)

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Benjamin Serebrin <serebrin@google.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..5ae8e4b174f8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..5b4cda49ac0c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,6 +6,7 @@
 #include <linux/interrupt.h>
 #include <linux/module.h>
 #include <linux/cpu.h>
+#include "../../../kernel/sched/sched.h"
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -140,10 +141,12 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int oldstate = this_cpu_read(cpu_tlbstate.state);
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +245,29 @@ static void flush_tlb_func(void *info)
 
 }
 
+/*
+ * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH, which
+ * will force it to flush %cr3 at the next context switch, effectively
+ * doing a delayed TLB flush for a CPU in lazy TLB mode.
+ * This takes the runqueue lock to protect against the race condition
+ * of the target CPU rescheduling while we change its TLB state.
+ * Do nothing if the TLB state is already set to TLBSTATE_FLUSH.
+ */
+static void set_lazy_tlbstate_flush(int cpu) {
+	if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) {
+		raw_spin_lock(&cpu_rq(cpu)->lock);
+		if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY)
+			per_cpu(cpu_tlbstate.state, cpu) = TLBSTATE_FLUSH;
+		raw_spin_unlock(&cpu_rq(cpu)->lock);
+	}
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +283,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +290,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask) {
+		if (per_cpu(cpu_tlbstate.state, cpu) != TLBSTATE_OK) {
+			set_lazy_tlbstate_flush(cpu);
+			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);
+		}
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-25 19:04 [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier Rik van Riel
@ 2016-08-25 19:42 ` H. Peter Anvin
  2016-08-25 19:52   ` Rik van Riel
                     ` (3 more replies)
  2016-08-28  8:11 ` Andy Lutomirski
  1 sibling, 4 replies; 22+ messages in thread
From: H. Peter Anvin @ 2016-08-25 19:42 UTC (permalink / raw)
  To: Rik van Riel, serebrin
  Cc: mingo, peterz, linux-kernel, luto, bp, mgorman, tglx

On August 25, 2016 12:04:59 PM PDT, Rik van Riel <riel@redhat.com> wrote:
>Subject: x86,mm,sched: make lazy TLB mode even lazier
>
>Lazy TLB mode can result in an idle CPU being woken up for a TLB
>flush, when all it really needed to do was flush %cr3 before the
>next context switch.
>
>This is mostly fine on bare metal, though sub-optimal from a power
>saving point of view, and deeper C states could make TLB flushes
>take a little longer than desired.
>
>On virtual machines, the pain can be much worse, especially if a
>currently non-running VCPU is woken up for a TLB invalidation
>IPI, on a CPU that is busy running another task. It could take
>a while before that IPI is handled, leading to performance issues.
>
>This patch is still ugly, and the sched.h include needs to be cleaned
>up a lot (how would the scheduler people like to see the context switch
>blocking abstracted?)
>
>This patch deals with the issue by introducing a third tlb state,
>TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
>context switch. A CPU is transitioned from TLBSTATE_LAZY to
>TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
>
>Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
>
>This patch is totally untested, because I am at a conference right
>now, and Benjamin has the test case :)
>
>Signed-off-by: Rik van Riel <riel@redhat.com>
>Reported-by: Benjamin Serebrin <serebrin@google.com>
>---
> arch/x86/include/asm/tlbflush.h |  1 +
>arch/x86/mm/tlb.c               | 38
>+++++++++++++++++++++++++++++++++++---
> 2 files changed, 36 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/include/asm/tlbflush.h
>b/arch/x86/include/asm/tlbflush.h
>index 4e5be94e079a..5ae8e4b174f8 100644
>--- a/arch/x86/include/asm/tlbflush.h
>+++ b/arch/x86/include/asm/tlbflush.h
>@@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask
>*cpumask,
> 
> #define TLBSTATE_OK	1
> #define TLBSTATE_LAZY	2
>+#define TLBSTATE_FLUSH	3
> 
> static inline void reset_lazy_tlbstate(void)
> {
>diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>index 5643fd0b1a7d..5b4cda49ac0c 100644
>--- a/arch/x86/mm/tlb.c
>+++ b/arch/x86/mm/tlb.c
>@@ -6,6 +6,7 @@
> #include <linux/interrupt.h>
> #include <linux/module.h>
> #include <linux/cpu.h>
>+#include "../../../kernel/sched/sched.h"
> 
> #include <asm/tlbflush.h>
> #include <asm/mmu_context.h>
>@@ -140,10 +141,12 @@ void switch_mm_irqs_off(struct mm_struct *prev,
>struct mm_struct *next,
> 	}
> #ifdef CONFIG_SMP
> 	  else {
>+		int oldstate = this_cpu_read(cpu_tlbstate.state);
> 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
> 
>-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
>+		if (oldstate == TLBSTATE_FLUSH ||
>+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> 			/*
> 			 * On established mms, the mm_cpumask is only changed
> 			 * from irq context, from ptep_clear_flush() while in
>@@ -242,11 +245,29 @@ static void flush_tlb_func(void *info)
> 
> }
> 
>+/*
>+ * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH,
>which
>+ * will force it to flush %cr3 at the next context switch, effectively
>+ * doing a delayed TLB flush for a CPU in lazy TLB mode.
>+ * This takes the runqueue lock to protect against the race condition
>+ * of the target CPU rescheduling while we change its TLB state.
>+ * Do nothing if the TLB state is already set to TLBSTATE_FLUSH.
>+ */
>+static void set_lazy_tlbstate_flush(int cpu) {
>+	if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) {
>+		raw_spin_lock(&cpu_rq(cpu)->lock);
>+		if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY)
>+			per_cpu(cpu_tlbstate.state, cpu) = TLBSTATE_FLUSH;
>+		raw_spin_unlock(&cpu_rq(cpu)->lock);
>+	}
>+}
>+
> void native_flush_tlb_others(const struct cpumask *cpumask,
> 				 struct mm_struct *mm, unsigned long start,
> 				 unsigned long end)
> {
> 	struct flush_tlb_info info;
>+	unsigned int cpu;
> 
> 	if (end == 0)
> 		end = start + PAGE_SIZE;
>@@ -262,8 +283,6 @@ void native_flush_tlb_others(const struct cpumask
>*cpumask,
> 				(end - start) >> PAGE_SHIFT);
> 
> 	if (is_uv_system()) {
>-		unsigned int cpu;
>-
> 		cpu = smp_processor_id();
> 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
> 		if (cpumask)
>@@ -271,6 +290,19 @@ void native_flush_tlb_others(const struct cpumask
>*cpumask,
> 								&info, 1);
> 		return;
> 	}
>+
>+	/*
>+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
>+	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
>+	 * at the next context switch.
>+	 */
>+	for_each_cpu(cpu, cpumask) {
>+		if (per_cpu(cpu_tlbstate.state, cpu) != TLBSTATE_OK) {
>+			set_lazy_tlbstate_flush(cpu);
>+			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);
>+		}
>+	}
>+
> 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
> }
> 

Why grabbing a lock instead of cmpxchg?
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-25 19:42 ` H. Peter Anvin
@ 2016-08-25 19:52   ` Rik van Riel
  2016-08-25 20:15   ` [PATCH RFC v2] " Rik van Riel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-08-25 19:52 UTC (permalink / raw)
  To: H. Peter Anvin, serebrin
  Cc: mingo, peterz, linux-kernel, luto, bp, mgorman, tglx

On Thu, 2016-08-25 at 12:42 -0700, H. Peter Anvin wrote:
> On August 25, 2016 12:04:59 PM PDT, Rik van Riel <riel@redhat.com>
> wrote:
> > Subject: x86,mm,sched: make lazy TLB mode even lazier
> > 
> > Lazy TLB mode can result in an idle CPU being woken up for a TLB
> > flush, when all it really needed to do was flush %cr3 before the
> > next context switch.
> > 
> > This is mostly fine on bare metal, though sub-optimal from a power
> > saving point of view, and deeper C states could make TLB flushes
> > take a little longer than desired.
> > 
> > On virtual machines, the pain can be much worse, especially if a
> > currently non-running VCPU is woken up for a TLB invalidation
> > IPI, on a CPU that is busy running another task. It could take
> > a while before that IPI is handled, leading to performance issues.
> > 
> > This patch is still ugly, and the sched.h include needs to be
> > cleaned
> > up a lot (how would the scheduler people like to see the context
> > switch
> > blocking abstracted?)
> > 
> > This patch deals with the issue by introducing a third tlb state,
> > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> > context switch. A CPU is transitioned from TLBSTATE_LAZY to
> > TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
> > 
> > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
> > 
> > This patch is totally untested, because I am at a conference right
> > now, and Benjamin has the test case :)
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > Reported-by: Benjamin Serebrin <serebrin@google.com>
> > ---
> > arch/x86/include/asm/tlbflush.h |  1 +
> > arch/x86/mm/tlb.c               | 38
> > +++++++++++++++++++++++++++++++++++---
> > 2 files changed, 36 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/tlbflush.h
> > b/arch/x86/include/asm/tlbflush.h
> > index 4e5be94e079a..5ae8e4b174f8 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 
> > #define TLBSTATE_OK	1
> > #define TLBSTATE_LAZY	2
> > +#define TLBSTATE_FLUSH	3
> > 
> > static inline void reset_lazy_tlbstate(void)
> > {
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 5643fd0b1a7d..5b4cda49ac0c 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -6,6 +6,7 @@
> > #include <linux/interrupt.h>
> > #include <linux/module.h>
> > #include <linux/cpu.h>
> > +#include "../../../kernel/sched/sched.h"
> > 
> > #include <asm/tlbflush.h>
> > #include <asm/mmu_context.h>
> > @@ -140,10 +141,12 @@ void switch_mm_irqs_off(struct mm_struct
> > *prev,
> > struct mm_struct *next,
> > 	}
> > #ifdef CONFIG_SMP
> > 	  else {
> > +		int oldstate = this_cpu_read(cpu_tlbstate.state);
> > 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> > 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
> > 
> > -		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> > +		if (oldstate == TLBSTATE_FLUSH ||
> > +				!cpumask_test_cpu(cpu,
> > mm_cpumask(next))) {
> > 			/*
> > 			 * On established mms, the mm_cpumask is only
> > changed
> > 			 * from irq context, from ptep_clear_flush()
> > while in
> > @@ -242,11 +245,29 @@ static void flush_tlb_func(void *info)
> > 
> > }
> > 
> > +/*
> > + * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH,
> > which
> > + * will force it to flush %cr3 at the next context switch,
> > effectively
> > + * doing a delayed TLB flush for a CPU in lazy TLB mode.
> > + * This takes the runqueue lock to protect against the race
> > condition
> > + * of the target CPU rescheduling while we change its TLB state.
> > + * Do nothing if the TLB state is already set to TLBSTATE_FLUSH.
> > + */
> > +static void set_lazy_tlbstate_flush(int cpu) {
> > +	if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) {
> > +		raw_spin_lock(&cpu_rq(cpu)->lock);
> > +		if (per_cpu(cpu_tlbstate.state, cpu) ==
> > TLBSTATE_LAZY)
> > +			per_cpu(cpu_tlbstate.state, cpu) =
> > TLBSTATE_FLUSH;
> > +		raw_spin_unlock(&cpu_rq(cpu)->lock);
> > +	}
> > +}
> > +
> > void native_flush_tlb_others(const struct cpumask *cpumask,
> > 				 struct mm_struct *mm, unsigned long
> > start,
> > 				 unsigned long end)
> > {
> > 	struct flush_tlb_info info;
> > +	unsigned int cpu;
> > 
> > 	if (end == 0)
> > 		end = start + PAGE_SIZE;
> > @@ -262,8 +283,6 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 				(end - start) >> PAGE_SHIFT);
> > 
> > 	if (is_uv_system()) {
> > -		unsigned int cpu;
> > -
> > 		cpu = smp_processor_id();
> > 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end,
> > cpu);
> > 		if (cpumask)
> > @@ -271,6 +290,19 @@ void native_flush_tlb_others(const struct
> > cpumask
> > *cpumask,
> > 								&info,
> > 1);
> > 		return;
> > 	}
> > +
> > +	/*
> > +	 * Instead of sending IPIs to CPUs in lazy TLB mode, move
> > that
> > +	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be
> > flushed
> > +	 * at the next context switch.
> > +	 */
> > +	for_each_cpu(cpu, cpumask) {
> > +		if (per_cpu(cpu_tlbstate.state, cpu) !=
> > TLBSTATE_OK) {
> > +			set_lazy_tlbstate_flush(cpu);
> > +			cpumask_clear_cpu(cpu, (struct cpumask
> > *)cpumask);
> > +		}
> > +	}
> > +
> > 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
> > }
> > 
> 
> Why grabbing a lock instead of cmpxchg?

Good point, cmpxchg would work for doing a
LAZY -> FLUSH transition.

Additionally, my RFC patch has a race condition,
in that it checks for !TLBSTATE_OK outside of the
lock, and clears the CPU from the cpumask outside
of the lock.

We can indeed do the LAZY -> FLUSH transition with
cmpxchg, and I should do that.

We do not need to check against a FLUSH -> OK
transition that happens while we are in
native_flush_tlb_others, because the page tables
will have been modified before, and the TLB is
flushed at context switch time.

We do not need to worry about not catching an
OK -> LAZY transition, either. In that case,
the CPU will simply stay in the bitmap, and
get a TLB flush IPI.

We can also continue doing nothing if a CPU
is already in FLUSH TLB mode.

However, a LAZY -> OK transition needs to be
caught, because if that happens during a
native_flush_tlb_others, a CPU may do a context
switch without a TLB flush.

Your cmpxchg idea would catch that last transition,
and allow the CPU to stay in the bitmap, so it can
have its TLB flushed via an IPI.

I will change set_lazy_tlbstate_flush from a void
to a bool, use cmpxchg, and do the cpumask_clear_cpu
only if the cmpxchg from LAZY -> FLUSH succeeds.

Thanks for the feedback, Peter!

This also gets rid of the ugly bits of internal
scheduler knowledge.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC v2] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-25 19:42 ` H. Peter Anvin
  2016-08-25 19:52   ` Rik van Riel
@ 2016-08-25 20:15   ` Rik van Riel
  2016-08-25 21:01   ` [PATCH RFC v3] " Rik van Riel
  2016-08-29 16:08   ` [PATCH RFC UGLY] " Rik van Riel
  3 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-08-25 20:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: serebrin, mingo, peterz, linux-kernel, luto, bp, mgorman, tglx

Now using cmpxchg and closing a race, thanks to Peter Anvin.

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %cr3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third tlb state,
TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
context switch.

If there was a race transitioning a CPU from TLBSTATE_LAZY to
TLBSTATE_FLUSH, an invalidation IPI will be sent.

Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.

This patch is totally untested, because I am at a conference right
now, and Benjamin has the test case :)

Benjamin, does this help your issue?

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Benjamin Serebrin <serebrin@google.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 41 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..5ae8e4b174f8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..8dcc0947681c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,12 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int oldstate = this_cpu_read(cpu_tlbstate.state);
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +244,33 @@ static void flush_tlb_func(void *info)
 
 }
 
+/*
+ * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH, which
+ * will force it to flush %cr3 at the next context switch, effectively
+ * doing a delayed TLB flush for a CPU in lazy TLB mode.
+ * Do nothing if the TLB state is already set to TLBSTATE_FLUSH.
+ */
+static bool set_lazy_tlbstate_flush(int cpu) {
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	bool skipflush = true;
+	if (*tlbstate == TLBSTATE_LAZY) {
+		/*
+		 * If cmpxchg fails, the CPU may have context switched from
+		 * TLBSTATE_LAZY to TLBSTATE_OK. Send a TLB flush IPI.
+		 */
+		if (cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH) !=
+								TLBSTATE_LAZY)
+			skipflush = false;
+	}
+	return skipflush;
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +286,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +293,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask) {
+		if (per_cpu(cpu_tlbstate.state, cpu) != TLBSTATE_OK) {
+			if (set_lazy_tlbstate_flush(cpu))
+				cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);
+		}
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-25 19:42 ` H. Peter Anvin
  2016-08-25 19:52   ` Rik van Riel
  2016-08-25 20:15   ` [PATCH RFC v2] " Rik van Riel
@ 2016-08-25 21:01   ` Rik van Riel
  2016-08-27  8:03     ` Ingo Molnar
  2016-08-29 16:08   ` [PATCH RFC UGLY] " Rik van Riel
  3 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2016-08-25 21:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: serebrin, mingo, peterz, linux-kernel, luto, bp, mgorman, tglx

On Thu, 25 Aug 2016 12:42:15 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> Why grabbing a lock instead of cmpxchg?

... and some more cleanups later, this might actually be
good to merge, assuming it works for Benjamin :)

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %cr3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third tlb state,
TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

This patch is totally untested, because I am at a conference right
now, and Benjamin has the test case :)

Benjamin, does this help your issue?

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Benjamin Serebrin <serebrin@google.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 47 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..5ae8e4b174f8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..4352db65a129 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,12 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int oldstate = this_cpu_read(cpu_tlbstate.state);
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +244,42 @@ static void flush_tlb_func(void *info)
 
 }
 
+/*
+ * Determine whether a CPU's TLB needs to be flushed now, or whether the
+ * flush can be delayed until the next context switch, by changing the
+ * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
+ */
+static bool lazy_tlb_can_skip_flush(int cpu) {
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	int old;
+
+	/* A task on the CPU is actively using the mm. Flush the TLB. */
+	if (*tlbstate == TLBSTATE_OK)
+		return false;
+
+	/* The TLB will be flushed on the next context switch. */
+	if (*tlbstate == TLBSTATE_FLUSH)
+		return true;
+
+	/*
+	 * The CPU is in TLBSTATE_LAZY, which could context switch back
+	 * to TLBSTATE_OK, re-using the TLB state without a TLB flush.
+	 * In that case, a TLB flush IPI needs to be sent.
+	 *
+	 * Otherwise, the TLB state is now TLBSTATE_FLUSH, and the
+	 * TLB flush IPI can be skipped.
+	 */
+	old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
+
+	return old != TLBSTATE_OK;
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +295,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +302,16 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask)
+		if (lazy_tlb_can_skip_flush(cpu))
+			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 


 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-25 21:01   ` [PATCH RFC v3] " Rik van Riel
@ 2016-08-27  8:03     ` Ingo Molnar
  2016-08-27 23:02       ` Linus Torvalds
  2016-08-29 15:24       ` [PATCH RFC v3] " Rik van Riel
  0 siblings, 2 replies; 22+ messages in thread
From: Ingo Molnar @ 2016-08-27  8:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, serebrin, peterz, linux-kernel, luto, bp,
	mgorman, tglx, Linus Torvalds


* Rik van Riel <riel@redhat.com> wrote:

> On Thu, 25 Aug 2016 12:42:15 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
> > Why grabbing a lock instead of cmpxchg?
> 
> ... and some more cleanups later, this might actually be
> good to merge, assuming it works for Benjamin :)
> 
> ---8<---

LGTM in principle (it's a pretty clever trick!), just some minor stylistic nits:

> 
> Subject: x86,mm,sched: make lazy TLB mode even lazier
> 
> Lazy TLB mode can result in an idle CPU being woken up for a TLB
> flush, when all it really needed to do was flush %cr3 before the
> next context switch.

s/%cr3/CR3

> This is mostly fine on bare metal, though sub-optimal from a power
> saving point of view, and deeper C states could make TLB flushes
> take a little longer than desired.

s/C state/C-state/

> On virtual machines, the pain can be much worse, especially if a
> currently non-running VCPU is woken up for a TLB invalidation
> IPI, on a CPU that is busy running another task. It could take
> a while before that IPI is handled, leading to performance issues.

So I was going to suggest:

 s/VCPU/vCPU

But then, out of curiosity, I ran:

   triton:~/tip> for N in $(git log v3.0.. | grep -i vcpu | sed 's/[^a-zA-Z\d\s]/ /g'); do echo $N; done | grep -iw vcpu  | sort | uniq -c | sort -n
      3 Vcpu
    125 vCPU
    527 VCPU
   1611 vcpu

So never mind, I guess we missed that boat by a few years already ;-)

> 
> This patch deals with the issue by introducing a third tlb state,
> TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> context switch.

s/tlb state/TLB state

> 
> A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
> the attempted transition to TLBSTATE_FLUSH will get a TLB flush
> IPI, just like a CPU that was in TLBSTATE_OK to begin with.
> 
> Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.
> 
> This patch is totally untested, because I am at a conference right
> now, and Benjamin has the test case :)
> 
> Benjamin, does this help your issue?
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Benjamin Serebrin <serebrin@google.com>
> ---
>  arch/x86/include/asm/tlbflush.h |  1 +
>  arch/x86/mm/tlb.c               | 47 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4e5be94e079a..5ae8e4b174f8 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  
>  #define TLBSTATE_OK	1
>  #define TLBSTATE_LAZY	2
> +#define TLBSTATE_FLUSH	3
>  
>  static inline void reset_lazy_tlbstate(void)
>  {
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 5643fd0b1a7d..4352db65a129 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -140,10 +140,12 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  	}
>  #ifdef CONFIG_SMP
>  	  else {
> +		int oldstate = this_cpu_read(cpu_tlbstate.state);
>  		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);

Please separate local variable definition blocks from statements by an extra 
newline.

>  		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>  
> -		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> +		if (oldstate == TLBSTATE_FLUSH ||
> +				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
>  			/*
>  			 * On established mms, the mm_cpumask is only changed
>  			 * from irq context, from ptep_clear_flush() while in
> @@ -242,11 +244,42 @@ static void flush_tlb_func(void *info)
>  
>  }
>  
> +/*
> + * Determine whether a CPU's TLB needs to be flushed now, or whether the
> + * flush can be delayed until the next context switch, by changing the
> + * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
> + */
> +static bool lazy_tlb_can_skip_flush(int cpu) {
> +	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
> +	int old;

Yeah, so this is not how the standard function definition looks like.

> +	/* A task on the CPU is actively using the mm. Flush the TLB. */
> +	if (*tlbstate == TLBSTATE_OK)
> +		return false;
> +
> +	/* The TLB will be flushed on the next context switch. */
> +	if (*tlbstate == TLBSTATE_FLUSH)
> +		return true;
> +
> +	/*
> +	 * The CPU is in TLBSTATE_LAZY, which could context switch back
> +	 * to TLBSTATE_OK, re-using the TLB state without a TLB flush.
> +	 * In that case, a TLB flush IPI needs to be sent.
> +	 *
> +	 * Otherwise, the TLB state is now TLBSTATE_FLUSH, and the
> +	 * TLB flush IPI can be skipped.
> +	 */
> +	old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
> +
> +	return old != TLBSTATE_OK;
> +}
> +
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>  				 struct mm_struct *mm, unsigned long start,
>  				 unsigned long end)
>  {
>  	struct flush_tlb_info info;
> +	unsigned int cpu;
>  
>  	if (end == 0)
>  		end = start + PAGE_SIZE;
> @@ -262,8 +295,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  				(end - start) >> PAGE_SHIFT);
>  
>  	if (is_uv_system()) {
> -		unsigned int cpu;
> -
>  		cpu = smp_processor_id();
>  		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
>  		if (cpumask)
> @@ -271,6 +302,16 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  								&info, 1);
>  		return;
>  	}
> +
> +	/*
> +	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
> +	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
> +	 * at the next context switch.

s/CPUs/CPU's

> +	 */
> +	for_each_cpu(cpu, cpumask)
> +		if (lazy_tlb_can_skip_flush(cpu))
> +			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);

Please remove the 'const' from the cpumask type definition instead of this ugly 
cast!

Plus this needs one more pair of curly braces and it's perfect! ;-)

I'd also like to wait for the Tested-by from Benjamin as well before we can 
proceeed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-27  8:03     ` Ingo Molnar
@ 2016-08-27 23:02       ` Linus Torvalds
  2016-08-30 19:53         ` [PATCH RFC v4] " Rik van Riel
  2016-08-31 16:27         ` [PATCH RFC v6] " Rik van Riel
  2016-08-29 15:24       ` [PATCH RFC v3] " Rik van Riel
  1 sibling, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2016-08-27 23:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, H. Peter Anvin, serebrin, Peter Zijlstra,
	Linux Kernel Mailing List, Andrew Lutomirski, Borislav Petkov,
	Mel Gorman, Thomas Gleixner

Yeah, with those small fixes from Ingo, I definitely don't think this
looks hacky at all. This all seems to be exactly what we should always
have done.

The only remaining comment is that I'd make that
lazy_tlb_can_skip_flush() function just use a switch table for the
tlbstate comparisons rather than the repeated conditionals.

I'd love to see the results from Benjamin - maybe it helps a lot, and
maybe it doesn't. But regardless, the patch makes sense to me.

So with the small fixups: Ack.

             Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-25 19:04 [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier Rik van Riel
  2016-08-25 19:42 ` H. Peter Anvin
@ 2016-08-28  8:11 ` Andy Lutomirski
  2016-08-29 14:54   ` Rik van Riel
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2016-08-28  8:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Mel Gorman, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Benjamin Serebrin, Ingo Molnar, Borislav Petkov

On Aug 25, 2016 9:06 PM, "Rik van Riel" <riel@redhat.com> wrote:
>
> Subject: x86,mm,sched: make lazy TLB mode even lazier
>
> Lazy TLB mode can result in an idle CPU being woken up for a TLB
> flush, when all it really needed to do was flush %cr3 before the
> next context switch.
>
> This is mostly fine on bare metal, though sub-optimal from a power
> saving point of view, and deeper C states could make TLB flushes
> take a little longer than desired.
>
> On virtual machines, the pain can be much worse, especially if a
> currently non-running VCPU is woken up for a TLB invalidation
> IPI, on a CPU that is busy running another task. It could take
> a while before that IPI is handled, leading to performance issues.
>
> This patch is still ugly, and the sched.h include needs to be cleaned
> up a lot (how would the scheduler people like to see the context switch
> blocking abstracted?)
>
> This patch deals with the issue by introducing a third tlb state,
> TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> context switch. A CPU is transitioned from TLBSTATE_LAZY to
> TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
>
> Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
>
> This patch is totally untested, because I am at a conference right
> now, and Benjamin has the test case :)
>

I haven't had a chance to seriously read the code yet, but what
happens when the mm is deleted outright?  Or is the idea that a
reference is held until all the lazy users are gone, too?

On PCID systems (still need to get that code upstream...), I wonder if
we could go the other way and stop being lazy, as cr3 writes can be
much faster.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-28  8:11 ` Andy Lutomirski
@ 2016-08-29 14:54   ` Rik van Riel
  2016-08-29 23:55     ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2016-08-29 14:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Mel Gorman, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Benjamin Serebrin, Ingo Molnar, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]

On Sun, 2016-08-28 at 01:11 -0700, Andy Lutomirski wrote:
> On Aug 25, 2016 9:06 PM, "Rik van Riel" <riel@redhat.com> wrote:
> > 
> > Subject: x86,mm,sched: make lazy TLB mode even lazier
> > 
> > Lazy TLB mode can result in an idle CPU being woken up for a TLB
> > flush, when all it really needed to do was flush %cr3 before the
> > next context switch.
> > 
> > This is mostly fine on bare metal, though sub-optimal from a power
> > saving point of view, and deeper C states could make TLB flushes
> > take a little longer than desired.
> > 
> > On virtual machines, the pain can be much worse, especially if a
> > currently non-running VCPU is woken up for a TLB invalidation
> > IPI, on a CPU that is busy running another task. It could take
> > a while before that IPI is handled, leading to performance issues.
> > 
> > This patch is still ugly, and the sched.h include needs to be
> > cleaned
> > up a lot (how would the scheduler people like to see the context
> > switch
> > blocking abstracted?)
> > 
> > This patch deals with the issue by introducing a third tlb state,
> > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> > context switch. A CPU is transitioned from TLBSTATE_LAZY to
> > TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
> > 
> > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
> > 
> > This patch is totally untested, because I am at a conference right
> > now, and Benjamin has the test case :)
> > 
> 
> I haven't had a chance to seriously read the code yet, but what
> happens when the mm is deleted outright?  Or is the idea that a
> reference is held until all the lazy users are gone, too?

Worst case we send a TLB flush to a CPU that does
not need it.

As not sending an IPI will be faster than sending
one, I do not think the tradeoff will be much
different for a system with PCID.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-27  8:03     ` Ingo Molnar
  2016-08-27 23:02       ` Linus Torvalds
@ 2016-08-29 15:24       ` Rik van Riel
  1 sibling, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-08-29 15:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, serebrin, peterz, linux-kernel, luto, bp,
	mgorman, tglx, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1225 bytes --]

On Sat, 2016-08-27 at 10:03 +0200, Ingo Molnar wrote:
> * Rik van Riel <riel@redhat.com> wrote:
> 
> > On Thu, 25 Aug 2016 12:42:15 -0700
> > "H. Peter Anvin" <hpa@zytor.com> wrote:
> > 
> > > Why grabbing a lock instead of cmpxchg?
> > 
> > ... and some more cleanups later, this might actually be
> > good to merge, assuming it works for Benjamin :)
> > 
> > ---8<---
> 
> LGTM in principle (it's a pretty clever trick!), just some minor
> stylistic nits:

Thanks for the review. I have applied the stylistic nits, and
turned lazy_tlb_can_skip_flush into a big switch statement as
suggested by Linus.

> > +	 */
> > +	for_each_cpu(cpu, cpumask)
> > +		if (lazy_tlb_can_skip_flush(cpu))
> > +			cpumask_clear_cpu(cpu, (struct cpumask
> > *)cpumask);
> 
> Please remove the 'const' from the cpumask type definition instead of
> this ugly 
> cast!

I played around with this on Thursday already, and ran out of
time to clean that up before going to the next talk. This will
be fixed in the next version. 

> I'd also like to wait for the Tested-by from Benjamin as well before
> we can 
> proceeed.

Agreed.

Ben, a new version is coming up real soon.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-25 19:42 ` H. Peter Anvin
                     ` (2 preceding siblings ...)
  2016-08-25 21:01   ` [PATCH RFC v3] " Rik van Riel
@ 2016-08-29 16:08   ` Rik van Riel
  3 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-08-29 16:08 UTC (permalink / raw)
  To: H. Peter Anvin, serebrin
  Cc: mingo, peterz, torvalds, linux-kernel, luto, bp, mgorman, tglx

[-- Attachment #1: Type: text/plain, Size: 2709 bytes --]

On Thu, 2016-08-25 at 12:42 -0700, H. Peter Anvin wrote:

> > +static void set_lazy_tlbstate_flush(int cpu) {
> > +	if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) {
> > +		raw_spin_lock(&cpu_rq(cpu)->lock);
> > +		if (per_cpu(cpu_tlbstate.state, cpu) ==
> > TLBSTATE_LAZY)
> > +			per_cpu(cpu_tlbstate.state, cpu) =
> > TLBSTATE_FLUSH;
> > +		raw_spin_unlock(&cpu_rq(cpu)->lock);
> > +	}
> > +}
> > +
> > 
> Why grabbing a lock instead of cmpxchg?

The second and third version of the patch had cmpxchg,
instead of grabbing the remote CPU's runqueue lock,
but I am no longer convinced it is safe.

At TLB invalidation time, we have this:

        int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
        int old;

        switch (*tlbstate) {
        case TLBSTATE_LAZY:
                /*
                 * The CPU is in TLBSTATE_LAZY, which could context switch back
                 * to TLBSTATE_OK, re-using the old TLB state without a flush.
                 * If that happened, send a TLB flush IPI.
                 *
                 * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will
                 * be flushed at the next context switch. Skip the IPI.
                 */ 
                old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
                return old != TLBSTATE_OK;

At context switch time, we have this:

                int oldstate = this_cpu_read(cpu_tlbstate.state);

                this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
                BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

                if (oldstate == TLBSTATE_FLUSH ||
                                !cpumask_test_cpu(cpu, mm_cpumask(next))) {

In each case, the read will happen before the write, because
they are to the same address.

If the invalidate and context switch happen concurrently,
the writes can be ordered in two directions:

1) The cmpxchg in the TLB flush code happens after the
this_cpu_write in the context switch code. This is safe.

2) The cmpxchg in the TLB flush code happens before the
this_cpu_write in the context switch code. This is broken.

I can see two ways to fix that:
1) Change the write in the context switch code to a
   cmpxchg. I do not know how expensive this is on
   modern CPUs, or whether the overhead of doing this
   is unacceptable (or even noticeable, considering the
   cache line needs to be acquired for write anyway).
2) Acquire the runqueue lock of the remote CPU from the
   (much rarer?) TLB flush code, in order to ensure it
   does not run concurrently with the context switch
   code.

Any preferences?

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-29 14:54   ` Rik van Riel
@ 2016-08-29 23:55     ` Andy Lutomirski
  2016-08-30  1:14       ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2016-08-29 23:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Thomas Gleixner, Benjamin Serebrin, Borislav Petkov, Ingo Molnar,
	linux-kernel, Mel Gorman, H. Peter Anvin, Peter Zijlstra

On Aug 29, 2016 7:54 AM, "Rik van Riel" <riel@redhat.com> wrote:
>
> On Sun, 2016-08-28 at 01:11 -0700, Andy Lutomirski wrote:
> > On Aug 25, 2016 9:06 PM, "Rik van Riel" <riel@redhat.com> wrote:
> > >
> > > Subject: x86,mm,sched: make lazy TLB mode even lazier
> > >
> > > Lazy TLB mode can result in an idle CPU being woken up for a TLB
> > > flush, when all it really needed to do was flush %cr3 before the
> > > next context switch.
> > >
> > > This is mostly fine on bare metal, though sub-optimal from a power
> > > saving point of view, and deeper C states could make TLB flushes
> > > take a little longer than desired.
> > >
> > > On virtual machines, the pain can be much worse, especially if a
> > > currently non-running VCPU is woken up for a TLB invalidation
> > > IPI, on a CPU that is busy running another task. It could take
> > > a while before that IPI is handled, leading to performance issues.
> > >
> > > This patch is still ugly, and the sched.h include needs to be
> > > cleaned
> > > up a lot (how would the scheduler people like to see the context
> > > switch
> > > blocking abstracted?)
> > >
> > > This patch deals with the issue by introducing a third tlb state,
> > > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> > > context switch. A CPU is transitioned from TLBSTATE_LAZY to
> > > TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
> > >
> > > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
> > >
> > > This patch is totally untested, because I am at a conference right
> > > now, and Benjamin has the test case :)
> > >
> >
> > I haven't had a chance to seriously read the code yet, but what
> > happens when the mm is deleted outright?  Or is the idea that a
> > reference is held until all the lazy users are gone, too?
>
> Worst case we send a TLB flush to a CPU that does
> not need it.
>
> As not sending an IPI will be faster than sending
> one, I do not think the tradeoff will be much
> different for a system with PCID.

If we were fully non-lazy, we wouldn't need to send these IPIs at all,
right?  We would just keep cr3 pointing at swapper_pg_dir when not
actively using the mm.  The problem with doing that without PCID is
that cr3 writes are really slow.  Or am I missing something?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-29 23:55     ` Andy Lutomirski
@ 2016-08-30  1:14       ` H. Peter Anvin
  2016-08-30 18:23         ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2016-08-30  1:14 UTC (permalink / raw)
  To: Andy Lutomirski, Rik van Riel
  Cc: Thomas Gleixner, Benjamin Serebrin, Borislav Petkov, Ingo Molnar,
	linux-kernel, Mel Gorman, Peter Zijlstra

On August 29, 2016 4:55:02 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>On Aug 29, 2016 7:54 AM, "Rik van Riel" <riel@redhat.com> wrote:
>>
>> On Sun, 2016-08-28 at 01:11 -0700, Andy Lutomirski wrote:
>> > On Aug 25, 2016 9:06 PM, "Rik van Riel" <riel@redhat.com> wrote:
>> > >
>> > > Subject: x86,mm,sched: make lazy TLB mode even lazier
>> > >
>> > > Lazy TLB mode can result in an idle CPU being woken up for a TLB
>> > > flush, when all it really needed to do was flush %cr3 before the
>> > > next context switch.
>> > >
>> > > This is mostly fine on bare metal, though sub-optimal from a
>power
>> > > saving point of view, and deeper C states could make TLB flushes
>> > > take a little longer than desired.
>> > >
>> > > On virtual machines, the pain can be much worse, especially if a
>> > > currently non-running VCPU is woken up for a TLB invalidation
>> > > IPI, on a CPU that is busy running another task. It could take
>> > > a while before that IPI is handled, leading to performance
>issues.
>> > >
>> > > This patch is still ugly, and the sched.h include needs to be
>> > > cleaned
>> > > up a lot (how would the scheduler people like to see the context
>> > > switch
>> > > blocking abstracted?)
>> > >
>> > > This patch deals with the issue by introducing a third tlb state,
>> > > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
>> > > context switch. A CPU is transitioned from TLBSTATE_LAZY to
>> > > TLBSTATE_FLUSH with the rq lock held, to prevent context
>switches.
>> > >
>> > > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
>> > >
>> > > This patch is totally untested, because I am at a conference
>right
>> > > now, and Benjamin has the test case :)
>> > >
>> >
>> > I haven't had a chance to seriously read the code yet, but what
>> > happens when the mm is deleted outright?  Or is the idea that a
>> > reference is held until all the lazy users are gone, too?
>>
>> Worst case we send a TLB flush to a CPU that does
>> not need it.
>>
>> As not sending an IPI will be faster than sending
>> one, I do not think the tradeoff will be much
>> different for a system with PCID.
>
>If we were fully non-lazy, we wouldn't need to send these IPIs at all,
>right?  We would just keep cr3 pointing at swapper_pg_dir when not
>actively using the mm.  The problem with doing that without PCID is
>that cr3 writes are really slow.  Or am I missing something?

Writing cr3 on a PCID system doesn't (necessarily) flush the TLB context.  The whole reason for PCIDs is to *enable* lazy TLB by not making it necessary to flush a TLB context during the running of another process.  As such, this methodology should help a PCID system even more: we can remember if we need to flush a TLB context during the scheduling of said task, without needing any IPI.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-30  1:14       ` H. Peter Anvin
@ 2016-08-30 18:23         ` Andy Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2016-08-30 18:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Rik van Riel, Thomas Gleixner, Benjamin Serebrin,
	Borislav Petkov, Ingo Molnar, linux-kernel, Mel Gorman,
	Peter Zijlstra

On Mon, Aug 29, 2016 at 6:14 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On August 29, 2016 4:55:02 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>>On Aug 29, 2016 7:54 AM, "Rik van Riel" <riel@redhat.com> wrote:
>>>
>>> On Sun, 2016-08-28 at 01:11 -0700, Andy Lutomirski wrote:
>>> > On Aug 25, 2016 9:06 PM, "Rik van Riel" <riel@redhat.com> wrote:
>>> > >
>>> > > Subject: x86,mm,sched: make lazy TLB mode even lazier
>>> > >
>>> > > Lazy TLB mode can result in an idle CPU being woken up for a TLB
>>> > > flush, when all it really needed to do was flush %cr3 before the
>>> > > next context switch.
>>> > >
>>> > > This is mostly fine on bare metal, though sub-optimal from a
>>power
>>> > > saving point of view, and deeper C states could make TLB flushes
>>> > > take a little longer than desired.
>>> > >
>>> > > On virtual machines, the pain can be much worse, especially if a
>>> > > currently non-running VCPU is woken up for a TLB invalidation
>>> > > IPI, on a CPU that is busy running another task. It could take
>>> > > a while before that IPI is handled, leading to performance
>>issues.
>>> > >
>>> > > This patch is still ugly, and the sched.h include needs to be
>>> > > cleaned
>>> > > up a lot (how would the scheduler people like to see the context
>>> > > switch
>>> > > blocking abstracted?)
>>> > >
>>> > > This patch deals with the issue by introducing a third tlb state,
>>> > > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
>>> > > context switch. A CPU is transitioned from TLBSTATE_LAZY to
>>> > > TLBSTATE_FLUSH with the rq lock held, to prevent context
>>switches.
>>> > >
>>> > > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
>>> > >
>>> > > This patch is totally untested, because I am at a conference
>>right
>>> > > now, and Benjamin has the test case :)
>>> > >
>>> >
>>> > I haven't had a chance to seriously read the code yet, but what
>>> > happens when the mm is deleted outright?  Or is the idea that a
>>> > reference is held until all the lazy users are gone, too?
>>>
>>> Worst case we send a TLB flush to a CPU that does
>>> not need it.
>>>
>>> As not sending an IPI will be faster than sending
>>> one, I do not think the tradeoff will be much
>>> different for a system with PCID.
>>
>>If we were fully non-lazy, we wouldn't need to send these IPIs at all,
>>right?  We would just keep cr3 pointing at swapper_pg_dir when not
>>actively using the mm.  The problem with doing that without PCID is
>>that cr3 writes are really slow.  Or am I missing something?
>
> Writing cr3 on a PCID system doesn't (necessarily) flush the TLB context.  The whole reason for PCIDs is to *enable* lazy TLB by not making it necessary to flush a TLB context during the running of another process.  As such, this methodology should help a PCID system even more: we can remember if we need to flush a TLB context during the scheduling of said task, without needing any IPI.

What I mean, more precisely, is: when unusing an mm, if we have PCID,
we could actually switch to swapper_pg_dir without flushing the TLB.
Then, when we resume the old task, we can use the tracking (that I add
in my patches) to decide when to flush them.

I'm not sure this would actually improve matters in any meaningful way.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC v4] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-27 23:02       ` Linus Torvalds
@ 2016-08-30 19:53         ` Rik van Riel
  2016-08-30 21:09           ` [PATCH RFC v5] " Rik van Riel
  2016-08-31 16:27         ` [PATCH RFC v6] " Rik van Riel
  1 sibling, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2016-08-30 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, serebrin, Peter Zijlstra,
	Linux Kernel Mailing List, Andrew Lutomirski, Borislav Petkov,
	Mel Gorman, Thomas Gleixner

On Sat, 27 Aug 2016 16:02:25 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The only remaining comment is that I'd make that
> lazy_tlb_can_skip_flush() function just use a switch table for the
> tlbstate comparisons rather than the repeated conditionals.

After staring at the code for an hour or so yesterday, I found a race
condition. It took me a few minutes to realize we can fix it with a
cmpxchg at context switch time, and then most of a day to realize that
we only need that cmpxchg in the context switch code if the old tlb
state is TLBSTATE_LAZY.

Context switch times when the tlb state is something else should be
unaffected.

The 4th version of the patch (below) closes that race condition, and
includes the improvements suggested by Ingo and you.
 
> I'd love to see the results from Benjamin - maybe it helps a lot, and
> maybe it doesn't. But regardless, the patch makes sense to me.

I would love to see test results from Ben, as well.  Ben? :)

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %CR3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C-states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third TLB state,
TLBSTATE_FLUSH, which causes %CR3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Benjamin Serebrin <serebrin@google.com>
---
 arch/x86/include/asm/tlbflush.h |  3 +-
 arch/x86/include/asm/uv/uv.h    |  6 ++--
 arch/x86/mm/tlb.c               | 64 ++++++++++++++++++++++++++++++++++++++---
 arch/x86/platform/uv/tlb_uv.c   |  2 +-
 4 files changed, 66 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..c3dbacbc49be 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -304,12 +304,13 @@ extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 #define flush_tlb()	flush_tlb_current_task()
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_others(struct cpumask *cpumask,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end);
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 062921ef34e9..7e83cc633ba1 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -13,7 +13,7 @@ extern int is_uv_system(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						 struct mm_struct *mm,
 						 unsigned long start,
 						 unsigned long end,
@@ -25,8 +25,8 @@ static inline enum uv_system_type get_uv_system_type(void) { return UV_NONE; }
 static inline int is_uv_system(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask, struct mm_struct *mm,
 		    unsigned long start, unsigned long end, unsigned int cpu)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..634248b38db9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state);
+		int oldstate = *tlbstate;
+
+		if (unlikely(oldstate == TLBSTATE_LAZY)) {
+			/*
+			 * The TLB flush code (lazy_tlb_can_skip_flush) can
+			 * move the TLB state to TLBSTATE_FLUSH concurrently
+			 * with a context switch. Using cmpxchg here will catch
+			 * that transition, causing a TLB flush below.
+			 */
+			oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK);
+		}
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +256,44 @@ static void flush_tlb_func(void *info)
 
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+/*
+ * Determine whether a CPU's TLB needs to be flushed now, or whether the
+ * flush can be delayed until the next context switch, by changing the
+ * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
+ */
+static bool lazy_tlb_can_skip_flush(int cpu)
+{
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	int old;
+
+	switch (*tlbstate) {
+	case TLBSTATE_FLUSH:
+		/* The TLB will be flushed on the next context switch. */
+		return true;
+	case TLBSTATE_LAZY:
+		/*
+		 * The CPU is in TLBSTATE_LAZY, which could context switch back
+		 * to TLBSTATE_OK, re-using the old TLB state without a flush.
+		 * If that happened, send a TLB flush IPI.
+		 *
+		 * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will
+		 * be flushed at the next context switch. Skip the IPI.
+		 */
+		old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
+		return old != TLBSTATE_OK;
+	case TLBSTATE_OK:
+	default:
+		/* A task on the CPU is actively using the mm. Flush the TLB. */
+		return false;
+	}
+}
+
+void native_flush_tlb_others(struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +309,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +316,17 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask) {
+		if (lazy_tlb_can_skip_flush(cpu))
+			cpumask_clear_cpu(cpu, cpumask);
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index fdb4d42b4ce5..7a2221a81e77 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1090,7 +1090,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * Returns pointer to cpumask if some remote flushing remains to be
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
-const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						struct mm_struct *mm,
 						unsigned long start,
 						unsigned long end,

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v5] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-30 19:53         ` [PATCH RFC v4] " Rik van Riel
@ 2016-08-30 21:09           ` Rik van Riel
  0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-08-30 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, serebrin, Peter Zijlstra,
	Linux Kernel Mailing List, Andrew Lutomirski, Borislav Petkov,
	Mel Gorman, Thomas Gleixner

On Tue, 30 Aug 2016 15:53:32 -0400
Rik van Riel <riel@redhat.com> wrote:

> On Sat, 27 Aug 2016 16:02:25 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > The only remaining comment is that I'd make that
> > lazy_tlb_can_skip_flush() function just use a switch table for the
> > tlbstate comparisons rather than the repeated conditionals.  
> 
> After staring at the code for an hour or so yesterday, I found a race
> condition. It took me a few minutes to realize we can fix it with a
> cmpxchg at context switch time, and then most of a day to realize that
> we only need that cmpxchg in the context switch code if the old tlb
> state is TLBSTATE_LAZY.
> 
> Context switch times when the tlb state is something else should be
> unaffected.
> 
> The 4th version of the patch (below) closes that race condition, and
> includes the improvements suggested by Ingo and you.
>  
> > I'd love to see the results from Benjamin - maybe it helps a lot, and
> > maybe it doesn't. But regardless, the patch makes sense to me.  
> 
> I would love to see test results from Ben, as well.  Ben? :)

The kbuild test robot helpfully reminded me that I forgot to enable
CONFIG_PARAVIRT in my build. v5 adds a one-line change in paravirt_types.h,
to match the changed prototype for flush_tlb_others.

Sorry about the noise.

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %CR3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C-states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third TLB state,
TLBSTATE_FLUSH, which causes %CR3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Benjamin Serebrin <serebrin@google.com>
---
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/tlbflush.h       |  3 +-
 arch/x86/include/asm/uv/uv.h          |  6 ++--
 arch/x86/mm/tlb.c                     | 64 ++++++++++++++++++++++++++++++++---
 arch/x86/platform/uv/tlb_uv.c         |  2 +-
 5 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7fa9e7740ba3..b7e695c90c43 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -225,7 +225,7 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_single)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
+	void (*flush_tlb_others)(struct cpumask *cpus,
 				 struct mm_struct *mm,
 				 unsigned long start,
 				 unsigned long end);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..c3dbacbc49be 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -304,12 +304,13 @@ extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 #define flush_tlb()	flush_tlb_current_task()
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_others(struct cpumask *cpumask,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end);
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 062921ef34e9..7e83cc633ba1 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -13,7 +13,7 @@ extern int is_uv_system(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						 struct mm_struct *mm,
 						 unsigned long start,
 						 unsigned long end,
@@ -25,8 +25,8 @@ static inline enum uv_system_type get_uv_system_type(void) { return UV_NONE; }
 static inline int is_uv_system(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask, struct mm_struct *mm,
 		    unsigned long start, unsigned long end, unsigned int cpu)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..634248b38db9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state);
+		int oldstate = *tlbstate;
+
+		if (unlikely(oldstate == TLBSTATE_LAZY)) {
+			/*
+			 * The TLB flush code (lazy_tlb_can_skip_flush) can
+			 * move the TLB state to TLBSTATE_FLUSH concurrently
+			 * with a context switch. Using cmpxchg here will catch
+			 * that transition, causing a TLB flush below.
+			 */
+			oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK);
+		}
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +256,44 @@ static void flush_tlb_func(void *info)
 
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+/*
+ * Determine whether a CPU's TLB needs to be flushed now, or whether the
+ * flush can be delayed until the next context switch, by changing the
+ * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
+ */
+static bool lazy_tlb_can_skip_flush(int cpu)
+{
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	int old;
+
+	switch (*tlbstate) {
+	case TLBSTATE_FLUSH:
+		/* The TLB will be flushed on the next context switch. */
+		return true;
+	case TLBSTATE_LAZY:
+		/*
+		 * The CPU is in TLBSTATE_LAZY, which could context switch back
+		 * to TLBSTATE_OK, re-using the old TLB state without a flush.
+		 * If that happened, send a TLB flush IPI.
+		 *
+		 * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will
+		 * be flushed at the next context switch. Skip the IPI.
+		 */
+		old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
+		return old != TLBSTATE_OK;
+	case TLBSTATE_OK:
+	default:
+		/* A task on the CPU is actively using the mm. Flush the TLB. */
+		return false;
+	}
+}
+
+void native_flush_tlb_others(struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +309,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +316,17 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask) {
+		if (lazy_tlb_can_skip_flush(cpu))
+			cpumask_clear_cpu(cpu, cpumask);
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index fdb4d42b4ce5..7a2221a81e77 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1090,7 +1090,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * Returns pointer to cpumask if some remote flushing remains to be
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
-const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						struct mm_struct *mm,
 						unsigned long start,
 						unsigned long end,

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-27 23:02       ` Linus Torvalds
  2016-08-30 19:53         ` [PATCH RFC v4] " Rik van Riel
@ 2016-08-31 16:27         ` Rik van Riel
  2016-09-08  6:56           ` Ingo Molnar
  1 sibling, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2016-08-31 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, serebrin, Peter Zijlstra,
	Linux Kernel Mailing List, Andrew Lutomirski, Borislav Petkov,
	Mel Gorman, Thomas Gleixner

On Sat, 27 Aug 2016 16:02:25 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Yeah, with those small fixes from Ingo, I definitely don't think this
> looks hacky at all. This all seems to be exactly what we should always
> have done.

OK, so I was too tired yesterday to do kernel hacking, and
missed yet another bit (xen_flush_tlb_others). Sigh.

Otherwise, the patch is identical.

Looking forward to Ben's test results.

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %CR3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C-states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third TLB state,
TLBSTATE_FLUSH, which causes %CR3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Benjamin Serebrin <serebrin@google.com>
---
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/tlbflush.h       |  3 +-
 arch/x86/include/asm/uv/uv.h          |  6 ++--
 arch/x86/mm/tlb.c                     | 64 ++++++++++++++++++++++++++++++++---
 arch/x86/platform/uv/tlb_uv.c         |  2 +-
 arch/x86/xen/mmu.c                    |  2 +-
 6 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7fa9e7740ba3..b7e695c90c43 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -225,7 +225,7 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_single)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
+	void (*flush_tlb_others)(struct cpumask *cpus,
 				 struct mm_struct *mm,
 				 unsigned long start,
 				 unsigned long end);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..c3dbacbc49be 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -304,12 +304,13 @@ extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 #define flush_tlb()	flush_tlb_current_task()
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_others(struct cpumask *cpumask,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end);
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 062921ef34e9..7e83cc633ba1 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -13,7 +13,7 @@ extern int is_uv_system(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						 struct mm_struct *mm,
 						 unsigned long start,
 						 unsigned long end,
@@ -25,8 +25,8 @@ static inline enum uv_system_type get_uv_system_type(void) { return UV_NONE; }
 static inline int is_uv_system(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask, struct mm_struct *mm,
 		    unsigned long start, unsigned long end, unsigned int cpu)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..634248b38db9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state);
+		int oldstate = *tlbstate;
+
+		if (unlikely(oldstate == TLBSTATE_LAZY)) {
+			/*
+			 * The TLB flush code (lazy_tlb_can_skip_flush) can
+			 * move the TLB state to TLBSTATE_FLUSH concurrently
+			 * with a context switch. Using cmpxchg here will catch
+			 * that transition, causing a TLB flush below.
+			 */
+			oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK);
+		}
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +256,44 @@ static void flush_tlb_func(void *info)
 
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+/*
+ * Determine whether a CPU's TLB needs to be flushed now, or whether the
+ * flush can be delayed until the next context switch, by changing the
+ * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
+ */
+static bool lazy_tlb_can_skip_flush(int cpu)
+{
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	int old;
+
+	switch (*tlbstate) {
+	case TLBSTATE_FLUSH:
+		/* The TLB will be flushed on the next context switch. */
+		return true;
+	case TLBSTATE_LAZY:
+		/*
+		 * The CPU is in TLBSTATE_LAZY, which could context switch back
+		 * to TLBSTATE_OK, re-using the old TLB state without a flush.
+		 * If that happened, send a TLB flush IPI.
+		 *
+		 * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will
+		 * be flushed at the next context switch. Skip the IPI.
+		 */
+		old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
+		return old != TLBSTATE_OK;
+	case TLBSTATE_OK:
+	default:
+		/* A task on the CPU is actively using the mm. Flush the TLB. */
+		return false;
+	}
+}
+
+void native_flush_tlb_others(struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +309,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +316,17 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask) {
+		if (lazy_tlb_can_skip_flush(cpu))
+			cpumask_clear_cpu(cpu, cpumask);
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index fdb4d42b4ce5..7a2221a81e77 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1090,7 +1090,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * Returns pointer to cpumask if some remote flushing remains to be
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
-const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						struct mm_struct *mm,
 						unsigned long start,
 						unsigned long end,
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 67433714b791..0e3e5969527f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1370,7 +1370,7 @@ static void xen_flush_tlb_single(unsigned long addr)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_others(const struct cpumask *cpus,
+static void xen_flush_tlb_others(struct cpumask *cpus,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier
  2016-08-31 16:27         ` [PATCH RFC v6] " Rik van Riel
@ 2016-09-08  6:56           ` Ingo Molnar
  2016-09-09  0:09             ` Benjamin Serebrin
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2016-09-08  6:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, H. Peter Anvin, serebrin, Peter Zijlstra,
	Linux Kernel Mailing List, Andrew Lutomirski, Borislav Petkov,
	Mel Gorman, Thomas Gleixner


* Rik van Riel <riel@redhat.com> wrote:

> On Sat, 27 Aug 2016 16:02:25 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Yeah, with those small fixes from Ingo, I definitely don't think this
> > looks hacky at all. This all seems to be exactly what we should always
> > have done.
> 
> OK, so I was too tired yesterday to do kernel hacking, and
> missed yet another bit (xen_flush_tlb_others). Sigh.
> 
> Otherwise, the patch is identical.
> 
> Looking forward to Ben's test results.

Gentle ping to Ben.

I can also apply this without waiting for the test result, the patch looks sane 
enough to me.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier
  2016-09-08  6:56           ` Ingo Molnar
@ 2016-09-09  0:09             ` Benjamin Serebrin
  2016-09-09  4:39               ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Benjamin Serebrin @ 2016-09-09  0:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Linus Torvalds, H. Peter Anvin, Peter Zijlstra,
	Linux Kernel Mailing List, Andrew Lutomirski, Borislav Petkov,
	Mel Gorman, Thomas Gleixner

Sorry for the delay, I was eaten by a grue.

I found that my initial study did not actually measure the number of
TLB shootdown IPIs sent per TLB shootdown.  I think the intuition was
correct but I didn't actually observe what I thought I had; my
original use of probe points was incorrect.  However, after fixing my
methodology, I'm having trouble proving that the existing Lazy TLB
mode is working properly.



I've spent some time trying to reproduce this in a microbenchmark.
One thread does mmap, touch page, munmap, while other threads in the
same process are configured to either busy-spin or busy-spin and
yield.  All threads set their own affinity to a unique cpu, and the
system is otherwise idle.  I look at the per-cpu delta of the TLB and
CAL lines of /proc/interrupts over the run of the microbenchmark.

Let's say I have 4 spin threads that never yield.  The mmap thread
does N unmaps.  I observe each spin-thread core receives N (+/-  small
noise) TLB shootdown interrupts, and the total TLB interrupt count is
4N (+/- small noise).  This is expected behavior.

Then I add some synchronization:  the unmap thread rendezvouses with
all the spinners, and when they are all ready, the spinners busy-spin
for D milliseconds and then yield (pthread_yield, sched_yield produce
identical results, though I'm not confident here that this is the
right yield).  Meanwhile, the unmap thread busy-spins for D+E
milliseconds and then does M map/touch/unmaps.  (D, E are single-digit
milliseconds).  The idea here is that the unmap happens a little while
after the spinners yielded; the kernel should be in the user process'
mm but lazy TLB mode should defer TLB flushes.  It seems that lazy
mode on each CPU should take 1 interrupt and then suppress subsequent
interrupts.

I expect lazy TLB invalidation to take 1 interrupt on each spinner
CPU, per rendezvous sequence, and I expect Rik's extra-lazy version to
take 0.  I see M in all cases.  This leads me to wonder if I'm failing
to trigger lazy TLB invalidation, or if lazy TLB invalidation is not
working as intended.

I get similar results using perf record on probe points: I filter by
CPU number and count the number of IPIs sent per each pair of probe
points in the tlb flush routines.  I put probe points on
flush_tlb_mm_range and  flush_tlb_mm_range%return.  Counting number of
IPIs sent: In a VM that uses x2_physical mode, probing
native_x2apic_icr_write or __x2apic_send_IPI_dest is usually
convenient if it doesn't get inlined away (which sometimes happens),
since that function is called once per CPU target in the cpu_mask of
__x2apic_send_IPI_mask (in x2 physical mode).  I filter perf script to
look at the distribution of cpus targeted per TLB shootdown.


Rik's patch definitely looks correct, but I can't yet cite the gains.

Thanks!
Ben





On Wed, Sep 7, 2016 at 11:56 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Rik van Riel <riel@redhat.com> wrote:
>
>> On Sat, 27 Aug 2016 16:02:25 -0700
>> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> > Yeah, with those small fixes from Ingo, I definitely don't think this
>> > looks hacky at all. This all seems to be exactly what we should always
>> > have done.
>>
>> OK, so I was too tired yesterday to do kernel hacking, and
>> missed yet another bit (xen_flush_tlb_others). Sigh.
>>
>> Otherwise, the patch is identical.
>>
>> Looking forward to Ben's test results.
>
> Gentle ping to Ben.
>
> I can also apply this without waiting for the test result, the patch looks sane
> enough to me.
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier
  2016-09-09  0:09             ` Benjamin Serebrin
@ 2016-09-09  4:39               ` Andy Lutomirski
  2016-09-09  7:44                 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2016-09-09  4:39 UTC (permalink / raw)
  To: Benjamin Serebrin
  Cc: Ingo Molnar, Rik van Riel, Linus Torvalds, H. Peter Anvin,
	Peter Zijlstra, Linux Kernel Mailing List, Andrew Lutomirski,
	Borislav Petkov, Mel Gorman, Thomas Gleixner

On Thu, Sep 8, 2016 at 5:09 PM, Benjamin Serebrin <serebrin@google.com> wrote:
> Sorry for the delay, I was eaten by a grue.
>
> I found that my initial study did not actually measure the number of
> TLB shootdown IPIs sent per TLB shootdown.  I think the intuition was
> correct but I didn't actually observe what I thought I had; my
> original use of probe points was incorrect.  However, after fixing my
> methodology, I'm having trouble proving that the existing Lazy TLB
> mode is working properly.
>
>
>
> I've spent some time trying to reproduce this in a microbenchmark.
> One thread does mmap, touch page, munmap, while other threads in the
> same process are configured to either busy-spin or busy-spin and
> yield.  All threads set their own affinity to a unique cpu, and the
> system is otherwise idle.  I look at the per-cpu delta of the TLB and
> CAL lines of /proc/interrupts over the run of the microbenchmark.
>
> Let's say I have 4 spin threads that never yield.  The mmap thread
> does N unmaps.  I observe each spin-thread core receives N (+/-  small
> noise) TLB shootdown interrupts, and the total TLB interrupt count is
> 4N (+/- small noise).  This is expected behavior.
>
> Then I add some synchronization:  the unmap thread rendezvouses with
> all the spinners, and when they are all ready, the spinners busy-spin
> for D milliseconds and then yield (pthread_yield, sched_yield produce
> identical results, though I'm not confident here that this is the
> right yield).  Meanwhile, the unmap thread busy-spins for D+E
> milliseconds and then does M map/touch/unmaps.  (D, E are single-digit
> milliseconds).  The idea here is that the unmap happens a little while
> after the spinners yielded; the kernel should be in the user process'
> mm but lazy TLB mode should defer TLB flushes.  It seems that lazy
> mode on each CPU should take 1 interrupt and then suppress subsequent
> interrupts.

If they're busy threads, shouldn't the yield return immediately
because the threads are still ready to run?  Lazy TLB won't do much
unless you get the kernel in some state where it's running in the
context of a different kernel thread and hasn't switched to
swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
sleep to go idle.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier
  2016-09-09  4:39               ` Andy Lutomirski
@ 2016-09-09  7:44                 ` Peter Zijlstra
  2017-04-25  3:30                   ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2016-09-09  7:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Benjamin Serebrin, Ingo Molnar, Rik van Riel, Linus Torvalds,
	H. Peter Anvin, Linux Kernel Mailing List, Andrew Lutomirski,
	Borislav Petkov, Mel Gorman, Thomas Gleixner

On Thu, Sep 08, 2016 at 09:39:45PM -0700, Andy Lutomirski wrote:
> If they're busy threads, shouldn't the yield return immediately
> because the threads are still ready to run?  Lazy TLB won't do much
> unless you get the kernel in some state where it's running in the
> context of a different kernel thread and hasn't switched to
> swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
> sleep to go idle.

Right, a task doing:

	for (;;) sched_yield();

esp. when its the only runnable thread on the CPU, is a busy thread. It
will not enter switch_mm(), which was where the invalidate hook was
placed IIRC.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier
  2016-09-09  7:44                 ` Peter Zijlstra
@ 2017-04-25  3:30                   ` Andy Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2017-04-25  3:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Serebrin, Ingo Molnar, Rik van Riel, Linus Torvalds,
	H. Peter Anvin, Linux Kernel Mailing List, Andrew Lutomirski,
	Borislav Petkov, Mel Gorman, Thomas Gleixner

On Fri, Sep 9, 2016 at 12:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 08, 2016 at 09:39:45PM -0700, Andy Lutomirski wrote:
>> If they're busy threads, shouldn't the yield return immediately
>> because the threads are still ready to run?  Lazy TLB won't do much
>> unless you get the kernel in some state where it's running in the
>> context of a different kernel thread and hasn't switched to
>> swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
>> sleep to go idle.
>
> Right, a task doing:
>
>         for (;;) sched_yield();
>
> esp. when its the only runnable thread on the CPU, is a busy thread. It
> will not enter switch_mm(), which was where the invalidate hook was
> placed IIRC.

Hi all-

I'm guessing that this patch got abandoned, at least temporarily.  I'm
currently polishing up my PCID series, and I think it might be worth
revisiting this on top of my PCID rework.  The relevant major
infrastructure change I'm making with my PCID code is that I'm adding
an atomic64_t to each mm_context_t that gets incremented every time a
flush on that mm is requested.  With that change, we might be able to
get away with simply removing a cpu from mm_cpumask immediately when
it enters lazy mode and adding a hook to the scheduler to revalidate
the TLB state when switching mms when we were previously lazy.
Revalidation would just check that the counter hasn't changed.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-04-25  3:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-25 19:04 [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier Rik van Riel
2016-08-25 19:42 ` H. Peter Anvin
2016-08-25 19:52   ` Rik van Riel
2016-08-25 20:15   ` [PATCH RFC v2] " Rik van Riel
2016-08-25 21:01   ` [PATCH RFC v3] " Rik van Riel
2016-08-27  8:03     ` Ingo Molnar
2016-08-27 23:02       ` Linus Torvalds
2016-08-30 19:53         ` [PATCH RFC v4] " Rik van Riel
2016-08-30 21:09           ` [PATCH RFC v5] " Rik van Riel
2016-08-31 16:27         ` [PATCH RFC v6] " Rik van Riel
2016-09-08  6:56           ` Ingo Molnar
2016-09-09  0:09             ` Benjamin Serebrin
2016-09-09  4:39               ` Andy Lutomirski
2016-09-09  7:44                 ` Peter Zijlstra
2017-04-25  3:30                   ` Andy Lutomirski
2016-08-29 15:24       ` [PATCH RFC v3] " Rik van Riel
2016-08-29 16:08   ` [PATCH RFC UGLY] " Rik van Riel
2016-08-28  8:11 ` Andy Lutomirski
2016-08-29 14:54   ` Rik van Riel
2016-08-29 23:55     ` Andy Lutomirski
2016-08-30  1:14       ` H. Peter Anvin
2016-08-30 18:23         ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).