linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
@ 2018-01-25  0:36 Tim Chen
  2018-01-25  0:36 ` [RFC PATCH 2/2] x86/ibpb: Prevent missed IBPB flush Tim Chen
  2018-01-25  8:58 ` [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Tim Chen @ 2018-01-25  0:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tim Chen, KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	rkrcmar, Thomas Gleixner, Tom Lendacky, x86

These two patches provide optimization to skip IBPB for this
commonly encountered scenario:
We could switch to a kernel idle thread and then back to the original
process such as:
process A -> idle -> process A

In such scenario, we do not have to do IBPB here even though the process
is non-dumpable, as we are switching back to the same process after
an hiatus.

The cost is to have an extra pointer to track the last mm we were using before
switching to the init_mm used by idle.  But avoiding the extra IBPB
is probably worth the extra memory for such a common scenario.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 arch/x86/include/asm/tlbflush.h |  2 ++
 arch/x86/mm/tlb.c               | 13 +++++++++----
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3effd3c..c5e325f 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -174,6 +174,8 @@ struct tlb_state {
 	struct mm_struct *loaded_mm;
 	u16 loaded_mm_asid;
 	u16 next_asid;
+	/* last user mm */
+	struct mm_struct *last_usr_mm;
 
 	/*
 	 * We can be in one of several states:
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d5a35fe..86ed07f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -220,6 +220,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	} else {
 		u16 new_asid;
 		bool need_flush;
+		struct mm_struct *last = this_cpu_read(cpu_tlbstate.last_usr_mm);
 
 		/*
 		 * Avoid user/user BTB poisoning by flushing the branch predictor
@@ -229,15 +230,17 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
                  * As an optimization flush indirect branches only when
                  * switching into processes that disable dumping.
                  *
-                 * This will not flush branches when switching into kernel
-                 * threads, but it would flush them when switching to the
-                 * idle thread and back.
+		 * This will not flush branches when switching into kernel
+		 * threads. It will also not flush if we switch to idle
+		 * thread and back to the same process. It will flush if we
+		 * switch to a different non-dumpable process.
                  *
                  * It might be useful to have a one-off cache here
                  * to also not flush the idle case, but we would need some
                  * kind of stable sequence number to remember the previous mm.
 		 */
-		if (tsk && tsk->mm && get_dumpable(tsk->mm) != SUID_DUMP_USER)
+		if (tsk && tsk->mm && (tsk->mm != last)
+			&& get_dumpable(tsk->mm) != SUID_DUMP_USER)
 			indirect_branch_prediction_barrier();
 
 		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -288,6 +291,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
+		if (next != &init_mm && next != last)
+			this_cpu_write(cpu_tlbstate.last_usr_mm, next);
 		this_cpu_write(cpu_tlbstate.loaded_mm, next);
 		this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
 	}
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/2] x86/ibpb: Prevent missed IBPB flush
  2018-01-25  0:36 [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Tim Chen
@ 2018-01-25  0:36 ` Tim Chen
  2018-01-25  8:20   ` David Woodhouse
  2018-01-25  8:58 ` [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Tim Chen @ 2018-01-25  0:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tim Chen, KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	rkrcmar, Thomas Gleixner, Tom Lendacky, x86

It is possible that the last uesr mm that we recorded for a cpu was
released, and a new mm with identical address was allocated when we
check it again.  We could skip IBPB flush here for the process with
the new mm.

It is a difficult to exploit case as we have to exit() a process on a
cpu, free the mm, and fork() the victim to use the mm pointer on that
cpu. The exploiter needs the old mm to get recycled to the
newly forked process and no other processes run on the target cpu.

Nevertheless, the patch below is one way to close this hole by
adding a ref count to prevent the last user mm from being released.
It does add ref counting overhead, and extra memory cost of keeping an mm
(though not the VMAs and most of page tables) around longer than we will
otherwise need to. Any better solutions are welcomed.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 arch/x86/mm/tlb.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 86ed07f..3bdaa10 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/sched/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -291,8 +292,17 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
-		if (next != &init_mm && next != last)
+		if (next != &init_mm && next != last) {
+			if (last != NULL)
+				mmdrop(last);
+			/*
+			 * Keep 'next' allocated until we switch to another mm.
+			 * This keeps us from missing a flush of the branch predictors
+			 * if 'next' gets freed and reallocated.
+			 */
+			mmgrab(next);
 			this_cpu_write(cpu_tlbstate.last_usr_mm, next);
+		}
 		this_cpu_write(cpu_tlbstate.loaded_mm, next);
 		this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
 	}
@@ -370,6 +380,7 @@ void initialize_tlbstate_and_flush(void)
 	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
+	this_cpu_write(cpu_tlbstate.last_usr_mm, NULL);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
 	this_cpu_write(cpu_tlbstate.next_asid, 1);
 	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] x86/ibpb: Prevent missed IBPB flush
  2018-01-25  0:36 ` [RFC PATCH 2/2] x86/ibpb: Prevent missed IBPB flush Tim Chen
@ 2018-01-25  8:20   ` David Woodhouse
  2018-01-25 16:56     ` Tim Chen
  0 siblings, 1 reply; 22+ messages in thread
From: David Woodhouse @ 2018-01-25  8:20 UTC (permalink / raw)
  To: Tim Chen, linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, rkrcmar, Thomas Gleixner, Tom Lendacky, x86

[-- Attachment #1: Type: text/plain, Size: 1495 bytes --]

On Wed, 2018-01-24 at 16:36 -0800, Tim Chen wrote:
> It is possible that the last uesr mm that we recorded for a cpu was
> released, and a new mm with identical address was allocated when we
> check it again.  We could skip IBPB flush here for the process with
> the new mm.
> 
> It is a difficult to exploit case as we have to exit() a process on a
> cpu, free the mm, and fork() the victim to use the mm pointer on that
> cpu. The exploiter needs the old mm to get recycled to the
> newly forked process and no other processes run on the target cpu.

That's what it takes to have the victim process leak information into
the cache. In order to *harvest* that information, the attacker must
then get run on the same CPU again? And since her first process had to
exits, as described above, she needs a new process for that?

I confess, with all the other wildly theoretical loopholes that exist,
I wasn't losing much sleep over this one.

> Nevertheless, the patch below is one way to close this hole by
> adding a ref count to prevent the last user mm from being released.
> It does add ref counting overhead, and extra memory cost of keeping an mm
> (though not the VMAs and most of page tables) around longer than we will
> otherwise need to. Any better solutions are welcomed.

This has no upper bound on the amount of time the user mm gets held,
right? If a given CPU remains idle for ever (and what happens if it's
taken offline?) we'll never do that mmdrop() ?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25  0:36 [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Tim Chen
  2018-01-25  0:36 ` [RFC PATCH 2/2] x86/ibpb: Prevent missed IBPB flush Tim Chen
@ 2018-01-25  8:58 ` Peter Zijlstra
  2018-01-25  9:22   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-01-25  8:58 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-kernel, KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, rkrcmar,
	Thomas Gleixner, Tom Lendacky, x86

On Wed, Jan 24, 2018 at 04:36:41PM -0800, Tim Chen wrote:
> These two patches provide optimization to skip IBPB for this
> commonly encountered scenario:
> We could switch to a kernel idle thread and then back to the original
> process such as:
> process A -> idle -> process A
> 
> In such scenario, we do not have to do IBPB here even though the process
> is non-dumpable, as we are switching back to the same process after
> an hiatus.
> 
> The cost is to have an extra pointer to track the last mm we were using before
> switching to the init_mm used by idle.  But avoiding the extra IBPB
> is probably worth the extra memory for such a common scenario.

So we already track active_mm for kernel threads. I can't immediately
see where this fails for idle and your changelog doesn't say.

> @@ -229,15 +230,17 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>                   * As an optimization flush indirect branches only when
>                   * switching into processes that disable dumping.
>                   *
> -                 * This will not flush branches when switching into kernel
> -                 * threads, but it would flush them when switching to the
> -                 * idle thread and back.
> +		 * This will not flush branches when switching into kernel
> +		 * threads. It will also not flush if we switch to idle
> +		 * thread and back to the same process. It will flush if we
> +		 * switch to a different non-dumpable process.

Whitespace damage.

>                   *
>                   * It might be useful to have a one-off cache here
>                   * to also not flush the idle case, but we would need some
>                   * kind of stable sequence number to remember the previous mm.
>  		 */
> -		if (tsk && tsk->mm && get_dumpable(tsk->mm) != SUID_DUMP_USER)
> +		if (tsk && tsk->mm && (tsk->mm != last)
> +			&& get_dumpable(tsk->mm) != SUID_DUMP_USER)

Broken coding style, operators go at the end of the previous line.

>  			indirect_branch_prediction_barrier();
>  
>  		if (IS_ENABLED(CONFIG_VMAP_STACK)) {

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25  8:58 ` [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Peter Zijlstra
@ 2018-01-25  9:22   ` Peter Zijlstra
  2018-01-25 13:21     ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-01-25  9:22 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-kernel, KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, rkrcmar,
	Thomas Gleixner, Tom Lendacky, x86

On Thu, Jan 25, 2018 at 09:58:20AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 24, 2018 at 04:36:41PM -0800, Tim Chen wrote:
> > These two patches provide optimization to skip IBPB for this
> > commonly encountered scenario:
> > We could switch to a kernel idle thread and then back to the original
> > process such as:
> > process A -> idle -> process A
> > 
> > In such scenario, we do not have to do IBPB here even though the process
> > is non-dumpable, as we are switching back to the same process after
> > an hiatus.
> > 
> > The cost is to have an extra pointer to track the last mm we were using before
> > switching to the init_mm used by idle.  But avoiding the extra IBPB
> > is probably worth the extra memory for such a common scenario.
> 
> So we already track active_mm for kernel threads. I can't immediately
> see where this fails for idle and your changelog doesn't say.

idle_task_exit() explicitly switches back to init_mm when we take the
CPU offline, this very much suggests the active_mm thing works for idle
too.

This means that 'A -> idle -> A' should never pass through switch_mm to
begin with.

Please clarify how you think it does.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25  9:22   ` Peter Zijlstra
@ 2018-01-25 13:21     ` Arjan van de Ven
  2018-01-25 13:50       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2018-01-25 13:21 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: linux-kernel, KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, rkrcmar, Thomas Gleixner,
	Tom Lendacky, x86

> 
> This means that 'A -> idle -> A' should never pass through switch_mm to
> begin with.
> 
> Please clarify how you think it does.
> 

the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
for a tlb flush.

(trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
state to just flush a tlb that's empty, the performance of that is horrific)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 13:21     ` Arjan van de Ven
@ 2018-01-25 13:50       ` Peter Zijlstra
  2018-01-25 14:07         ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-01-25 13:50 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Tim Chen, linux-kernel, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, rkrcmar,
	Thomas Gleixner, Tom Lendacky, x86

On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
> > 
> > This means that 'A -> idle -> A' should never pass through switch_mm to
> > begin with.
> > 
> > Please clarify how you think it does.
> > 
> 
> the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
> for a tlb flush.

The intel_idle code does, not the idle code. This is squirreled away in
some driver :/

> (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
> state to just flush a tlb that's empty, the performance of that is horrific)

Hurmph. I'd rather fix that some other way than leave_mm(), this is
piling special on special.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 13:50       ` Peter Zijlstra
@ 2018-01-25 14:07         ` Arjan van de Ven
  2018-01-25 16:41           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2018-01-25 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, linux-kernel, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, rkrcmar,
	Thomas Gleixner, Tom Lendacky, x86

On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
> On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
>>>
>>> This means that 'A -> idle -> A' should never pass through switch_mm to
>>> begin with.
>>>
>>> Please clarify how you think it does.
>>>
>>
>> the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
>> for a tlb flush.
> 
> The intel_idle code does, not the idle code. This is squirreled away in
> some driver :/

afaik (but haven't looked in a while) acpi drivers did too
> 
>> (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
>> state to just flush a tlb that's empty, the performance of that is horrific)
> 
> Hurmph. I'd rather fix that some other way than leave_mm(), this is
> piling special on special.
> 
the problem was tricky. but of course if something better is possible lets figure this out

problem is that an IPI to an idle cpu is both power inefficient and will take time,
exit of a deep C state can be, say 50 to 100 usec range of time (it varies by many things, but
for abstractly thinking about the problem one should generally round up to nice round numbers)

if you have say 64 cores that had the mm at some point, but 63 are in idle, the 64th
really does not want to IPI each of those 63 serially (technically this is does not need
to be serial but IPI code is tricky, some things end up serializing this a bit)
to get the 100 usec hit 63 times. Actually, even if it's not serialized, even ONE hit of 100 usec
is unpleasant.

so a CPU that goes idle wants to "unsubscribe" itself from those IPIs as general objective.

but not getting flush IPIs is only safe if the TLBs in the CPU have nothing that such IPI would
want to flush, so the TLB needs to be empty of those things.

the only way to do THAT is to switch to an mm that is safe; a leave_mm() does this, but I'm sure other
options exist.

note: While a CPU that is in a deeper C state will itself flush the TLB, you don't know if you will actually
enter that deep at the time of making OS decisions (if an interrupt comes in the cycle before mwait, mwait
becomes a nop for example). In addition, once you wake up, you don't want the CPU to go start filling
the TLBs with invalid data so you can't really just set a bit and flush after leaving idle

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 14:07         ` Arjan van de Ven
@ 2018-01-25 16:41           ` Peter Zijlstra
  2018-01-25 17:04             ` Andy Lutomirski
  2018-01-25 17:24             ` Arjan van de Ven
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2018-01-25 16:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Tim Chen, linux-kernel, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, rkrcmar,
	Thomas Gleixner, Tom Lendacky, x86

On Thu, Jan 25, 2018 at 06:07:07AM -0800, Arjan van de Ven wrote:
> On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
> > On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
> > > > 
> > > > This means that 'A -> idle -> A' should never pass through switch_mm to
> > > > begin with.
> > > > 
> > > > Please clarify how you think it does.
> > > > 
> > > 
> > > the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
> > > for a tlb flush.
> > 
> > The intel_idle code does, not the idle code. This is squirreled away in
> > some driver :/
> 
> afaik (but haven't looked in a while) acpi drivers did too

Only makes it worse.. drivers shouldn't be frobbing with things like
this.

> > > (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
> > > state to just flush a tlb that's empty, the performance of that is horrific)
> > 
> > Hurmph. I'd rather fix that some other way than leave_mm(), this is
> > piling special on special.
> > 
> the problem was tricky. but of course if something better is possible lets figure this out

How about something like the below? It boots with "nopcid" appended to
the cmdline.

Andy, could you pretty please have a look at this? This is fickle code
at best and I'm sure I messed _something_ up.

The idea is simple, do what we do for virt. Don't send IPI's to CPUs
that don't need them (in virt's case because the vCPU isn't running, in
our case because we're not in fact running a user process), but mark the
CPU as having needed a TLB flush.

Then when we leave the special mode (switching back to a user task),
check our flag and invalidate TLBs if required.

All of this is conditional on tlb_defer_switch_to_init_mm() (aka !PCID)
because otherwise we already switch to init_mm under the hood (we retain
active_mm) unconditionally.

This way active_mm is preserved and we can use something like:

  if (prev != next)
    ibpb();

in switch_mm_irqs_off().


---
 arch/x86/include/asm/acpi.h        |  2 --
 arch/x86/include/asm/mmu_context.h |  1 +
 arch/x86/include/asm/tlbflush.h    |  4 +++-
 arch/x86/mm/tlb.c                  | 38 +++++++++++++++++++++++++++++++++++---
 drivers/acpi/processor_idle.c      |  2 --
 drivers/idle/intel_idle.c          |  8 --------
 kernel/sched/core.c                | 28 +++++++++++++++-------------
 7 files changed, 54 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 8d0ec9df1cbe..72d867f6b518 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
 extern int x86_acpi_numa_init(void);
 #endif /* CONFIG_ACPI_NUMA */
 
-#define acpi_unlazy_tlb(x)	leave_mm(x)
-
 #ifdef CONFIG_ACPI_APEI
 static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 {
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c931b88982a0..009c6a450e70 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -180,6 +180,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 }
 
 void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
+void leave_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3effd3c994af..948c0997e6ab 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -189,8 +189,10 @@ struct tlb_state {
 	 *    We're heuristically guessing that the CR3 load we
 	 *    skipped more than makes up for the overhead added by
 	 *    lazy mode.
+	 *
+	 *    XXX
 	 */
-	bool is_lazy;
+	u8 is_lazy;
 
 	/*
 	 * If set we changed the page tables in such a way that we
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a1561957dccb..f94767d16b24 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -139,7 +139,6 @@ void leave_mm(int cpu)
 
 	switch_mm(NULL, &init_mm, NULL);
 }
-EXPORT_SYMBOL_GPL(leave_mm);
 
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
@@ -304,12 +303,21 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 		 * old mm loaded and only switch to init_mm when
 		 * tlb_remove_page() happens.
 		 */
-		this_cpu_write(cpu_tlbstate.is_lazy, true);
+		this_cpu_write(cpu_tlbstate.is_lazy, 1);
 	} else {
-		switch_mm(NULL, &init_mm, NULL);
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
 	}
 }
 
+void leave_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+	if (!tlb_defer_switch_to_init_mm())
+		return;
+
+	if (xchg(this_cpu_ptr(&cpu_tlbstate.is_lazy), 0) == 2)
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
+}
+
 /*
  * Call this when reinitializing a CPU.  It fixes the following potential
  * problems:
@@ -496,9 +504,13 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
+static DEFINE_PER_CPU(cpumask_var_t, __tlb_mask);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
+	struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__tlb_mask);
+
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -531,6 +543,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 					       (void *)info, 1);
 		return;
 	}
+
+	if (tlb_defer_switch_to_init_mm() && flushmask) {
+		int cpu;
+
+		cpumask_copy(flushmask, cpumask);
+		for_each_cpu(cpu, flushmask) {
+			if (cmpxchg(per_cpu_ptr(&cpu_tlbstate.is_lazy, cpu), 1, 2) >= 1)
+				__cpumask_clear_cpu(cpu, flushmask);
+		}
+
+		cpumask = flushmask;
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func_remote,
 			       (void *)info, 1);
 }
@@ -688,6 +713,13 @@ static const struct file_operations fops_tlbflush = {
 
 static int __init create_tlb_single_page_flush_ceiling(void)
 {
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		zalloc_cpumask_var_node(per_cpu_ptr(&__tlb_mask, cpu),
+					GFP_KERNEL, cpu_to_node(cpu));
+	}
+
 	debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
 			    arch_debugfs_dir, NULL, &fops_tlbflush);
 	return 0;
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index d50a7b6ccddd..2736e25e9dc6 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -710,8 +710,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
 static void acpi_idle_enter_bm(struct acpi_processor *pr,
 			       struct acpi_processor_cx *cx, bool timer_bc)
 {
-	acpi_unlazy_tlb(smp_processor_id());
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index f0b06b14e782..920e719156db 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -913,17 +913,9 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 	struct cpuidle_state *state = &drv->states[index];
 	unsigned long eax = flg2MWAIT(state->flags);
 	unsigned int cstate;
-	int cpu = smp_processor_id();
 
 	cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
-	/*
-	 * leave_mm() to avoid costly and often unnecessary wakeups
-	 * for flushing the user TLB's associated with the active mm.
-	 */
-	if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
-		leave_mm(cpu);
-
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		tick_broadcast_enter();
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..74c51da7301a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2750,12 +2750,8 @@ static __always_inline struct rq *
 context_switch(struct rq *rq, struct task_struct *prev,
 	       struct task_struct *next, struct rq_flags *rf)
 {
-	struct mm_struct *mm, *oldmm;
-
 	prepare_task_switch(rq, prev, next);
 
-	mm = next->mm;
-	oldmm = prev->active_mm;
 	/*
 	 * For paravirt, this is coupled with an exit in switch_to to
 	 * combine the page table reload and the switch backend into
@@ -2763,16 +2759,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
-	if (!mm) {
-		next->active_mm = oldmm;
-		mmgrab(oldmm);
-		enter_lazy_tlb(oldmm, next);
-	} else
-		switch_mm_irqs_off(oldmm, mm, next);
+	if (!next->mm) {	/* to kernel */
+		next->active_mm = prev->active_mm;
+
+		if (prev->mm) {	/* from user */
+			enter_lazy_tlb(prev->active_mm, next);
+			mmgrab(prev->active_mm);
+		}
+	} else {		/* to user */
+		if (!prev->mm) { /* from kernel */
+			/* will mmdrop() in finish_task_switch(). */
+			leave_lazy_tlb(prev->active_mm, NULL);
+			rq->prev_mm = prev->active_mm;
+			prev->active_mm = NULL;
+		}
 
-	if (!prev->mm) {
-		prev->active_mm = NULL;
-		rq->prev_mm = oldmm;
+		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 	}
 
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] x86/ibpb: Prevent missed IBPB flush
  2018-01-25  8:20   ` David Woodhouse
@ 2018-01-25 16:56     ` Tim Chen
  0 siblings, 0 replies; 22+ messages in thread
From: Tim Chen @ 2018-01-25 16:56 UTC (permalink / raw)
  To: David Woodhouse, linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, rkrcmar, Thomas Gleixner, Tom Lendacky, x86

On 01/25/2018 12:20 AM, David Woodhouse wrote:
> On Wed, 2018-01-24 at 16:36 -0800, Tim Chen wrote:
>> It is possible that the last uesr mm that we recorded for a cpu was
>> released, and a new mm with identical address was allocated when we
>> check it again.  We could skip IBPB flush here for the process with
>> the new mm.
>>
>> It is a difficult to exploit case as we have to exit() a process on a
>> cpu, free the mm, and fork() the victim to use the mm pointer on that
>> cpu. The exploiter needs the old mm to get recycled to the
>> newly forked process and no other processes run on the target cpu.
> 
> That's what it takes to have the victim process leak information into
> the cache. In order to *harvest* that information, the attacker must
> then get run on the same CPU again? And since her first process had to
> exits, as described above, she needs a new process for that?
> 
> I confess, with all the other wildly theoretical loopholes that exist,
> I wasn't losing much sleep over this one.
> 
>> Nevertheless, the patch below is one way to close this hole by
>> adding a ref count to prevent the last user mm from being released.
>> It does add ref counting overhead, and extra memory cost of keeping an mm
>> (though not the VMAs and most of page tables) around longer than we will
>> otherwise need to. Any better solutions are welcomed.
> 
> This has no upper bound on the amount of time the user mm gets held,
> right? If a given CPU remains idle for ever (and what happens if it's
> taken offline?) we'll never do that mmdrop() ?
> 

The downside with this approach is we do hold on to the mm longer
than we needs to.

Yes, the offline path does need to be fixed up.

Tim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 16:41           ` Peter Zijlstra
@ 2018-01-25 17:04             ` Andy Lutomirski
  2018-01-25 18:18               ` Peter Zijlstra
  2018-01-25 17:24             ` Arjan van de Ven
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2018-01-25 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Tim Chen, LKML, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Radim Krcmar,
	Thomas Gleixner, Tom Lendacky, X86 ML

On Thu, Jan 25, 2018 at 8:41 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jan 25, 2018 at 06:07:07AM -0800, Arjan van de Ven wrote:
>> On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
>> > On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
>> > > >
>> > > > This means that 'A -> idle -> A' should never pass through switch_mm to
>> > > > begin with.
>> > > >
>> > > > Please clarify how you think it does.
>> > > >
>> > >
>> > > the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
>> > > for a tlb flush.
>> >
>> > The intel_idle code does, not the idle code. This is squirreled away in
>> > some driver :/
>>
>> afaik (but haven't looked in a while) acpi drivers did too
>
> Only makes it worse.. drivers shouldn't be frobbing with things like
> this.
>
>> > > (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
>> > > state to just flush a tlb that's empty, the performance of that is horrific)
>> >
>> > Hurmph. I'd rather fix that some other way than leave_mm(), this is
>> > piling special on special.
>> >
>> the problem was tricky. but of course if something better is possible lets figure this out
>
> How about something like the below? It boots with "nopcid" appended to
> the cmdline.
>
> Andy, could you pretty please have a look at this? This is fickle code
> at best and I'm sure I messed _something_ up.
>
> The idea is simple, do what we do for virt. Don't send IPI's to CPUs
> that don't need them (in virt's case because the vCPU isn't running, in
> our case because we're not in fact running a user process), but mark the
> CPU as having needed a TLB flush.

I haven't tried to fully decipher the patch, but I think the idea is
wrong.  (I think it's the same wrong idea that Rik and I both had and
that I got into Linus' tree for a while...)  The problem is that it's
not actually correct to run indefinitely in kernel mode using stale
cached page table data.  The stale PTEs themselves are fine, but the
stale intermediate translations can cause the CPU to speculatively
load complete garbage into the TLB, and that's bad (and causes MCEs on
AMD CPUs).

I think we only really have two choices: tlb_defer_switch_to_init_mm()
== true and tlb_defer_switch_to_init_mm() == false.  The current
heuristic is to not defer if we have PCID, because loading CR3 is
reasonably fast.



>  void native_flush_tlb_others(const struct cpumask *cpumask,
>                              const struct flush_tlb_info *info)
>  {
> +       struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__tlb_mask);
> +
>         count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
>         if (info->end == TLB_FLUSH_ALL)
>                 trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
> @@ -531,6 +543,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>                                                (void *)info, 1);
>                 return;
>         }
> +
> +       if (tlb_defer_switch_to_init_mm() && flushmask) {
> +               int cpu;
> +
> +               cpumask_copy(flushmask, cpumask);
> +               for_each_cpu(cpu, flushmask) {
> +                       if (cmpxchg(per_cpu_ptr(&cpu_tlbstate.is_lazy, cpu), 1, 2) >= 1)
> +                               __cpumask_clear_cpu(cpu, flushmask);

If this code path here executes and we're flushing because we just
removed a reference to a page table and we're about to free the page
table, then the CPU that we didn't notify by IPI can start using
whatever gets written to the pagetable after it's freed, and that's
bad :(

> +               }
> +
> +               cpumask = flushmask;
> +       }
> +
>         smp_call_function_many(cpumask, flush_tlb_func_remote,
>                                (void *)info, 1);
>  }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 16:41           ` Peter Zijlstra
  2018-01-25 17:04             ` Andy Lutomirski
@ 2018-01-25 17:24             ` Arjan van de Ven
  1 sibling, 0 replies; 22+ messages in thread
From: Arjan van de Ven @ 2018-01-25 17:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, linux-kernel, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, rkrcmar,
	Thomas Gleixner, Tom Lendacky, x86

> The idea is simple, do what we do for virt. Don't send IPI's to CPUs
> that don't need them (in virt's case because the vCPU isn't running, in
> our case because we're not in fact running a user process), but mark the
> CPU as having needed a TLB flush.

I am really uncomfortable with that idea.
You really can't run code safely on a cpu where the TLBs in the CPU are invalid
or where a CPU that does (partial) page walks would install invalid PTEs either
through actual or through speculative execution.

(in the virt case there's a cheat, since the code is not actually running
there isn't a cpu with TLBs live. You can't do that same cheat for this case)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 17:04             ` Andy Lutomirski
@ 2018-01-25 18:18               ` Peter Zijlstra
  2018-01-25 19:32                 ` Tim Chen
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-01-25 18:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arjan van de Ven, Tim Chen, LKML, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krcmar, Thomas Gleixner,
	Tom Lendacky, X86 ML

On Thu, Jan 25, 2018 at 09:04:21AM -0800, Andy Lutomirski wrote:
> I haven't tried to fully decipher the patch, but I think the idea is
> wrong.  (I think it's the same wrong idea that Rik and I both had and
> that I got into Linus' tree for a while...)  The problem is that it's
> not actually correct to run indefinitely in kernel mode using stale
> cached page table data.  The stale PTEs themselves are fine, but the
> stale intermediate translations can cause the CPU to speculatively
> load complete garbage into the TLB, and that's bad (and causes MCEs on
> AMD CPUs).

Urggh.. indeed :/

> I think we only really have two choices: tlb_defer_switch_to_init_mm()
> == true and tlb_defer_switch_to_init_mm() == false.  The current
> heuristic is to not defer if we have PCID, because loading CR3 is
> reasonably fast.

I just _really_ _really_ hate idle drivers doing leave_mm(). I don't
suppose limiting the !IPI case to just the idle case would be correct
either, because between waking from idle and testing our 'should I have
invalidated' bit it can (however unlikely) speculate into stale TLB
entries too..

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 18:18               ` Peter Zijlstra
@ 2018-01-25 19:32                 ` Tim Chen
  2018-01-25 19:34                   ` Arjan van de Ven
  2018-01-25 20:46                   ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Tim Chen @ 2018-01-25 19:32 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Arjan van de Ven, LKML, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krcmar, Thomas Gleixner,
	Tom Lendacky, X86 ML

On 01/25/2018 10:18 AM, Peter Zijlstra wrote:
> On Thu, Jan 25, 2018 at 09:04:21AM -0800, Andy Lutomirski wrote:
>> I haven't tried to fully decipher the patch, but I think the idea is
>> wrong.  (I think it's the same wrong idea that Rik and I both had and
>> that I got into Linus' tree for a while...)  The problem is that it's
>> not actually correct to run indefinitely in kernel mode using stale
>> cached page table data.  The stale PTEs themselves are fine, but the
>> stale intermediate translations can cause the CPU to speculatively
>> load complete garbage into the TLB, and that's bad (and causes MCEs on
>> AMD CPUs).
> 
> Urggh.. indeed :/
> 
>> I think we only really have two choices: tlb_defer_switch_to_init_mm()
>> == true and tlb_defer_switch_to_init_mm() == false.  The current
>> heuristic is to not defer if we have PCID, because loading CR3 is
>> reasonably fast.
> 
> I just _really_ _really_ hate idle drivers doing leave_mm(). I don't
> suppose limiting the !IPI case to just the idle case would be correct
> either, because between waking from idle and testing our 'should I have
> invalidated' bit it can (however unlikely) speculate into stale TLB
> entries too..
> 
> 

Peter, 

This patch is not ideal as it comes with the caveats that
patch 2 tries to close.  I put it out here to see if it can prompt
people to come up with a better solution. Keeping active_mm around would
have been cleaner but it looks like there are issues that Andy mentioned.

The "A -> idle -> A" case would not trigger IBPB if tlb_defer_switch_to_init_mm()
is true (non pcid) as we does not change the mm.

This patch tries to address the case when we do switch to init_mm and back.
Do you still have objections to the approach in this patch
to save the last active mm before switching to init_mm?

Tim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 19:32                 ` Tim Chen
@ 2018-01-25 19:34                   ` Arjan van de Ven
  2018-01-25 19:45                     ` Dave Hansen
  2018-01-25 20:46                   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2018-01-25 19:34 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Andy Lutomirski
  Cc: LKML, KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	David Woodhouse, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krcmar, Thomas Gleixner, Tom Lendacky, X86 ML

> 
> This patch tries to address the case when we do switch to init_mm and back.
> Do you still have objections to the approach in this patch
> to save the last active mm before switching to init_mm?

how do you know the last active mm did not go away and started a new process with new content?
(other than taking a reference which has other side effects)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 19:34                   ` Arjan van de Ven
@ 2018-01-25 19:45                     ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2018-01-25 19:45 UTC (permalink / raw)
  To: Arjan van de Ven, Tim Chen, Peter Zijlstra, Andy Lutomirski
  Cc: LKML, KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Radim Krcmar,
	Thomas Gleixner, Tom Lendacky, X86 ML

On 01/25/2018 11:34 AM, Arjan van de Ven wrote:
>> This patch tries to address the case when we do switch to init_mm
>> and back. Do you still have objections to the approach in this
>> patch to save the last active mm before switching to init_mm?
> 
> how do you know the last active mm did not go away and started a new
> process with new content?
> (other than taking a reference which has other side effects)

We couldn't think of an easy way to prevent mm reuse other than taking a
reference.  Think of it this way: the mm getting run poisons a CPU.  We
can either go at exit() time and do IBPB on every CPU that might have
poison from the mm.  Or, we do IBPB once on each CPU the first time the
mm runs there to make sure that no old poison is still around.

Both of those require per-cpu state in the mm, kinda like the TLB
tracking.  That will not be fun to get right.  It also adds overhead to
the common-case exit() or fork() paths.  Also not fun.

The refcount just eats a little memory for the mm *itself* but none of
the actual expensive stuff: VMAs or page tables that hang off the mm.
It's also zero-cost at fork/exit.  The going-to-idle cost is manageable
and *certainly* outweighs the cost of even one extra IBPB that we would
otherwise have to do.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 19:32                 ` Tim Chen
  2018-01-25 19:34                   ` Arjan van de Ven
@ 2018-01-25 20:46                   ` Peter Zijlstra
  2018-01-25 21:04                     ` Andy Lutomirski
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-01-25 20:46 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andy Lutomirski, Arjan van de Ven, LKML, KarimAllah Ahmed,
	Andi Kleen, Andrea Arcangeli, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Radim Krcmar,
	Thomas Gleixner, Tom Lendacky, X86 ML

On Thu, Jan 25, 2018 at 11:32:46AM -0800, Tim Chen wrote:
> 
> This patch is not ideal as it comes with the caveats that
> patch 2 tries to close.  I put it out here to see if it can prompt
> people to come up with a better solution. Keeping active_mm around would
> have been cleaner but it looks like there are issues that Andy mentioned.
> 
> The "A -> idle -> A" case would not trigger IBPB if tlb_defer_switch_to_init_mm()
> is true (non pcid) as we does not change the mm.
> 
> This patch tries to address the case when we do switch to init_mm and back.
> Do you still have objections to the approach in this patch
> to save the last active mm before switching to init_mm?

I still think the existing active_mm is sufficient. Something like:

  switch_mm()
  {
	...
	if (prev && next != prev)
		ibpb();
	...
  }

should work. Because while the idle crud does leave_mm() and PCID does
enter_lazy_tlb() and both end up doing: switch_mm(NULL, &init_mm, NULL),
nothing there affects tsk->active_mm.

So over the "A -> idle -> A" transition, active_mm should actually track
what you want.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 20:46                   ` Peter Zijlstra
@ 2018-01-25 21:04                     ` Andy Lutomirski
  2018-01-25 21:57                       ` Andi Kleen
  2018-01-26  0:01                       ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Andy Lutomirski @ 2018-01-25 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Andy Lutomirski, Arjan van de Ven, LKML,
	KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	David Woodhouse, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krcmar, Thomas Gleixner, Tom Lendacky, X86 ML

On Thu, Jan 25, 2018 at 12:46 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jan 25, 2018 at 11:32:46AM -0800, Tim Chen wrote:
>>
>> This patch is not ideal as it comes with the caveats that
>> patch 2 tries to close.  I put it out here to see if it can prompt
>> people to come up with a better solution. Keeping active_mm around would
>> have been cleaner but it looks like there are issues that Andy mentioned.
>>
>> The "A -> idle -> A" case would not trigger IBPB if tlb_defer_switch_to_init_mm()
>> is true (non pcid) as we does not change the mm.
>>
>> This patch tries to address the case when we do switch to init_mm and back.
>> Do you still have objections to the approach in this patch
>> to save the last active mm before switching to init_mm?
>
> I still think the existing active_mm is sufficient. Something like:
>
>   switch_mm()
>   {
>         ...
>         if (prev && next != prev)
>                 ibpb();
>         ...
>   }
>
> should work. Because while the idle crud does leave_mm() and PCID does
> enter_lazy_tlb() and both end up doing: switch_mm(NULL, &init_mm, NULL),
> nothing there affects tsk->active_mm.
>
> So over the "A -> idle -> A" transition, active_mm should actually track
> what you want.
>
>

Can we please not rely on any of the active_mm shit?  That thing has
really weird semantics and should just die.

That being said, just stashing last_user_mm without any refcounting
should be fine.  After all, the only thing anyone does with it is
comparing to next, and next is always alive.  Or we could use
last_user_ctx_id, since we already have a never-reused ctx_id for each
mm on x86.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 21:04                     ` Andy Lutomirski
@ 2018-01-25 21:57                       ` Andi Kleen
  2018-01-25 22:01                         ` Andy Lutomirski
  2018-01-26  0:01                       ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2018-01-25 21:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Tim Chen, Arjan van de Ven, LKML,
	KarimAllah Ahmed, Andrea Arcangeli, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima

Andy Lutomirski <luto@kernel.org> writes:
>
> That being said, just stashing last_user_mm without any refcounting
> should be fine.

If last_user_mm is freed and reallocated by a different process,
then that would miss the IPBP incorrectly.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 21:57                       ` Andi Kleen
@ 2018-01-25 22:01                         ` Andy Lutomirski
  2018-01-25 23:07                           ` Tim Chen
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2018-01-25 22:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Peter Zijlstra, Tim Chen, Arjan van de Ven,
	LKML, KarimAllah Ahmed, Andrea Arcangeli, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	David Woodhouse, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima

On Thu, Jan 25, 2018 at 1:57 PM, Andi Kleen <ak@linux.intel.com> wrote:
> Andy Lutomirski <luto@kernel.org> writes:
>>
>> That being said, just stashing last_user_mm without any refcounting
>> should be fine.
>
> If last_user_mm is freed and reallocated by a different process,
> then that would miss the IPBP incorrectly.
>

Hmm, right.  So ctx_id it is.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 22:01                         ` Andy Lutomirski
@ 2018-01-25 23:07                           ` Tim Chen
  0 siblings, 0 replies; 22+ messages in thread
From: Tim Chen @ 2018-01-25 23:07 UTC (permalink / raw)
  To: Andy Lutomirski, Andi Kleen
  Cc: Peter Zijlstra, Arjan van de Ven, LKML, KarimAllah Ahmed,
	Andrea Arcangeli, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima

On 01/25/2018 02:01 PM, Andy Lutomirski wrote:
> On Thu, Jan 25, 2018 at 1:57 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> Andy Lutomirski <luto@kernel.org> writes:
>>>
>>> That being said, just stashing last_user_mm without any refcounting
>>> should be fine.
>>
>> If last_user_mm is freed and reallocated by a different process,
>> then that would miss the IPBP incorrectly.
>>
> 
> Hmm, right.  So ctx_id it is.
> 
> --Andy
> 
Thanks.  Using ctx_id is a pretty clean approach.  I will refresh
this patch and drop the second patch.

Tim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
  2018-01-25 21:04                     ` Andy Lutomirski
  2018-01-25 21:57                       ` Andi Kleen
@ 2018-01-26  0:01                       ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2018-01-26  0:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tim Chen, Arjan van de Ven, LKML, KarimAllah Ahmed, Andi Kleen,
	Andrea Arcangeli, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krcmar, Thomas Gleixner,
	Tom Lendacky, X86 ML

On Thu, Jan 25, 2018 at 01:04:23PM -0800, Andy Lutomirski wrote:
> Can we please not rely on any of the active_mm shit?  That thing has
> really weird semantics and should just die.

I don't agree on the weird semantics. Its simply the last user mm.

I won't mind seeing it go as it would reduce the number of atomics on
the schedule path, but its really not terribly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-01-26  0:01 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-25  0:36 [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Tim Chen
2018-01-25  0:36 ` [RFC PATCH 2/2] x86/ibpb: Prevent missed IBPB flush Tim Chen
2018-01-25  8:20   ` David Woodhouse
2018-01-25 16:56     ` Tim Chen
2018-01-25  8:58 ` [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process Peter Zijlstra
2018-01-25  9:22   ` Peter Zijlstra
2018-01-25 13:21     ` Arjan van de Ven
2018-01-25 13:50       ` Peter Zijlstra
2018-01-25 14:07         ` Arjan van de Ven
2018-01-25 16:41           ` Peter Zijlstra
2018-01-25 17:04             ` Andy Lutomirski
2018-01-25 18:18               ` Peter Zijlstra
2018-01-25 19:32                 ` Tim Chen
2018-01-25 19:34                   ` Arjan van de Ven
2018-01-25 19:45                     ` Dave Hansen
2018-01-25 20:46                   ` Peter Zijlstra
2018-01-25 21:04                     ` Andy Lutomirski
2018-01-25 21:57                       ` Andi Kleen
2018-01-25 22:01                         ` Andy Lutomirski
2018-01-25 23:07                           ` Tim Chen
2018-01-26  0:01                       ` Peter Zijlstra
2018-01-25 17:24             ` Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).