[patch] x86, mm: avoid stale tlb entries by clearing prev mm

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
@ 2011-02-02 20:07 Suresh Siddha
  2011-02-03  1:23 ` Andi Kleen
  2011-02-03  1:55 ` Linus Torvalds
  0 siblings, 2 replies; 17+ messages in thread
From: Suresh Siddha @ 2011-02-02 20:07 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Linus Torvalds
  Cc: LKML, Mallick, Asit K

For the prev mm that is handing over the cpu to another mm, clear the cpu
from the mm_cpumask(prev) after the cr3 is changed.

Otherwise, clearing the mm_cpumask early will avoid the flush tlb IPI's while
the cr3 and TLB's are still pointing to the prev mm. And this window can lead
to the stale (global) TLB entries.

Marking it for -stable, though we haven't seen any reported failure that
can be attributed to this.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org	[v2.6.32+]
---
 arch/x86/include/asm/mmu_context.h |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 4a2d4e0..8b5393e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -36,8 +36,6 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
-		/* stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
 #ifdef CONFIG_SMP
 		percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		percpu_write(cpu_tlbstate.active_mm, next);
@@ -47,6 +45,9 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		/* Re-load page tables */
 		load_cr3(next->pgd);
 
+		/* stop flush ipis for the previous mm */
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
 		/*
 		 * load the LDT, if the LDT is different:
 		 */



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-02 20:07 [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm Suresh Siddha
@ 2011-02-03  1:23 ` Andi Kleen
  2011-02-03  1:55   ` Suresh Siddha
  2011-02-03  1:55 ` Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2011-02-03  1:23 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	LKML, Mallick, Asit K

Suresh Siddha <suresh.b.siddha@intel.com> writes:

> For the prev mm that is handing over the cpu to another mm, clear the cpu
> from the mm_cpumask(prev) after the cr3 is changed.
>
> Otherwise, clearing the mm_cpumask early will avoid the flush tlb IPI's while
> the cr3 and TLB's are still pointing to the prev mm. And this window can lead
> to the stale (global) TLB entries.
>
> Marking it for -stable, though we haven't seen any reported failure that
> can be attributed to this.

Would it be safer to add a memory barrier between the load_cr3 and the
cpumask_clear_cpu()? As far as I can see cpumask_clear_cpu doesn't
imply a general one and load_cr3 doesn't either. There's this
__force_order hack in system.h, but I don't think it will enforce
order here.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03  1:23 ` Andi Kleen
@ 2011-02-03  1:55   ` Suresh Siddha
  2011-02-03  2:15     ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2011-02-03  1:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Linus Torvalds,
	LKML, Mallick, Asit K

On Wed, 2011-02-02 at 17:23 -0800, Andi Kleen wrote:
> Suresh Siddha <suresh.b.siddha@intel.com> writes:
> 
> > For the prev mm that is handing over the cpu to another mm, clear the cpu
> > from the mm_cpumask(prev) after the cr3 is changed.
> >
> > Otherwise, clearing the mm_cpumask early will avoid the flush tlb IPI's while
> > the cr3 and TLB's are still pointing to the prev mm. And this window can lead
> > to the stale (global) TLB entries.
> >
> > Marking it for -stable, though we haven't seen any reported failure that
> > can be attributed to this.
> 
> Would it be safer to add a memory barrier between the load_cr3 and the
> cpumask_clear_cpu()? As far as I can see cpumask_clear_cpu doesn't
> imply a general one and load_cr3 doesn't either. There's this
> __force_order hack in system.h, but I don't think it will enforce
> order here.

I thought "asm volatile" is going to take care of that.

If not, then we have issues even today. no?

thanks,
suresh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-02 20:07 [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm Suresh Siddha
  2011-02-03  1:23 ` Andi Kleen
@ 2011-02-03  1:55 ` Linus Torvalds
  2011-02-03  4:03   ` Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2011-02-03  1:55 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, LKML, Mallick, Asit K

On Wed, Feb 2, 2011 at 12:07 PM, Suresh Siddha
<suresh.b.siddha@intel.com> wrote:
> For the prev mm that is handing over the cpu to another mm, clear the cpu
> from the mm_cpumask(prev) after the cr3 is changed.
>
> Otherwise, clearing the mm_cpumask early will avoid the flush tlb IPI's while
> the cr3 and TLB's are still pointing to the prev mm. And this window can lead
> to the stale (global) TLB entries.

Why?

This looks pointless. Explain why this matters. Global entries are
never per-mm, so any global entries can never care about the
mm_cpumask.

And for any normal entries it doesn't matter if the IPI gets lost,
since the TLB will be flushed (immediately afterwards) by the cr3
write.

So no, I will not accept this patch, and it shouldn't be marked for
stable. Unless you can show exactly what is the point of it, and
exactly what the bug is.

                       Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03  1:55   ` Suresh Siddha
@ 2011-02-03  2:15     ` Andi Kleen
  2011-02-03  4:12       ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2011-02-03  2:15 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Andi Kleen, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Linus Torvalds, LKML, Mallick, Asit K

> I thought "asm volatile" is going to take care of that.

asm volatile just prevents deletion:

>>
The `volatile' keyword indicates that the instruction has important
side-effects.  GCC will not delete a volatile `asm' if it is reachable.
(The instruction can still be deleted if GCC can prove that
control-flow will never reach the location of the instruction.)  Note
that even a volatile `asm' instruction can be moved relative to other
code, including across jump instructions.
<<

> If not, then we have issues even today. no?

Well yes. It depends on the compiler if it actually moves something.
So most likely nothing was reordered yet, but it's safer to prevent
it.

I normally go by the rule that if correctness requires a specific
order without explicit side effects I add memory barriers.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03  1:55 ` Linus Torvalds
@ 2011-02-03  4:03   ` Linus Torvalds
  2011-02-03 18:27     ` Suresh Siddha
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2011-02-03  4:03 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, LKML, Mallick, Asit K

On Wed, Feb 2, 2011 at 5:55 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Feb 2, 2011 at 12:07 PM, Suresh Siddha
> <suresh.b.siddha@intel.com> wrote:
>> For the prev mm that is handing over the cpu to another mm, clear the cpu
>> from the mm_cpumask(prev) after the cr3 is changed.
>>
>> Otherwise, clearing the mm_cpumask early will avoid the flush tlb IPI's while
>> the cr3 and TLB's are still pointing to the prev mm. And this window can lead
>> to the stale (global) TLB entries.
>
> Why?
>
> This looks pointless. Explain why this matters. Global entries are
> never per-mm, so any global entries can never care about the
> mm_cpumask.
>
> And for any normal entries it doesn't matter if the IPI gets lost,
> since the TLB will be flushed (immediately afterwards) by the cr3
> write.

Actually, for normal entries I could well imagine the code that wants
to do a flush before freeing the page caring.

So I think the _patch_ may be correct, but the changelog is definitely
not correct, and needs serious surgery to explain what the bug that
this fixes actually is.

                         Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03  2:15     ` Andi Kleen
@ 2011-02-03  4:12       ` Linus Torvalds
  2011-02-03  4:22         ` H. Peter Anvin
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2011-02-03  4:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Suresh Siddha, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	LKML, Mallick, Asit K

On Wed, Feb 2, 2011 at 6:15 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> I thought "asm volatile" is going to take care of that.
>
> asm volatile just prevents deletion:

No. We also assume that gcc does the sane thing.

If "asm volatile" means that the asm can still be moved around
arbitrarily (including across other asm volatiles etc), then the whole
"volatile" has no meaning at all.

So there's no point in pointing to known-bogus gcc documentation.
Afaik, there's at least one bugzilla entry about that bogus
documentation by hpa, and the traditional meaning - and documentation
- of "volatile" on asms was that they wouldn't be moved around
"significantly".

For example, from the previous time we talked about asm volatiles, hpa said:

  "The other thing that the documentation *does* specifically make
clear is that an "asm volatile" has an implicit dependency on all
memory (as an input, as opposed to an output/clobber.)"

(quoting some C++ documentation), and the fact is that without rules
like that, "volatile" really is totally meaningless on asm's, and it
would be a totally pointless keyword (since EVERY SINGLE USE would
have to be replaced by a memory clobber).

So right now, we pretty much depend on "asm volatile" (a) not being
re-ordered wrt other asm volatiles and (b) having that dependency on
memory.

Iirc, the gcc people even agreed on this. Peter may remember details better..

                       Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03  4:12       ` Linus Torvalds
@ 2011-02-03  4:22         ` H. Peter Anvin
  0 siblings, 0 replies; 17+ messages in thread
From: H. Peter Anvin @ 2011-02-03  4:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Suresh Siddha, Ingo Molnar, Thomas Gleixner, LKML,
	Mallick, Asit K

On 02/02/2011 08:12 PM, Linus Torvalds wrote:
> 
> So right now, we pretty much depend on "asm volatile" (a) not being
> re-ordered wrt other asm volatiles and (b) having that dependency on
> memory.
> 
> Iirc, the gcc people even agreed on this. Peter may remember details better..
> 

Yes, IIRC an "asm volatile" is assumed to be a universal consumer of
memory (it is only a universal producer if the "memory" clobber is
used), as well as being ordered with respect to other volatile operations.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03  4:03   ` Linus Torvalds
@ 2011-02-03 18:27     ` Suresh Siddha
  2011-02-03 19:13       ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2011-02-03 18:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, LKML, Mallick, Asit K

On Wed, 2011-02-02 at 20:03 -0800, Linus Torvalds wrote:
> On Wed, Feb 2, 2011 at 5:55 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > This looks pointless. Explain why this matters. Global entries are
> > never per-mm, so any global entries can never care about the
> > mm_cpumask.
> >
> > And for any normal entries it doesn't matter if the IPI gets lost,
> > since the TLB will be flushed (immediately afterwards) by the cr3
> > write.
> 
> Actually, for normal entries I could well imagine the code that wants
> to do a flush before freeing the page caring.
> 
> So I think the _patch_ may be correct, but the changelog is definitely
> not correct, and needs serious surgery to explain what the bug that
> this fixes actually is.

Linus, I updated the changelog to explain the failing case in more
detail. Please review. Thanks.

---
From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm

Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb IPI's while
the cr3 is still pointing to the prev mm. And this window can lead
to the stale (global) TLB entries in the scenarios like the one mentioned
below.

T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI etc
between the point of clearing the cpu from the mm_cpumask(mm1) and before
reloading the cr3 with the new mm2.

T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with flushing
the TLB for mm1. It doesn't send the flush TLB to CPU-1 as it doesn't see that
cpu listed in the mm_cpumask(mm1)

T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
page-table pages associated with the removed vma mapping.

T4. CPU-2 now allocates those freed page-table pages for something else.

T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1 can
potentially speculate and walk through the page-table caches and can
insert new TLB entries. As the page-table pages are already freed and being
used on CPU-2, this page walk can potentially insert a stale global TLB entry 
depending on the contents of the page that is being used on CPU-2

T6. This stale TLB entry being global will be active across future CR3
changes and can result in weird memory corruption etc.

To avoid this issue, for the prev mm that is handing over the cpu to another mm,
clear the cpu from the mm_cpumask(prev) after the cr3 is changed.

Marking it for -stable, though we haven't seen any reported failure that
can be attributed to this.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org	[v2.6.32+]
---
 arch/x86/include/asm/mmu_context.h |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 4a2d4e0..8b5393e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -36,8 +36,6 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();

 	if (likely(prev != next)) {
-		/* stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
 #ifdef CONFIG_SMP
 		percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		percpu_write(cpu_tlbstate.active_mm, next);
@@ -47,6 +45,9 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		/* Re-load page tables */
 		load_cr3(next->pgd);

+		/* stop flush ipis for the previous mm */
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
 		/*
 		 * load the LDT, if the LDT is different:
 		 */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03 18:27     ` Suresh Siddha
@ 2011-02-03 19:13       ` Linus Torvalds
  2011-02-03 19:34         ` Suresh Siddha
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2011-02-03 19:13 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, LKML, Mallick, Asit K

On Thu, Feb 3, 2011 at 10:27 AM, Suresh Siddha
<suresh.b.siddha@intel.com> wrote:
>> Actually, for normal entries I could well imagine the code that wants
>> to do a flush before freeing the page caring.
>>
>> So I think the _patch_ may be correct, but the changelog is definitely
>> not correct, and needs serious surgery to explain what the bug that
>> this fixes actually is.
>
> Linus, I updated the changelog to explain the failing case in more
> detail. Please review. Thanks.

So this explains the bug, and the explanation looks good. Except for
one (large) detail: it has nothing to do with "stale" entries.

The problematic entries are simply _bogus_, not stale. They were never
valid to begin with. So the subject and the initial part is very
misleading.

The global bit seems to be largely irrelevant too, except for the fact
that a global bogus entry obviously stays around and causes way more
trouble. But I could imagine that there could be trouble with entries
that have conflicting PSE or cacheability issues (causing machine
checks or something) even if they are ephemeral and not global.

                                  Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03 19:13       ` Linus Torvalds
@ 2011-02-03 19:34         ` Suresh Siddha
  2011-02-03 19:48           ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2011-02-03 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, LKML, Mallick, Asit K

On Thu, 2011-02-03 at 11:13 -0800, Linus Torvalds wrote:
> On Thu, Feb 3, 2011 at 10:27 AM, Suresh Siddha
> <suresh.b.siddha@intel.com> wrote:
> >> Actually, for normal entries I could well imagine the code that wants
> >> to do a flush before freeing the page caring.
> >>
> >> So I think the _patch_ may be correct, but the changelog is definitely
> >> not correct, and needs serious surgery to explain what the bug that
> >> this fixes actually is.
> >
> > Linus, I updated the changelog to explain the failing case in more
> > detail. Please review. Thanks.
> 
> So this explains the bug, and the explanation looks good. Except for
> one (large) detail: it has nothing to do with "stale" entries.
> 
> The problematic entries are simply _bogus_, not stale. They were never
> valid to begin with. So the subject and the initial part is very
> misleading.

True. 'stale' is the wrong word. Do you want me to send a corrected one
by replacing it with 'bogus'?

> The global bit seems to be largely irrelevant too, except for the fact
> that a global bogus entry obviously stays around and causes way more
> trouble. But I could imagine that there could be trouble with entries
> that have conflicting PSE or cacheability issues (causing machine
> checks or something) even if they are ephemeral and not global.

my understanding is that unless we end up using that TLB entry, we will
not have the issues like machine checks due to cacheability issues etc.
If it is not global, upcoming cr3 change will flush it and meanwhile I
don't think there is a scenario where we refer to these user-addresses.

thanks,
suresh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03 19:34         ` Suresh Siddha
@ 2011-02-03 19:48           ` Linus Torvalds
  2011-02-03 20:20             ` Suresh Siddha
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2011-02-03 19:48 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, LKML, Mallick, Asit K

On Thu, Feb 3, 2011 at 11:34 AM, Suresh Siddha
<suresh.b.siddha@intel.com> wrote:
>
> True. 'stale' is the wrong word. Do you want me to send a corrected one
> by replacing it with 'bogus'?

Please.

> my understanding is that unless we end up using that TLB entry, we will
> not have the issues like machine checks due to cacheability issues etc.
> If it is not global, upcoming cr3 change will flush it and meanwhile I
> don't think there is a scenario where we refer to these user-addresses.

Quite possible. The situation I envisioned was the same speculative
memory access that causes the TLB fill to also cause a cache fill -
for a noncacheable region (because the bogus TLB entry sets the random
address to cacheable).

And then what happens when somebody else accesses the same memory
noncacheably (through a valid TLB entry), and finds it in the cache?

I dunno. Not really important. The important part is the "possible
random bogus TLB entry", the fact that the CPU can act strangely after
that is pretty much a given.

                            Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03 19:48           ` Linus Torvalds
@ 2011-02-03 20:20             ` Suresh Siddha
  2011-02-03 21:06               ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2011-02-03 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, LKML, Mallick, Asit K

On Thu, 2011-02-03 at 11:48 -0800, Linus Torvalds wrote:
> On Thu, Feb 3, 2011 at 11:34 AM, Suresh Siddha
> <suresh.b.siddha@intel.com> wrote:
> >
> > True. 'stale' is the wrong word. Do you want me to send a corrected one
> > by replacing it with 'bogus'?
> 
> Please.
> 
> > my understanding is that unless we end up using that TLB entry, we will
> > not have the issues like machine checks due to cacheability issues etc.
> > If it is not global, upcoming cr3 change will flush it and meanwhile I
> > don't think there is a scenario where we refer to these user-addresses.
> 
> Quite possible. The situation I envisioned was the same speculative
> memory access that causes the TLB fill to also cause a cache fill -
> for a noncacheable region (because the bogus TLB entry sets the random
> address to cacheable).
> 
> And then what happens when somebody else accesses the same memory
> noncacheably (through a valid TLB entry), and finds it in the cache?
> 
> I dunno. Not really important. The important part is the "possible
> random bogus TLB entry", the fact that the CPU can act strangely after
> that is pretty much a given.
> 

Ok. Updated patch appended.

thanks,
suresh
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: x86, mm: avoid bogus tlb entries by clearing prev mm_cpumask after switching mm

Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb IPI's while
the cr3 is still pointing to the prev mm. And this window can lead
to the possibility of bogus TLB fills resulting in strange failures.
One such problematic scenario is mentioned below.

T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI etc
between the point of clearing the cpu from the mm_cpumask(mm1) and before
reloading the cr3 with the new mm2.

T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with flushing
the TLB for mm1. It doesn't send the flush TLB to CPU-1 as it doesn't see that
cpu listed in the mm_cpumask(mm1).

T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
page-table pages associated with the removed vma mapping.

T4. CPU-2 now allocates those freed page-table pages for something else.

T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1 can
potentially speculate and walk through the page-table caches and can
insert new TLB entries. As the page-table pages are already freed and being
used on CPU-2, this page walk can potentially insert a bogus global TLB entry 
depending on the (random) contents of the page that is being used on CPU-2.

T6. This bogus TLB entry being global will be active across future CR3
changes and can result in weird memory corruption etc.

To avoid this issue, for the prev mm that is handing over the cpu to another mm,
clear the cpu from the mm_cpumask(prev) after the cr3 is changed.

Marking it for -stable, though we haven't seen any reported failure that
can be attributed to this.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org	[v2.6.32+]
---
 arch/x86/include/asm/mmu_context.h |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 4a2d4e0..8b5393e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -36,8 +36,6 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();

 	if (likely(prev != next)) {
-		/* stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
 #ifdef CONFIG_SMP
 		percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		percpu_write(cpu_tlbstate.active_mm, next);
@@ -47,6 +45,9 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		/* Re-load page tables */
 		load_cr3(next->pgd);

+		/* stop flush ipis for the previous mm */
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
 		/*
 		 * load the LDT, if the LDT is different:
 		 */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03 20:20             ` Suresh Siddha
@ 2011-02-03 21:06               ` Ingo Molnar
  2011-02-03 21:33                 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2011-02-03 21:06 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Linus Torvalds, H. Peter Anvin, Thomas Gleixner, LKML, Mallick, Asit K


* Suresh Siddha <suresh.b.siddha@intel.com> wrote:

> On Thu, 2011-02-03 at 11:48 -0800, Linus Torvalds wrote:
> > On Thu, Feb 3, 2011 at 11:34 AM, Suresh Siddha
> > <suresh.b.siddha@intel.com> wrote:
> > >
> > > True. 'stale' is the wrong word. Do you want me to send a corrected one
> > > by replacing it with 'bogus'?
> > 
> > Please.
> > 
> > > my understanding is that unless we end up using that TLB entry, we will
> > > not have the issues like machine checks due to cacheability issues etc.
> > > If it is not global, upcoming cr3 change will flush it and meanwhile I
> > > don't think there is a scenario where we refer to these user-addresses.
> > 
> > Quite possible. The situation I envisioned was the same speculative
> > memory access that causes the TLB fill to also cause a cache fill -
> > for a noncacheable region (because the bogus TLB entry sets the random
> > address to cacheable).
> > 
> > And then what happens when somebody else accesses the same memory
> > noncacheably (through a valid TLB entry), and finds it in the cache?
> > 
> > I dunno. Not really important. The important part is the "possible
> > random bogus TLB entry", the fact that the CPU can act strangely after
> > that is pretty much a given.
> > 
> 
> Ok. Updated patch appended.

Linus, the patch fine to me too.

Acked-by: Ingo Molnar <mingo@elte.hu>

Do you want to apply it or should I?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03 21:06               ` Ingo Molnar
@ 2011-02-03 21:33                 ` Linus Torvalds
  2011-02-03 21:39                   ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2011-02-03 21:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Suresh Siddha, H. Peter Anvin, Thomas Gleixner, LKML, Mallick, Asit K

On Thu, Feb 3, 2011 at 1:06 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> Do you want to apply it or should I?

I already did (but I now amended it to have your ack and pushed it out).

                   Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
  2011-02-03 21:33                 ` Linus Torvalds
@ 2011-02-03 21:39                   ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2011-02-03 21:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Suresh Siddha, H. Peter Anvin, Thomas Gleixner, LKML, Mallick, Asit K

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Feb 3, 2011 at 1:06 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > Do you want to apply it or should I?
> 
> I already did (but I now amended it to have your ack and pushed it out).

Cool, thanks!

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm
       [not found]   ` <BLU157-w548B9D2F8EDD8CA9DB6CD3DAAC0@phx.gbl>
@ 2011-04-15 11:58     ` MaoXiaoyun
  0 siblings, 0 replies; 17+ messages in thread
From: MaoXiaoyun @ 2011-04-15 11:58 UTC (permalink / raw)
  To: linux-kernel







Hi:

Recently I've met a kernel bug. 
Kernel version: 2.6.32.26, from git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01f
I think the crash might related to this patch.
Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before cpumask_clear_cpu(line 49).
Could it possible that right after execute line 40 of mmu_context.h, 
CPU go to IPI interrupt to try to flush the mm, but find the TLB state happened to be TLBSTATE_OK.

Crash log attached. 
Thanks.

arch/x86/include/asm/mmu_context.h

33 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
34 <+++<+++<+++ struct task_struct *tsk)
35 {
36 <+++unsigned cpu = smp_processor_id();
37 
38 <+++if (likely(prev != next)) {
39 #ifdef CONFIG_SMP
40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next);
42 #endif
43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next));
44 
45 <+++<+++/* Re-load page tables */
46 <+++<+++load_cr3(next->pgd);
47 
48 <+++<+++/* stop flush ipis for the previous mm */
49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev)); 



===============crash log==========================
INIT: Id "s0" respawning too fast: disabled for 5 minutes
__ratelimit: 14 callbacks suppressed
blktap_sysfs_destroy
blktap_sysfs_destroy
------------[ cut here ]------------
kernel BUG at arch/x86/mm/tlb.c:61!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb
CPU 1 
Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 
RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46
RSP: e02b:ffff88002805be48 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0
RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200
R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880
R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000
FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40)
Stack:
ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88
<0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78
<0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108
Call Trace:
<IRQ> 
[<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
[<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
[<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
[<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
[<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
[<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
[<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
[<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> 
[<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
[<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
[<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
[<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
[<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
[<ffffffff81042fcf>] ? need_resched+0x23/0x2d
[<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
[<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
[<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
[<ffffffff81114362>] ? do_execve+0x1c3/0x29e
[<ffffffff8101155d>] ? sys_execve+0x43/0x5d
[<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
[<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
[<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
[<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
[<ffffffff81013daa>] ? child_rip+0xa/0x20
[<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
[<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
[<ffffffff81013da0>] ? child_rip+0x0/0x20
Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 
RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46
RSP <ffff88002805be48>
---[ end trace ce9cee6832a9c503 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1
Call Trace:
<IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a
[<ffffffff8144008a>] ? init_amd+0x296/0x37a
[<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf
[<ffffffff8100f8e2>] ? check_events+0x12/0x20
[<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25
[<ffffffff81448185>] oops_end+0xb6/0xc6
[<ffffffff810166e5>] die+0x5a/0x63
[<ffffffff81447a5c>] do_trap+0x115/0x124
[<ffffffff810148e6>] do_invalid_op+0x9c/0xa5
[<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
[<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23
[<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11
[<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67
[<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
[<ffffffff81013b3b>] invalid_op+0x1b/0x20
[<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
[<ffffffff8103a3cb>] ? leave_mm+0x15/0x46
[<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53
[<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
[<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28
[<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
[<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
[<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d
[<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46
[<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17
[<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500
[<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
[<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
[<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef
[<ffffffff81042fcf>] ? need_resched+0x23/0x2d
[<ffffffff811f4648>] ? process_measurement+0xc0/0xd7
[<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef
[<ffffffff81113094>] ? search_binary_handler+0xc8/0x255
[<ffffffff81114362>] ? do_execve+0x1c3/0x29e
[<ffffffff8101155d>] ? sys_execve+0x43/0x5d
[<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
[<ffffffff81013e28>] ? kernel_execve+0x68/0xd0
[<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
[<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e
[<ffffffff81013daa>] ? child_rip+0xa/0x20
[<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f
[<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff8101371d>] ? retint_restore_args+0x5/0x6
[<ffffffff81013da0>] ? child_rip+0x0/0x20 		 	   		  

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-04-15 11:58 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-02 20:07 [patch] x86, mm: avoid stale tlb entries by clearing prev mm_cpumask after switching mm Suresh Siddha
2011-02-03  1:23 ` Andi Kleen
2011-02-03  1:55   ` Suresh Siddha
2011-02-03  2:15     ` Andi Kleen
2011-02-03  4:12       ` Linus Torvalds
2011-02-03  4:22         ` H. Peter Anvin
2011-02-03  1:55 ` Linus Torvalds
2011-02-03  4:03   ` Linus Torvalds
2011-02-03 18:27     ` Suresh Siddha
2011-02-03 19:13       ` Linus Torvalds
2011-02-03 19:34         ` Suresh Siddha
2011-02-03 19:48           ` Linus Torvalds
2011-02-03 20:20             ` Suresh Siddha
2011-02-03 21:06               ` Ingo Molnar
2011-02-03 21:33                 ` Linus Torvalds
2011-02-03 21:39                   ` Ingo Molnar
     [not found] <BLU157-w495F35A445E063E33DC9E8DAAC0@phx.gbl>
     [not found] ` <BLU157-w41A610FD3413183801060CDAAC0@phx.gbl>
     [not found]   ` <BLU157-w548B9D2F8EDD8CA9DB6CD3DAAC0@phx.gbl>
2011-04-15 11:58     ` MaoXiaoyun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.