linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH - sort of] x86: Livelock in handle_pte_fault
@ 2013-05-17  8:42 Stanislav Meduna
  2013-05-22  0:39 ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-17  8:42 UTC (permalink / raw)
  To: linux-rt-users, linux-kernel
  Cc: rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

Hi all,

I don't know whether this is linux-rt specific or applies to
the mainline too, so I'll repeat some things the linux-rt
readers already know.

Environment:

- Geode LX or Celeron M
- _not_ CONFIG_SMP
- linux 3.4 with realtime patches and full preempt configured
- an application consisting of several mostly RR-class threads
- the application runs with mlockall()
- there is no swap

Problem:

- after several hours to 1-2 weeks some of the threads start to loop
  in the following way

  0d...0 62811.755382: function:  do_page_fault
  0....0 62811.755386: function:     handle_mm_fault
  0....0 62811.755389: function:        handle_pte_fault
  0d...0 62811.755394: function:  do_page_fault
  0....0 62811.755396: function:     handle_mm_fault
  0....0 62811.755398: function:        handle_pte_fault
  0d...0 62811.755402: function:  do_page_fault
  0....0 62811.755404: function:     handle_mm_fault
  0....0 62811.755406: function:        handle_pte_fault

  and stay in the loop until the RT throttling gets activated.
  One of the faulting addresses was in code (after returning
  from a syscall), a second one in stack (inside put_user right
  before a syscall ends), both were surely mapped.

- After RT throttler activates it somehow magically fixes itself,
  probably (not verified) because another _process_ gets scheduled.
  When throttled the RR and FF threads are not allowed to run for
  a while (20 ms in my configuration). The livelocks lasts around
  1-3 seconds, and there is a SCHED_OTHER process that runs each
  2 seconds.

- Kernel threads with higher priority than the faulting one (linux-rt
  irq threads) run normally. A higher priority user thread from the
  same process gets scheduled and then enters the same faulting loop.

- in ps -o min_flt,maj_flt the number of minor page faults
  for the offending thread skyrockets to hundreds of thousands
  (normally it stays zero as everything is already mapped
  when it is started)

- The code in handle_pte_fault proceeds through the
    entry = pte_mkyoung(entry);
  line and the following
    ptep_set_access_flags
  returns zero.

- The livelock is extremely timing sensitive - different workloads
  cause it not to happen at all or far later.

- I was able to make this happen a bit faster (once per ~4 hours)
  with the rt thread repeatly causing the kernel to try to
  invoke modprobe to load a missing module - so there is a load
  of kworker-s launching modprobes (in case anyone wonders how it
  can happen: this was a bug in our application with invalid level
  specified for setsockopt causing searching for TCP congestion
  module instead of setting SO_LINGER)

- the symptoms are similar to
    http://lkml.indiana.edu/hypermail/linux/kernel/1103.0/01364.html
  which got fixed by
    https://lkml.org/lkml/2011/3/15/516
  but this fix does not apply to the processors in question

- the patch below _seems_ to fix it, or at least massively delay it -
  the testcase now runs for 2.5 days instead of 4 hours. I doubt
  it is the proper patch (it brutally reloads the CR3 every time
  a thread with userspace mapping is switched to). I just got the
  suspicion that there is some way the kernel forgets to update
  the memory mapping when going from an userpace thread through
  some kernel ones back to another userspace one and tried to make
  sure the mapping is always reloaded.

- the whole history starts at
    http://www.spinics.net/lists/linux-rt-users/msg09758.html
  I originally thought the problem is in timerfd and hunted it
  in several places until I learned to use the tracing infrastructure
  and started to pin it down with trace prints etc :)

- A trace file of the hang is at
  http://www.meduna.org/tmp/trace.mmfaulthang.dat.gz

Does this ring a bell with someone?

Thanks
                                              Stano




diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 6902152..3d54a15 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -54,21 +54,23 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		if (unlikely(prev->context.ldt != next->context.ldt))
 			load_LDT_nolock(&next->context);
 	}
-#ifdef CONFIG_SMP
 	else {
+#ifdef CONFIG_SMP
 		percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);

 		if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
+#endif
 			/* We were in lazy tlb mode and leave_mm disabled
 			 * tlb flush IPI delivery. We must reload CR3
 			 * to make sure to use no freed page tables.
 			 */
 			load_cr3(next->pgd);
 			load_LDT_nolock(&next->context);
+#ifdef CONFIG_SMP
 		}
-	}
 #endif
+	}
 }

 #define activate_mm(prev, next)


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault
  2013-05-17  8:42 [PATCH - sort of] x86: Livelock in handle_pte_fault Stanislav Meduna
@ 2013-05-22  0:39 ` Steven Rostedt
  2013-05-22  7:32   ` Stanislav Meduna
  2013-05-22 12:33   ` Rik van Riel
  0 siblings, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2013-05-22  0:39 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, riel

On Fri, 2013-05-17 at 10:42 +0200, Stanislav Meduna wrote:
> Hi all,
> 
> I don't know whether this is linux-rt specific or applies to
> the mainline too, so I'll repeat some things the linux-rt
> readers already know.
> 
> Environment:
> 
> - Geode LX or Celeron M
> - _not_ CONFIG_SMP
> - linux 3.4 with realtime patches and full preempt configured
> - an application consisting of several mostly RR-class threads

The threads do a mlockall too right? I'm not sure mlock will lock memory
for a new thread's stack.

> - the application runs with mlockall()

With both MCL_FUTURE and MCL_CURRENT set, right?

> - there is no swap

Hmm, doesn't mean that code can't be swapped out, as it is just mapped
from the file it came from. But you'd think mlockall would prevent that.

> 
> Problem:
> 
> - after several hours to 1-2 weeks some of the threads start to loop
>   in the following way
> 
>   0d...0 62811.755382: function:  do_page_fault
>   0....0 62811.755386: function:     handle_mm_fault
>   0....0 62811.755389: function:        handle_pte_fault
>   0d...0 62811.755394: function:  do_page_fault
>   0....0 62811.755396: function:     handle_mm_fault
>   0....0 62811.755398: function:        handle_pte_fault
>   0d...0 62811.755402: function:  do_page_fault
>   0....0 62811.755404: function:     handle_mm_fault
>   0....0 62811.755406: function:        handle_pte_fault
> 
>   and stay in the loop until the RT throttling gets activated.
>   One of the faulting addresses was in code (after returning
>   from a syscall), a second one in stack (inside put_user right
>   before a syscall ends), both were surely mapped.
> 
> - After RT throttler activates it somehow magically fixes itself,
>   probably (not verified) because another _process_ gets scheduled.
>   When throttled the RR and FF threads are not allowed to run for
>   a while (20 ms in my configuration). The livelocks lasts around
>   1-3 seconds, and there is a SCHED_OTHER process that runs each
>   2 seconds.

Hmm, if there was a missed TLB flush, and we are faulting due to a bad
TLB table, and it goes into an infinite faulting loop, the only thing
that will stop it is the RT throttle. Then a new task gets scheduled,
and we flush the TLB and everything is fine again.

> 
> - Kernel threads with higher priority than the faulting one (linux-rt
>   irq threads) run normally. A higher priority user thread from the
>   same process gets scheduled and then enters the same faulting loop.

Kernel threads share the mm, and wont cause a reload of the CR3.

> 
> - in ps -o min_flt,maj_flt the number of minor page faults
>   for the offending thread skyrockets to hundreds of thousands
>   (normally it stays zero as everything is already mapped
>   when it is started)
> 
> - The code in handle_pte_fault proceeds through the
>     entry = pte_mkyoung(entry);
>   line and the following
>     ptep_set_access_flags
>   returns zero.
> 
> - The livelock is extremely timing sensitive - different workloads
>   cause it not to happen at all or far later.
> 
> - I was able to make this happen a bit faster (once per ~4 hours)
>   with the rt thread repeatly causing the kernel to try to
>   invoke modprobe to load a missing module - so there is a load
>   of kworker-s launching modprobes (in case anyone wonders how it
>   can happen: this was a bug in our application with invalid level
>   specified for setsockopt causing searching for TCP congestion
>   module instead of setting SO_LINGER)

Note, that modules are in vmalloc space, and do fault in. But it also
changes the PGD.

> 
> - the symptoms are similar to
>     http://lkml.indiana.edu/hypermail/linux/kernel/1103.0/01364.html
>   which got fixed by
>     https://lkml.org/lkml/2011/3/15/516
>   but this fix does not apply to the processors in question
> 
> - the patch below _seems_ to fix it, or at least massively delay it -
>   the testcase now runs for 2.5 days instead of 4 hours. I doubt
>   it is the proper patch (it brutally reloads the CR3 every time
>   a thread with userspace mapping is switched to). I just got the
>   suspicion that there is some way the kernel forgets to update
>   the memory mapping when going from an userpace thread through
>   some kernel ones back to another userspace one and tried to make
>   sure the mapping is always reloaded.

Seems a bit extreme. Looks to me there's a missing flush TLB somewhere.

Do you have a reproducer you can share. That way, maybe we can all share
the joy.

-- Steve

> 
> - the whole history starts at
>     http://www.spinics.net/lists/linux-rt-users/msg09758.html
>   I originally thought the problem is in timerfd and hunted it
>   in several places until I learned to use the tracing infrastructure
>   and started to pin it down with trace prints etc :)
> 
> - A trace file of the hang is at
>   http://www.meduna.org/tmp/trace.mmfaulthang.dat.gz
> 
> Does this ring a bell with someone?
> 
> Thanks
>                                               Stano
> 
> 
> 
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 6902152..3d54a15 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -54,21 +54,23 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  		if (unlikely(prev->context.ldt != next->context.ldt))
>  			load_LDT_nolock(&next->context);
>  	}
> -#ifdef CONFIG_SMP
>  	else {
> +#ifdef CONFIG_SMP
>  		percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>  		BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);
> 
>  		if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
> +#endif
>  			/* We were in lazy tlb mode and leave_mm disabled
>  			 * tlb flush IPI delivery. We must reload CR3
>  			 * to make sure to use no freed page tables.
>  			 */
>  			load_cr3(next->pgd);
>  			load_LDT_nolock(&next->context);
> +#ifdef CONFIG_SMP
>  		}
> -	}
>  #endif
> +	}
>  }
> 
>  #define activate_mm(prev, next)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault
  2013-05-22  0:39 ` Steven Rostedt
@ 2013-05-22  7:32   ` Stanislav Meduna
  2013-05-22 12:33   ` Rik van Riel
  1 sibling, 0 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-22  7:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, riel

On 22.05.2013 02:39, Steven Rostedt wrote:

> The threads do a mlockall too right? I'm not sure mlock will lock memory
> for a new thread's stack.

They don't. However,
https://rt.wiki.kernel.org/index.php/Threaded_RT-application_with_memory_locking_and_stack_handling_example
claims

  "Threads started after a call to mlockall(MCL_CURRENT | MCL_FUTURE) will
  generate page faults immediately since the new stack is immediately forced
  to RAM (due to the MCL_FUTURE flag)."

and as the ps -o min_flt reports zero page faults for the threads
so I think it is also the case.

Anyway, both particular addresses were surely mapped long before
the fault.

>> - the application runs with mlockall()
> 
> With both MCL_FUTURE and MCL_CURRENT set, right?

Yes.

>> - there is no swap
> 
> Hmm, doesn't mean that code can't be swapped out, as it is just mapped
> from the file it came from. But you'd think mlockall would prevent that.

mlockall also forces the stack to be mapped immediately and not
generating pagefaults when incrementally expanding.

> Seems a bit extreme. Looks to me there's a missing flush TLB somewhere.

Probably.

One interesting thing: the test for "need to reload something"
looks a bit differently for the ARM architecture in
arch/arm/include/asm/mmu_context.h:

  if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next)) || prev != next) {

and they do something also for the
  !CONFIG_SMP && !cpumask_test_and_set_cpu(cpu, mm_cpumask(next)
case. I don't know what exactly is semantics of mm_cpumask,
but the difference is suspicious.

> Do you have a reproducer you can share. That way, maybe we can all share
> the joy.

Unfortunately not and I have really tried :( If I get new ideas, I will
try again.

Thanks
-- 
                                                   Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault
  2013-05-22  0:39 ` Steven Rostedt
  2013-05-22  7:32   ` Stanislav Meduna
@ 2013-05-22 12:33   ` Rik van Riel
  2013-05-22 15:01     ` Linus Torvalds
  1 sibling, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2013-05-22 12:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Stanislav Meduna, linux-rt-users, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, Linus Torvalds, Hai Huang

[-- Attachment #1: Type: text/plain, Size: 2854 bytes --]

On 05/21/2013 08:39 PM, Steven Rostedt wrote:
> On Fri, 2013-05-17 at 10:42 +0200, Stanislav Meduna wrote:
>> Hi all,
>>
>> I don't know whether this is linux-rt specific or applies to
>> the mainline too, so I'll repeat some things the linux-rt
>> readers already know.
>>
>> Environment:
>>
>> - Geode LX or Celeron M
>> - _not_ CONFIG_SMP
>> - linux 3.4 with realtime patches and full preempt configured
>> - an application consisting of several mostly RR-class threads
>
> The threads do a mlockall too right? I'm not sure mlock will lock memory
> for a new thread's stack.
>
>> - the application runs with mlockall()
>
> With both MCL_FUTURE and MCL_CURRENT set, right?
>
>> - there is no swap
>
> Hmm, doesn't mean that code can't be swapped out, as it is just mapped
> from the file it came from. But you'd think mlockall would prevent that.
>
>>
>> Problem:
>>
>> - after several hours to 1-2 weeks some of the threads start to loop
>>    in the following way
>>
>>    0d...0 62811.755382: function:  do_page_fault
>>    0....0 62811.755386: function:     handle_mm_fault
>>    0....0 62811.755389: function:        handle_pte_fault
>>    0d...0 62811.755394: function:  do_page_fault
>>    0....0 62811.755396: function:     handle_mm_fault
>>    0....0 62811.755398: function:        handle_pte_fault
>>    0d...0 62811.755402: function:  do_page_fault
>>    0....0 62811.755404: function:     handle_mm_fault
>>    0....0 62811.755406: function:        handle_pte_fault
>>
>>    and stay in the loop until the RT throttling gets activated.
>>    One of the faulting addresses was in code (after returning
>>    from a syscall), a second one in stack (inside put_user right
>>    before a syscall ends), both were surely mapped.
>>
>> - After RT throttler activates it somehow magically fixes itself,
>>    probably (not verified) because another _process_ gets scheduled.
>>    When throttled the RR and FF threads are not allowed to run for
>>    a while (20 ms in my configuration). The livelocks lasts around
>>    1-3 seconds, and there is a SCHED_OTHER process that runs each
>>    2 seconds.
>
> Hmm, if there was a missed TLB flush, and we are faulting due to a bad
> TLB table, and it goes into an infinite faulting loop, the only thing
> that will stop it is the RT throttle. Then a new task gets scheduled,
> and we flush the TLB and everything is fine again.

That sounds like maybe we DO want a TLB flush on spurious
page faults, so we get rid of this problem.

Last fall we thought this problem could not happen on x86,
but your bug report suggests that it might.

We can get flush_tlb_fix_spurious_fault to do a local TLB
invalidate of just the address in question by removing the
x86-specific dummy version, falling back to the asm-generic
version that does something.

Can you test the attached patch?

-- 
All rights reversed

[-- Attachment #2: flush-tlb-on-spurious-fault.patch --]
[-- Type: text/x-patch, Size: 1003 bytes --]

Subject: x86,mm: flush TLB on spurious fault

It appears that certain x86 CPUs do not automatically flush the
TLB entry that caused a page fault, causing spurious faults to
loop forever under certain circumstances.

Remove the dummy flush_tlb_fix_spurious_fault define, so x86
falls back to the asm-generic version, which does do a local
TLB flush.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Stanislav Meduna <stano@meduna.org>
---
 arch/x86/include/asm/pgtable.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..43e7966 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -729,8 +729,6 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
 	pte_update(mm, addr, ptep);
 }
 
-#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
-
 #define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
 
 #define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault
  2013-05-22 12:33   ` Rik van Riel
@ 2013-05-22 15:01     ` Linus Torvalds
  2013-05-22 17:41       ` [PATCH] mm: fix up a spurious page fault whenever it happens Rik van Riel
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2013-05-22 15:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, Stanislav Meduna, linux-rt-users, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Hai Huang

On Wed, May 22, 2013 at 5:33 AM, Rik van Riel <riel@redhat.com> wrote:
>
> That sounds like maybe we DO want a TLB flush on spurious
> page faults, so we get rid of this problem.

Hmm. If it was just the Geode, I wouldn't be surprised. But with a Celeron too?

Anyway, worth testing..

> We can get flush_tlb_fix_spurious_fault to do a local TLB
> invalidate of just the address in question by removing the
> x86-specific dummy version, falling back to the asm-generic
> version that does something.
>
> Can you test the attached patch?

I think you should also remove the

        if (flags & FAULT_FLAG_WRITE)

test in handle_pte_fault(). Because if it's spurious, it might happen
on reads too, I think.

RT people - does RT do anything special with the page tables?

Stanislav, the patch you sent out may well work, but it's damned odd.
On UP, we don't do the leave_mm() optimization that makes that code
necessary. So I agree with Rik that it's more likely somewhere else
(and infinite page faults do imply the TLB not getting flushed by the
page fault exception), and your patch might just be working around it
by simply flushing the TLB at least when switching between threads,
which still happens.

                   Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 15:01     ` Linus Torvalds
@ 2013-05-22 17:41       ` Rik van Riel
  2013-05-22 18:04         ` Stanislav Meduna
  0 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2013-05-22 17:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Stanislav Meduna, linux-rt-users, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Hai Huang

On Wed, 22 May 2013 08:01:43 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, May 22, 2013 at 5:33 AM, Rik van Riel <riel@redhat.com> wrote:

> > Can you test the attached patch?
> 
> I think you should also remove the
> 
>         if (flags & FAULT_FLAG_WRITE)
> 
> test in handle_pte_fault(). Because if it's spurious, it might happen
> on reads too, I think.

Here you are. I wonder if the conditional was put in because we
originally did a global TLB flush (with IPIs) from the spurious
fault handler...

Stanislav, could you add this patch to your test?

---8<---
Subject: [PATCH] mm: fix up a spurious page fault whenever it happens

The kernel currently only handles spurious page faults when they
"should" happen, but potentially this is not the only situation
where they could happen.

The spurious fault handler only flushes an entry from the local
TLB; this should be a rare event with minimal side effects.

This patch removes the conditional, allowing the spurious fault
handler to execute whenever a spurious page fault happens, which
should eliminate infinite page fault loops.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Stanislav Meduna <stano@meduna.org>
---
 mm/memory.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dc1882..962477d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3744,13 +3744,11 @@ int handle_pte_fault(struct mm_struct *mm,
 		update_mmu_cache(vma, address, pte);
 	} else {
 		/*
-		 * This is needed only for protection faults but the arch code
-		 * is not yet telling us if this is a protection fault or not.
-		 * This still avoids useless tlb flushes for .text page faults
-		 * with threads.
+		 * The page table entry is good, but the CPU generated a
+		 * spurious fault. Invalidate the corresponding TLB entry
+		 * on this CPU, so the next access can succeed.
 		 */
-		if (flags & FAULT_FLAG_WRITE)
-			flush_tlb_fix_spurious_fault(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 unlock:
 	pte_unmap_unlock(pte, ptl);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 17:41       ` [PATCH] mm: fix up a spurious page fault whenever it happens Rik van Riel
@ 2013-05-22 18:04         ` Stanislav Meduna
  2013-05-22 18:11           ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-22 18:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Steven Rostedt, linux-rt-users, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Hai Huang

On 22.05.2013 19:41, Rik van Riel wrote:

>> I think you should also remove the
>>
>>         if (flags & FAULT_FLAG_WRITE)

Done

>>> Can you test the attached patch?

Nope. Fails with the same symptoms, min_flt skyrockets,
the throttler activates and after 2 seconds all is well
again.

This is on Geode LX, I don't have the Celeron M at the hand now.

Thank
-- 
                                     Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 18:04         ` Stanislav Meduna
@ 2013-05-22 18:11           ` Steven Rostedt
  2013-05-22 18:21             ` Stanislav Meduna
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2013-05-22 18:11 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: Rik van Riel, Linus Torvalds, linux-rt-users, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Hai Huang

On Wed, 2013-05-22 at 20:04 +0200, Stanislav Meduna wrote:
> On 22.05.2013 19:41, Rik van Riel wrote:
> 
> >> I think you should also remove the
> >>
> >>         if (flags & FAULT_FLAG_WRITE)
> 
> Done
> 
> >>> Can you test the attached patch?
> 
> Nope. Fails with the same symptoms, min_flt skyrockets,
> the throttler activates and after 2 seconds all is well
> again.
> 
> This is on Geode LX, I don't have the Celeron M at the hand now.
> 

Did you apply both patches? Without the first one, this one is
meaningless.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 18:11           ` Steven Rostedt
@ 2013-05-22 18:21             ` Stanislav Meduna
  2013-05-22 18:35               ` Rik van Riel
  0 siblings, 1 reply; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-22 18:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rik van Riel, Linus Torvalds, linux-rt-users, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Hai Huang

On 22.05.2013 20:11, Steven Rostedt wrote:

> Did you apply both patches? Without the first one, this one is
> meaningless.

Sure.

BTW, back when I tried to pinpoint it I also tried adding
  flush_tlb_page(vma, address)
at the beginning of handle_pte_fault, which as I read should
be basically the same. It did not not change anything.
I did mention it some in some previous mail but forgot
to include it again in the summary - sorry :/

-- 
                                        Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 18:21             ` Stanislav Meduna
@ 2013-05-22 18:35               ` Rik van Riel
  2013-05-22 18:42                 ` H. Peter Anvin
  2013-05-22 18:47                 ` Stanislav Meduna
  0 siblings, 2 replies; 35+ messages in thread
From: Rik van Riel @ 2013-05-22 18:35 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Hai Huang

On 05/22/2013 02:21 PM, Stanislav Meduna wrote:
> On 22.05.2013 20:11, Steven Rostedt wrote:
>
>> Did you apply both patches? Without the first one, this one is
>> meaningless.
>
> Sure.
>
> BTW, back when I tried to pinpoint it I also tried adding
>    flush_tlb_page(vma, address)
> at the beginning of handle_pte_fault, which as I read should
> be basically the same. It did not not change anything.

I'm stumped.

If the Geode knows how to flush single TLB entries, it
should do that when flush_tlb_page is called.

If it does not know, it should throw an invalid instruction
exception, and not quietly complete the instruction without
doing anything.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 18:35               ` Rik van Riel
@ 2013-05-22 18:42                 ` H. Peter Anvin
  2013-05-22 18:43                   ` Rik van Riel
  2013-05-22 18:47                 ` Stanislav Meduna
  1 sibling, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2013-05-22 18:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stanislav Meduna, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 05/22/2013 11:35 AM, Rik van Riel wrote:
> On 05/22/2013 02:21 PM, Stanislav Meduna wrote:
>> On 22.05.2013 20:11, Steven Rostedt wrote:
>>
>>> Did you apply both patches? Without the first one, this one is
>>> meaningless.
>>
>> Sure.
>>
>> BTW, back when I tried to pinpoint it I also tried adding
>>    flush_tlb_page(vma, address)
>> at the beginning of handle_pte_fault, which as I read should
>> be basically the same. It did not not change anything.
> 
> I'm stumped.
> 
> If the Geode knows how to flush single TLB entries, it
> should do that when flush_tlb_page is called.
> 
> If it does not know, it should throw an invalid instruction
> exception, and not quietly complete the instruction without
> doing anything.
> 

Some CPUs have had errata when it comes to flushing large pages that
have been split into small pages by hardware, e.g. due to MTRR
conflicts.  In that case, fragments of the large page may have been left
in the TLB.

Could that explain what you are seeing?

	-hpa


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 18:42                 ` H. Peter Anvin
@ 2013-05-22 18:43                   ` Rik van Riel
  2013-05-23  8:07                     ` Stanislav Meduna
  0 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2013-05-22 18:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Stanislav Meduna, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 05/22/2013 02:42 PM, H. Peter Anvin wrote:
> On 05/22/2013 11:35 AM, Rik van Riel wrote:
>> On 05/22/2013 02:21 PM, Stanislav Meduna wrote:
>>> On 22.05.2013 20:11, Steven Rostedt wrote:
>>>
>>>> Did you apply both patches? Without the first one, this one is
>>>> meaningless.
>>>
>>> Sure.
>>>
>>> BTW, back when I tried to pinpoint it I also tried adding
>>>     flush_tlb_page(vma, address)
>>> at the beginning of handle_pte_fault, which as I read should
>>> be basically the same. It did not not change anything.
>>
>> I'm stumped.
>>
>> If the Geode knows how to flush single TLB entries, it
>> should do that when flush_tlb_page is called.
>>
>> If it does not know, it should throw an invalid instruction
>> exception, and not quietly complete the instruction without
>> doing anything.
>>
>
> Some CPUs have had errata when it comes to flushing large pages that
> have been split into small pages by hardware, e.g. due to MTRR
> conflicts.  In that case, fragments of the large page may have been left
> in the TLB.
>
> Could that explain what you are seeing?

That would be testable by changing __native_flush_tlb_single()
to call __flush_tlb(), instead of doing an invlpg instruction.

In other words, make the code look like this, for testing:

static inline void __native_flush_tlb_single(unsigned long addr)
{
         __flush_tlb();
}

This on top of the other two patches.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 18:35               ` Rik van Riel
  2013-05-22 18:42                 ` H. Peter Anvin
@ 2013-05-22 18:47                 ` Stanislav Meduna
  1 sibling, 0 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-22 18:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Hai Huang

On 22.05.2013 20:35, Rik van Riel wrote:

> I'm stumped.
> 
> If the Geode knows how to flush single TLB entries, it
> should do that when flush_tlb_page is called.
> 
> If it does not know, it should throw an invalid instruction
> exception, and not quietly complete the instruction without
> doing anything.

Could it be that the problem is not stale TLB, but a page directory
that is somehow invalid, e.g. belonging to the previous modprobe
(or whatever) instead of the running process?

My patch does load_cr3(next->pgd); so it explicitely loads something
there.

> In other words, make the code look like this, for testing:
>
> static inline void __native_flush_tlb_single(unsigned long addr)
> {
>         __flush_tlb();
> }

Yup, will try it.

Thanks
-- 
                                           Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-22 18:43                   ` Rik van Riel
@ 2013-05-23  8:07                     ` Stanislav Meduna
  2013-05-23 12:19                       ` Rik van Riel
  2013-05-23 14:45                       ` Linus Torvalds
  0 siblings, 2 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-23  8:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 22.05.2013 20:43, Rik van Riel wrote:

>> Some CPUs have had errata when it comes to flushing large pages that
>> have been split into small pages by hardware, e.g. due to MTRR
>> conflicts.  In that case, fragments of the large page may have been left
>> in the TLB.

Can I somehow find if this is the case? The memory mapping
for the failing process has two regions slightly larger than
4 MB - code and heap.

The process also does not access any funny memory regions
from userspace - it is basically networking (both TCP/IP
and raw sockets) and crunching of the data received.
No mmapped devices or something like that.

> static inline void __native_flush_tlb_single(unsigned long addr)
> {
>         __flush_tlb();
> }
> 
> This on top of the other two patches.

It did not crash overnight, but it also does not show any
minor fault counted for the threads, so I'm afraid the situation
just did not happen - there should be at least one visible in
the ps -o min_flt output, right?

I will give it some more testing time.

Thanks
-- 
                                             Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23  8:07                     ` Stanislav Meduna
@ 2013-05-23 12:19                       ` Rik van Riel
  2013-05-23 13:29                         ` Steven Rostedt
  2013-05-24  8:29                         ` Stanislav Meduna
  2013-05-23 14:45                       ` Linus Torvalds
  1 sibling, 2 replies; 35+ messages in thread
From: Rik van Riel @ 2013-05-23 12:19 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 05/23/2013 04:07 AM, Stanislav Meduna wrote:
> On 22.05.2013 20:43, Rik van Riel wrote:
>
>>> Some CPUs have had errata when it comes to flushing large pages that
>>> have been split into small pages by hardware, e.g. due to MTRR
>>> conflicts.  In that case, fragments of the large page may have been left
>>> in the TLB.
>
> Can I somehow find if this is the case? The memory mapping
> for the failing process has two regions slightly larger than
> 4 MB - code and heap.
>
> The process also does not access any funny memory regions
> from userspace - it is basically networking (both TCP/IP
> and raw sockets) and crunching of the data received.
> No mmapped devices or something like that.
>
>> static inline void __native_flush_tlb_single(unsigned long addr)
>> {
>>          __flush_tlb();
>> }
>>
>> This on top of the other two patches.
>
> It did not crash overnight, but it also does not show any
> minor fault counted for the threads, so I'm afraid the situation
> just did not happen - there should be at least one visible in
> the ps -o min_flt output, right?

If all the page faults are done by he main thread,
and the TLB gets properly flushed now, the other
threads might not see minor faults.

> I will give it some more testing time.

That is a good idea.

Now to figure out how we properly fix this
issue in the kernel...

We can add a bit in the architecture bits that
we use to check against other CPU and system
errata, and conditionally flush the whole TLB
from __native_flush_tlb_single().

The question is, how do we identify what CPUs
need the extra flushing?

And in what circumstances do they require it?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 12:19                       ` Rik van Riel
@ 2013-05-23 13:29                         ` Steven Rostedt
  2013-05-23 15:06                           ` H. Peter Anvin
  2013-05-24  8:29                         ` Stanislav Meduna
  1 sibling, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2013-05-23 13:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stanislav Meduna, H. Peter Anvin, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On Thu, 2013-05-23 at 08:19 -0400, Rik van Riel wrote:

> We can add a bit in the architecture bits that
> we use to check against other CPU and system
> errata, and conditionally flush the whole TLB
> from __native_flush_tlb_single().

If we find that some CPUs have issues and others do not, and we can
determine this by checking the CPU type at run time, I would strongly
suggest using the jump_label infrastructure to do the branches. I know
this is early to suggest something like this, but I just wanted to put
it in your head ;-)

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23  8:07                     ` Stanislav Meduna
  2013-05-23 12:19                       ` Rik van Riel
@ 2013-05-23 14:45                       ` Linus Torvalds
  2013-05-23 14:50                         ` Linus Torvalds
  1 sibling, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2013-05-23 14:45 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On Thu, May 23, 2013 at 1:07 AM, Stanislav Meduna <stano@meduna.org> wrote:
>
> It did not crash overnight, but it also does not show any
> minor fault counted for the threads

Page faults that don't cause us to map a page (ie a spurious one, or
one that just updates dirty/accessed bits) don't show up as even minor
faults. Thing of the major/minor as "mapping activity" not a page
fault count.

So if this is due to some stuck TLB entry, that wouldn't show up anyway.

                    Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 14:45                       ` Linus Torvalds
@ 2013-05-23 14:50                         ` Linus Torvalds
  2013-05-23 15:03                           ` Stanislav Meduna
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2013-05-23 14:50 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On Thu, May 23, 2013 at 7:45 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Page faults that don't cause us to map a page (ie a spurious one, or
> one that just updates dirty/accessed bits) don't show up as even minor
> faults. Thing of the major/minor as "mapping activity" not a page
> fault count.

Actually, I take that back. We always update eithe rmin_flt or maj_flt. My bad.

Another question: I'm assuming this is all 32-bit, is it with PAE
enabled? That changes some of the TLB flushing, and we had one bug
related to that, maybe there are others..

            Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 14:50                         ` Linus Torvalds
@ 2013-05-23 15:03                           ` Stanislav Meduna
  0 siblings, 0 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-23 15:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 23.05.2013 16:50, Linus Torvalds wrote:

> Another question: I'm assuming this is all 32-bit, is it with PAE
> enabled? That changes some of the TLB flushing, and we had one bug
> related to that, maybe there are others..

32 bit, no PAE.

-- 
                                           Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 13:29                         ` Steven Rostedt
@ 2013-05-23 15:06                           ` H. Peter Anvin
  2013-05-23 15:27                             ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2013-05-23 15:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 05/23/2013 06:29 AM, Steven Rostedt wrote:
> On Thu, 2013-05-23 at 08:19 -0400, Rik van Riel wrote:
> 
>> We can add a bit in the architecture bits that
>> we use to check against other CPU and system
>> errata, and conditionally flush the whole TLB
>> from __native_flush_tlb_single().
> 
> If we find that some CPUs have issues and others do not, and we can
> determine this by checking the CPU type at run time, I would strongly
> suggest using the jump_label infrastructure to do the branches. I know
> this is early to suggest something like this, but I just wanted to put
> it in your head ;-)
> 

We don't even need the jump_label infrastructure -- we have
static_cpu_has*() which actually predates jump_label although it uses
the same underlying ideas.

	-hpa



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 15:06                           ` H. Peter Anvin
@ 2013-05-23 15:27                             ` Steven Rostedt
  2013-05-23 17:24                               ` H. Peter Anvin
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2013-05-23 15:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:

> We don't even need the jump_label infrastructure -- we have
> static_cpu_has*() which actually predates jump_label although it uses
> the same underlying ideas.

Ah right. I wonder if it would be worth consolidating a lot of these
"modifying of code" infrastructures. Which reminds me, I need to update
text_poke() to do things similar to what ftrace does, and get rid of the
stop machine code.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 15:27                             ` Steven Rostedt
@ 2013-05-23 17:24                               ` H. Peter Anvin
  2013-05-23 17:36                                 ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2013-05-23 17:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 05/23/2013 08:27 AM, Steven Rostedt wrote:
> On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:
> 
>> We don't even need the jump_label infrastructure -- we have
>> static_cpu_has*() which actually predates jump_label although it uses
>> the same underlying ideas.
> 
> Ah right. I wonder if it would be worth consolidating a lot of these
> "modifying of code" infrastructures. Which reminds me, I need to update
> text_poke() to do things similar to what ftrace does, and get rid of the
> stop machine code.
> 

Well, static_cpu_has*() just uses the alternatives infrastructure.

	-hpa



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 17:24                               ` H. Peter Anvin
@ 2013-05-23 17:36                                 ` Steven Rostedt
  2013-05-23 17:38                                   ` H. Peter Anvin
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2013-05-23 17:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On Thu, 2013-05-23 at 10:24 -0700, H. Peter Anvin wrote:
> On 05/23/2013 08:27 AM, Steven Rostedt wrote:
> > On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:
> > 
> >> We don't even need the jump_label infrastructure -- we have
> >> static_cpu_has*() which actually predates jump_label although it uses
> >> the same underlying ideas.
> > 
> > Ah right. I wonder if it would be worth consolidating a lot of these
> > "modifying of code" infrastructures. Which reminds me, I need to update
> > text_poke() to do things similar to what ftrace does, and get rid of the
> > stop machine code.
> > 
> 
> Well, static_cpu_has*() just uses the alternatives infrastructure.

And as it's a boot time change only, it's not quite in the category of
jump_labels and function tracing.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 17:36                                 ` Steven Rostedt
@ 2013-05-23 17:38                                   ` H. Peter Anvin
  0 siblings, 0 replies; 35+ messages in thread
From: H. Peter Anvin @ 2013-05-23 17:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 05/23/2013 10:36 AM, Steven Rostedt wrote:
> On Thu, 2013-05-23 at 10:24 -0700, H. Peter Anvin wrote:
>> On 05/23/2013 08:27 AM, Steven Rostedt wrote:
>>> On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote:
>>>
>>>> We don't even need the jump_label infrastructure -- we have
>>>> static_cpu_has*() which actually predates jump_label although it uses
>>>> the same underlying ideas.
>>>
>>> Ah right. I wonder if it would be worth consolidating a lot of these
>>> "modifying of code" infrastructures. Which reminds me, I need to update
>>> text_poke() to do things similar to what ftrace does, and get rid of the
>>> stop machine code.
>>>
>>
>> Well, static_cpu_has*() just uses the alternatives infrastructure.
> 
> And as it's a boot time change only, it's not quite in the category of
> jump_labels and function tracing.
> 

Right.

	-hpa



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-23 12:19                       ` Rik van Riel
  2013-05-23 13:29                         ` Steven Rostedt
@ 2013-05-24  8:29                         ` Stanislav Meduna
  2013-05-24 10:28                           ` Stanislav Meduna
  2013-05-24 13:06                           ` Rik van Riel
  1 sibling, 2 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-24  8:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 23.05.2013 14:19, Rik van Riel wrote:

>>> static inline void __native_flush_tlb_single(unsigned long addr)
>>> {
>>>          __flush_tlb();
>>> }
> 
>> I will give it some more testing time.
> 
> That is a good idea.

Still no crash, so this one indeed seems to change things.

If I understand it correctly, these patches fix the problem
when it happens and we still don't know why the TLB is stale
in the first place - whether there is (also) a genuine bug
or whether we are hitting some chip errata, right?


For the record the cpuinfo for my present testsystem:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 5
model           : 10
model name      : Geode(TM) Integrated Processor by AMD PCS
stepping        : 2
microcode       : 0x88a93d
cpu MHz         : 498.042
cache size      : 128 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu de pse tsc msr cx8 sep pge cmov clflush mmx
                  mmxext 3dnowext 3dnow
bogomips        : 996.08
clflush size    : 32
cache_alignment : 32
address sizes   : 32 bits physical, 32 bits virtual
power management:

and for the Celeron M where I can unfortunately reproduce
it much less often (days to weeks).

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : Intel(R) Celeron(R) M processor         1.00GHz
stepping        : 8
cpu MHz         : 1000.011
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
                  mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm
                  pbe nx bts
bogomips        : 2000.02
clflush size    : 64
cache_alignment : 64
address sizes   : 32 bits physical, 32 bits virtual
power management:


Thanks
-- 
                                           Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-24  8:29                         ` Stanislav Meduna
@ 2013-05-24 10:28                           ` Stanislav Meduna
  2013-05-24 13:06                           ` Rik van Riel
  1 sibling, 0 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-24 10:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 24.05.2013 10:29, Stanislav Meduna wrote:

>>>> static inline void __native_flush_tlb_single(unsigned long addr)
>>>> {
>>>>          __flush_tlb();
>>>> }
>>
>>> I will give it some more testing time.
>>
>> That is a good idea.
> 
> Still no crash, so this one indeed seems to change things.

Take that back, now crashed as well, it just took longer.
min_flt of two threads jumped from zero at 1848 (lower prio)
and 735993 (higher prio, preempted the first one) respectively,
1.7 seconds hang.

-- 
                                        Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-24  8:29                         ` Stanislav Meduna
  2013-05-24 10:28                           ` Stanislav Meduna
@ 2013-05-24 13:06                           ` Rik van Riel
  2013-05-24 13:55                             ` Stanislav Meduna
  2013-06-16 21:34                             ` Stanislav Meduna
  1 sibling, 2 replies; 35+ messages in thread
From: Rik van Riel @ 2013-05-24 13:06 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 05/24/2013 04:29 AM, Stanislav Meduna wrote:
> On 23.05.2013 14:19, Rik van Riel wrote:
>
>>>> static inline void __native_flush_tlb_single(unsigned long addr)
>>>> {
>>>>           __flush_tlb();
>>>> }
>>
>>> I will give it some more testing time.
>>
>> That is a good idea.
>
> Still no crash, so this one indeed seems to change things.
>
> If I understand it correctly, these patches fix the problem
> when it happens and we still don't know why the TLB is stale
> in the first place - whether there is (also) a genuine bug
> or whether we are hitting some chip errata, right?

Just to rule something out, are you using
transparent huge pages on those systems?

That could result in a mix of 4MB and 4kB
mappings, sometimes of the same memory.
The page tables would only ever contain
one of those mappings, but if we have some
kind of TLB problem, we might preserve a
large mapping across a page breakup, or
a small one across a page collapse...


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-24 13:06                           ` Rik van Riel
@ 2013-05-24 13:55                             ` Stanislav Meduna
  2013-05-24 14:23                               ` Stanislav Meduna
  2013-06-16 21:34                             ` Stanislav Meduna
  1 sibling, 1 reply; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-24 13:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 24.05.2013 15:06, Rik van Riel wrote:

> Just to rule something out, are you using
> transparent huge pages on those systems?

On my present test system they are configured in, but I am
not using them.

# cat /proc/meminfo | grep Huge
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       4096 kB

However during my (many) previous experiments the problem
also happened with kernels that did not have it configured.

Thanks
-- 
                                      Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-24 13:55                             ` Stanislav Meduna
@ 2013-05-24 14:23                               ` Stanislav Meduna
  0 siblings, 0 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-05-24 14:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 24.05.2013 15:55, Stanislav Meduna wrote:

>> Just to rule something out, are you using
>> transparent huge pages on those systems?
> 
> On my present test system they are configured in, but I am
> not using them.

Ah, _transparent_ huge pages. No, that is not enabled.

-- 
                                         Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-05-24 13:06                           ` Rik van Riel
  2013-05-24 13:55                             ` Stanislav Meduna
@ 2013-06-16 21:34                             ` Stanislav Meduna
  2013-06-18 19:13                               ` Stanislav Meduna
  1 sibling, 1 reply; 35+ messages in thread
From: Stanislav Meduna @ 2013-06-16 21:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

Hi all,

I was able to reproduce the page fault problem with
a relatively simple application, for now on the
Geode platform. It can be downloaded at

  http://www.meduna.org/tmp/PageFault.tar.gz

Basically the test application does:

- 4 threads that do nothing but periodically sleep
- 1 thread looping in a timerfd loop doing nothing
- 4 threads doing nonblocking TCP connects to an address
  in the local network that does not exist, i.e. all that
  happens are ARP requests.
- additionally a non-existing TCP congestion algorithm is
  requested resulting in repeated futile requests to load
  the module. This looks to be an important part in reproducing
  it, but the problem also occasionally happened with kernels
  that did not have modules enabled at all, so it is
  probably just pushing some probabilities.
- the application is statically linked - this might or might
  not be relevant, I just wanted the text-segment to be bigger

I know it is a weird mix, I was just trying to mimic what
our application did in the form that was able to trigger
the faults most often.

In my few tests this repeatably triggered the problem in hours,
max a day.

My feeling is that the problem is triggered best if there
is little network traffic and no other connections to the
machine, but this is only a subjective feeling.

The kernel configuration, cpuinfo, meminfo and lspci
are included in the tarball. The kernel configuration is not
very clean, it is a kernel intended to work on both Geode
and Celeron and is also a snapshot of what reproduced the
problem the best.

The environment is a current 3.4-rt with following tweaks:

 chrt -f -p 37 <pid of ksoftirqd/0>
 chrt -o -p 0 <pid of irq/14-pata>  [because of a pata_cs5536 bug]
 renice -15 <pid of irq/14-pata>
 ulimit -s 512

Before compiling change the CONNECT_ADDR define to an address
that is in the local LAN but is not present.

Other than this application a lightweight mix of usual Debian
processes is running. There are no servers except openssh and ntp.
A shell script that wakes each 2 seconds and does some
housekeeping is running, that probably recovers the system
when it enters the page-fault loop followed by the
RT throttling.

Right now a test with the same kernel with preempt none
is running to see whether the problem also happens with this
application there (due to the timing sensitivity only a positive
result has a significance). I did not have a chance to test
on an Intel processor yet.

Thanks
-- 
                                       Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-06-16 21:34                             ` Stanislav Meduna
@ 2013-06-18 19:13                               ` Stanislav Meduna
  2013-06-19  5:20                                 ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Stanislav Meduna @ 2013-06-18 19:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 16.06.2013 23:34, Stanislav Meduna wrote:

> Right now a test with the same kernel with preempt none
> is running to see whether the problem also happens with this
> application there (due to the timing sensitivity only a positive
> result has a significance).

No crash in 2 days running with preempt none...

-- 
                                              Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-06-18 19:13                               ` Stanislav Meduna
@ 2013-06-19  5:20                                 ` Linus Torvalds
  2013-06-19  7:36                                   ` Stanislav Meduna
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2013-06-19  5:20 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On Tue, Jun 18, 2013 at 9:13 AM, Stanislav Meduna <stano@meduna.org> wrote:
>
> No crash in 2 days running with preempt none...

Is this UP?

There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
cause infinite TLB faults, but it definitely causes potentially
incoherent TLB contents. And afaik it only happens with
CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
your setup...

                Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-06-19  5:20                                 ` Linus Torvalds
@ 2013-06-19  7:36                                   ` Stanislav Meduna
  2013-06-19  8:06                                     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Stanislav Meduna @ 2013-06-19  7:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users,
	linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang, Peter Zijlstra

On 19.06.2013 07:20, Linus Torvalds wrote:

>> No crash in 2 days running with preempt none...
> 
> Is this UP?

Yes it is.

> There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
> ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
> cause infinite TLB faults, but it definitely causes potentially
> incoherent TLB contents. And afaik it only happens with
> CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
> your setup...

Oh, thank you for the pointer, this indeed looks interesting.

Unfortunately the patch massively does not apply to 3.4 which
I am using and I know too little what all is involved here
to backport it. I will test it when (if) it gets to the 3.4(-rt)
(or when I find some spare time to play with the newer kernel
on that system).

Thanks
-- 
                                     Stano


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-06-19  7:36                                   ` Stanislav Meduna
@ 2013-06-19  8:06                                     ` Peter Zijlstra
  2013-06-20 17:50                                       ` Stanislav Meduna
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2013-06-19  8:06 UTC (permalink / raw)
  To: Stanislav Meduna
  Cc: Linus Torvalds, Rik van Riel, H. Peter Anvin, Steven Rostedt,
	linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On Wed, Jun 19, 2013 at 09:36:39AM +0200, Stanislav Meduna wrote:
> On 19.06.2013 07:20, Linus Torvalds wrote:
> 
> >> No crash in 2 days running with preempt none...
> > 
> > Is this UP?
> 
> Yes it is.
> 
> > There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
> > ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
> > cause infinite TLB faults, but it definitely causes potentially
> > incoherent TLB contents. And afaik it only happens with
> > CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
> > your setup...
> 
> Oh, thank you for the pointer, this indeed looks interesting.
> 
> Unfortunately the patch massively does not apply to 3.4 which
> I am using and I know too little what all is involved here
> to backport it. I will test it when (if) it gets to the 3.4(-rt)
> (or when I find some spare time to play with the newer kernel
> on that system).

The easiest way to test for your system is to ensure tlb_fast_mode()
return an unconditional 0.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: fix up a spurious page fault whenever it happens
  2013-06-19  8:06                                     ` Peter Zijlstra
@ 2013-06-20 17:50                                       ` Stanislav Meduna
  0 siblings, 0 replies; 35+ messages in thread
From: Stanislav Meduna @ 2013-06-20 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Rik van Riel, H. Peter Anvin, Steven Rostedt,
	linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Hai Huang

On 19.06.2013 10:06, Peter Zijlstra wrote:

>> On 19.06.2013 07:20, Linus Torvalds wrote:
>>> There's the fast_tlb race that Peter fixed in commit 29eb77825cc7
>>> ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would
>>> cause infinite TLB faults, but it definitely causes potentially
>>> incoherent TLB contents. And afaik it only happens with
>>> CONFIG_PREEMPT, and on UP systems. Which sounds like it might match
>>> your setup...

> The easiest way to test for your system is to ensure tlb_fast_mode()
> return an unconditional 0.

Nope. Got the faults also with tlb_fast_mode() returning 0, this time
after ~10 hours. So there still has to be something...

Regards
-- 
                                        Stano



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2013-06-20 17:51 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-17  8:42 [PATCH - sort of] x86: Livelock in handle_pte_fault Stanislav Meduna
2013-05-22  0:39 ` Steven Rostedt
2013-05-22  7:32   ` Stanislav Meduna
2013-05-22 12:33   ` Rik van Riel
2013-05-22 15:01     ` Linus Torvalds
2013-05-22 17:41       ` [PATCH] mm: fix up a spurious page fault whenever it happens Rik van Riel
2013-05-22 18:04         ` Stanislav Meduna
2013-05-22 18:11           ` Steven Rostedt
2013-05-22 18:21             ` Stanislav Meduna
2013-05-22 18:35               ` Rik van Riel
2013-05-22 18:42                 ` H. Peter Anvin
2013-05-22 18:43                   ` Rik van Riel
2013-05-23  8:07                     ` Stanislav Meduna
2013-05-23 12:19                       ` Rik van Riel
2013-05-23 13:29                         ` Steven Rostedt
2013-05-23 15:06                           ` H. Peter Anvin
2013-05-23 15:27                             ` Steven Rostedt
2013-05-23 17:24                               ` H. Peter Anvin
2013-05-23 17:36                                 ` Steven Rostedt
2013-05-23 17:38                                   ` H. Peter Anvin
2013-05-24  8:29                         ` Stanislav Meduna
2013-05-24 10:28                           ` Stanislav Meduna
2013-05-24 13:06                           ` Rik van Riel
2013-05-24 13:55                             ` Stanislav Meduna
2013-05-24 14:23                               ` Stanislav Meduna
2013-06-16 21:34                             ` Stanislav Meduna
2013-06-18 19:13                               ` Stanislav Meduna
2013-06-19  5:20                                 ` Linus Torvalds
2013-06-19  7:36                                   ` Stanislav Meduna
2013-06-19  8:06                                     ` Peter Zijlstra
2013-06-20 17:50                                       ` Stanislav Meduna
2013-05-23 14:45                       ` Linus Torvalds
2013-05-23 14:50                         ` Linus Torvalds
2013-05-23 15:03                           ` Stanislav Meduna
2013-05-22 18:47                 ` Stanislav Meduna

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).