* [PATCH - sort of] x86: Livelock in handle_pte_fault @ 2013-05-17 8:42 Stanislav Meduna 2013-05-22 0:39 ` Steven Rostedt 0 siblings, 1 reply; 35+ messages in thread From: Stanislav Meduna @ 2013-05-17 8:42 UTC (permalink / raw) To: linux-rt-users, linux-kernel Cc: rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86 Hi all, I don't know whether this is linux-rt specific or applies to the mainline too, so I'll repeat some things the linux-rt readers already know. Environment: - Geode LX or Celeron M - _not_ CONFIG_SMP - linux 3.4 with realtime patches and full preempt configured - an application consisting of several mostly RR-class threads - the application runs with mlockall() - there is no swap Problem: - after several hours to 1-2 weeks some of the threads start to loop in the following way 0d...0 62811.755382: function: do_page_fault 0....0 62811.755386: function: handle_mm_fault 0....0 62811.755389: function: handle_pte_fault 0d...0 62811.755394: function: do_page_fault 0....0 62811.755396: function: handle_mm_fault 0....0 62811.755398: function: handle_pte_fault 0d...0 62811.755402: function: do_page_fault 0....0 62811.755404: function: handle_mm_fault 0....0 62811.755406: function: handle_pte_fault and stay in the loop until the RT throttling gets activated. One of the faulting addresses was in code (after returning from a syscall), a second one in stack (inside put_user right before a syscall ends), both were surely mapped. - After RT throttler activates it somehow magically fixes itself, probably (not verified) because another _process_ gets scheduled. When throttled the RR and FF threads are not allowed to run for a while (20 ms in my configuration). The livelocks lasts around 1-3 seconds, and there is a SCHED_OTHER process that runs each 2 seconds. - Kernel threads with higher priority than the faulting one (linux-rt irq threads) run normally. A higher priority user thread from the same process gets scheduled and then enters the same faulting loop. - in ps -o min_flt,maj_flt the number of minor page faults for the offending thread skyrockets to hundreds of thousands (normally it stays zero as everything is already mapped when it is started) - The code in handle_pte_fault proceeds through the entry = pte_mkyoung(entry); line and the following ptep_set_access_flags returns zero. - The livelock is extremely timing sensitive - different workloads cause it not to happen at all or far later. - I was able to make this happen a bit faster (once per ~4 hours) with the rt thread repeatly causing the kernel to try to invoke modprobe to load a missing module - so there is a load of kworker-s launching modprobes (in case anyone wonders how it can happen: this was a bug in our application with invalid level specified for setsockopt causing searching for TCP congestion module instead of setting SO_LINGER) - the symptoms are similar to http://lkml.indiana.edu/hypermail/linux/kernel/1103.0/01364.html which got fixed by https://lkml.org/lkml/2011/3/15/516 but this fix does not apply to the processors in question - the patch below _seems_ to fix it, or at least massively delay it - the testcase now runs for 2.5 days instead of 4 hours. I doubt it is the proper patch (it brutally reloads the CR3 every time a thread with userspace mapping is switched to). I just got the suspicion that there is some way the kernel forgets to update the memory mapping when going from an userpace thread through some kernel ones back to another userspace one and tried to make sure the mapping is always reloaded. - the whole history starts at http://www.spinics.net/lists/linux-rt-users/msg09758.html I originally thought the problem is in timerfd and hunted it in several places until I learned to use the tracing infrastructure and started to pin it down with trace prints etc :) - A trace file of the hang is at http://www.meduna.org/tmp/trace.mmfaulthang.dat.gz Does this ring a bell with someone? Thanks Stano diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 6902152..3d54a15 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -54,21 +54,23 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT_nolock(&next->context); } -#ifdef CONFIG_SMP else { +#ifdef CONFIG_SMP percpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next); if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { +#endif /* We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. */ load_cr3(next->pgd); load_LDT_nolock(&next->context); +#ifdef CONFIG_SMP } - } #endif + } } #define activate_mm(prev, next) ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault 2013-05-17 8:42 [PATCH - sort of] x86: Livelock in handle_pte_fault Stanislav Meduna @ 2013-05-22 0:39 ` Steven Rostedt 2013-05-22 7:32 ` Stanislav Meduna 2013-05-22 12:33 ` Rik van Riel 0 siblings, 2 replies; 35+ messages in thread From: Steven Rostedt @ 2013-05-22 0:39 UTC (permalink / raw) To: Stanislav Meduna Cc: linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, riel On Fri, 2013-05-17 at 10:42 +0200, Stanislav Meduna wrote: > Hi all, > > I don't know whether this is linux-rt specific or applies to > the mainline too, so I'll repeat some things the linux-rt > readers already know. > > Environment: > > - Geode LX or Celeron M > - _not_ CONFIG_SMP > - linux 3.4 with realtime patches and full preempt configured > - an application consisting of several mostly RR-class threads The threads do a mlockall too right? I'm not sure mlock will lock memory for a new thread's stack. > - the application runs with mlockall() With both MCL_FUTURE and MCL_CURRENT set, right? > - there is no swap Hmm, doesn't mean that code can't be swapped out, as it is just mapped from the file it came from. But you'd think mlockall would prevent that. > > Problem: > > - after several hours to 1-2 weeks some of the threads start to loop > in the following way > > 0d...0 62811.755382: function: do_page_fault > 0....0 62811.755386: function: handle_mm_fault > 0....0 62811.755389: function: handle_pte_fault > 0d...0 62811.755394: function: do_page_fault > 0....0 62811.755396: function: handle_mm_fault > 0....0 62811.755398: function: handle_pte_fault > 0d...0 62811.755402: function: do_page_fault > 0....0 62811.755404: function: handle_mm_fault > 0....0 62811.755406: function: handle_pte_fault > > and stay in the loop until the RT throttling gets activated. > One of the faulting addresses was in code (after returning > from a syscall), a second one in stack (inside put_user right > before a syscall ends), both were surely mapped. > > - After RT throttler activates it somehow magically fixes itself, > probably (not verified) because another _process_ gets scheduled. > When throttled the RR and FF threads are not allowed to run for > a while (20 ms in my configuration). The livelocks lasts around > 1-3 seconds, and there is a SCHED_OTHER process that runs each > 2 seconds. Hmm, if there was a missed TLB flush, and we are faulting due to a bad TLB table, and it goes into an infinite faulting loop, the only thing that will stop it is the RT throttle. Then a new task gets scheduled, and we flush the TLB and everything is fine again. > > - Kernel threads with higher priority than the faulting one (linux-rt > irq threads) run normally. A higher priority user thread from the > same process gets scheduled and then enters the same faulting loop. Kernel threads share the mm, and wont cause a reload of the CR3. > > - in ps -o min_flt,maj_flt the number of minor page faults > for the offending thread skyrockets to hundreds of thousands > (normally it stays zero as everything is already mapped > when it is started) > > - The code in handle_pte_fault proceeds through the > entry = pte_mkyoung(entry); > line and the following > ptep_set_access_flags > returns zero. > > - The livelock is extremely timing sensitive - different workloads > cause it not to happen at all or far later. > > - I was able to make this happen a bit faster (once per ~4 hours) > with the rt thread repeatly causing the kernel to try to > invoke modprobe to load a missing module - so there is a load > of kworker-s launching modprobes (in case anyone wonders how it > can happen: this was a bug in our application with invalid level > specified for setsockopt causing searching for TCP congestion > module instead of setting SO_LINGER) Note, that modules are in vmalloc space, and do fault in. But it also changes the PGD. > > - the symptoms are similar to > http://lkml.indiana.edu/hypermail/linux/kernel/1103.0/01364.html > which got fixed by > https://lkml.org/lkml/2011/3/15/516 > but this fix does not apply to the processors in question > > - the patch below _seems_ to fix it, or at least massively delay it - > the testcase now runs for 2.5 days instead of 4 hours. I doubt > it is the proper patch (it brutally reloads the CR3 every time > a thread with userspace mapping is switched to). I just got the > suspicion that there is some way the kernel forgets to update > the memory mapping when going from an userpace thread through > some kernel ones back to another userspace one and tried to make > sure the mapping is always reloaded. Seems a bit extreme. Looks to me there's a missing flush TLB somewhere. Do you have a reproducer you can share. That way, maybe we can all share the joy. -- Steve > > - the whole history starts at > http://www.spinics.net/lists/linux-rt-users/msg09758.html > I originally thought the problem is in timerfd and hunted it > in several places until I learned to use the tracing infrastructure > and started to pin it down with trace prints etc :) > > - A trace file of the hang is at > http://www.meduna.org/tmp/trace.mmfaulthang.dat.gz > > Does this ring a bell with someone? > > Thanks > Stano > > > > > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h > index 6902152..3d54a15 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -54,21 +54,23 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > if (unlikely(prev->context.ldt != next->context.ldt)) > load_LDT_nolock(&next->context); > } > -#ifdef CONFIG_SMP > else { > +#ifdef CONFIG_SMP > percpu_write(cpu_tlbstate.state, TLBSTATE_OK); > BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next); > > if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { > +#endif > /* We were in lazy tlb mode and leave_mm disabled > * tlb flush IPI delivery. We must reload CR3 > * to make sure to use no freed page tables. > */ > load_cr3(next->pgd); > load_LDT_nolock(&next->context); > +#ifdef CONFIG_SMP > } > - } > #endif > + } > } > > #define activate_mm(prev, next) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault 2013-05-22 0:39 ` Steven Rostedt @ 2013-05-22 7:32 ` Stanislav Meduna 2013-05-22 12:33 ` Rik van Riel 1 sibling, 0 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-05-22 7:32 UTC (permalink / raw) To: Steven Rostedt Cc: linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, riel On 22.05.2013 02:39, Steven Rostedt wrote: > The threads do a mlockall too right? I'm not sure mlock will lock memory > for a new thread's stack. They don't. However, https://rt.wiki.kernel.org/index.php/Threaded_RT-application_with_memory_locking_and_stack_handling_example claims "Threads started after a call to mlockall(MCL_CURRENT | MCL_FUTURE) will generate page faults immediately since the new stack is immediately forced to RAM (due to the MCL_FUTURE flag)." and as the ps -o min_flt reports zero page faults for the threads so I think it is also the case. Anyway, both particular addresses were surely mapped long before the fault. >> - the application runs with mlockall() > > With both MCL_FUTURE and MCL_CURRENT set, right? Yes. >> - there is no swap > > Hmm, doesn't mean that code can't be swapped out, as it is just mapped > from the file it came from. But you'd think mlockall would prevent that. mlockall also forces the stack to be mapped immediately and not generating pagefaults when incrementally expanding. > Seems a bit extreme. Looks to me there's a missing flush TLB somewhere. Probably. One interesting thing: the test for "need to reload something" looks a bit differently for the ARM architecture in arch/arm/include/asm/mmu_context.h: if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next)) || prev != next) { and they do something also for the !CONFIG_SMP && !cpumask_test_and_set_cpu(cpu, mm_cpumask(next) case. I don't know what exactly is semantics of mm_cpumask, but the difference is suspicious. > Do you have a reproducer you can share. That way, maybe we can all share > the joy. Unfortunately not and I have really tried :( If I get new ideas, I will try again. Thanks -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault 2013-05-22 0:39 ` Steven Rostedt 2013-05-22 7:32 ` Stanislav Meduna @ 2013-05-22 12:33 ` Rik van Riel 2013-05-22 15:01 ` Linus Torvalds 1 sibling, 1 reply; 35+ messages in thread From: Rik van Riel @ 2013-05-22 12:33 UTC (permalink / raw) To: Steven Rostedt Cc: Stanislav Meduna, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Linus Torvalds, Hai Huang [-- Attachment #1: Type: text/plain, Size: 2854 bytes --] On 05/21/2013 08:39 PM, Steven Rostedt wrote: > On Fri, 2013-05-17 at 10:42 +0200, Stanislav Meduna wrote: >> Hi all, >> >> I don't know whether this is linux-rt specific or applies to >> the mainline too, so I'll repeat some things the linux-rt >> readers already know. >> >> Environment: >> >> - Geode LX or Celeron M >> - _not_ CONFIG_SMP >> - linux 3.4 with realtime patches and full preempt configured >> - an application consisting of several mostly RR-class threads > > The threads do a mlockall too right? I'm not sure mlock will lock memory > for a new thread's stack. > >> - the application runs with mlockall() > > With both MCL_FUTURE and MCL_CURRENT set, right? > >> - there is no swap > > Hmm, doesn't mean that code can't be swapped out, as it is just mapped > from the file it came from. But you'd think mlockall would prevent that. > >> >> Problem: >> >> - after several hours to 1-2 weeks some of the threads start to loop >> in the following way >> >> 0d...0 62811.755382: function: do_page_fault >> 0....0 62811.755386: function: handle_mm_fault >> 0....0 62811.755389: function: handle_pte_fault >> 0d...0 62811.755394: function: do_page_fault >> 0....0 62811.755396: function: handle_mm_fault >> 0....0 62811.755398: function: handle_pte_fault >> 0d...0 62811.755402: function: do_page_fault >> 0....0 62811.755404: function: handle_mm_fault >> 0....0 62811.755406: function: handle_pte_fault >> >> and stay in the loop until the RT throttling gets activated. >> One of the faulting addresses was in code (after returning >> from a syscall), a second one in stack (inside put_user right >> before a syscall ends), both were surely mapped. >> >> - After RT throttler activates it somehow magically fixes itself, >> probably (not verified) because another _process_ gets scheduled. >> When throttled the RR and FF threads are not allowed to run for >> a while (20 ms in my configuration). The livelocks lasts around >> 1-3 seconds, and there is a SCHED_OTHER process that runs each >> 2 seconds. > > Hmm, if there was a missed TLB flush, and we are faulting due to a bad > TLB table, and it goes into an infinite faulting loop, the only thing > that will stop it is the RT throttle. Then a new task gets scheduled, > and we flush the TLB and everything is fine again. That sounds like maybe we DO want a TLB flush on spurious page faults, so we get rid of this problem. Last fall we thought this problem could not happen on x86, but your bug report suggests that it might. We can get flush_tlb_fix_spurious_fault to do a local TLB invalidate of just the address in question by removing the x86-specific dummy version, falling back to the asm-generic version that does something. Can you test the attached patch? -- All rights reversed [-- Attachment #2: flush-tlb-on-spurious-fault.patch --] [-- Type: text/x-patch, Size: 1003 bytes --] Subject: x86,mm: flush TLB on spurious fault It appears that certain x86 CPUs do not automatically flush the TLB entry that caused a page fault, causing spurious faults to loop forever under certain circumstances. Remove the dummy flush_tlb_fix_spurious_fault define, so x86 falls back to the asm-generic version, which does do a local TLB flush. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Stanislav Meduna <stano@meduna.org> --- arch/x86/include/asm/pgtable.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 1e67223..43e7966 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -729,8 +729,6 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, pte_update(mm, addr, ptep); } -#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0) - #define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot)) #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH - sort of] x86: Livelock in handle_pte_fault 2013-05-22 12:33 ` Rik van Riel @ 2013-05-22 15:01 ` Linus Torvalds 2013-05-22 17:41 ` [PATCH] mm: fix up a spurious page fault whenever it happens Rik van Riel 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2013-05-22 15:01 UTC (permalink / raw) To: Rik van Riel Cc: Steven Rostedt, Stanislav Meduna, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, Hai Huang On Wed, May 22, 2013 at 5:33 AM, Rik van Riel <riel@redhat.com> wrote: > > That sounds like maybe we DO want a TLB flush on spurious > page faults, so we get rid of this problem. Hmm. If it was just the Geode, I wouldn't be surprised. But with a Celeron too? Anyway, worth testing.. > We can get flush_tlb_fix_spurious_fault to do a local TLB > invalidate of just the address in question by removing the > x86-specific dummy version, falling back to the asm-generic > version that does something. > > Can you test the attached patch? I think you should also remove the if (flags & FAULT_FLAG_WRITE) test in handle_pte_fault(). Because if it's spurious, it might happen on reads too, I think. RT people - does RT do anything special with the page tables? Stanislav, the patch you sent out may well work, but it's damned odd. On UP, we don't do the leave_mm() optimization that makes that code necessary. So I agree with Rik that it's more likely somewhere else (and infinite page faults do imply the TLB not getting flushed by the page fault exception), and your patch might just be working around it by simply flushing the TLB at least when switching between threads, which still happens. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 15:01 ` Linus Torvalds @ 2013-05-22 17:41 ` Rik van Riel 2013-05-22 18:04 ` Stanislav Meduna 0 siblings, 1 reply; 35+ messages in thread From: Rik van Riel @ 2013-05-22 17:41 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Stanislav Meduna, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, Hai Huang On Wed, 22 May 2013 08:01:43 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, May 22, 2013 at 5:33 AM, Rik van Riel <riel@redhat.com> wrote: > > Can you test the attached patch? > > I think you should also remove the > > if (flags & FAULT_FLAG_WRITE) > > test in handle_pte_fault(). Because if it's spurious, it might happen > on reads too, I think. Here you are. I wonder if the conditional was put in because we originally did a global TLB flush (with IPIs) from the spurious fault handler... Stanislav, could you add this patch to your test? ---8<--- Subject: [PATCH] mm: fix up a spurious page fault whenever it happens The kernel currently only handles spurious page faults when they "should" happen, but potentially this is not the only situation where they could happen. The spurious fault handler only flushes an entry from the local TLB; this should be a rare event with minimal side effects. This patch removes the conditional, allowing the spurious fault handler to execute whenever a spurious page fault happens, which should eliminate infinite page fault loops. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Stanislav Meduna <stano@meduna.org> --- mm/memory.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..962477d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3744,13 +3744,11 @@ int handle_pte_fault(struct mm_struct *mm, update_mmu_cache(vma, address, pte); } else { /* - * This is needed only for protection faults but the arch code - * is not yet telling us if this is a protection fault or not. - * This still avoids useless tlb flushes for .text page faults - * with threads. + * The page table entry is good, but the CPU generated a + * spurious fault. Invalidate the corresponding TLB entry + * on this CPU, so the next access can succeed. */ - if (flags & FAULT_FLAG_WRITE) - flush_tlb_fix_spurious_fault(vma, address); + flush_tlb_fix_spurious_fault(vma, address); } unlock: pte_unmap_unlock(pte, ptl); ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 17:41 ` [PATCH] mm: fix up a spurious page fault whenever it happens Rik van Riel @ 2013-05-22 18:04 ` Stanislav Meduna 2013-05-22 18:11 ` Steven Rostedt 0 siblings, 1 reply; 35+ messages in thread From: Stanislav Meduna @ 2013-05-22 18:04 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, Hai Huang On 22.05.2013 19:41, Rik van Riel wrote: >> I think you should also remove the >> >> if (flags & FAULT_FLAG_WRITE) Done >>> Can you test the attached patch? Nope. Fails with the same symptoms, min_flt skyrockets, the throttler activates and after 2 seconds all is well again. This is on Geode LX, I don't have the Celeron M at the hand now. Thank -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 18:04 ` Stanislav Meduna @ 2013-05-22 18:11 ` Steven Rostedt 2013-05-22 18:21 ` Stanislav Meduna 0 siblings, 1 reply; 35+ messages in thread From: Steven Rostedt @ 2013-05-22 18:11 UTC (permalink / raw) To: Stanislav Meduna Cc: Rik van Riel, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, Hai Huang On Wed, 2013-05-22 at 20:04 +0200, Stanislav Meduna wrote: > On 22.05.2013 19:41, Rik van Riel wrote: > > >> I think you should also remove the > >> > >> if (flags & FAULT_FLAG_WRITE) > > Done > > >>> Can you test the attached patch? > > Nope. Fails with the same symptoms, min_flt skyrockets, > the throttler activates and after 2 seconds all is well > again. > > This is on Geode LX, I don't have the Celeron M at the hand now. > Did you apply both patches? Without the first one, this one is meaningless. -- Steve ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 18:11 ` Steven Rostedt @ 2013-05-22 18:21 ` Stanislav Meduna 2013-05-22 18:35 ` Rik van Riel 0 siblings, 1 reply; 35+ messages in thread From: Stanislav Meduna @ 2013-05-22 18:21 UTC (permalink / raw) To: Steven Rostedt Cc: Rik van Riel, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, Hai Huang On 22.05.2013 20:11, Steven Rostedt wrote: > Did you apply both patches? Without the first one, this one is > meaningless. Sure. BTW, back when I tried to pinpoint it I also tried adding flush_tlb_page(vma, address) at the beginning of handle_pte_fault, which as I read should be basically the same. It did not not change anything. I did mention it some in some previous mail but forgot to include it again in the summary - sorry :/ -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 18:21 ` Stanislav Meduna @ 2013-05-22 18:35 ` Rik van Riel 2013-05-22 18:42 ` H. Peter Anvin 2013-05-22 18:47 ` Stanislav Meduna 0 siblings, 2 replies; 35+ messages in thread From: Rik van Riel @ 2013-05-22 18:35 UTC (permalink / raw) To: Stanislav Meduna Cc: Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, Hai Huang On 05/22/2013 02:21 PM, Stanislav Meduna wrote: > On 22.05.2013 20:11, Steven Rostedt wrote: > >> Did you apply both patches? Without the first one, this one is >> meaningless. > > Sure. > > BTW, back when I tried to pinpoint it I also tried adding > flush_tlb_page(vma, address) > at the beginning of handle_pte_fault, which as I read should > be basically the same. It did not not change anything. I'm stumped. If the Geode knows how to flush single TLB entries, it should do that when flush_tlb_page is called. If it does not know, it should throw an invalid instruction exception, and not quietly complete the instruction without doing anything. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 18:35 ` Rik van Riel @ 2013-05-22 18:42 ` H. Peter Anvin 2013-05-22 18:43 ` Rik van Riel 2013-05-22 18:47 ` Stanislav Meduna 1 sibling, 1 reply; 35+ messages in thread From: H. Peter Anvin @ 2013-05-22 18:42 UTC (permalink / raw) To: Rik van Riel Cc: Stanislav Meduna, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 05/22/2013 11:35 AM, Rik van Riel wrote: > On 05/22/2013 02:21 PM, Stanislav Meduna wrote: >> On 22.05.2013 20:11, Steven Rostedt wrote: >> >>> Did you apply both patches? Without the first one, this one is >>> meaningless. >> >> Sure. >> >> BTW, back when I tried to pinpoint it I also tried adding >> flush_tlb_page(vma, address) >> at the beginning of handle_pte_fault, which as I read should >> be basically the same. It did not not change anything. > > I'm stumped. > > If the Geode knows how to flush single TLB entries, it > should do that when flush_tlb_page is called. > > If it does not know, it should throw an invalid instruction > exception, and not quietly complete the instruction without > doing anything. > Some CPUs have had errata when it comes to flushing large pages that have been split into small pages by hardware, e.g. due to MTRR conflicts. In that case, fragments of the large page may have been left in the TLB. Could that explain what you are seeing? -hpa ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 18:42 ` H. Peter Anvin @ 2013-05-22 18:43 ` Rik van Riel 2013-05-23 8:07 ` Stanislav Meduna 0 siblings, 1 reply; 35+ messages in thread From: Rik van Riel @ 2013-05-22 18:43 UTC (permalink / raw) To: H. Peter Anvin Cc: Stanislav Meduna, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 05/22/2013 02:42 PM, H. Peter Anvin wrote: > On 05/22/2013 11:35 AM, Rik van Riel wrote: >> On 05/22/2013 02:21 PM, Stanislav Meduna wrote: >>> On 22.05.2013 20:11, Steven Rostedt wrote: >>> >>>> Did you apply both patches? Without the first one, this one is >>>> meaningless. >>> >>> Sure. >>> >>> BTW, back when I tried to pinpoint it I also tried adding >>> flush_tlb_page(vma, address) >>> at the beginning of handle_pte_fault, which as I read should >>> be basically the same. It did not not change anything. >> >> I'm stumped. >> >> If the Geode knows how to flush single TLB entries, it >> should do that when flush_tlb_page is called. >> >> If it does not know, it should throw an invalid instruction >> exception, and not quietly complete the instruction without >> doing anything. >> > > Some CPUs have had errata when it comes to flushing large pages that > have been split into small pages by hardware, e.g. due to MTRR > conflicts. In that case, fragments of the large page may have been left > in the TLB. > > Could that explain what you are seeing? That would be testable by changing __native_flush_tlb_single() to call __flush_tlb(), instead of doing an invlpg instruction. In other words, make the code look like this, for testing: static inline void __native_flush_tlb_single(unsigned long addr) { __flush_tlb(); } This on top of the other two patches. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 18:43 ` Rik van Riel @ 2013-05-23 8:07 ` Stanislav Meduna 2013-05-23 12:19 ` Rik van Riel 2013-05-23 14:45 ` Linus Torvalds 0 siblings, 2 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-05-23 8:07 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 22.05.2013 20:43, Rik van Riel wrote: >> Some CPUs have had errata when it comes to flushing large pages that >> have been split into small pages by hardware, e.g. due to MTRR >> conflicts. In that case, fragments of the large page may have been left >> in the TLB. Can I somehow find if this is the case? The memory mapping for the failing process has two regions slightly larger than 4 MB - code and heap. The process also does not access any funny memory regions from userspace - it is basically networking (both TCP/IP and raw sockets) and crunching of the data received. No mmapped devices or something like that. > static inline void __native_flush_tlb_single(unsigned long addr) > { > __flush_tlb(); > } > > This on top of the other two patches. It did not crash overnight, but it also does not show any minor fault counted for the threads, so I'm afraid the situation just did not happen - there should be at least one visible in the ps -o min_flt output, right? I will give it some more testing time. Thanks -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 8:07 ` Stanislav Meduna @ 2013-05-23 12:19 ` Rik van Riel 2013-05-23 13:29 ` Steven Rostedt 2013-05-24 8:29 ` Stanislav Meduna 2013-05-23 14:45 ` Linus Torvalds 1 sibling, 2 replies; 35+ messages in thread From: Rik van Riel @ 2013-05-23 12:19 UTC (permalink / raw) To: Stanislav Meduna Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 05/23/2013 04:07 AM, Stanislav Meduna wrote: > On 22.05.2013 20:43, Rik van Riel wrote: > >>> Some CPUs have had errata when it comes to flushing large pages that >>> have been split into small pages by hardware, e.g. due to MTRR >>> conflicts. In that case, fragments of the large page may have been left >>> in the TLB. > > Can I somehow find if this is the case? The memory mapping > for the failing process has two regions slightly larger than > 4 MB - code and heap. > > The process also does not access any funny memory regions > from userspace - it is basically networking (both TCP/IP > and raw sockets) and crunching of the data received. > No mmapped devices or something like that. > >> static inline void __native_flush_tlb_single(unsigned long addr) >> { >> __flush_tlb(); >> } >> >> This on top of the other two patches. > > It did not crash overnight, but it also does not show any > minor fault counted for the threads, so I'm afraid the situation > just did not happen - there should be at least one visible in > the ps -o min_flt output, right? If all the page faults are done by he main thread, and the TLB gets properly flushed now, the other threads might not see minor faults. > I will give it some more testing time. That is a good idea. Now to figure out how we properly fix this issue in the kernel... We can add a bit in the architecture bits that we use to check against other CPU and system errata, and conditionally flush the whole TLB from __native_flush_tlb_single(). The question is, how do we identify what CPUs need the extra flushing? And in what circumstances do they require it? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 12:19 ` Rik van Riel @ 2013-05-23 13:29 ` Steven Rostedt 2013-05-23 15:06 ` H. Peter Anvin 2013-05-24 8:29 ` Stanislav Meduna 1 sibling, 1 reply; 35+ messages in thread From: Steven Rostedt @ 2013-05-23 13:29 UTC (permalink / raw) To: Rik van Riel Cc: Stanislav Meduna, H. Peter Anvin, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On Thu, 2013-05-23 at 08:19 -0400, Rik van Riel wrote: > We can add a bit in the architecture bits that > we use to check against other CPU and system > errata, and conditionally flush the whole TLB > from __native_flush_tlb_single(). If we find that some CPUs have issues and others do not, and we can determine this by checking the CPU type at run time, I would strongly suggest using the jump_label infrastructure to do the branches. I know this is early to suggest something like this, but I just wanted to put it in your head ;-) -- Steve ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 13:29 ` Steven Rostedt @ 2013-05-23 15:06 ` H. Peter Anvin 2013-05-23 15:27 ` Steven Rostedt 0 siblings, 1 reply; 35+ messages in thread From: H. Peter Anvin @ 2013-05-23 15:06 UTC (permalink / raw) To: Steven Rostedt Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 05/23/2013 06:29 AM, Steven Rostedt wrote: > On Thu, 2013-05-23 at 08:19 -0400, Rik van Riel wrote: > >> We can add a bit in the architecture bits that >> we use to check against other CPU and system >> errata, and conditionally flush the whole TLB >> from __native_flush_tlb_single(). > > If we find that some CPUs have issues and others do not, and we can > determine this by checking the CPU type at run time, I would strongly > suggest using the jump_label infrastructure to do the branches. I know > this is early to suggest something like this, but I just wanted to put > it in your head ;-) > We don't even need the jump_label infrastructure -- we have static_cpu_has*() which actually predates jump_label although it uses the same underlying ideas. -hpa ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 15:06 ` H. Peter Anvin @ 2013-05-23 15:27 ` Steven Rostedt 2013-05-23 17:24 ` H. Peter Anvin 0 siblings, 1 reply; 35+ messages in thread From: Steven Rostedt @ 2013-05-23 15:27 UTC (permalink / raw) To: H. Peter Anvin Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote: > We don't even need the jump_label infrastructure -- we have > static_cpu_has*() which actually predates jump_label although it uses > the same underlying ideas. Ah right. I wonder if it would be worth consolidating a lot of these "modifying of code" infrastructures. Which reminds me, I need to update text_poke() to do things similar to what ftrace does, and get rid of the stop machine code. -- Steve ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 15:27 ` Steven Rostedt @ 2013-05-23 17:24 ` H. Peter Anvin 2013-05-23 17:36 ` Steven Rostedt 0 siblings, 1 reply; 35+ messages in thread From: H. Peter Anvin @ 2013-05-23 17:24 UTC (permalink / raw) To: Steven Rostedt Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 05/23/2013 08:27 AM, Steven Rostedt wrote: > On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote: > >> We don't even need the jump_label infrastructure -- we have >> static_cpu_has*() which actually predates jump_label although it uses >> the same underlying ideas. > > Ah right. I wonder if it would be worth consolidating a lot of these > "modifying of code" infrastructures. Which reminds me, I need to update > text_poke() to do things similar to what ftrace does, and get rid of the > stop machine code. > Well, static_cpu_has*() just uses the alternatives infrastructure. -hpa ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 17:24 ` H. Peter Anvin @ 2013-05-23 17:36 ` Steven Rostedt 2013-05-23 17:38 ` H. Peter Anvin 0 siblings, 1 reply; 35+ messages in thread From: Steven Rostedt @ 2013-05-23 17:36 UTC (permalink / raw) To: H. Peter Anvin Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On Thu, 2013-05-23 at 10:24 -0700, H. Peter Anvin wrote: > On 05/23/2013 08:27 AM, Steven Rostedt wrote: > > On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote: > > > >> We don't even need the jump_label infrastructure -- we have > >> static_cpu_has*() which actually predates jump_label although it uses > >> the same underlying ideas. > > > > Ah right. I wonder if it would be worth consolidating a lot of these > > "modifying of code" infrastructures. Which reminds me, I need to update > > text_poke() to do things similar to what ftrace does, and get rid of the > > stop machine code. > > > > Well, static_cpu_has*() just uses the alternatives infrastructure. And as it's a boot time change only, it's not quite in the category of jump_labels and function tracing. -- Steve ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 17:36 ` Steven Rostedt @ 2013-05-23 17:38 ` H. Peter Anvin 0 siblings, 0 replies; 35+ messages in thread From: H. Peter Anvin @ 2013-05-23 17:38 UTC (permalink / raw) To: Steven Rostedt Cc: Rik van Riel, Stanislav Meduna, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 05/23/2013 10:36 AM, Steven Rostedt wrote: > On Thu, 2013-05-23 at 10:24 -0700, H. Peter Anvin wrote: >> On 05/23/2013 08:27 AM, Steven Rostedt wrote: >>> On Thu, 2013-05-23 at 08:06 -0700, H. Peter Anvin wrote: >>> >>>> We don't even need the jump_label infrastructure -- we have >>>> static_cpu_has*() which actually predates jump_label although it uses >>>> the same underlying ideas. >>> >>> Ah right. I wonder if it would be worth consolidating a lot of these >>> "modifying of code" infrastructures. Which reminds me, I need to update >>> text_poke() to do things similar to what ftrace does, and get rid of the >>> stop machine code. >>> >> >> Well, static_cpu_has*() just uses the alternatives infrastructure. > > And as it's a boot time change only, it's not quite in the category of > jump_labels and function tracing. > Right. -hpa ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 12:19 ` Rik van Riel 2013-05-23 13:29 ` Steven Rostedt @ 2013-05-24 8:29 ` Stanislav Meduna 2013-05-24 10:28 ` Stanislav Meduna 2013-05-24 13:06 ` Rik van Riel 1 sibling, 2 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-05-24 8:29 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 23.05.2013 14:19, Rik van Riel wrote: >>> static inline void __native_flush_tlb_single(unsigned long addr) >>> { >>> __flush_tlb(); >>> } > >> I will give it some more testing time. > > That is a good idea. Still no crash, so this one indeed seems to change things. If I understand it correctly, these patches fix the problem when it happens and we still don't know why the TLB is stale in the first place - whether there is (also) a genuine bug or whether we are hitting some chip errata, right? For the record the cpuinfo for my present testsystem: processor : 0 vendor_id : AuthenticAMD cpu family : 5 model : 10 model name : Geode(TM) Integrated Processor by AMD PCS stepping : 2 microcode : 0x88a93d cpu MHz : 498.042 cache size : 128 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu de pse tsc msr cx8 sep pge cmov clflush mmx mmxext 3dnowext 3dnow bogomips : 996.08 clflush size : 32 cache_alignment : 32 address sizes : 32 bits physical, 32 bits virtual power management: and for the Celeron M where I can unfortunately reproduce it much less often (days to weeks). processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 13 model name : Intel(R) Celeron(R) M processor 1.00GHz stepping : 8 cpu MHz : 1000.011 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx bts bogomips : 2000.02 clflush size : 64 cache_alignment : 64 address sizes : 32 bits physical, 32 bits virtual power management: Thanks -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-24 8:29 ` Stanislav Meduna @ 2013-05-24 10:28 ` Stanislav Meduna 2013-05-24 13:06 ` Rik van Riel 1 sibling, 0 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-05-24 10:28 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 24.05.2013 10:29, Stanislav Meduna wrote: >>>> static inline void __native_flush_tlb_single(unsigned long addr) >>>> { >>>> __flush_tlb(); >>>> } >> >>> I will give it some more testing time. >> >> That is a good idea. > > Still no crash, so this one indeed seems to change things. Take that back, now crashed as well, it just took longer. min_flt of two threads jumped from zero at 1848 (lower prio) and 735993 (higher prio, preempted the first one) respectively, 1.7 seconds hang. -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-24 8:29 ` Stanislav Meduna 2013-05-24 10:28 ` Stanislav Meduna @ 2013-05-24 13:06 ` Rik van Riel 2013-05-24 13:55 ` Stanislav Meduna 2013-06-16 21:34 ` Stanislav Meduna 1 sibling, 2 replies; 35+ messages in thread From: Rik van Riel @ 2013-05-24 13:06 UTC (permalink / raw) To: Stanislav Meduna Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 05/24/2013 04:29 AM, Stanislav Meduna wrote: > On 23.05.2013 14:19, Rik van Riel wrote: > >>>> static inline void __native_flush_tlb_single(unsigned long addr) >>>> { >>>> __flush_tlb(); >>>> } >> >>> I will give it some more testing time. >> >> That is a good idea. > > Still no crash, so this one indeed seems to change things. > > If I understand it correctly, these patches fix the problem > when it happens and we still don't know why the TLB is stale > in the first place - whether there is (also) a genuine bug > or whether we are hitting some chip errata, right? Just to rule something out, are you using transparent huge pages on those systems? That could result in a mix of 4MB and 4kB mappings, sometimes of the same memory. The page tables would only ever contain one of those mappings, but if we have some kind of TLB problem, we might preserve a large mapping across a page breakup, or a small one across a page collapse... ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-24 13:06 ` Rik van Riel @ 2013-05-24 13:55 ` Stanislav Meduna 2013-05-24 14:23 ` Stanislav Meduna 2013-06-16 21:34 ` Stanislav Meduna 1 sibling, 1 reply; 35+ messages in thread From: Stanislav Meduna @ 2013-05-24 13:55 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 24.05.2013 15:06, Rik van Riel wrote: > Just to rule something out, are you using > transparent huge pages on those systems? On my present test system they are configured in, but I am not using them. # cat /proc/meminfo | grep Huge HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 4096 kB However during my (many) previous experiments the problem also happened with kernels that did not have it configured. Thanks -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-24 13:55 ` Stanislav Meduna @ 2013-05-24 14:23 ` Stanislav Meduna 0 siblings, 0 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-05-24 14:23 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 24.05.2013 15:55, Stanislav Meduna wrote: >> Just to rule something out, are you using >> transparent huge pages on those systems? > > On my present test system they are configured in, but I am > not using them. Ah, _transparent_ huge pages. No, that is not enabled. -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-24 13:06 ` Rik van Riel 2013-05-24 13:55 ` Stanislav Meduna @ 2013-06-16 21:34 ` Stanislav Meduna 2013-06-18 19:13 ` Stanislav Meduna 1 sibling, 1 reply; 35+ messages in thread From: Stanislav Meduna @ 2013-06-16 21:34 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang Hi all, I was able to reproduce the page fault problem with a relatively simple application, for now on the Geode platform. It can be downloaded at http://www.meduna.org/tmp/PageFault.tar.gz Basically the test application does: - 4 threads that do nothing but periodically sleep - 1 thread looping in a timerfd loop doing nothing - 4 threads doing nonblocking TCP connects to an address in the local network that does not exist, i.e. all that happens are ARP requests. - additionally a non-existing TCP congestion algorithm is requested resulting in repeated futile requests to load the module. This looks to be an important part in reproducing it, but the problem also occasionally happened with kernels that did not have modules enabled at all, so it is probably just pushing some probabilities. - the application is statically linked - this might or might not be relevant, I just wanted the text-segment to be bigger I know it is a weird mix, I was just trying to mimic what our application did in the form that was able to trigger the faults most often. In my few tests this repeatably triggered the problem in hours, max a day. My feeling is that the problem is triggered best if there is little network traffic and no other connections to the machine, but this is only a subjective feeling. The kernel configuration, cpuinfo, meminfo and lspci are included in the tarball. The kernel configuration is not very clean, it is a kernel intended to work on both Geode and Celeron and is also a snapshot of what reproduced the problem the best. The environment is a current 3.4-rt with following tweaks: chrt -f -p 37 <pid of ksoftirqd/0> chrt -o -p 0 <pid of irq/14-pata> [because of a pata_cs5536 bug] renice -15 <pid of irq/14-pata> ulimit -s 512 Before compiling change the CONNECT_ADDR define to an address that is in the local LAN but is not present. Other than this application a lightweight mix of usual Debian processes is running. There are no servers except openssh and ntp. A shell script that wakes each 2 seconds and does some housekeeping is running, that probably recovers the system when it enters the page-fault loop followed by the RT throttling. Right now a test with the same kernel with preempt none is running to see whether the problem also happens with this application there (due to the timing sensitivity only a positive result has a significance). I did not have a chance to test on an Intel processor yet. Thanks -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-06-16 21:34 ` Stanislav Meduna @ 2013-06-18 19:13 ` Stanislav Meduna 2013-06-19 5:20 ` Linus Torvalds 0 siblings, 1 reply; 35+ messages in thread From: Stanislav Meduna @ 2013-06-18 19:13 UTC (permalink / raw) To: Rik van Riel Cc: H. Peter Anvin, Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 16.06.2013 23:34, Stanislav Meduna wrote: > Right now a test with the same kernel with preempt none > is running to see whether the problem also happens with this > application there (due to the timing sensitivity only a positive > result has a significance). No crash in 2 days running with preempt none... -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-06-18 19:13 ` Stanislav Meduna @ 2013-06-19 5:20 ` Linus Torvalds 2013-06-19 7:36 ` Stanislav Meduna 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2013-06-19 5:20 UTC (permalink / raw) To: Stanislav Meduna Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On Tue, Jun 18, 2013 at 9:13 AM, Stanislav Meduna <stano@meduna.org> wrote: > > No crash in 2 days running with preempt none... Is this UP? There's the fast_tlb race that Peter fixed in commit 29eb77825cc7 ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would cause infinite TLB faults, but it definitely causes potentially incoherent TLB contents. And afaik it only happens with CONFIG_PREEMPT, and on UP systems. Which sounds like it might match your setup... Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-06-19 5:20 ` Linus Torvalds @ 2013-06-19 7:36 ` Stanislav Meduna 2013-06-19 8:06 ` Peter Zijlstra 0 siblings, 1 reply; 35+ messages in thread From: Stanislav Meduna @ 2013-06-19 7:36 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang, Peter Zijlstra On 19.06.2013 07:20, Linus Torvalds wrote: >> No crash in 2 days running with preempt none... > > Is this UP? Yes it is. > There's the fast_tlb race that Peter fixed in commit 29eb77825cc7 > ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would > cause infinite TLB faults, but it definitely causes potentially > incoherent TLB contents. And afaik it only happens with > CONFIG_PREEMPT, and on UP systems. Which sounds like it might match > your setup... Oh, thank you for the pointer, this indeed looks interesting. Unfortunately the patch massively does not apply to 3.4 which I am using and I know too little what all is involved here to backport it. I will test it when (if) it gets to the 3.4(-rt) (or when I find some spare time to play with the newer kernel on that system). Thanks -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-06-19 7:36 ` Stanislav Meduna @ 2013-06-19 8:06 ` Peter Zijlstra 2013-06-20 17:50 ` Stanislav Meduna 0 siblings, 1 reply; 35+ messages in thread From: Peter Zijlstra @ 2013-06-19 8:06 UTC (permalink / raw) To: Stanislav Meduna Cc: Linus Torvalds, Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On Wed, Jun 19, 2013 at 09:36:39AM +0200, Stanislav Meduna wrote: > On 19.06.2013 07:20, Linus Torvalds wrote: > > >> No crash in 2 days running with preempt none... > > > > Is this UP? > > Yes it is. > > > There's the fast_tlb race that Peter fixed in commit 29eb77825cc7 > > ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would > > cause infinite TLB faults, but it definitely causes potentially > > incoherent TLB contents. And afaik it only happens with > > CONFIG_PREEMPT, and on UP systems. Which sounds like it might match > > your setup... > > Oh, thank you for the pointer, this indeed looks interesting. > > Unfortunately the patch massively does not apply to 3.4 which > I am using and I know too little what all is involved here > to backport it. I will test it when (if) it gets to the 3.4(-rt) > (or when I find some spare time to play with the newer kernel > on that system). The easiest way to test for your system is to ensure tlb_fast_mode() return an unconditional 0. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-06-19 8:06 ` Peter Zijlstra @ 2013-06-20 17:50 ` Stanislav Meduna 0 siblings, 0 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-06-20 17:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 19.06.2013 10:06, Peter Zijlstra wrote: >> On 19.06.2013 07:20, Linus Torvalds wrote: >>> There's the fast_tlb race that Peter fixed in commit 29eb77825cc7 >>> ("arch, mm: Remove tlb_fast_mode()"). I'm not seeing how it would >>> cause infinite TLB faults, but it definitely causes potentially >>> incoherent TLB contents. And afaik it only happens with >>> CONFIG_PREEMPT, and on UP systems. Which sounds like it might match >>> your setup... > The easiest way to test for your system is to ensure tlb_fast_mode() > return an unconditional 0. Nope. Got the faults also with tlb_fast_mode() returning 0, this time after ~10 hours. So there still has to be something... Regards -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 8:07 ` Stanislav Meduna 2013-05-23 12:19 ` Rik van Riel @ 2013-05-23 14:45 ` Linus Torvalds 2013-05-23 14:50 ` Linus Torvalds 1 sibling, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2013-05-23 14:45 UTC (permalink / raw) To: Stanislav Meduna Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On Thu, May 23, 2013 at 1:07 AM, Stanislav Meduna <stano@meduna.org> wrote: > > It did not crash overnight, but it also does not show any > minor fault counted for the threads Page faults that don't cause us to map a page (ie a spurious one, or one that just updates dirty/accessed bits) don't show up as even minor faults. Thing of the major/minor as "mapping activity" not a page fault count. So if this is due to some stuck TLB entry, that wouldn't show up anyway. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 14:45 ` Linus Torvalds @ 2013-05-23 14:50 ` Linus Torvalds 2013-05-23 15:03 ` Stanislav Meduna 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2013-05-23 14:50 UTC (permalink / raw) To: Stanislav Meduna Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On Thu, May 23, 2013 at 7:45 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Page faults that don't cause us to map a page (ie a spurious one, or > one that just updates dirty/accessed bits) don't show up as even minor > faults. Thing of the major/minor as "mapping activity" not a page > fault count. Actually, I take that back. We always update eithe rmin_flt or maj_flt. My bad. Another question: I'm assuming this is all 32-bit, is it with PAE enabled? That changes some of the TLB flushing, and we had one bug related to that, maybe there are others.. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-23 14:50 ` Linus Torvalds @ 2013-05-23 15:03 ` Stanislav Meduna 0 siblings, 0 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-05-23 15:03 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, H. Peter Anvin, Steven Rostedt, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Hai Huang On 23.05.2013 16:50, Linus Torvalds wrote: > Another question: I'm assuming this is all 32-bit, is it with PAE > enabled? That changes some of the TLB flushing, and we had one bug > related to that, maybe there are others.. 32 bit, no PAE. -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] mm: fix up a spurious page fault whenever it happens 2013-05-22 18:35 ` Rik van Riel 2013-05-22 18:42 ` H. Peter Anvin @ 2013-05-22 18:47 ` Stanislav Meduna 1 sibling, 0 replies; 35+ messages in thread From: Stanislav Meduna @ 2013-05-22 18:47 UTC (permalink / raw) To: Rik van Riel Cc: Steven Rostedt, Linus Torvalds, linux-rt-users, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, Hai Huang On 22.05.2013 20:35, Rik van Riel wrote: > I'm stumped. > > If the Geode knows how to flush single TLB entries, it > should do that when flush_tlb_page is called. > > If it does not know, it should throw an invalid instruction > exception, and not quietly complete the instruction without > doing anything. Could it be that the problem is not stale TLB, but a page directory that is somehow invalid, e.g. belonging to the previous modprobe (or whatever) instead of the running process? My patch does load_cr3(next->pgd); so it explicitely loads something there. > In other words, make the code look like this, for testing: > > static inline void __native_flush_tlb_single(unsigned long addr) > { > __flush_tlb(); > } Yup, will try it. Thanks -- Stano ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2013-06-20 17:51 UTC | newest] Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-17 8:42 [PATCH - sort of] x86: Livelock in handle_pte_fault Stanislav Meduna 2013-05-22 0:39 ` Steven Rostedt 2013-05-22 7:32 ` Stanislav Meduna 2013-05-22 12:33 ` Rik van Riel 2013-05-22 15:01 ` Linus Torvalds 2013-05-22 17:41 ` [PATCH] mm: fix up a spurious page fault whenever it happens Rik van Riel 2013-05-22 18:04 ` Stanislav Meduna 2013-05-22 18:11 ` Steven Rostedt 2013-05-22 18:21 ` Stanislav Meduna 2013-05-22 18:35 ` Rik van Riel 2013-05-22 18:42 ` H. Peter Anvin 2013-05-22 18:43 ` Rik van Riel 2013-05-23 8:07 ` Stanislav Meduna 2013-05-23 12:19 ` Rik van Riel 2013-05-23 13:29 ` Steven Rostedt 2013-05-23 15:06 ` H. Peter Anvin 2013-05-23 15:27 ` Steven Rostedt 2013-05-23 17:24 ` H. Peter Anvin 2013-05-23 17:36 ` Steven Rostedt 2013-05-23 17:38 ` H. Peter Anvin 2013-05-24 8:29 ` Stanislav Meduna 2013-05-24 10:28 ` Stanislav Meduna 2013-05-24 13:06 ` Rik van Riel 2013-05-24 13:55 ` Stanislav Meduna 2013-05-24 14:23 ` Stanislav Meduna 2013-06-16 21:34 ` Stanislav Meduna 2013-06-18 19:13 ` Stanislav Meduna 2013-06-19 5:20 ` Linus Torvalds 2013-06-19 7:36 ` Stanislav Meduna 2013-06-19 8:06 ` Peter Zijlstra 2013-06-20 17:50 ` Stanislav Meduna 2013-05-23 14:45 ` Linus Torvalds 2013-05-23 14:50 ` Linus Torvalds 2013-05-23 15:03 ` Stanislav Meduna 2013-05-22 18:47 ` Stanislav Meduna
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).