From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754991Ab3EQJ0G (ORCPT ); Fri, 17 May 2013 05:26:06 -0400 Received: from www.meduna.org ([92.240.244.38]:58694 "EHLO meduna.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753532Ab3EQJ0E (ORCPT ); Fri, 17 May 2013 05:26:04 -0400 X-Greylist: delayed 2565 seconds by postgrey-1.27 at vger.kernel.org; Fri, 17 May 2013 05:26:03 EDT Message-ID: <5195ED8B.7060002@meduna.org> Date: Fri, 17 May 2013 10:42:51 +0200 From: Stanislav Meduna User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: "linux-rt-users@vger.kernel.org" , "linux-kernel@vger.kernel.org" CC: rostedt@goodmis.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org Subject: [PATCH - sort of] x86: Livelock in handle_pte_fault X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Authenticated-User: stano@meduna.org X-Authenticator: dovecot_plain X-Spam-Score: -6.9 X-Spam-Score-Int: -68 X-Exim-Version: 4.72 (build at 25-Oct-2012 18:35:58) X-Date: 2013-05-17 10:43:12 X-Connected-IP: 95.105.165.4:9358 X-Message-Linecount: 146 X-Body-Linecount: 131 X-Message-Size: 5440 X-Body-Size: 4762 X-Received-Count: 1 X-Recipient-Count: 7 X-Local-Recipient-Count: 7 X-Local-Recipient-Defer-Count: 0 X-Local-Recipient-Fail-Count: 0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi all, I don't know whether this is linux-rt specific or applies to the mainline too, so I'll repeat some things the linux-rt readers already know. Environment: - Geode LX or Celeron M - _not_ CONFIG_SMP - linux 3.4 with realtime patches and full preempt configured - an application consisting of several mostly RR-class threads - the application runs with mlockall() - there is no swap Problem: - after several hours to 1-2 weeks some of the threads start to loop in the following way 0d...0 62811.755382: function: do_page_fault 0....0 62811.755386: function: handle_mm_fault 0....0 62811.755389: function: handle_pte_fault 0d...0 62811.755394: function: do_page_fault 0....0 62811.755396: function: handle_mm_fault 0....0 62811.755398: function: handle_pte_fault 0d...0 62811.755402: function: do_page_fault 0....0 62811.755404: function: handle_mm_fault 0....0 62811.755406: function: handle_pte_fault and stay in the loop until the RT throttling gets activated. One of the faulting addresses was in code (after returning from a syscall), a second one in stack (inside put_user right before a syscall ends), both were surely mapped. - After RT throttler activates it somehow magically fixes itself, probably (not verified) because another _process_ gets scheduled. When throttled the RR and FF threads are not allowed to run for a while (20 ms in my configuration). The livelocks lasts around 1-3 seconds, and there is a SCHED_OTHER process that runs each 2 seconds. - Kernel threads with higher priority than the faulting one (linux-rt irq threads) run normally. A higher priority user thread from the same process gets scheduled and then enters the same faulting loop. - in ps -o min_flt,maj_flt the number of minor page faults for the offending thread skyrockets to hundreds of thousands (normally it stays zero as everything is already mapped when it is started) - The code in handle_pte_fault proceeds through the entry = pte_mkyoung(entry); line and the following ptep_set_access_flags returns zero. - The livelock is extremely timing sensitive - different workloads cause it not to happen at all or far later. - I was able to make this happen a bit faster (once per ~4 hours) with the rt thread repeatly causing the kernel to try to invoke modprobe to load a missing module - so there is a load of kworker-s launching modprobes (in case anyone wonders how it can happen: this was a bug in our application with invalid level specified for setsockopt causing searching for TCP congestion module instead of setting SO_LINGER) - the symptoms are similar to http://lkml.indiana.edu/hypermail/linux/kernel/1103.0/01364.html which got fixed by https://lkml.org/lkml/2011/3/15/516 but this fix does not apply to the processors in question - the patch below _seems_ to fix it, or at least massively delay it - the testcase now runs for 2.5 days instead of 4 hours. I doubt it is the proper patch (it brutally reloads the CR3 every time a thread with userspace mapping is switched to). I just got the suspicion that there is some way the kernel forgets to update the memory mapping when going from an userpace thread through some kernel ones back to another userspace one and tried to make sure the mapping is always reloaded. - the whole history starts at http://www.spinics.net/lists/linux-rt-users/msg09758.html I originally thought the problem is in timerfd and hunted it in several places until I learned to use the tracing infrastructure and started to pin it down with trace prints etc :) - A trace file of the hang is at http://www.meduna.org/tmp/trace.mmfaulthang.dat.gz Does this ring a bell with someone? Thanks Stano diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 6902152..3d54a15 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -54,21 +54,23 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT_nolock(&next->context); } -#ifdef CONFIG_SMP else { +#ifdef CONFIG_SMP percpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next); if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { +#endif /* We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. */ load_cr3(next->pgd); load_LDT_nolock(&next->context); +#ifdef CONFIG_SMP } - } #endif + } } #define activate_mm(prev, next)