* TLB and PTE coherency during munmap @ 2013-05-26 2:42 Max Filippov 2013-05-26 2:50 ` Max Filippov 2013-05-28 14:35 ` Catalin Marinas 0 siblings, 2 replies; 70+ messages in thread From: Max Filippov @ 2013-05-26 2:42 UTC (permalink / raw) To: linux-arch, linux-mm, linux-xtensa; +Cc: Chris Zankel, Marc Gauthier Hello arch and mm people. Is it intentional that threads of a process that invoked munmap syscall can see TLB entries pointing to already freed pages, or it is a bug? I'm talking about zap_pmd_range and zap_pte_range: zap_pmd_range zap_pte_range arch_enter_lazy_mmu_mode ptep_get_and_clear_full tlb_remove_tlb_entry __tlb_remove_page arch_leave_lazy_mmu_mode cond_resched With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry and __tlb_remove_page there is a loop in the zap_pte_range that clears PTEs and frees corresponding pages, but doesn't flush TLB, and surrounding loop in the zap_pmd_range that calls cond_resched. If a thread of the same process gets scheduled then it is able to see TLB entries pointing to already freed physical pages. I've noticed that with xtensa arch when I added a test before returning to userspace checking that TLB contents agrees with page tables of the current mm. This check reliably fires with the LTP test mtest05 that maps, unmaps and accesses memory from multiple threads. Is there anything wrong in my description, maybe something specific to my arch, or this issue really exists? I've also noticed that there are a lot of arches with default implementations of the involved functions, does that mean that any/all of them have this issue too? -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* TLB and PTE coherency during munmap 2013-05-26 2:42 TLB and PTE coherency during munmap Max Filippov @ 2013-05-26 2:50 ` Max Filippov 2013-05-28 14:35 ` Catalin Marinas 1 sibling, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-26 2:50 UTC (permalink / raw) To: linux-arch, linux-mm, linux-xtensa; +Cc: Chris Zankel, Marc Gauthier Hello arch and mm people. Is it intentional that threads of a process that invoked munmap syscall can see TLB entries pointing to already freed pages, or it is a bug? I'm talking about zap_pmd_range and zap_pte_range: zap_pmd_range zap_pte_range arch_enter_lazy_mmu_mode ptep_get_and_clear_full tlb_remove_tlb_entry __tlb_remove_page arch_leave_lazy_mmu_mode cond_resched With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry and __tlb_remove_page there is a loop in the zap_pte_range that clears PTEs and frees corresponding pages, but doesn't flush TLB, and surrounding loop in the zap_pmd_range that calls cond_resched. If a thread of the same process gets scheduled then it is able to see TLB entries pointing to already freed physical pages. I've noticed that with xtensa arch when I added a test before returning to userspace checking that TLB contents agrees with page tables of the current mm. This check reliably fires with the LTP test mtest05 that maps, unmaps and accesses memory from multiple threads. Is there anything wrong in my description, maybe something specific to my arch, or this issue really exists? I've also noticed that there are a lot of arches with default implementations of the involved functions, does that mean that any/all of them have this issue too? -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* TLB and PTE coherency during munmap @ 2013-05-26 2:50 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-26 2:50 UTC (permalink / raw) To: linux-arch, linux-mm, linux-xtensa; +Cc: Chris Zankel, Marc Gauthier Hello arch and mm people. Is it intentional that threads of a process that invoked munmap syscall can see TLB entries pointing to already freed pages, or it is a bug? I'm talking about zap_pmd_range and zap_pte_range: zap_pmd_range zap_pte_range arch_enter_lazy_mmu_mode ptep_get_and_clear_full tlb_remove_tlb_entry __tlb_remove_page arch_leave_lazy_mmu_mode cond_resched With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry and __tlb_remove_page there is a loop in the zap_pte_range that clears PTEs and frees corresponding pages, but doesn't flush TLB, and surrounding loop in the zap_pmd_range that calls cond_resched. If a thread of the same process gets scheduled then it is able to see TLB entries pointing to already freed physical pages. I've noticed that with xtensa arch when I added a test before returning to userspace checking that TLB contents agrees with page tables of the current mm. This check reliably fires with the LTP test mtest05 that maps, unmaps and accesses memory from multiple threads. Is there anything wrong in my description, maybe something specific to my arch, or this issue really exists? I've also noticed that there are a lot of arches with default implementations of the involved functions, does that mean that any/all of them have this issue too? -- Thanks. -- Max -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-26 2:50 ` Max Filippov @ 2013-05-28 7:10 ` Max Filippov -1 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-28 7:10 UTC (permalink / raw) To: Peter Zijlstra, KAMEZAWA Hiroyuki, linux-arch, linux-mm Cc: Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: > Hello arch and mm people. > > Is it intentional that threads of a process that invoked munmap syscall > can see TLB entries pointing to already freed pages, or it is a bug? > > I'm talking about zap_pmd_range and zap_pte_range: > > zap_pmd_range > zap_pte_range > arch_enter_lazy_mmu_mode > ptep_get_and_clear_full > tlb_remove_tlb_entry > __tlb_remove_page > arch_leave_lazy_mmu_mode > cond_resched > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > and __tlb_remove_page there is a loop in the zap_pte_range that clears > PTEs and frees corresponding pages, but doesn't flush TLB, and > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > of the same process gets scheduled then it is able to see TLB entries > pointing to already freed physical pages. > > I've noticed that with xtensa arch when I added a test before returning to > userspace checking that TLB contents agrees with page tables of the > current mm. This check reliably fires with the LTP test mtest05 that > maps, unmaps and accesses memory from multiple threads. > > Is there anything wrong in my description, maybe something specific to > my arch, or this issue really exists? Hi, I've made similar checking function for MIPS (because qemu is my only choice and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. With MIPS I can also see this issue. I hope I did it right, the patch at the bottom is for the reference. The test I run and the diagnostic output are as follows: # ./runltp -p -q -T 100 -s mtest05 ... mmstress 0 TINFO : test2: Test case tests the race condition between simultaneous write faults in the same address space. [ 439.010000] 14: 70d68000: 03178000/00000000 mmstress 2 TPASS : TEST 2 Passed ... mmstress 0 TINFO : test2: Test case tests the race condition between simultaneous write faults in the same address space. [ 947.390000] 10: 6f9d2000: 03639000/00000000 [ 947.390000] 10: 6f9d3000: 03638000/00000000 mmstress 2 TPASS : TEST 2 Passed ... mmstress 0 TINFO : test1: Test case tests the race condition between simultaneous read faults in the same address space. [ 1922.680000] 10: 68e12000: 03b59000/00000000 [ 1922.680000] 10: 68e13000: 03b58000/00000000 mmstress 1 TPASS : TEST 1 Passed ... To me it looks like the cond_resched in the zap_pmd_range is the root cause of this issue (let alone SMP case for now). It was introduced in the commit commit 97a894136f29802da19a15541de3c019e1ca147e Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue May 24 17:12:04 2011 -0700 mm: Remove i_mmap_lock lockbreak Peter, Kamezawa, other reviewers of that commit, could you please comment? ------8<------ From 29324d01c0c95b11fbfe509d9b0eae42cab6d380 Mon Sep 17 00:00:00 2001 From: Max Filippov <jcmvbkbc@gmail.com> Date: Tue, 28 May 2013 08:06:03 +0400 Subject: [PATCH] mips: check TLB sanity on return to userspace Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> --- arch/mips/kernel/entry.S | 1 + arch/mips/lib/dump_tlb.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 0 deletions(-) diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S index e578685..82881ac 100644 --- a/arch/mips/kernel/entry.S +++ b/arch/mips/kernel/entry.S @@ -49,6 +49,7 @@ resume_userspace: local_irq_disable # make sure we dont miss an # interrupt setting need_resched # between sampling and return + jal check_tlb_all LONG_L a2, TI_FLAGS($28) # current->work andi t0, a2, _TIF_WORK_MASK # (ignoring syscall_trace) bnez t0, work_pending diff --git a/arch/mips/lib/dump_tlb.c b/arch/mips/lib/dump_tlb.c index 32b9f21..629e38b 100644 --- a/arch/mips/lib/dump_tlb.c +++ b/arch/mips/lib/dump_tlb.c @@ -6,6 +6,8 @@ */ #include <linux/kernel.h> #include <linux/mm.h> +#include <linux/sched.h> +#include <linux/mm_types.h> #include <asm/mipsregs.h> #include <asm/page.h> @@ -111,3 +113,82 @@ void dump_tlb_all(void) { dump_tlb(0, current_cpu_data.tlbsize - 1); } + +static inline pte_t *get_pte_for_vaddr(unsigned vaddr) +{ + struct task_struct *task = get_current(); + struct mm_struct *mm = task->mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + if (!mm) + mm = task->active_mm; + pgd = pgd_offset(mm, vaddr); + if (pgd_none_or_clear_bad(pgd)) + return NULL; + pud = pud_offset(pgd, vaddr); + if (pud_none_or_clear_bad(pud)) + return NULL; + pmd = pmd_offset(pud, vaddr); + if (pmd_none_or_clear_bad(pmd)) + return NULL; + pte = pte_offset_map(pmd, vaddr); + return pte; +} + +static inline void check_tlb_entry(unsigned i, unsigned long va, + unsigned long long entrylo) +{ + if (entrylo & 2) { + pte_t *pte = get_pte_for_vaddr(va); + unsigned long pte_pa = pte ? (pte_pfn(*pte) << PAGE_SHIFT) : 0; + unsigned long tlb_pa = (entrylo << 6) & PAGE_MASK; + if (pte_pa != tlb_pa) + pr_err("%d: %08lx: %08lx/%08lx\n", + i, va, tlb_pa, pte_pa); + } +} +static void check_tlb(unsigned last) +{ + unsigned long s_entryhi, asid; + unsigned int s_index, s_pagemask, i; + struct { + unsigned long entryhi; + unsigned long entrylo0, entrylo1; + } c[64]; + + BUG_ON(last > ARRAY_SIZE(c)); + s_pagemask = read_c0_pagemask(); + s_entryhi = read_c0_entryhi(); + s_index = read_c0_index(); + asid = s_entryhi & 0xff; + + for (i = 0; i < last; ++i) { + write_c0_index(i); + BARRIER(); + tlb_read(); + BARRIER(); + c[i].entryhi = read_c0_entryhi(); + c[i].entrylo0 = read_c0_entrylo0(); + c[i].entrylo1 = read_c0_entrylo1(); + } + for (i = 0; i < last; ++i) { + /* Unused entries have a virtual address of CKSEG0. */ + if ((c[i].entryhi & ~0x1ffffUL) != CKSEG0 + && (c[i].entryhi & 0xff) == asid) { + unsigned long va = c[i].entryhi & ~0x1fffUL; + check_tlb_entry(i, va, c[i].entrylo0); + check_tlb_entry(i, va + PAGE_SIZE, c[i].entrylo1); + } + } + write_c0_entryhi(s_entryhi); + write_c0_index(s_index); + write_c0_pagemask(s_pagemask); +} + +void check_tlb_all(void) +{ + check_tlb(current_cpu_data.tlbsize); +} -- 1.7.7.6 -- Thanks. -- Max ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-28 7:10 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-28 7:10 UTC (permalink / raw) To: Peter Zijlstra, KAMEZAWA Hiroyuki, linux-arch, linux-mm Cc: Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: > Hello arch and mm people. > > Is it intentional that threads of a process that invoked munmap syscall > can see TLB entries pointing to already freed pages, or it is a bug? > > I'm talking about zap_pmd_range and zap_pte_range: > > zap_pmd_range > zap_pte_range > arch_enter_lazy_mmu_mode > ptep_get_and_clear_full > tlb_remove_tlb_entry > __tlb_remove_page > arch_leave_lazy_mmu_mode > cond_resched > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > and __tlb_remove_page there is a loop in the zap_pte_range that clears > PTEs and frees corresponding pages, but doesn't flush TLB, and > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > of the same process gets scheduled then it is able to see TLB entries > pointing to already freed physical pages. > > I've noticed that with xtensa arch when I added a test before returning to > userspace checking that TLB contents agrees with page tables of the > current mm. This check reliably fires with the LTP test mtest05 that > maps, unmaps and accesses memory from multiple threads. > > Is there anything wrong in my description, maybe something specific to > my arch, or this issue really exists? Hi, I've made similar checking function for MIPS (because qemu is my only choice and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. With MIPS I can also see this issue. I hope I did it right, the patch at the bottom is for the reference. The test I run and the diagnostic output are as follows: # ./runltp -p -q -T 100 -s mtest05 ... mmstress 0 TINFO : test2: Test case tests the race condition between simultaneous write faults in the same address space. [ 439.010000] 14: 70d68000: 03178000/00000000 mmstress 2 TPASS : TEST 2 Passed ... mmstress 0 TINFO : test2: Test case tests the race condition between simultaneous write faults in the same address space. [ 947.390000] 10: 6f9d2000: 03639000/00000000 [ 947.390000] 10: 6f9d3000: 03638000/00000000 mmstress 2 TPASS : TEST 2 Passed ... mmstress 0 TINFO : test1: Test case tests the race condition between simultaneous read faults in the same address space. [ 1922.680000] 10: 68e12000: 03b59000/00000000 [ 1922.680000] 10: 68e13000: 03b58000/00000000 mmstress 1 TPASS : TEST 1 Passed ... To me it looks like the cond_resched in the zap_pmd_range is the root cause of this issue (let alone SMP case for now). It was introduced in the commit commit 97a894136f29802da19a15541de3c019e1ca147e Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue May 24 17:12:04 2011 -0700 mm: Remove i_mmap_lock lockbreak Peter, Kamezawa, other reviewers of that commit, could you please comment? ------8<------ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-28 7:10 ` Max Filippov @ 2013-05-29 12:27 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-29 12:27 UTC (permalink / raw) To: Max Filippov Cc: KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Tue, May 28, 2013 at 11:10:25AM +0400, Max Filippov wrote: > On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: > > Hello arch and mm people. > > > > Is it intentional that threads of a process that invoked munmap syscall > > can see TLB entries pointing to already freed pages, or it is a bug? > > > > I'm talking about zap_pmd_range and zap_pte_range: > > > > zap_pmd_range > > zap_pte_range > > arch_enter_lazy_mmu_mode > > ptep_get_and_clear_full > > tlb_remove_tlb_entry > > __tlb_remove_page > > arch_leave_lazy_mmu_mode > > cond_resched > > > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > > and __tlb_remove_page there is a loop in the zap_pte_range that clears > > PTEs and frees corresponding pages, but doesn't flush TLB, and > > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > > of the same process gets scheduled then it is able to see TLB entries > > pointing to already freed physical pages. > > > > I've noticed that with xtensa arch when I added a test before returning to > > userspace checking that TLB contents agrees with page tables of the > > current mm. This check reliably fires with the LTP test mtest05 that > > maps, unmaps and accesses memory from multiple threads. > > > > Is there anything wrong in my description, maybe something specific to > > my arch, or this issue really exists? > > Hi, > > I've made similar checking function for MIPS (because qemu is my only choice > and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. > With MIPS I can also see this issue. I hope I did it right, the patch at the > bottom is for the reference. The test I run and the diagnostic output are as > follows: > > To me it looks like the cond_resched in the zap_pmd_range is the root cause > of this issue (let alone SMP case for now). It was introduced in the commit > > commit 97a894136f29802da19a15541de3c019e1ca147e > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Tue May 24 17:12:04 2011 -0700 > > mm: Remove i_mmap_lock lockbreak > > Peter, Kamezawa, other reviewers of that commit, could you please comment? Are you all running UP systems? I suppose the preemptible muck invalidated the assumption that UP systems are 'easy'. If you make tlb_fast_mode() return an unconditional false, does it all work again? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 12:27 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-29 12:27 UTC (permalink / raw) To: Max Filippov Cc: KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Tue, May 28, 2013 at 11:10:25AM +0400, Max Filippov wrote: > On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: > > Hello arch and mm people. > > > > Is it intentional that threads of a process that invoked munmap syscall > > can see TLB entries pointing to already freed pages, or it is a bug? > > > > I'm talking about zap_pmd_range and zap_pte_range: > > > > zap_pmd_range > > zap_pte_range > > arch_enter_lazy_mmu_mode > > ptep_get_and_clear_full > > tlb_remove_tlb_entry > > __tlb_remove_page > > arch_leave_lazy_mmu_mode > > cond_resched > > > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > > and __tlb_remove_page there is a loop in the zap_pte_range that clears > > PTEs and frees corresponding pages, but doesn't flush TLB, and > > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > > of the same process gets scheduled then it is able to see TLB entries > > pointing to already freed physical pages. > > > > I've noticed that with xtensa arch when I added a test before returning to > > userspace checking that TLB contents agrees with page tables of the > > current mm. This check reliably fires with the LTP test mtest05 that > > maps, unmaps and accesses memory from multiple threads. > > > > Is there anything wrong in my description, maybe something specific to > > my arch, or this issue really exists? > > Hi, > > I've made similar checking function for MIPS (because qemu is my only choice > and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. > With MIPS I can also see this issue. I hope I did it right, the patch at the > bottom is for the reference. The test I run and the diagnostic output are as > follows: > > To me it looks like the cond_resched in the zap_pmd_range is the root cause > of this issue (let alone SMP case for now). It was introduced in the commit > > commit 97a894136f29802da19a15541de3c019e1ca147e > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Tue May 24 17:12:04 2011 -0700 > > mm: Remove i_mmap_lock lockbreak > > Peter, Kamezawa, other reviewers of that commit, could you please comment? Are you all running UP systems? I suppose the preemptible muck invalidated the assumption that UP systems are 'easy'. If you make tlb_fast_mode() return an unconditional false, does it all work again? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 12:42 ` Vineet Gupta 0 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-29 12:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 05/29/2013 05:57 PM, Peter Zijlstra wrote: > On Tue, May 28, 2013 at 11:10:25AM +0400, Max Filippov wrote: >> On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: >>> Hello arch and mm people. >>> >>> Is it intentional that threads of a process that invoked munmap syscall >>> can see TLB entries pointing to already freed pages, or it is a bug? >>> >>> I'm talking about zap_pmd_range and zap_pte_range: >>> >>> zap_pmd_range >>> zap_pte_range >>> arch_enter_lazy_mmu_mode >>> ptep_get_and_clear_full >>> tlb_remove_tlb_entry >>> __tlb_remove_page >>> arch_leave_lazy_mmu_mode >>> cond_resched >>> >>> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >>> and __tlb_remove_page there is a loop in the zap_pte_range that clears >>> PTEs and frees corresponding pages, but doesn't flush TLB, and >>> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >>> of the same process gets scheduled then it is able to see TLB entries >>> pointing to already freed physical pages. >>> >>> I've noticed that with xtensa arch when I added a test before returning to >>> userspace checking that TLB contents agrees with page tables of the >>> current mm. This check reliably fires with the LTP test mtest05 that >>> maps, unmaps and accesses memory from multiple threads. >>> >>> Is there anything wrong in my description, maybe something specific to >>> my arch, or this issue really exists? >> >> Hi, >> >> I've made similar checking function for MIPS (because qemu is my only choice >> and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. >> With MIPS I can also see this issue. I hope I did it right, the patch at the >> bottom is for the reference. The test I run and the diagnostic output are as >> follows: >> >> To me it looks like the cond_resched in the zap_pmd_range is the root cause >> of this issue (let alone SMP case for now). It was introduced in the commit >> >> commit 97a894136f29802da19a15541de3c019e1ca147e >> Author: Peter Zijlstra <a.p.zijlstra@chello.nl> >> Date: Tue May 24 17:12:04 2011 -0700 >> >> mm: Remove i_mmap_lock lockbreak >> >> Peter, Kamezawa, other reviewers of that commit, could you please comment? > > Are you all running UP systems? I suppose the preemptible muck > invalidated the assumption that UP systems are 'easy'. > > If you make tlb_fast_mode() return an unconditional false, does it all > work again? > It seems tlb_fast_mode() only affects the page free batching and won't affect the TLB flush themselves unless ofcourse the batching runs out of space. FWIW, prior to your commit d16dfc550f5326 "mm: mmu_gather rework" tlb_finish_mmu() right before the need_resced() which would have handled the current situation. My proposal - please see my earlier email in thread is to reuse the force_flush logic in zap_pte_range() to do this. -Vineet -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 12:42 ` Vineet Gupta 0 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-29 12:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 05/29/2013 05:57 PM, Peter Zijlstra wrote: > On Tue, May 28, 2013 at 11:10:25AM +0400, Max Filippov wrote: >> On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: >>> Hello arch and mm people. >>> >>> Is it intentional that threads of a process that invoked munmap syscall >>> can see TLB entries pointing to already freed pages, or it is a bug? >>> >>> I'm talking about zap_pmd_range and zap_pte_range: >>> >>> zap_pmd_range >>> zap_pte_range >>> arch_enter_lazy_mmu_mode >>> ptep_get_and_clear_full >>> tlb_remove_tlb_entry >>> __tlb_remove_page >>> arch_leave_lazy_mmu_mode >>> cond_resched >>> >>> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >>> and __tlb_remove_page there is a loop in the zap_pte_range that clears >>> PTEs and frees corresponding pages, but doesn't flush TLB, and >>> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >>> of the same process gets scheduled then it is able to see TLB entries >>> pointing to already freed physical pages. >>> >>> I've noticed that with xtensa arch when I added a test before returning to >>> userspace checking that TLB contents agrees with page tables of the >>> current mm. This check reliably fires with the LTP test mtest05 that >>> maps, unmaps and accesses memory from multiple threads. >>> >>> Is there anything wrong in my description, maybe something specific to >>> my arch, or this issue really exists? >> >> Hi, >> >> I've made similar checking function for MIPS (because qemu is my only choice >> and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. >> With MIPS I can also see this issue. I hope I did it right, the patch at the >> bottom is for the reference. The test I run and the diagnostic output are as >> follows: >> >> To me it looks like the cond_resched in the zap_pmd_range is the root cause >> of this issue (let alone SMP case for now). It was introduced in the commit >> >> commit 97a894136f29802da19a15541de3c019e1ca147e >> Author: Peter Zijlstra <a.p.zijlstra@chello.nl> >> Date: Tue May 24 17:12:04 2011 -0700 >> >> mm: Remove i_mmap_lock lockbreak >> >> Peter, Kamezawa, other reviewers of that commit, could you please comment? > > Are you all running UP systems? I suppose the preemptible muck > invalidated the assumption that UP systems are 'easy'. > > If you make tlb_fast_mode() return an unconditional false, does it all > work again? > It seems tlb_fast_mode() only affects the page free batching and won't affect the TLB flush themselves unless ofcourse the batching runs out of space. FWIW, prior to your commit d16dfc550f5326 "mm: mmu_gather rework" tlb_finish_mmu() right before the need_resced() which would have handled the current situation. My proposal - please see my earlier email in thread is to reuse the force_flush logic in zap_pte_range() to do this. -Vineet ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 12:47 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-29 12:47 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Wed, May 29, 2013 at 06:12:15PM +0530, Vineet Gupta wrote: > It seems tlb_fast_mode() only affects the page free batching and won't affect the > TLB flush themselves unless ofcourse the batching runs out of space. It does, it will keep the pages around until after a flush. So you'll no longer free pages before having done a TLB flush. > FWIW, prior to your commit d16dfc550f5326 "mm: mmu_gather rework" > tlb_finish_mmu() right before the need_resced() which would have handled the > current situation. My proposal - please see my earlier email in thread is to reuse > the force_flush logic in zap_pte_range() to do this. I don't have earlier emails as my lkml folder is somewhat broken atm :/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 12:47 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-29 12:47 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Wed, May 29, 2013 at 06:12:15PM +0530, Vineet Gupta wrote: > It seems tlb_fast_mode() only affects the page free batching and won't affect the > TLB flush themselves unless ofcourse the batching runs out of space. It does, it will keep the pages around until after a flush. So you'll no longer free pages before having done a TLB flush. > FWIW, prior to your commit d16dfc550f5326 "mm: mmu_gather rework" > tlb_finish_mmu() right before the need_resced() which would have handled the > current situation. My proposal - please see my earlier email in thread is to reuse > the force_flush logic in zap_pte_range() to do this. I don't have earlier emails as my lkml folder is somewhat broken atm :/ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 17:51 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-29 17:51 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins What about something like this? --- include/asm-generic/tlb.h | 11 ++++++++++- mm/memory.c | 17 ++++++++++++++++- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b1b1fa6..651b1cf 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -116,6 +116,7 @@ struct mmu_gather { static inline int tlb_fast_mode(struct mmu_gather *tlb) { +#ifndef CONFIG_PREEMPT #ifdef CONFIG_SMP return tlb->fast_mode; #else @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) * and page free order so much.. */ return 1; -#endif +#endif /* CONFIG_SMP */ +#else /* CONFIG_PREEMPT */ + /* + * Since mmu_gather is preemptible, preemptible kernels are like SMP + * kernels, we must batch to make sure we invalidate TLBs before we + * free the pages. + */ + return 0; +#endif /* CONFIG_PREEMPT */ } void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..e915af2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ +static inline void cond_resched_tlb(struct mmu_gather *tlb) +{ +#ifndef CONFIG_PREEMPT + /* + * For full preempt kernels we must do regular batching like + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and + * do a flush before our voluntary 'yield'. + */ + if (need_resched()) { + tlb_flush_mmu(tlb); + cond_resched(); + } +#endif +} + /* * If a p?d_bad entry is found while walking page tables, report * the error, before resetting entry to p?d_none. Usually (but @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); next: - cond_resched(); + cond_resched_tlb(tlb); } while (pmd++, addr = next, addr != end); return addr; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 17:51 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-29 17:51 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins What about something like this? --- include/asm-generic/tlb.h | 11 ++++++++++- mm/memory.c | 17 ++++++++++++++++- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b1b1fa6..651b1cf 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -116,6 +116,7 @@ struct mmu_gather { static inline int tlb_fast_mode(struct mmu_gather *tlb) { +#ifndef CONFIG_PREEMPT #ifdef CONFIG_SMP return tlb->fast_mode; #else @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) * and page free order so much.. */ return 1; -#endif +#endif /* CONFIG_SMP */ +#else /* CONFIG_PREEMPT */ + /* + * Since mmu_gather is preemptible, preemptible kernels are like SMP + * kernels, we must batch to make sure we invalidate TLBs before we + * free the pages. + */ + return 0; +#endif /* CONFIG_PREEMPT */ } void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..e915af2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ +static inline void cond_resched_tlb(struct mmu_gather *tlb) +{ +#ifndef CONFIG_PREEMPT + /* + * For full preempt kernels we must do regular batching like + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and + * do a flush before our voluntary 'yield'. + */ + if (need_resched()) { + tlb_flush_mmu(tlb); + cond_resched(); + } +#endif +} + /* * If a p?d_bad entry is found while walking page tables, report * the error, before resetting entry to p?d_none. Usually (but @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); next: - cond_resched(); + cond_resched_tlb(tlb); } while (pmd++, addr = next, addr != end); return addr; ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 22:04 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-29 22:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Vineet Gupta, Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 29 May 2013 18:51, Peter Zijlstra <peterz@infradead.org> wrote: > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > +{ > +#ifndef CONFIG_PREEMPT > + /* > + * For full preempt kernels we must do regular batching like > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > + * do a flush before our voluntary 'yield'. > + */ > + if (need_resched()) { > + tlb_flush_mmu(tlb); > + cond_resched(); > + } > +#endif > +} Does it matter that in the CONFIG_PREEMPT case, you no longer call cond_resched()? I guess we can just rely on the kernel full preemption to reschedule as needed. -- Catalin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 22:04 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-29 22:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Vineet Gupta, Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 29 May 2013 18:51, Peter Zijlstra <peterz@infradead.org> wrote: > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > +{ > +#ifndef CONFIG_PREEMPT > + /* > + * For full preempt kernels we must do regular batching like > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > + * do a flush before our voluntary 'yield'. > + */ > + if (need_resched()) { > + tlb_flush_mmu(tlb); > + cond_resched(); > + } > +#endif > +} Does it matter that in the CONFIG_PREEMPT case, you no longer call cond_resched()? I guess we can just rely on the kernel full preemption to reschedule as needed. -- Catalin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 6:48 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-30 6:48 UTC (permalink / raw) To: Catalin Marinas Cc: Vineet Gupta, Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Wed, May 29, 2013 at 11:04:35PM +0100, Catalin Marinas wrote: > On 29 May 2013 18:51, Peter Zijlstra <peterz@infradead.org> wrote: > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > > +{ > > +#ifndef CONFIG_PREEMPT > > + /* > > + * For full preempt kernels we must do regular batching like > > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > > + * do a flush before our voluntary 'yield'. > > + */ > > + if (need_resched()) { > > + tlb_flush_mmu(tlb); > > + cond_resched(); > > + } > > +#endif > > +} > > Does it matter that in the CONFIG_PREEMPT case, you no longer call > cond_resched()? I guess we can just rely on the kernel full preemption > to reschedule as needed. Exactly, the preempt_enable from the spin_unlock in pte_unmap_unlock() will most likely immediately trigger a preemption. And since we do full batching for PREEMPT doing extra flushes would be detrimental for performance -- however unlikely it is we'll still see the need_resched() there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 6:48 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-30 6:48 UTC (permalink / raw) To: Catalin Marinas Cc: Vineet Gupta, Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Wed, May 29, 2013 at 11:04:35PM +0100, Catalin Marinas wrote: > On 29 May 2013 18:51, Peter Zijlstra <peterz@infradead.org> wrote: > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > > +{ > > +#ifndef CONFIG_PREEMPT > > + /* > > + * For full preempt kernels we must do regular batching like > > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > > + * do a flush before our voluntary 'yield'. > > + */ > > + if (need_resched()) { > > + tlb_flush_mmu(tlb); > > + cond_resched(); > > + } > > +#endif > > +} > > Does it matter that in the CONFIG_PREEMPT case, you no longer call > cond_resched()? I guess we can just rely on the kernel full preemption > to reschedule as needed. Exactly, the preempt_enable from the spin_unlock in pte_unmap_unlock() will most likely immediately trigger a preemption. And since we do full batching for PREEMPT doing extra flushes would be detrimental for performance -- however unlikely it is we'll still see the need_resched() there. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-29 17:51 ` Peter Zijlstra @ 2013-05-30 5:04 ` Vineet Gupta -1 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-30 5:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 05/29/2013 11:21 PM, Peter Zijlstra wrote: > What about something like this? > > --- > include/asm-generic/tlb.h | 11 ++++++++++- > mm/memory.c | 17 ++++++++++++++++- > 2 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index b1b1fa6..651b1cf 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -116,6 +116,7 @@ struct mmu_gather { > > static inline int tlb_fast_mode(struct mmu_gather *tlb) > { > +#ifndef CONFIG_PREEMPT > #ifdef CONFIG_SMP > return tlb->fast_mode; > #else > @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) > * and page free order so much.. > */ > return 1; > -#endif > +#endif /* CONFIG_SMP */ > +#else /* CONFIG_PREEMPT */ > + /* > + * Since mmu_gather is preemptible, preemptible kernels are like SMP > + * kernels, we must batch to make sure we invalidate TLBs before we > + * free the pages. > + */ > + return 0; > +#endif /* CONFIG_PREEMPT */ > } So this adds the page batching logic to small/simpler UP systems - but it's necessary evil :-( > void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); > diff --git a/mm/memory.c b/mm/memory.c > index 6dc1882..e915af2 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > +{ > +#ifndef CONFIG_PREEMPT > + /* > + * For full preempt kernels we must do regular batching like > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > + * do a flush before our voluntary 'yield'. > + */ > + if (need_resched()) { This is really neat: w/o this check, a @fullmm flush (exit/execve) would have suffered multiple full TLB flushes in the loop, now you do that only if a scheduling was needed - meaning only in the case when we have the potential race condition which Max was seeing. Cool ! > + tlb_flush_mmu(tlb); > + cond_resched(); > + } > +#endif > +} > + > /* > * If a p?d_bad entry is found while walking page tables, report > * the error, before resetting entry to p?d_none. Usually (but > @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > goto next; > next = zap_pte_range(tlb, vma, pmd, addr, next, details); > next: > - cond_resched(); > + cond_resched_tlb(tlb); > } while (pmd++, addr = next, addr != end); > > return addr; BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in one spot - unless it is tied to rcu table free stuff. --------------> From: Vineet Gupta <vgupta@synopsys.com> Date: Thu, 30 May 2013 10:25:30 +0530 Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() Signed-off-by: Vineet Gupta <vgupta@synopsys.com> --- mm/memory.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index d9d5fd9..569ffe1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e /* keep the page table cache within bounds */ check_pgt_cache(); + if (tlb_fast_mode(tlb)) + return; + for (batch = tlb->local.next; batch; batch = next) { next = batch->next; free_pages((unsigned long)batch, 0); -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 5:04 ` Vineet Gupta 0 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-30 5:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 05/29/2013 11:21 PM, Peter Zijlstra wrote: > What about something like this? > > --- > include/asm-generic/tlb.h | 11 ++++++++++- > mm/memory.c | 17 ++++++++++++++++- > 2 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index b1b1fa6..651b1cf 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -116,6 +116,7 @@ struct mmu_gather { > > static inline int tlb_fast_mode(struct mmu_gather *tlb) > { > +#ifndef CONFIG_PREEMPT > #ifdef CONFIG_SMP > return tlb->fast_mode; > #else > @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) > * and page free order so much.. > */ > return 1; > -#endif > +#endif /* CONFIG_SMP */ > +#else /* CONFIG_PREEMPT */ > + /* > + * Since mmu_gather is preemptible, preemptible kernels are like SMP > + * kernels, we must batch to make sure we invalidate TLBs before we > + * free the pages. > + */ > + return 0; > +#endif /* CONFIG_PREEMPT */ > } So this adds the page batching logic to small/simpler UP systems - but it's necessary evil :-( > void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); > diff --git a/mm/memory.c b/mm/memory.c > index 6dc1882..e915af2 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > +{ > +#ifndef CONFIG_PREEMPT > + /* > + * For full preempt kernels we must do regular batching like > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > + * do a flush before our voluntary 'yield'. > + */ > + if (need_resched()) { This is really neat: w/o this check, a @fullmm flush (exit/execve) would have suffered multiple full TLB flushes in the loop, now you do that only if a scheduling was needed - meaning only in the case when we have the potential race condition which Max was seeing. Cool ! > + tlb_flush_mmu(tlb); > + cond_resched(); > + } > +#endif > +} > + > /* > * If a p?d_bad entry is found while walking page tables, report > * the error, before resetting entry to p?d_none. Usually (but > @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > goto next; > next = zap_pte_range(tlb, vma, pmd, addr, next, details); > next: > - cond_resched(); > + cond_resched_tlb(tlb); > } while (pmd++, addr = next, addr != end); > > return addr; BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in one spot - unless it is tied to rcu table free stuff. --------------> From: Vineet Gupta <vgupta@synopsys.com> Date: Thu, 30 May 2013 10:25:30 +0530 Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() Signed-off-by: Vineet Gupta <vgupta@synopsys.com> --- mm/memory.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index d9d5fd9..569ffe1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e /* keep the page table cache within bounds */ check_pgt_cache(); + if (tlb_fast_mode(tlb)) + return; + for (batch = tlb->local.next; batch; batch = next) { next = batch->next; free_pages((unsigned long)batch, 0); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 6:56 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-30 6:56 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Thu, May 30, 2013 at 10:34:53AM +0530, Vineet Gupta wrote: > On 05/29/2013 11:21 PM, Peter Zijlstra wrote: > > What about something like this? > > > > --- > > include/asm-generic/tlb.h | 11 ++++++++++- > > mm/memory.c | 17 ++++++++++++++++- > > 2 files changed, 26 insertions(+), 2 deletions(-) > > > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > > index b1b1fa6..651b1cf 100644 > > --- a/include/asm-generic/tlb.h > > +++ b/include/asm-generic/tlb.h > > @@ -116,6 +116,7 @@ struct mmu_gather { > > > > static inline int tlb_fast_mode(struct mmu_gather *tlb) > > { > > +#ifndef CONFIG_PREEMPT > > #ifdef CONFIG_SMP > > return tlb->fast_mode; > > #else > > @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) > > * and page free order so much.. > > */ > > return 1; > > -#endif > > +#endif /* CONFIG_SMP */ > > +#else /* CONFIG_PREEMPT */ > > + /* > > + * Since mmu_gather is preemptible, preemptible kernels are like SMP > > + * kernels, we must batch to make sure we invalidate TLBs before we > > + * free the pages. > > + */ > > + return 0; > > +#endif /* CONFIG_PREEMPT */ > > } > > So this adds the page batching logic to small/simpler UP systems - but it's > necessary evil :-( Well, only if you have CONFIG_PREEMPT=y, at which point UP isn't really _that_ simply anymore really. > > void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); > > diff --git a/mm/memory.c b/mm/memory.c > > index 6dc1882..e915af2 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > > +{ > > +#ifndef CONFIG_PREEMPT > > + /* > > + * For full preempt kernels we must do regular batching like > > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > > + * do a flush before our voluntary 'yield'. > > + */ > > + if (need_resched()) { > > This is really neat: w/o this check, a @fullmm flush (exit/execve) would have > suffered multiple full TLB flushes in the loop, now you do that only if a > scheduling was needed - meaning only in the case when we have the potential race > condition which Max was seeing. Cool ! Glad you like it ;-) > > + tlb_flush_mmu(tlb); > > + cond_resched(); > > + } > > +#endif > > +} > > + > > /* > > * If a p?d_bad entry is found while walking page tables, report > > * the error, before resetting entry to p?d_none. Usually (but > > @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > > goto next; > > next = zap_pte_range(tlb, vma, pmd, addr, next, details); > > next: > > - cond_resched(); > > + cond_resched_tlb(tlb); > > } while (pmd++, addr = next, addr != end); > > > > return addr; > > BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in > one spot - unless it is tied to rcu table free stuff. > > --------------> > From: Vineet Gupta <vgupta@synopsys.com> > Date: Thu, 30 May 2013 10:25:30 +0530 > Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() > > Signed-off-by: Vineet Gupta <vgupta@synopsys.com> > --- > mm/memory.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index d9d5fd9..569ffe1 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long > start, unsigned long e > /* keep the page table cache within bounds */ > check_pgt_cache(); > > + if (tlb_fast_mode(tlb)) > + return; > + > for (batch = tlb->local.next; batch; batch = next) { > next = batch->next; > free_pages((unsigned long)batch, 0); Yes I think that is possible. It would shrink the code a little when fast_mode was unconditionally 1 -- ie. simple UP ;-). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 6:56 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-30 6:56 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Thu, May 30, 2013 at 10:34:53AM +0530, Vineet Gupta wrote: > On 05/29/2013 11:21 PM, Peter Zijlstra wrote: > > What about something like this? > > > > --- > > include/asm-generic/tlb.h | 11 ++++++++++- > > mm/memory.c | 17 ++++++++++++++++- > > 2 files changed, 26 insertions(+), 2 deletions(-) > > > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > > index b1b1fa6..651b1cf 100644 > > --- a/include/asm-generic/tlb.h > > +++ b/include/asm-generic/tlb.h > > @@ -116,6 +116,7 @@ struct mmu_gather { > > > > static inline int tlb_fast_mode(struct mmu_gather *tlb) > > { > > +#ifndef CONFIG_PREEMPT > > #ifdef CONFIG_SMP > > return tlb->fast_mode; > > #else > > @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) > > * and page free order so much.. > > */ > > return 1; > > -#endif > > +#endif /* CONFIG_SMP */ > > +#else /* CONFIG_PREEMPT */ > > + /* > > + * Since mmu_gather is preemptible, preemptible kernels are like SMP > > + * kernels, we must batch to make sure we invalidate TLBs before we > > + * free the pages. > > + */ > > + return 0; > > +#endif /* CONFIG_PREEMPT */ > > } > > So this adds the page batching logic to small/simpler UP systems - but it's > necessary evil :-( Well, only if you have CONFIG_PREEMPT=y, at which point UP isn't really _that_ simply anymore really. > > void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); > > diff --git a/mm/memory.c b/mm/memory.c > > index 6dc1882..e915af2 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > > +{ > > +#ifndef CONFIG_PREEMPT > > + /* > > + * For full preempt kernels we must do regular batching like > > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > > + * do a flush before our voluntary 'yield'. > > + */ > > + if (need_resched()) { > > This is really neat: w/o this check, a @fullmm flush (exit/execve) would have > suffered multiple full TLB flushes in the loop, now you do that only if a > scheduling was needed - meaning only in the case when we have the potential race > condition which Max was seeing. Cool ! Glad you like it ;-) > > + tlb_flush_mmu(tlb); > > + cond_resched(); > > + } > > +#endif > > +} > > + > > /* > > * If a p?d_bad entry is found while walking page tables, report > > * the error, before resetting entry to p?d_none. Usually (but > > @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > > goto next; > > next = zap_pte_range(tlb, vma, pmd, addr, next, details); > > next: > > - cond_resched(); > > + cond_resched_tlb(tlb); > > } while (pmd++, addr = next, addr != end); > > > > return addr; > > BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in > one spot - unless it is tied to rcu table free stuff. > > --------------> > From: Vineet Gupta <vgupta@synopsys.com> > Date: Thu, 30 May 2013 10:25:30 +0530 > Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() > > Signed-off-by: Vineet Gupta <vgupta@synopsys.com> > --- > mm/memory.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index d9d5fd9..569ffe1 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long > start, unsigned long e > /* keep the page table cache within bounds */ > check_pgt_cache(); > > + if (tlb_fast_mode(tlb)) > + return; > + > for (batch = tlb->local.next; batch; batch = next) { > next = batch->next; > free_pages((unsigned long)batch, 0); Yes I think that is possible. It would shrink the code a little when fast_mode was unconditionally 1 -- ie. simple UP ;-). ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 7:00 ` Vineet Gupta 0 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-30 7:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 05/30/2013 12:26 PM, Peter Zijlstra wrote: > On Thu, May 30, 2013 at 10:34:53AM +0530, Vineet Gupta wrote: >> On 05/29/2013 11:21 PM, Peter Zijlstra wrote: >> >> BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in >> one spot - unless it is tied to rcu table free stuff. >> >> --------------> >> From: Vineet Gupta <vgupta@synopsys.com> >> Date: Thu, 30 May 2013 10:25:30 +0530 >> Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() >> >> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> >> --- >> mm/memory.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index d9d5fd9..569ffe1 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long >> start, unsigned long e >> /* keep the page table cache within bounds */ >> check_pgt_cache(); >> >> + if (tlb_fast_mode(tlb)) >> + return; >> + >> for (batch = tlb->local.next; batch; batch = next) { >> next = batch->next; >> free_pages((unsigned long)batch, 0); > Yes I think that is possible. It would shrink the code a little when > fast_mode was unconditionally 1 -- ie. simple UP ;-). Exactly ! Can you please revert with a Reviewed-by/Acked-by so I can formally send it over to linux-mm list. Thx, -Vineet -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 7:00 ` Vineet Gupta 0 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-30 7:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On 05/30/2013 12:26 PM, Peter Zijlstra wrote: > On Thu, May 30, 2013 at 10:34:53AM +0530, Vineet Gupta wrote: >> On 05/29/2013 11:21 PM, Peter Zijlstra wrote: >> >> BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in >> one spot - unless it is tied to rcu table free stuff. >> >> --------------> >> From: Vineet Gupta <vgupta@synopsys.com> >> Date: Thu, 30 May 2013 10:25:30 +0530 >> Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() >> >> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> >> --- >> mm/memory.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index d9d5fd9..569ffe1 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long >> start, unsigned long e >> /* keep the page table cache within bounds */ >> check_pgt_cache(); >> >> + if (tlb_fast_mode(tlb)) >> + return; >> + >> for (batch = tlb->local.next; batch; batch = next) { >> next = batch->next; >> free_pages((unsigned long)batch, 0); > Yes I think that is possible. It would shrink the code a little when > fast_mode was unconditionally 1 -- ie. simple UP ;-). Exactly ! Can you please revert with a Reviewed-by/Acked-by so I can formally send it over to linux-mm list. Thx, -Vineet ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 11:03 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-30 11:03 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Thu, May 30, 2013 at 12:30:51PM +0530, Vineet Gupta wrote: > On 05/30/2013 12:26 PM, Peter Zijlstra wrote: > > On Thu, May 30, 2013 at 10:34:53AM +0530, Vineet Gupta wrote: > >> On 05/29/2013 11:21 PM, Peter Zijlstra wrote: > >> > >> BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in > >> one spot - unless it is tied to rcu table free stuff. > >> > >> --------------> > >> From: Vineet Gupta <vgupta@synopsys.com> > >> Date: Thu, 30 May 2013 10:25:30 +0530 > >> Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() > >> > >> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> > >> --- > >> mm/memory.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index d9d5fd9..569ffe1 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long > >> start, unsigned long e > >> /* keep the page table cache within bounds */ > >> check_pgt_cache(); > >> > >> + if (tlb_fast_mode(tlb)) > >> + return; > >> + > >> for (batch = tlb->local.next; batch; batch = next) { > >> next = batch->next; > >> free_pages((unsigned long)batch, 0); > > Yes I think that is possible. It would shrink the code a little when > > fast_mode was unconditionally 1 -- ie. simple UP ;-). > > Exactly ! Can you please revert with a Reviewed-by/Acked-by so I can formally send > it over to linux-mm list. s/revert/reply/? Acked-by: Peter Zijlstra <peterz@infradead.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-30 11:03 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-30 11:03 UTC (permalink / raw) To: Vineet Gupta Cc: Max Filippov, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Thu, May 30, 2013 at 12:30:51PM +0530, Vineet Gupta wrote: > On 05/30/2013 12:26 PM, Peter Zijlstra wrote: > > On Thu, May 30, 2013 at 10:34:53AM +0530, Vineet Gupta wrote: > >> On 05/29/2013 11:21 PM, Peter Zijlstra wrote: > >> > >> BTW, since we are on the topic, it seems that we are missing tlb_fast_mode() in > >> one spot - unless it is tied to rcu table free stuff. > >> > >> --------------> > >> From: Vineet Gupta <vgupta@synopsys.com> > >> Date: Thu, 30 May 2013 10:25:30 +0530 > >> Subject: [PATCH] mm: tlb_fast_mode check missing in tlb_finish_mmu() > >> > >> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> > >> --- > >> mm/memory.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index d9d5fd9..569ffe1 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -269,6 +269,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long > >> start, unsigned long e > >> /* keep the page table cache within bounds */ > >> check_pgt_cache(); > >> > >> + if (tlb_fast_mode(tlb)) > >> + return; > >> + > >> for (batch = tlb->local.next; batch; batch = next) { > >> next = batch->next; > >> free_pages((unsigned long)batch, 0); > > Yes I think that is possible. It would shrink the code a little when > > fast_mode was unconditionally 1 -- ie. simple UP ;-). > > Exactly ! Can you please revert with a Reviewed-by/Acked-by so I can formally send > it over to linux-mm list. s/revert/reply/? Acked-by: Peter Zijlstra <peterz@infradead.org> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-29 17:51 ` Peter Zijlstra @ 2013-05-31 4:09 ` Max Filippov -1 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-31 4:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins Hi Peter, On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > What about something like this? With that patch I still get mtest05 firing my TLB/PTE incoherency check in the UP PREEMPT_VOLUNTARY configuration. This happens after zap_pte_range completion in the end of unmap_region because of rescheduling called in the following call chain: unmap_region free_pgtables unlink_anon_vmas lock_anon_vma_root down_write might_sleep might_resched _cond_resched > > --- > include/asm-generic/tlb.h | 11 ++++++++++- > mm/memory.c | 17 ++++++++++++++++- > 2 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index b1b1fa6..651b1cf 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -116,6 +116,7 @@ struct mmu_gather { > > static inline int tlb_fast_mode(struct mmu_gather *tlb) > { > +#ifndef CONFIG_PREEMPT > #ifdef CONFIG_SMP > return tlb->fast_mode; > #else > @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) > * and page free order so much.. > */ > return 1; > -#endif > +#endif /* CONFIG_SMP */ > +#else /* CONFIG_PREEMPT */ > + /* > + * Since mmu_gather is preemptible, preemptible kernels are like SMP > + * kernels, we must batch to make sure we invalidate TLBs before we > + * free the pages. > + */ > + return 0; > +#endif /* CONFIG_PREEMPT */ > } > > void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); > diff --git a/mm/memory.c b/mm/memory.c > index 6dc1882..e915af2 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > +{ > +#ifndef CONFIG_PREEMPT > + /* > + * For full preempt kernels we must do regular batching like > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > + * do a flush before our voluntary 'yield'. > + */ > + if (need_resched()) { > + tlb_flush_mmu(tlb); > + cond_resched(); > + } > +#endif > +} > + > /* > * If a p?d_bad entry is found while walking page tables, report > * the error, before resetting entry to p?d_none. Usually (but > @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > goto next; > next = zap_pte_range(tlb, vma, pmd, addr, next, details); > next: > - cond_resched(); > + cond_resched_tlb(tlb); > } while (pmd++, addr = next, addr != end); > > return addr; > > -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 4:09 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-31 4:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins Hi Peter, On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > What about something like this? With that patch I still get mtest05 firing my TLB/PTE incoherency check in the UP PREEMPT_VOLUNTARY configuration. This happens after zap_pte_range completion in the end of unmap_region because of rescheduling called in the following call chain: unmap_region free_pgtables unlink_anon_vmas lock_anon_vma_root down_write might_sleep might_resched _cond_resched > > --- > include/asm-generic/tlb.h | 11 ++++++++++- > mm/memory.c | 17 ++++++++++++++++- > 2 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index b1b1fa6..651b1cf 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -116,6 +116,7 @@ struct mmu_gather { > > static inline int tlb_fast_mode(struct mmu_gather *tlb) > { > +#ifndef CONFIG_PREEMPT > #ifdef CONFIG_SMP > return tlb->fast_mode; > #else > @@ -124,7 +125,15 @@ static inline int tlb_fast_mode(struct mmu_gather *tlb) > * and page free order so much.. > */ > return 1; > -#endif > +#endif /* CONFIG_SMP */ > +#else /* CONFIG_PREEMPT */ > + /* > + * Since mmu_gather is preemptible, preemptible kernels are like SMP > + * kernels, we must batch to make sure we invalidate TLBs before we > + * free the pages. > + */ > + return 0; > +#endif /* CONFIG_PREEMPT */ > } > > void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); > diff --git a/mm/memory.c b/mm/memory.c > index 6dc1882..e915af2 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -384,6 +384,21 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > > #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ > > +static inline void cond_resched_tlb(struct mmu_gather *tlb) > +{ > +#ifndef CONFIG_PREEMPT > + /* > + * For full preempt kernels we must do regular batching like > + * SMP, see tlb_fast_mode(). For !PREEMPT we can 'cheat' and > + * do a flush before our voluntary 'yield'. > + */ > + if (need_resched()) { > + tlb_flush_mmu(tlb); > + cond_resched(); > + } > +#endif > +} > + > /* > * If a p?d_bad entry is found while walking page tables, report > * the error, before resetting entry to p?d_none. Usually (but > @@ -1264,7 +1279,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > goto next; > next = zap_pte_range(tlb, vma, pmd, addr, next, details); > next: > - cond_resched(); > + cond_resched_tlb(tlb); > } while (pmd++, addr = next, addr != end); > > return addr; > > -- Thanks. -- Max -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-31 4:09 ` Max Filippov @ 2013-05-31 7:55 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-31 7:55 UTC (permalink / raw) To: Max Filippov Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > Hi Peter, > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > What about something like this? > > With that patch I still get mtest05 firing my TLB/PTE incoherency check > in the UP PREEMPT_VOLUNTARY configuration. This happens after > zap_pte_range completion in the end of unmap_region because of > rescheduling called in the following call chain: > > unmap_region > free_pgtables > unlink_anon_vmas > lock_anon_vma_root > down_write > might_sleep > might_resched > _cond_resched > Hurm, yeah. Catching all regular blocking primitives and making it maintainable is going to be a problem :/ I suppose the easiest thing is simply killing fast_mode for now/ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 7:55 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-05-31 7:55 UTC (permalink / raw) To: Max Filippov Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > Hi Peter, > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > What about something like this? > > With that patch I still get mtest05 firing my TLB/PTE incoherency check > in the UP PREEMPT_VOLUNTARY configuration. This happens after > zap_pte_range completion in the end of unmap_region because of > rescheduling called in the following call chain: > > unmap_region > free_pgtables > unlink_anon_vmas > lock_anon_vma_root > down_write > might_sleep > might_resched > _cond_resched > Hurm, yeah. Catching all regular blocking primitives and making it maintainable is going to be a problem :/ I suppose the easiest thing is simply killing fast_mode for now/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 9:05 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-03 9:05 UTC (permalink / raw) To: Max Filippov Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton, Ingo Molnar On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > Hi Peter, > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > What about something like this? > > With that patch I still get mtest05 firing my TLB/PTE incoherency check > in the UP PREEMPT_VOLUNTARY configuration. This happens after > zap_pte_range completion in the end of unmap_region because of > rescheduling called in the following call chain: OK, so there two options; completely kill off fast-mode or something like the below where we add magic to the scheduler :/ I'm aware people might object to something like the below -- but since its a possibility I thought we ought to at least mention it. For those new to the thread; the problem is that since the introduction of preemptible mmu_gather the traditional UP fast-mode is broken. Fast-mode is where we free the pages first and flush TLBs later. This is not a problem if there's no concurrency, but obviously if you can preempt there now is. I think I prefer completely killing off fast-mode esp. since UP seems to go the way of the Dodo and it does away with an exception in the mmu_gather code. Anyway; opinions? Linus, Thomas, Ingo? --- nclude/asm-generic/tlb.h | 19 +++++++++++++++---- include/linux/sched.h | 1 + kernel/sched/core.c | 9 +++++++++ mm/memory.c | 3 +++ 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b1b1fa6..8d84154 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -116,15 +116,26 @@ struct mmu_gather { static inline int tlb_fast_mode(struct mmu_gather *tlb) { +#ifdef CONFIG_PREEMPT + /* + * We don't want to add to the schedule fast path for preemptible + * kernels; disable fast mode unconditionally. + */ + return 0; +#endif + #ifdef CONFIG_SMP + /* + * We can only use fast mode if there's a single CPU online; + * otherwise SMP might trip over stale TLB entries. + */ return tlb->fast_mode; -#else +#endif + /* - * For UP we don't need to worry about TLB flush - * and page free order so much.. + * Non-preempt UP can do fast mode unconditionally. */ return 1; -#endif } void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 178a8d9..3dc6930 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1416,6 +1416,7 @@ struct task_struct { unsigned int sequential_io; unsigned int sequential_io_avg; #endif + struct mmu_gather *tlb; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 36f85be..6829b78 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2374,6 +2374,15 @@ static void __sched __schedule(void) struct rq *rq; int cpu; +#ifndef CONFIG_PREEMPT + /* + * We always force batched mmu_gather for preemptible kernels in order + * to minimize scheduling delays. See tlb_fast_mode(). + */ + if (current->tlb && tlb_fast_mode(current->tlb)) + tlb_flush_mmu(current->tlb); +#endif + need_resched: preempt_disable(); cpu = smp_processor_id(); diff --git a/mm/memory.c b/mm/memory.c index d7d54a1..8925578 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -230,6 +230,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb->batch = NULL; #endif + current->tlb = tlb; } void tlb_flush_mmu(struct mmu_gather *tlb) @@ -274,6 +275,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e free_pages((unsigned long)batch, 0); } tlb->local.next = NULL; + + current->tlb = NULL; } /* __tlb_remove_page -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 9:05 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-03 9:05 UTC (permalink / raw) To: Max Filippov Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton, Ingo Molnar On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > Hi Peter, > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > What about something like this? > > With that patch I still get mtest05 firing my TLB/PTE incoherency check > in the UP PREEMPT_VOLUNTARY configuration. This happens after > zap_pte_range completion in the end of unmap_region because of > rescheduling called in the following call chain: OK, so there two options; completely kill off fast-mode or something like the below where we add magic to the scheduler :/ I'm aware people might object to something like the below -- but since its a possibility I thought we ought to at least mention it. For those new to the thread; the problem is that since the introduction of preemptible mmu_gather the traditional UP fast-mode is broken. Fast-mode is where we free the pages first and flush TLBs later. This is not a problem if there's no concurrency, but obviously if you can preempt there now is. I think I prefer completely killing off fast-mode esp. since UP seems to go the way of the Dodo and it does away with an exception in the mmu_gather code. Anyway; opinions? Linus, Thomas, Ingo? --- nclude/asm-generic/tlb.h | 19 +++++++++++++++---- include/linux/sched.h | 1 + kernel/sched/core.c | 9 +++++++++ mm/memory.c | 3 +++ 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b1b1fa6..8d84154 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -116,15 +116,26 @@ struct mmu_gather { static inline int tlb_fast_mode(struct mmu_gather *tlb) { +#ifdef CONFIG_PREEMPT + /* + * We don't want to add to the schedule fast path for preemptible + * kernels; disable fast mode unconditionally. + */ + return 0; +#endif + #ifdef CONFIG_SMP + /* + * We can only use fast mode if there's a single CPU online; + * otherwise SMP might trip over stale TLB entries. + */ return tlb->fast_mode; -#else +#endif + /* - * For UP we don't need to worry about TLB flush - * and page free order so much.. + * Non-preempt UP can do fast mode unconditionally. */ return 1; -#endif } void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 178a8d9..3dc6930 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1416,6 +1416,7 @@ struct task_struct { unsigned int sequential_io; unsigned int sequential_io_avg; #endif + struct mmu_gather *tlb; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 36f85be..6829b78 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2374,6 +2374,15 @@ static void __sched __schedule(void) struct rq *rq; int cpu; +#ifndef CONFIG_PREEMPT + /* + * We always force batched mmu_gather for preemptible kernels in order + * to minimize scheduling delays. See tlb_fast_mode(). + */ + if (current->tlb && tlb_fast_mode(current->tlb)) + tlb_flush_mmu(current->tlb); +#endif + need_resched: preempt_disable(); cpu = smp_processor_id(); diff --git a/mm/memory.c b/mm/memory.c index d7d54a1..8925578 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -230,6 +230,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb->batch = NULL; #endif + current->tlb = tlb; } void tlb_flush_mmu(struct mmu_gather *tlb) @@ -274,6 +275,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e free_pages((unsigned long)batch, 0); } tlb->local.next = NULL; + + current->tlb = NULL; } /* __tlb_remove_page ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-06-03 9:05 ` Peter Zijlstra @ 2013-06-03 9:16 ` Ingo Molnar -1 siblings, 0 replies; 70+ messages in thread From: Ingo Molnar @ 2013-06-03 9:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton * Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > > Hi Peter, > > > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > What about something like this? > > > > With that patch I still get mtest05 firing my TLB/PTE incoherency > > check in the UP PREEMPT_VOLUNTARY configuration. This happens after > > zap_pte_range completion in the end of unmap_region because of > > rescheduling called in the following call chain: > > OK, so there two options; completely kill off fast-mode or something > like the below where we add magic to the scheduler :/ > > I'm aware people might object to something like the below -- but since > its a possibility I thought we ought to at least mention it. > > For those new to the thread; the problem is that since the introduction > of preemptible mmu_gather the traditional UP fast-mode is broken. > Fast-mode is where we free the pages first and flush TLBs later. This is > not a problem if there's no concurrency, but obviously if you can > preempt there now is. > > I think I prefer completely killing off fast-mode esp. since UP seems to > go the way of the Dodo and it does away with an exception in the > mmu_gather code. > > Anyway; opinions? Linus, Thomas, Ingo? Since UP kernels have not been packaged up by major distros for years, and since the live-patching of SMP kernels (the SMP alternative-instructions patching machinery) does away with a big chunk of the SMP cost, I guess UP kernels are slowly becoming like TINY_RCU: interesting but not really a primary design goal? ( Another reason for reducing SMP vs. UP complexity in this area would be the fact that we had a few bad regressions lately - the TLB code is not getting simpler, and bugs are getting discovered and fixed slower. ) At least that's the x86 perspective. ARM might still see it differently? Thanks, Ingo ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 9:16 ` Ingo Molnar 0 siblings, 0 replies; 70+ messages in thread From: Ingo Molnar @ 2013-06-03 9:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton * Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > > Hi Peter, > > > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > What about something like this? > > > > With that patch I still get mtest05 firing my TLB/PTE incoherency > > check in the UP PREEMPT_VOLUNTARY configuration. This happens after > > zap_pte_range completion in the end of unmap_region because of > > rescheduling called in the following call chain: > > OK, so there two options; completely kill off fast-mode or something > like the below where we add magic to the scheduler :/ > > I'm aware people might object to something like the below -- but since > its a possibility I thought we ought to at least mention it. > > For those new to the thread; the problem is that since the introduction > of preemptible mmu_gather the traditional UP fast-mode is broken. > Fast-mode is where we free the pages first and flush TLBs later. This is > not a problem if there's no concurrency, but obviously if you can > preempt there now is. > > I think I prefer completely killing off fast-mode esp. since UP seems to > go the way of the Dodo and it does away with an exception in the > mmu_gather code. > > Anyway; opinions? Linus, Thomas, Ingo? Since UP kernels have not been packaged up by major distros for years, and since the live-patching of SMP kernels (the SMP alternative-instructions patching machinery) does away with a big chunk of the SMP cost, I guess UP kernels are slowly becoming like TINY_RCU: interesting but not really a primary design goal? ( Another reason for reducing SMP vs. UP complexity in this area would be the fact that we had a few bad regressions lately - the TLB code is not getting simpler, and bugs are getting discovered and fixed slower. ) At least that's the x86 perspective. ARM might still see it differently? Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-06-03 9:16 ` Ingo Molnar @ 2013-06-03 10:01 ` Catalin Marinas -1 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-06-03 10:01 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton On 3 June 2013 10:16, Ingo Molnar <mingo@kernel.org> wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > >> On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: >> > Hi Peter, >> > >> > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> > > What about something like this? >> > >> > With that patch I still get mtest05 firing my TLB/PTE incoherency >> > check in the UP PREEMPT_VOLUNTARY configuration. This happens after >> > zap_pte_range completion in the end of unmap_region because of >> > rescheduling called in the following call chain: >> >> OK, so there two options; completely kill off fast-mode or something >> like the below where we add magic to the scheduler :/ >> >> I'm aware people might object to something like the below -- but since >> its a possibility I thought we ought to at least mention it. >> >> For those new to the thread; the problem is that since the introduction >> of preemptible mmu_gather the traditional UP fast-mode is broken. >> Fast-mode is where we free the pages first and flush TLBs later. This is >> not a problem if there's no concurrency, but obviously if you can >> preempt there now is. >> >> I think I prefer completely killing off fast-mode esp. since UP seems to >> go the way of the Dodo and it does away with an exception in the >> mmu_gather code. >> >> Anyway; opinions? Linus, Thomas, Ingo? > > Since UP kernels have not been packaged up by major distros for years, and > since the live-patching of SMP kernels (the SMP alternative-instructions > patching machinery) does away with a big chunk of the SMP cost, I guess UP > kernels are slowly becoming like TINY_RCU: interesting but not really a > primary design goal? > > ( Another reason for reducing SMP vs. UP complexity in this area would be > the fact that we had a few bad regressions lately - the TLB code is not > getting simpler, and bugs are getting discovered and fixed slower. ) > > At least that's the x86 perspective. ARM might still see it differently? On ARM there is a lot of ongoing work on single zImage for multiple SoCs and this implies SMP kernels. There is an SMP_ON_UP feature which does run-time code patching to optimise the UP case in a few places. Regarding tlb_fast_mode(), the ARM-specific implementation is always 0 on ARMv7 even if UP because of speculative TLB loads (the MMU could pretty much act as a separate processor). Catalin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 10:01 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-06-03 10:01 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton On 3 June 2013 10:16, Ingo Molnar <mingo@kernel.org> wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > >> On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: >> > Hi Peter, >> > >> > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> > > What about something like this? >> > >> > With that patch I still get mtest05 firing my TLB/PTE incoherency >> > check in the UP PREEMPT_VOLUNTARY configuration. This happens after >> > zap_pte_range completion in the end of unmap_region because of >> > rescheduling called in the following call chain: >> >> OK, so there two options; completely kill off fast-mode or something >> like the below where we add magic to the scheduler :/ >> >> I'm aware people might object to something like the below -- but since >> its a possibility I thought we ought to at least mention it. >> >> For those new to the thread; the problem is that since the introduction >> of preemptible mmu_gather the traditional UP fast-mode is broken. >> Fast-mode is where we free the pages first and flush TLBs later. This is >> not a problem if there's no concurrency, but obviously if you can >> preempt there now is. >> >> I think I prefer completely killing off fast-mode esp. since UP seems to >> go the way of the Dodo and it does away with an exception in the >> mmu_gather code. >> >> Anyway; opinions? Linus, Thomas, Ingo? > > Since UP kernels have not been packaged up by major distros for years, and > since the live-patching of SMP kernels (the SMP alternative-instructions > patching machinery) does away with a big chunk of the SMP cost, I guess UP > kernels are slowly becoming like TINY_RCU: interesting but not really a > primary design goal? > > ( Another reason for reducing SMP vs. UP complexity in this area would be > the fact that we had a few bad regressions lately - the TLB code is not > getting simpler, and bugs are getting discovered and fixed slower. ) > > At least that's the x86 perspective. ARM might still see it differently? On ARM there is a lot of ongoing work on single zImage for multiple SoCs and this implies SMP kernels. There is an SMP_ON_UP feature which does run-time code patching to optimise the UP case in a few places. Regarding tlb_fast_mode(), the ARM-specific implementation is always 0 on ARMv7 even if UP because of speculative TLB loads (the MMU could pretty much act as a separate processor). Catalin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 10:04 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-03 10:04 UTC (permalink / raw) To: Catalin Marinas Cc: Ingo Molnar, Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton On Mon, Jun 03, 2013 at 11:01:39AM +0100, Catalin Marinas wrote: > On ARM there is a lot of ongoing work on single zImage for multiple > SoCs and this implies SMP kernels. There is an SMP_ON_UP feature which > does run-time code patching to optimise the UP case in a few places. > > Regarding tlb_fast_mode(), the ARM-specific implementation is always 0 > on ARMv7 even if UP because of speculative TLB loads (the MMU could > pretty much act as a separate processor). Oh right.. I should really refresh the mmu_gather unification patches so all archs are using the generic code. /me ups on todo list. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 10:04 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-03 10:04 UTC (permalink / raw) To: Catalin Marinas Cc: Ingo Molnar, Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton On Mon, Jun 03, 2013 at 11:01:39AM +0100, Catalin Marinas wrote: > On ARM there is a lot of ongoing work on single zImage for multiple > SoCs and this implies SMP kernels. There is an SMP_ON_UP feature which > does run-time code patching to optimise the UP case in a few places. > > Regarding tlb_fast_mode(), the ARM-specific implementation is always 0 > on ARMv7 even if UP because of speculative TLB loads (the MMU could > pretty much act as a separate processor). Oh right.. I should really refresh the mmu_gather unification patches so all archs are using the generic code. /me ups on todo list. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-06-03 10:04 ` Peter Zijlstra @ 2013-06-03 10:09 ` Catalin Marinas -1 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-06-03 10:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton On Mon, Jun 03, 2013 at 11:04:44AM +0100, Peter Zijlstra wrote: > On Mon, Jun 03, 2013 at 11:01:39AM +0100, Catalin Marinas wrote: > > On ARM there is a lot of ongoing work on single zImage for multiple > > SoCs and this implies SMP kernels. There is an SMP_ON_UP feature which > > does run-time code patching to optimise the UP case in a few places. > > > > Regarding tlb_fast_mode(), the ARM-specific implementation is always 0 > > on ARMv7 even if UP because of speculative TLB loads (the MMU could > > pretty much act as a separate processor). > > Oh right.. I should really refresh the mmu_gather unification patches so > all archs are using the generic code. That would be great! I'll help with testing/review (or if it needs anything else) from an arm/arm64 perspective. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 10:09 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-06-03 10:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton On Mon, Jun 03, 2013 at 11:04:44AM +0100, Peter Zijlstra wrote: > On Mon, Jun 03, 2013 at 11:01:39AM +0100, Catalin Marinas wrote: > > On ARM there is a lot of ongoing work on single zImage for multiple > > SoCs and this implies SMP kernels. There is an SMP_ON_UP feature which > > does run-time code patching to optimise the UP case in a few places. > > > > Regarding tlb_fast_mode(), the ARM-specific implementation is always 0 > > on ARMv7 even if UP because of speculative TLB loads (the MMU could > > pretty much act as a separate processor). > > Oh right.. I should really refresh the mmu_gather unification patches so > all archs are using the generic code. That would be great! I'll help with testing/review (or if it needs anything else) from an arm/arm64 perspective. Thanks. -- Catalin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-04 9:52 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-04 9:52 UTC (permalink / raw) To: Max Filippov Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton, Ingo Molnar, Tony Luck On Mon, Jun 03, 2013 at 11:05:01AM +0200, Peter Zijlstra wrote: > On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > > Hi Peter, > > > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > What about something like this? > > > > With that patch I still get mtest05 firing my TLB/PTE incoherency check > > in the UP PREEMPT_VOLUNTARY configuration. This happens after > > zap_pte_range completion in the end of unmap_region because of > > rescheduling called in the following call chain: > > OK, so there two options; completely kill off fast-mode or something like the > below where we add magic to the scheduler :/ > > I'm aware people might object to something like the below -- but since its a > possibility I thought we ought to at least mention it. > > For those new to the thread; the problem is that since the introduction of > preemptible mmu_gather the traditional UP fast-mode is broken. Fast-mode is > where we free the pages first and flush TLBs later. This is not a problem if > there's no concurrency, but obviously if you can preempt there now is. > > I think I prefer completely killing off fast-mode esp. since UP seems to go the > way of the Dodo and it does away with an exception in the mmu_gather code. > > Anyway; opinions? Linus, Thomas, Ingo? And here's the patch that makes fast mode go *poof*.. --- arch/arm/include/asm/tlb.h | 27 ++++----------------------- arch/ia64/include/asm/tlb.h | 41 ++++++++--------------------------------- include/asm-generic/tlb.h | 17 +---------------- mm/memory.c | 9 --------- 4 files changed, 13 insertions(+), 81 deletions(-) diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h index 99a1951..bdf2b84 100644 --- a/arch/arm/include/asm/tlb.h +++ b/arch/arm/include/asm/tlb.h @@ -33,18 +33,6 @@ #include <asm/pgalloc.h> #include <asm/tlbflush.h> -/* - * We need to delay page freeing for SMP as other CPUs can access pages - * which have been removed but not yet had their TLB entries invalidated. - * Also, as ARMv7 speculative prefetch can drag new entries into the TLB, - * we need to apply this same delaying tactic to ensure correct operation. - */ -#if defined(CONFIG_SMP) || defined(CONFIG_CPU_32v7) -#define tlb_fast_mode(tlb) 0 -#else -#define tlb_fast_mode(tlb) 1 -#endif - #define MMU_GATHER_BUNDLE 8 /* @@ -112,12 +100,10 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb) static inline void tlb_flush_mmu(struct mmu_gather *tlb) { tlb_flush(tlb); - if (!tlb_fast_mode(tlb)) { - free_pages_and_swap_cache(tlb->pages, tlb->nr); - tlb->nr = 0; - if (tlb->pages == tlb->local) - __tlb_alloc_page(tlb); - } + free_pages_and_swap_cache(tlb->pages, tlb->nr); + tlb->nr = 0; + if (tlb->pages == tlb->local) + __tlb_alloc_page(tlb); } static inline void @@ -178,11 +164,6 @@ tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - tlb->pages[tlb->nr++] = page; VM_BUG_ON(tlb->nr > tlb->max); return tlb->max - tlb->nr; diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h index c3ffe3e..ef3a9de 100644 --- a/arch/ia64/include/asm/tlb.h +++ b/arch/ia64/include/asm/tlb.h @@ -46,12 +46,6 @@ #include <asm/tlbflush.h> #include <asm/machvec.h> -#ifdef CONFIG_SMP -# define tlb_fast_mode(tlb) ((tlb)->nr == ~0U) -#else -# define tlb_fast_mode(tlb) (1) -#endif - /* * If we can't allocate a page to make a big batch of page pointers * to work on, then just handle a few from the on-stack structure. @@ -60,7 +54,7 @@ struct mmu_gather { struct mm_struct *mm; - unsigned int nr; /* == ~0U => fast mode */ + unsigned int nr; unsigned int max; unsigned char fullmm; /* non-zero means full mm flush */ unsigned char need_flush; /* really unmapped some PTEs? */ @@ -103,6 +97,7 @@ extern struct ia64_tr_entry *ia64_idtrs[NR_CPUS]; static inline void ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end) { + unsigned long i; unsigned int nr; if (!tlb->need_flush) @@ -141,13 +136,11 @@ ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long e /* lastly, release the freed pages */ nr = tlb->nr; - if (!tlb_fast_mode(tlb)) { - unsigned long i; - tlb->nr = 0; - tlb->start_addr = ~0UL; - for (i = 0; i < nr; ++i) - free_page_and_swap_cache(tlb->pages[i]); - } + + tlb->nr = 0; + tlb->start_addr = ~0UL; + for (i = 0; i < nr; ++i) + free_page_and_swap_cache(tlb->pages[i]); } static inline void __tlb_alloc_page(struct mmu_gather *tlb) @@ -167,20 +160,7 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_m tlb->mm = mm; tlb->max = ARRAY_SIZE(tlb->local); tlb->pages = tlb->local; - /* - * Use fast mode if only 1 CPU is online. - * - * It would be tempting to turn on fast-mode for full_mm_flush as well. But this - * doesn't work because of speculative accesses and software prefetching: the page - * table of "mm" may (and usually is) the currently active page table and even - * though the kernel won't do any user-space accesses during the TLB shoot down, a - * compiler might use speculation or lfetch.fault on what happens to be a valid - * user-space address. This in turn could trigger a TLB miss fault (or a VHPT - * walk) and re-insert a TLB entry we just removed. Slow mode avoids such - * problems. (We could make fast-mode work by switching the current task to a - * different "mm" during the shootdown.) --davidm 08/02/2002 - */ - tlb->nr = (num_online_cpus() == 1) ? ~0U : 0; + tlb->nr = 0; tlb->fullmm = full_mm_flush; tlb->start_addr = ~0UL; } @@ -214,11 +194,6 @@ static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { tlb->need_flush = 1; - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - if (!tlb->nr && tlb->pages == tlb->local) __tlb_alloc_page(tlb); diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b1b1fa6..13821c3 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -97,11 +97,9 @@ struct mmu_gather { unsigned long start; unsigned long end; unsigned int need_flush : 1, /* Did free PTEs */ - fast_mode : 1; /* No batching */ - /* we are in the middle of an operation to clear * a full mm and can make some optimizations */ - unsigned int fullmm : 1, + fullmm : 1, /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1; @@ -114,19 +112,6 @@ struct mmu_gather { #define HAVE_GENERIC_MMU_GATHER -static inline int tlb_fast_mode(struct mmu_gather *tlb) -{ -#ifdef CONFIG_SMP - return tlb->fast_mode; -#else - /* - * For UP we don't need to worry about TLB flush - * and page free order so much.. - */ - return 1; -#endif -} - void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); void tlb_flush_mmu(struct mmu_gather *tlb); void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, diff --git a/mm/memory.c b/mm/memory.c index d7d54a1..95d0cce 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -220,7 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) tlb->start = -1UL; tlb->end = 0; tlb->need_flush = 0; - tlb->fast_mode = (num_possible_cpus() == 1); tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); @@ -244,9 +243,6 @@ void tlb_flush_mmu(struct mmu_gather *tlb) tlb_table_flush(tlb); #endif - if (tlb_fast_mode(tlb)) - return; - for (batch = &tlb->local; batch; batch = batch->next) { free_pages_and_swap_cache(batch->pages, batch->nr); batch->nr = 0; @@ -288,11 +284,6 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) VM_BUG_ON(!tlb->need_flush); - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu() */ - } - batch = tlb->active; batch->pages[batch->nr++] = page; if (batch->nr == batch->max) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-04 9:52 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-04 9:52 UTC (permalink / raw) To: Max Filippov Cc: Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton, Ingo Molnar, Tony Luck On Mon, Jun 03, 2013 at 11:05:01AM +0200, Peter Zijlstra wrote: > On Fri, May 31, 2013 at 08:09:17AM +0400, Max Filippov wrote: > > Hi Peter, > > > > On Wed, May 29, 2013 at 9:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > What about something like this? > > > > With that patch I still get mtest05 firing my TLB/PTE incoherency check > > in the UP PREEMPT_VOLUNTARY configuration. This happens after > > zap_pte_range completion in the end of unmap_region because of > > rescheduling called in the following call chain: > > OK, so there two options; completely kill off fast-mode or something like the > below where we add magic to the scheduler :/ > > I'm aware people might object to something like the below -- but since its a > possibility I thought we ought to at least mention it. > > For those new to the thread; the problem is that since the introduction of > preemptible mmu_gather the traditional UP fast-mode is broken. Fast-mode is > where we free the pages first and flush TLBs later. This is not a problem if > there's no concurrency, but obviously if you can preempt there now is. > > I think I prefer completely killing off fast-mode esp. since UP seems to go the > way of the Dodo and it does away with an exception in the mmu_gather code. > > Anyway; opinions? Linus, Thomas, Ingo? And here's the patch that makes fast mode go *poof*.. --- arch/arm/include/asm/tlb.h | 27 ++++----------------------- arch/ia64/include/asm/tlb.h | 41 ++++++++--------------------------------- include/asm-generic/tlb.h | 17 +---------------- mm/memory.c | 9 --------- 4 files changed, 13 insertions(+), 81 deletions(-) diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h index 99a1951..bdf2b84 100644 --- a/arch/arm/include/asm/tlb.h +++ b/arch/arm/include/asm/tlb.h @@ -33,18 +33,6 @@ #include <asm/pgalloc.h> #include <asm/tlbflush.h> -/* - * We need to delay page freeing for SMP as other CPUs can access pages - * which have been removed but not yet had their TLB entries invalidated. - * Also, as ARMv7 speculative prefetch can drag new entries into the TLB, - * we need to apply this same delaying tactic to ensure correct operation. - */ -#if defined(CONFIG_SMP) || defined(CONFIG_CPU_32v7) -#define tlb_fast_mode(tlb) 0 -#else -#define tlb_fast_mode(tlb) 1 -#endif - #define MMU_GATHER_BUNDLE 8 /* @@ -112,12 +100,10 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb) static inline void tlb_flush_mmu(struct mmu_gather *tlb) { tlb_flush(tlb); - if (!tlb_fast_mode(tlb)) { - free_pages_and_swap_cache(tlb->pages, tlb->nr); - tlb->nr = 0; - if (tlb->pages == tlb->local) - __tlb_alloc_page(tlb); - } + free_pages_and_swap_cache(tlb->pages, tlb->nr); + tlb->nr = 0; + if (tlb->pages == tlb->local) + __tlb_alloc_page(tlb); } static inline void @@ -178,11 +164,6 @@ tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - tlb->pages[tlb->nr++] = page; VM_BUG_ON(tlb->nr > tlb->max); return tlb->max - tlb->nr; diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h index c3ffe3e..ef3a9de 100644 --- a/arch/ia64/include/asm/tlb.h +++ b/arch/ia64/include/asm/tlb.h @@ -46,12 +46,6 @@ #include <asm/tlbflush.h> #include <asm/machvec.h> -#ifdef CONFIG_SMP -# define tlb_fast_mode(tlb) ((tlb)->nr == ~0U) -#else -# define tlb_fast_mode(tlb) (1) -#endif - /* * If we can't allocate a page to make a big batch of page pointers * to work on, then just handle a few from the on-stack structure. @@ -60,7 +54,7 @@ struct mmu_gather { struct mm_struct *mm; - unsigned int nr; /* == ~0U => fast mode */ + unsigned int nr; unsigned int max; unsigned char fullmm; /* non-zero means full mm flush */ unsigned char need_flush; /* really unmapped some PTEs? */ @@ -103,6 +97,7 @@ extern struct ia64_tr_entry *ia64_idtrs[NR_CPUS]; static inline void ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end) { + unsigned long i; unsigned int nr; if (!tlb->need_flush) @@ -141,13 +136,11 @@ ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long e /* lastly, release the freed pages */ nr = tlb->nr; - if (!tlb_fast_mode(tlb)) { - unsigned long i; - tlb->nr = 0; - tlb->start_addr = ~0UL; - for (i = 0; i < nr; ++i) - free_page_and_swap_cache(tlb->pages[i]); - } + + tlb->nr = 0; + tlb->start_addr = ~0UL; + for (i = 0; i < nr; ++i) + free_page_and_swap_cache(tlb->pages[i]); } static inline void __tlb_alloc_page(struct mmu_gather *tlb) @@ -167,20 +160,7 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_m tlb->mm = mm; tlb->max = ARRAY_SIZE(tlb->local); tlb->pages = tlb->local; - /* - * Use fast mode if only 1 CPU is online. - * - * It would be tempting to turn on fast-mode for full_mm_flush as well. But this - * doesn't work because of speculative accesses and software prefetching: the page - * table of "mm" may (and usually is) the currently active page table and even - * though the kernel won't do any user-space accesses during the TLB shoot down, a - * compiler might use speculation or lfetch.fault on what happens to be a valid - * user-space address. This in turn could trigger a TLB miss fault (or a VHPT - * walk) and re-insert a TLB entry we just removed. Slow mode avoids such - * problems. (We could make fast-mode work by switching the current task to a - * different "mm" during the shootdown.) --davidm 08/02/2002 - */ - tlb->nr = (num_online_cpus() == 1) ? ~0U : 0; + tlb->nr = 0; tlb->fullmm = full_mm_flush; tlb->start_addr = ~0UL; } @@ -214,11 +194,6 @@ static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { tlb->need_flush = 1; - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - if (!tlb->nr && tlb->pages == tlb->local) __tlb_alloc_page(tlb); diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b1b1fa6..13821c3 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -97,11 +97,9 @@ struct mmu_gather { unsigned long start; unsigned long end; unsigned int need_flush : 1, /* Did free PTEs */ - fast_mode : 1; /* No batching */ - /* we are in the middle of an operation to clear * a full mm and can make some optimizations */ - unsigned int fullmm : 1, + fullmm : 1, /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1; @@ -114,19 +112,6 @@ struct mmu_gather { #define HAVE_GENERIC_MMU_GATHER -static inline int tlb_fast_mode(struct mmu_gather *tlb) -{ -#ifdef CONFIG_SMP - return tlb->fast_mode; -#else - /* - * For UP we don't need to worry about TLB flush - * and page free order so much.. - */ - return 1; -#endif -} - void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); void tlb_flush_mmu(struct mmu_gather *tlb); void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, diff --git a/mm/memory.c b/mm/memory.c index d7d54a1..95d0cce 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -220,7 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) tlb->start = -1UL; tlb->end = 0; tlb->need_flush = 0; - tlb->fast_mode = (num_possible_cpus() == 1); tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); @@ -244,9 +243,6 @@ void tlb_flush_mmu(struct mmu_gather *tlb) tlb_table_flush(tlb); #endif - if (tlb_fast_mode(tlb)) - return; - for (batch = &tlb->local; batch; batch = batch->next) { free_pages_and_swap_cache(batch->pages, batch->nr); batch->nr = 0; @@ -288,11 +284,6 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) VM_BUG_ON(!tlb->need_flush); - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu() */ - } - batch = tlb->active; batch->pages[batch->nr++] = page; if (batch->nr == batch->max) { ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-05 0:05 ` Linus Torvalds 0 siblings, 0 replies; 70+ messages in thread From: Linus Torvalds @ 2013-06-05 0:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Andrew Morton, Ingo Molnar, Tony Luck On Tue, 4 Jun 2013, Peter Zijlstra wrote: > > And here's the patch that makes fast mode go *poof*.. Let's just do this. Mind sending a patch with proper changelog and sign-off? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-05 0:05 ` Linus Torvalds 0 siblings, 0 replies; 70+ messages in thread From: Linus Torvalds @ 2013-06-05 0:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Andrew Morton, Ingo Molnar, Tony Luck On Tue, 4 Jun 2013, Peter Zijlstra wrote: > > And here's the patch that makes fast mode go *poof*.. Let's just do this. Mind sending a patch with proper changelog and sign-off? Linus ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH] arch, mm: Remove tlb_fast_mode() @ 2013-06-05 10:26 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-05 10:26 UTC (permalink / raw) To: Linus Torvalds Cc: Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Andrew Morton, Ingo Molnar, Tony Luck On Wed, Jun 05, 2013 at 09:05:03AM +0900, Linus Torvalds wrote: > > > On Tue, 4 Jun 2013, Peter Zijlstra wrote: > > > > And here's the patch that makes fast mode go *poof*.. > > Let's just do this. Mind sending a patch with proper changelog and > sign-off? Sure, as an extra I even compile tested it! ;-) (including arm and ia64). --- Subject: arch, mm: Remove tlb_fast_mode() From: Peter Zijlstra <peterz@infradead.org> Since the introduction of preemptible mmu_gather TLB fast mode has been broken. TLB fast mode relies on there being absolutely no concurrency; it frees pages first and invalidates TLBs later. However now we can get concurrency and stuff goes *bang*. This patch removes all tlb_fast_mode() code; it was found the better option vs trying to patch the hole by entangling tlb invalidation with the scheduler. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Reported-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> --- arch/arm/include/asm/tlb.h | 27 ++++----------------------- arch/ia64/include/asm/tlb.h | 41 ++++++++--------------------------------- include/asm-generic/tlb.h | 17 +---------------- mm/memory.c | 9 --------- 4 files changed, 13 insertions(+), 81 deletions(-) --- a/arch/arm/include/asm/tlb.h +++ b/arch/arm/include/asm/tlb.h @@ -33,18 +33,6 @@ #include <asm/pgalloc.h> #include <asm/tlbflush.h> -/* - * We need to delay page freeing for SMP as other CPUs can access pages - * which have been removed but not yet had their TLB entries invalidated. - * Also, as ARMv7 speculative prefetch can drag new entries into the TLB, - * we need to apply this same delaying tactic to ensure correct operation. - */ -#if defined(CONFIG_SMP) || defined(CONFIG_CPU_32v7) -#define tlb_fast_mode(tlb) 0 -#else -#define tlb_fast_mode(tlb) 1 -#endif - #define MMU_GATHER_BUNDLE 8 /* @@ -112,12 +100,10 @@ static inline void __tlb_alloc_page(stru static inline void tlb_flush_mmu(struct mmu_gather *tlb) { tlb_flush(tlb); - if (!tlb_fast_mode(tlb)) { - free_pages_and_swap_cache(tlb->pages, tlb->nr); - tlb->nr = 0; - if (tlb->pages == tlb->local) - __tlb_alloc_page(tlb); - } + free_pages_and_swap_cache(tlb->pages, tlb->nr); + tlb->nr = 0; + if (tlb->pages == tlb->local) + __tlb_alloc_page(tlb); } static inline void @@ -178,11 +164,6 @@ tlb_end_vma(struct mmu_gather *tlb, stru static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - tlb->pages[tlb->nr++] = page; VM_BUG_ON(tlb->nr > tlb->max); return tlb->max - tlb->nr; --- a/arch/ia64/include/asm/tlb.h +++ b/arch/ia64/include/asm/tlb.h @@ -46,12 +46,6 @@ #include <asm/tlbflush.h> #include <asm/machvec.h> -#ifdef CONFIG_SMP -# define tlb_fast_mode(tlb) ((tlb)->nr == ~0U) -#else -# define tlb_fast_mode(tlb) (1) -#endif - /* * If we can't allocate a page to make a big batch of page pointers * to work on, then just handle a few from the on-stack structure. @@ -60,7 +54,7 @@ struct mmu_gather { struct mm_struct *mm; - unsigned int nr; /* == ~0U => fast mode */ + unsigned int nr; unsigned int max; unsigned char fullmm; /* non-zero means full mm flush */ unsigned char need_flush; /* really unmapped some PTEs? */ @@ -103,6 +97,7 @@ extern struct ia64_tr_entry *ia64_idtrs[ static inline void ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end) { + unsigned long i; unsigned int nr; if (!tlb->need_flush) @@ -141,13 +136,11 @@ ia64_tlb_flush_mmu (struct mmu_gather *t /* lastly, release the freed pages */ nr = tlb->nr; - if (!tlb_fast_mode(tlb)) { - unsigned long i; - tlb->nr = 0; - tlb->start_addr = ~0UL; - for (i = 0; i < nr; ++i) - free_page_and_swap_cache(tlb->pages[i]); - } + + tlb->nr = 0; + tlb->start_addr = ~0UL; + for (i = 0; i < nr; ++i) + free_page_and_swap_cache(tlb->pages[i]); } static inline void __tlb_alloc_page(struct mmu_gather *tlb) @@ -167,20 +160,7 @@ tlb_gather_mmu(struct mmu_gather *tlb, s tlb->mm = mm; tlb->max = ARRAY_SIZE(tlb->local); tlb->pages = tlb->local; - /* - * Use fast mode if only 1 CPU is online. - * - * It would be tempting to turn on fast-mode for full_mm_flush as well. But this - * doesn't work because of speculative accesses and software prefetching: the page - * table of "mm" may (and usually is) the currently active page table and even - * though the kernel won't do any user-space accesses during the TLB shoot down, a - * compiler might use speculation or lfetch.fault on what happens to be a valid - * user-space address. This in turn could trigger a TLB miss fault (or a VHPT - * walk) and re-insert a TLB entry we just removed. Slow mode avoids such - * problems. (We could make fast-mode work by switching the current task to a - * different "mm" during the shootdown.) --davidm 08/02/2002 - */ - tlb->nr = (num_online_cpus() == 1) ? ~0U : 0; + tlb->nr = 0; tlb->fullmm = full_mm_flush; tlb->start_addr = ~0UL; } @@ -214,11 +194,6 @@ static inline int __tlb_remove_page(stru { tlb->need_flush = 1; - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - if (!tlb->nr && tlb->pages == tlb->local) __tlb_alloc_page(tlb); --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -97,11 +97,9 @@ struct mmu_gather { unsigned long start; unsigned long end; unsigned int need_flush : 1, /* Did free PTEs */ - fast_mode : 1; /* No batching */ - /* we are in the middle of an operation to clear * a full mm and can make some optimizations */ - unsigned int fullmm : 1, + fullmm : 1, /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1; @@ -114,19 +112,6 @@ struct mmu_gather { #define HAVE_GENERIC_MMU_GATHER -static inline int tlb_fast_mode(struct mmu_gather *tlb) -{ -#ifdef CONFIG_SMP - return tlb->fast_mode; -#else - /* - * For UP we don't need to worry about TLB flush - * and page free order so much.. - */ - return 1; -#endif -} - void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); void tlb_flush_mmu(struct mmu_gather *tlb); void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, --- a/mm/memory.c +++ b/mm/memory.c @@ -220,7 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *t tlb->start = -1UL; tlb->end = 0; tlb->need_flush = 0; - tlb->fast_mode = (num_possible_cpus() == 1); tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); @@ -244,9 +243,6 @@ void tlb_flush_mmu(struct mmu_gather *tl tlb_table_flush(tlb); #endif - if (tlb_fast_mode(tlb)) - return; - for (batch = &tlb->local; batch; batch = batch->next) { free_pages_and_swap_cache(batch->pages, batch->nr); batch->nr = 0; @@ -288,11 +284,6 @@ int __tlb_remove_page(struct mmu_gather VM_BUG_ON(!tlb->need_flush); - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu() */ - } - batch = tlb->active; batch->pages[batch->nr++] = page; if (batch->nr == batch->max) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH] arch, mm: Remove tlb_fast_mode() @ 2013-06-05 10:26 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2013-06-05 10:26 UTC (permalink / raw) To: Linus Torvalds Cc: Max Filippov, Vineet Gupta, KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins, Thomas Gleixner, Andrew Morton, Ingo Molnar, Tony Luck On Wed, Jun 05, 2013 at 09:05:03AM +0900, Linus Torvalds wrote: > > > On Tue, 4 Jun 2013, Peter Zijlstra wrote: > > > > And here's the patch that makes fast mode go *poof*.. > > Let's just do this. Mind sending a patch with proper changelog and > sign-off? Sure, as an extra I even compile tested it! ;-) (including arm and ia64). --- Subject: arch, mm: Remove tlb_fast_mode() From: Peter Zijlstra <peterz@infradead.org> Since the introduction of preemptible mmu_gather TLB fast mode has been broken. TLB fast mode relies on there being absolutely no concurrency; it frees pages first and invalidates TLBs later. However now we can get concurrency and stuff goes *bang*. This patch removes all tlb_fast_mode() code; it was found the better option vs trying to patch the hole by entangling tlb invalidation with the scheduler. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Reported-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> --- arch/arm/include/asm/tlb.h | 27 ++++----------------------- arch/ia64/include/asm/tlb.h | 41 ++++++++--------------------------------- include/asm-generic/tlb.h | 17 +---------------- mm/memory.c | 9 --------- 4 files changed, 13 insertions(+), 81 deletions(-) --- a/arch/arm/include/asm/tlb.h +++ b/arch/arm/include/asm/tlb.h @@ -33,18 +33,6 @@ #include <asm/pgalloc.h> #include <asm/tlbflush.h> -/* - * We need to delay page freeing for SMP as other CPUs can access pages - * which have been removed but not yet had their TLB entries invalidated. - * Also, as ARMv7 speculative prefetch can drag new entries into the TLB, - * we need to apply this same delaying tactic to ensure correct operation. - */ -#if defined(CONFIG_SMP) || defined(CONFIG_CPU_32v7) -#define tlb_fast_mode(tlb) 0 -#else -#define tlb_fast_mode(tlb) 1 -#endif - #define MMU_GATHER_BUNDLE 8 /* @@ -112,12 +100,10 @@ static inline void __tlb_alloc_page(stru static inline void tlb_flush_mmu(struct mmu_gather *tlb) { tlb_flush(tlb); - if (!tlb_fast_mode(tlb)) { - free_pages_and_swap_cache(tlb->pages, tlb->nr); - tlb->nr = 0; - if (tlb->pages == tlb->local) - __tlb_alloc_page(tlb); - } + free_pages_and_swap_cache(tlb->pages, tlb->nr); + tlb->nr = 0; + if (tlb->pages == tlb->local) + __tlb_alloc_page(tlb); } static inline void @@ -178,11 +164,6 @@ tlb_end_vma(struct mmu_gather *tlb, stru static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - tlb->pages[tlb->nr++] = page; VM_BUG_ON(tlb->nr > tlb->max); return tlb->max - tlb->nr; --- a/arch/ia64/include/asm/tlb.h +++ b/arch/ia64/include/asm/tlb.h @@ -46,12 +46,6 @@ #include <asm/tlbflush.h> #include <asm/machvec.h> -#ifdef CONFIG_SMP -# define tlb_fast_mode(tlb) ((tlb)->nr == ~0U) -#else -# define tlb_fast_mode(tlb) (1) -#endif - /* * If we can't allocate a page to make a big batch of page pointers * to work on, then just handle a few from the on-stack structure. @@ -60,7 +54,7 @@ struct mmu_gather { struct mm_struct *mm; - unsigned int nr; /* == ~0U => fast mode */ + unsigned int nr; unsigned int max; unsigned char fullmm; /* non-zero means full mm flush */ unsigned char need_flush; /* really unmapped some PTEs? */ @@ -103,6 +97,7 @@ extern struct ia64_tr_entry *ia64_idtrs[ static inline void ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end) { + unsigned long i; unsigned int nr; if (!tlb->need_flush) @@ -141,13 +136,11 @@ ia64_tlb_flush_mmu (struct mmu_gather *t /* lastly, release the freed pages */ nr = tlb->nr; - if (!tlb_fast_mode(tlb)) { - unsigned long i; - tlb->nr = 0; - tlb->start_addr = ~0UL; - for (i = 0; i < nr; ++i) - free_page_and_swap_cache(tlb->pages[i]); - } + + tlb->nr = 0; + tlb->start_addr = ~0UL; + for (i = 0; i < nr; ++i) + free_page_and_swap_cache(tlb->pages[i]); } static inline void __tlb_alloc_page(struct mmu_gather *tlb) @@ -167,20 +160,7 @@ tlb_gather_mmu(struct mmu_gather *tlb, s tlb->mm = mm; tlb->max = ARRAY_SIZE(tlb->local); tlb->pages = tlb->local; - /* - * Use fast mode if only 1 CPU is online. - * - * It would be tempting to turn on fast-mode for full_mm_flush as well. But this - * doesn't work because of speculative accesses and software prefetching: the page - * table of "mm" may (and usually is) the currently active page table and even - * though the kernel won't do any user-space accesses during the TLB shoot down, a - * compiler might use speculation or lfetch.fault on what happens to be a valid - * user-space address. This in turn could trigger a TLB miss fault (or a VHPT - * walk) and re-insert a TLB entry we just removed. Slow mode avoids such - * problems. (We could make fast-mode work by switching the current task to a - * different "mm" during the shootdown.) --davidm 08/02/2002 - */ - tlb->nr = (num_online_cpus() == 1) ? ~0U : 0; + tlb->nr = 0; tlb->fullmm = full_mm_flush; tlb->start_addr = ~0UL; } @@ -214,11 +194,6 @@ static inline int __tlb_remove_page(stru { tlb->need_flush = 1; - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu */ - } - if (!tlb->nr && tlb->pages == tlb->local) __tlb_alloc_page(tlb); --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -97,11 +97,9 @@ struct mmu_gather { unsigned long start; unsigned long end; unsigned int need_flush : 1, /* Did free PTEs */ - fast_mode : 1; /* No batching */ - /* we are in the middle of an operation to clear * a full mm and can make some optimizations */ - unsigned int fullmm : 1, + fullmm : 1, /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1; @@ -114,19 +112,6 @@ struct mmu_gather { #define HAVE_GENERIC_MMU_GATHER -static inline int tlb_fast_mode(struct mmu_gather *tlb) -{ -#ifdef CONFIG_SMP - return tlb->fast_mode; -#else - /* - * For UP we don't need to worry about TLB flush - * and page free order so much.. - */ - return 1; -#endif -} - void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); void tlb_flush_mmu(struct mmu_gather *tlb); void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, --- a/mm/memory.c +++ b/mm/memory.c @@ -220,7 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *t tlb->start = -1UL; tlb->end = 0; tlb->need_flush = 0; - tlb->fast_mode = (num_possible_cpus() == 1); tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); @@ -244,9 +243,6 @@ void tlb_flush_mmu(struct mmu_gather *tl tlb_table_flush(tlb); #endif - if (tlb_fast_mode(tlb)) - return; - for (batch = &tlb->local; batch; batch = batch->next) { free_pages_and_swap_cache(batch->pages, batch->nr); batch->nr = 0; @@ -288,11 +284,6 @@ int __tlb_remove_page(struct mmu_gather VM_BUG_ON(!tlb->need_flush); - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu() */ - } - batch = tlb->active; batch->pages[batch->nr++] = page; if (batch->nr == batch->max) { ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 1:40 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-31 1:40 UTC (permalink / raw) To: Peter Zijlstra Cc: KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Wed, May 29, 2013 at 4:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 28, 2013 at 11:10:25AM +0400, Max Filippov wrote: >> On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: >> > Hello arch and mm people. >> > >> > Is it intentional that threads of a process that invoked munmap syscall >> > can see TLB entries pointing to already freed pages, or it is a bug? >> > >> > I'm talking about zap_pmd_range and zap_pte_range: >> > >> > zap_pmd_range >> > zap_pte_range >> > arch_enter_lazy_mmu_mode >> > ptep_get_and_clear_full >> > tlb_remove_tlb_entry >> > __tlb_remove_page >> > arch_leave_lazy_mmu_mode >> > cond_resched >> > >> > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> > and __tlb_remove_page there is a loop in the zap_pte_range that clears >> > PTEs and frees corresponding pages, but doesn't flush TLB, and >> > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> > of the same process gets scheduled then it is able to see TLB entries >> > pointing to already freed physical pages. >> > >> > I've noticed that with xtensa arch when I added a test before returning to >> > userspace checking that TLB contents agrees with page tables of the >> > current mm. This check reliably fires with the LTP test mtest05 that >> > maps, unmaps and accesses memory from multiple threads. >> > >> > Is there anything wrong in my description, maybe something specific to >> > my arch, or this issue really exists? >> >> Hi, >> >> I've made similar checking function for MIPS (because qemu is my only choice >> and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. >> With MIPS I can also see this issue. I hope I did it right, the patch at the >> bottom is for the reference. The test I run and the diagnostic output are as >> follows: >> >> To me it looks like the cond_resched in the zap_pmd_range is the root cause >> of this issue (let alone SMP case for now). It was introduced in the commit >> >> commit 97a894136f29802da19a15541de3c019e1ca147e >> Author: Peter Zijlstra <a.p.zijlstra@chello.nl> >> Date: Tue May 24 17:12:04 2011 -0700 >> >> mm: Remove i_mmap_lock lockbreak >> >> Peter, Kamezawa, other reviewers of that commit, could you please comment? > > Are you all running UP systems? I suppose the preemptible muck Yes, xtensa SMP support is in the middle of forward porting/cleanup process. > invalidated the assumption that UP systems are 'easy'. > > If you make tlb_fast_mode() return an unconditional false, does it all > work again? Actually the only thing that was visibly broken is the test for TLB/PTE coherency, but that incoherency appears to be intentional (and the test is too simple for that). Unconditionally returning false from tlb_fast_mode() doesn't fix that test, but AFAICS it fixes underlying page freeing. -- Thanks. -- Max -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 1:40 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-31 1:40 UTC (permalink / raw) To: Peter Zijlstra Cc: KAMEZAWA Hiroyuki, linux-arch, linux-mm, Ralf Baechle, Chris Zankel, Marc Gauthier, linux-xtensa, Hugh Dickins On Wed, May 29, 2013 at 4:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 28, 2013 at 11:10:25AM +0400, Max Filippov wrote: >> On Sun, May 26, 2013 at 6:50 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: >> > Hello arch and mm people. >> > >> > Is it intentional that threads of a process that invoked munmap syscall >> > can see TLB entries pointing to already freed pages, or it is a bug? >> > >> > I'm talking about zap_pmd_range and zap_pte_range: >> > >> > zap_pmd_range >> > zap_pte_range >> > arch_enter_lazy_mmu_mode >> > ptep_get_and_clear_full >> > tlb_remove_tlb_entry >> > __tlb_remove_page >> > arch_leave_lazy_mmu_mode >> > cond_resched >> > >> > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> > and __tlb_remove_page there is a loop in the zap_pte_range that clears >> > PTEs and frees corresponding pages, but doesn't flush TLB, and >> > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> > of the same process gets scheduled then it is able to see TLB entries >> > pointing to already freed physical pages. >> > >> > I've noticed that with xtensa arch when I added a test before returning to >> > userspace checking that TLB contents agrees with page tables of the >> > current mm. This check reliably fires with the LTP test mtest05 that >> > maps, unmaps and accesses memory from multiple threads. >> > >> > Is there anything wrong in my description, maybe something specific to >> > my arch, or this issue really exists? >> >> Hi, >> >> I've made similar checking function for MIPS (because qemu is my only choice >> and it simulates MIPS TLB) and ran my tests on mips-malta machine in qemu. >> With MIPS I can also see this issue. I hope I did it right, the patch at the >> bottom is for the reference. The test I run and the diagnostic output are as >> follows: >> >> To me it looks like the cond_resched in the zap_pmd_range is the root cause >> of this issue (let alone SMP case for now). It was introduced in the commit >> >> commit 97a894136f29802da19a15541de3c019e1ca147e >> Author: Peter Zijlstra <a.p.zijlstra@chello.nl> >> Date: Tue May 24 17:12:04 2011 -0700 >> >> mm: Remove i_mmap_lock lockbreak >> >> Peter, Kamezawa, other reviewers of that commit, could you please comment? > > Are you all running UP systems? I suppose the preemptible muck Yes, xtensa SMP support is in the middle of forward porting/cleanup process. > invalidated the assumption that UP systems are 'easy'. > > If you make tlb_fast_mode() return an unconditional false, does it all > work again? Actually the only thing that was visibly broken is the test for TLB/PTE coherency, but that incoherency appears to be intentional (and the test is too simple for that). Unconditionally returning false from tlb_fast_mode() doesn't fix that test, but AFAICS it fixes underlying page freeing. -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-28 14:34 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 70+ messages in thread From: Konrad Rzeszutek Wilk @ 2013-05-28 14:34 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Sun, May 26, 2013 at 06:50:46AM +0400, Max Filippov wrote: > Hello arch and mm people. > > Is it intentional that threads of a process that invoked munmap syscall > can see TLB entries pointing to already freed pages, or it is a bug? > > I'm talking about zap_pmd_range and zap_pte_range: > > zap_pmd_range > zap_pte_range > arch_enter_lazy_mmu_mode > ptep_get_and_clear_full > tlb_remove_tlb_entry > __tlb_remove_page > arch_leave_lazy_mmu_mode > cond_resched > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > and __tlb_remove_page there is a loop in the zap_pte_range that clears > PTEs and frees corresponding pages, but doesn't flush TLB, and > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > of the same process gets scheduled then it is able to see TLB entries > pointing to already freed physical pages. The idea behind the lazy MMU subsystem is that it does not need to flush the TLB all the time and allow one to do PTE manipulations in a "batch mode". Meaning there are stray entries - and one has to be diligient about not using them. Here is the relvant comment from the Linux header: /* * A facility to provide lazy MMU batching. This allows PTE updates and * page invalidations to be delayed until a call to leave lazy MMU mode * is issued. Some architectures may benefit from doing this, and it is * beneficial for both shadow and direct mode hypervisors, which may batch * the PTE updates which happen during this window. Note that using this * interface requires that read hazards be removed from the code. A read * hazard could result in the direct mode hypervisor case, since the actual * write to the page tables may not yet have taken place, so reads though * a raw PTE pointer after it has been modified are not guaranteed to be * up to date. This mode can only be entered and left under the protection of * the page table locks for all page tables which may be modified. In the UP * case, this is required so that preemption is disabled, and in the SMP case, * it must synchronize the delayed page table writes properly on other CPUs. */ This means that eventually when arch_leave_lazy_mmu_mode or arch_flush_lazy_mmu_mode is called, the PTE updates _should_ be flushed (aka, TLB flush if needed on the altered PTE entries). > > I've noticed that with xtensa arch when I added a test before returning to > userspace checking that TLB contents agrees with page tables of the > current mm. This check reliably fires with the LTP test mtest05 that > maps, unmaps and accesses memory from multiple threads. > > Is there anything wrong in my description, maybe something specific to > my arch, or this issue really exists? > > I've also noticed that there are a lot of arches with default implementations > of the involved functions, does that mean that any/all of them have this > issue too? > > -- > Thanks. > -- Max > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-28 14:34 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 70+ messages in thread From: Konrad Rzeszutek Wilk @ 2013-05-28 14:34 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Sun, May 26, 2013 at 06:50:46AM +0400, Max Filippov wrote: > Hello arch and mm people. > > Is it intentional that threads of a process that invoked munmap syscall > can see TLB entries pointing to already freed pages, or it is a bug? > > I'm talking about zap_pmd_range and zap_pte_range: > > zap_pmd_range > zap_pte_range > arch_enter_lazy_mmu_mode > ptep_get_and_clear_full > tlb_remove_tlb_entry > __tlb_remove_page > arch_leave_lazy_mmu_mode > cond_resched > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > and __tlb_remove_page there is a loop in the zap_pte_range that clears > PTEs and frees corresponding pages, but doesn't flush TLB, and > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > of the same process gets scheduled then it is able to see TLB entries > pointing to already freed physical pages. The idea behind the lazy MMU subsystem is that it does not need to flush the TLB all the time and allow one to do PTE manipulations in a "batch mode". Meaning there are stray entries - and one has to be diligient about not using them. Here is the relvant comment from the Linux header: /* * A facility to provide lazy MMU batching. This allows PTE updates and * page invalidations to be delayed until a call to leave lazy MMU mode * is issued. Some architectures may benefit from doing this, and it is * beneficial for both shadow and direct mode hypervisors, which may batch * the PTE updates which happen during this window. Note that using this * interface requires that read hazards be removed from the code. A read * hazard could result in the direct mode hypervisor case, since the actual * write to the page tables may not yet have taken place, so reads though * a raw PTE pointer after it has been modified are not guaranteed to be * up to date. This mode can only be entered and left under the protection of * the page table locks for all page tables which may be modified. In the UP * case, this is required so that preemption is disabled, and in the SMP case, * it must synchronize the delayed page table writes properly on other CPUs. */ This means that eventually when arch_leave_lazy_mmu_mode or arch_flush_lazy_mmu_mode is called, the PTE updates _should_ be flushed (aka, TLB flush if needed on the altered PTE entries). > > I've noticed that with xtensa arch when I added a test before returning to > userspace checking that TLB contents agrees with page tables of the > current mm. This check reliably fires with the LTP test mtest05 that > maps, unmaps and accesses memory from multiple threads. > > Is there anything wrong in my description, maybe something specific to > my arch, or this issue really exists? > > I've also noticed that there are a lot of arches with default implementations > of the involved functions, does that mean that any/all of them have this > issue too? > > -- > Thanks. > -- Max > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-28 14:34 ` Konrad Rzeszutek Wilk @ 2013-05-29 3:23 ` Max Filippov -1 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-29 3:23 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Tue, May 28, 2013 at 6:34 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Sun, May 26, 2013 at 06:50:46AM +0400, Max Filippov wrote: >> Hello arch and mm people. >> >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? >> >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> of the same process gets scheduled then it is able to see TLB entries >> pointing to already freed physical pages. > > The idea behind the lazy MMU subsystem is that it does not need to flush > the TLB all the time and allow one to do PTE manipulations in a "batch mode". > Meaning there are stray entries - and one has to be diligient about not using them. Yes, I got it, IOW TLB entries must either be flushed before userspace can see them, or the underlying pages must not be freed. > Here is the relvant comment from the Linux header: > > /* > * A facility to provide lazy MMU batching. This allows PTE updates and > * page invalidations to be delayed until a call to leave lazy MMU mode > * is issued. Some architectures may benefit from doing this, and it is > * beneficial for both shadow and direct mode hypervisors, which may batch > * the PTE updates which happen during this window. Note that using this > * interface requires that read hazards be removed from the code. A read > * hazard could result in the direct mode hypervisor case, since the actual > * write to the page tables may not yet have taken place, so reads though > * a raw PTE pointer after it has been modified are not guaranteed to be > * up to date. This mode can only be entered and left under the protection of > * the page table locks for all page tables which may be modified. In the UP > * case, this is required so that preemption is disabled, and in the SMP case, > * it must synchronize the delayed page table writes properly on other CPUs. > */ > > This means that eventually when arch_leave_lazy_mmu_mode or > arch_flush_lazy_mmu_mode is called, the PTE updates _should_ be flushed > (aka, TLB flush if needed on the altered PTE entries). Should (: But I only see powerpc, sparc and x86 defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE, so this does not apply to all remaining arches. -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 3:23 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-29 3:23 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Tue, May 28, 2013 at 6:34 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Sun, May 26, 2013 at 06:50:46AM +0400, Max Filippov wrote: >> Hello arch and mm people. >> >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? >> >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> of the same process gets scheduled then it is able to see TLB entries >> pointing to already freed physical pages. > > The idea behind the lazy MMU subsystem is that it does not need to flush > the TLB all the time and allow one to do PTE manipulations in a "batch mode". > Meaning there are stray entries - and one has to be diligient about not using them. Yes, I got it, IOW TLB entries must either be flushed before userspace can see them, or the underlying pages must not be freed. > Here is the relvant comment from the Linux header: > > /* > * A facility to provide lazy MMU batching. This allows PTE updates and > * page invalidations to be delayed until a call to leave lazy MMU mode > * is issued. Some architectures may benefit from doing this, and it is > * beneficial for both shadow and direct mode hypervisors, which may batch > * the PTE updates which happen during this window. Note that using this > * interface requires that read hazards be removed from the code. A read > * hazard could result in the direct mode hypervisor case, since the actual > * write to the page tables may not yet have taken place, so reads though > * a raw PTE pointer after it has been modified are not guaranteed to be > * up to date. This mode can only be entered and left under the protection of > * the page table locks for all page tables which may be modified. In the UP > * case, this is required so that preemption is disabled, and in the SMP case, > * it must synchronize the delayed page table writes properly on other CPUs. > */ > > This means that eventually when arch_leave_lazy_mmu_mode or > arch_flush_lazy_mmu_mode is called, the PTE updates _should_ be flushed > (aka, TLB flush if needed on the altered PTE entries). Should (: But I only see powerpc, sparc and x86 defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE, so this does not apply to all remaining arches. -- Thanks. -- Max -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-28 15:16 ` Michal Hocko 0 siblings, 0 replies; 70+ messages in thread From: Michal Hocko @ 2013-05-28 15:16 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Sun 26-05-13 06:50:46, Max Filippov wrote: > Hello arch and mm people. > > Is it intentional that threads of a process that invoked munmap syscall > can see TLB entries pointing to already freed pages, or it is a bug? > > I'm talking about zap_pmd_range and zap_pte_range: > > zap_pmd_range > zap_pte_range > arch_enter_lazy_mmu_mode > ptep_get_and_clear_full > tlb_remove_tlb_entry > __tlb_remove_page > arch_leave_lazy_mmu_mode > cond_resched > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > and __tlb_remove_page there is a loop in the zap_pte_range that clears > PTEs and frees corresponding pages, The page is not freed at that time (at least not for the generic mmu_gather implementation). It is stored into mmu_gather and then freed along with the tlb flush in tlb_flush_mmu. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-28 15:16 ` Michal Hocko 0 siblings, 0 replies; 70+ messages in thread From: Michal Hocko @ 2013-05-28 15:16 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Sun 26-05-13 06:50:46, Max Filippov wrote: > Hello arch and mm people. > > Is it intentional that threads of a process that invoked munmap syscall > can see TLB entries pointing to already freed pages, or it is a bug? > > I'm talking about zap_pmd_range and zap_pte_range: > > zap_pmd_range > zap_pte_range > arch_enter_lazy_mmu_mode > ptep_get_and_clear_full > tlb_remove_tlb_entry > __tlb_remove_page > arch_leave_lazy_mmu_mode > cond_resched > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > and __tlb_remove_page there is a loop in the zap_pte_range that clears > PTEs and frees corresponding pages, The page is not freed at that time (at least not for the generic mmu_gather implementation). It is stored into mmu_gather and then freed along with the tlb flush in tlb_flush_mmu. [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-28 15:16 ` Michal Hocko @ 2013-05-28 15:23 ` Catalin Marinas -1 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-28 15:23 UTC (permalink / raw) To: Michal Hocko Cc: Max Filippov, linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On 28 May 2013 16:16, Michal Hocko <mhocko@suse.cz> wrote: > On Sun 26-05-13 06:50:46, Max Filippov wrote: >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? >> >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, > > The page is not freed at that time (at least not for the generic > mmu_gather implementation). It is stored into mmu_gather and then freed > along with the tlb flush in tlb_flush_mmu. Actually for the UP case, the page gets freed in __tlb_remove_page() since tlb_fast_mode() is 1. -- Catalin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-28 15:23 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-28 15:23 UTC (permalink / raw) To: Michal Hocko Cc: Max Filippov, linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On 28 May 2013 16:16, Michal Hocko <mhocko@suse.cz> wrote: > On Sun 26-05-13 06:50:46, Max Filippov wrote: >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? >> >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, > > The page is not freed at that time (at least not for the generic > mmu_gather implementation). It is stored into mmu_gather and then freed > along with the tlb flush in tlb_flush_mmu. Actually for the UP case, the page gets freed in __tlb_remove_page() since tlb_fast_mode() is 1. -- Catalin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-26 2:42 TLB and PTE coherency during munmap Max Filippov 2013-05-26 2:50 ` Max Filippov @ 2013-05-28 14:35 ` Catalin Marinas 2013-05-29 4:15 ` Max Filippov ` (2 more replies) 1 sibling, 3 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-28 14:35 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier Max, On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: > Hello arch and mm people. > > Is it intentional that threads of a process that invoked munmap syscall > can see TLB entries pointing to already freed pages, or it is a bug? If it happens, this would be a bug. It means that a process can access a physical page that has been allocated to something else, possibly kernel data. > I'm talking about zap_pmd_range and zap_pte_range: > > zap_pmd_range > zap_pte_range > arch_enter_lazy_mmu_mode > ptep_get_and_clear_full > tlb_remove_tlb_entry > __tlb_remove_page > arch_leave_lazy_mmu_mode > cond_resched > > With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > and __tlb_remove_page there is a loop in the zap_pte_range that clears > PTEs and frees corresponding pages, but doesn't flush TLB, and > surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > of the same process gets scheduled then it is able to see TLB entries > pointing to already freed physical pages. It looks to me like cond_resched() here introduces a possible bug but it depends on the actual arch code, especially the __tlb_remove_tlb_entry() function. On ARM we record the range in tlb_remove_tlb_entry() and queue the pages to be removed in __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 even for the UP case (which is also needed for hardware speculative TLB loads). The tlb_finish_mmu() takes care of whatever pages are left to be freed. With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, cond_resched() in zap_pmd_range() would cause problems. I think possible workarounds: 1. tlb_fast_mode() always returning 0. 2. add a tlb_flush_mmu(tlb) before cond_resched() in zap_pmd_range(). 3. implement __tlb_remove_tlb_entry() on xtensa to always flush the tlb (which is probably costly). 4. drop the cond_resched() (not sure about preemptible kernels though). I would vote for 1 but let's see what the mm people say. -- Catalin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-28 14:35 ` Catalin Marinas @ 2013-05-29 4:15 ` Max Filippov 2013-05-29 11:53 ` Vineet Gupta 2013-05-29 12:00 ` Vineet Gupta 2 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-29 4:15 UTC (permalink / raw) To: Catalin Marinas Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier Hi Catalin, On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > Max, > > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >> Hello arch and mm people. >> >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? > > If it happens, this would be a bug. It means that a process can access > a physical page that has been allocated to something else, possibly > kernel data. > >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> of the same process gets scheduled then it is able to see TLB entries >> pointing to already freed physical pages. > > It looks to me like cond_resched() here introduces a possible bug but > it depends on the actual arch code, especially the > __tlb_remove_tlb_entry() function. On ARM we record the range in > tlb_remove_tlb_entry() and queue the pages to be removed in > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > even for the UP case (which is also needed for hardware speculative > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > to be freed. > > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > cond_resched() in zap_pmd_range() would cause problems. So, looks like most architectures in the UP configuration should have this issue (unless they flush TLB in the switch_mm, even when switching to the same mm): tlb_remove_tlb_entry __tlb_remove_tlb_entry __tlb_remove_page __HAVE_ARCH_ENTER_LAZY_MMU_MODE non-default non-trivial non-default defined alpha arc arm yes yes arm64 yes yes avr32 blackfin c6x cris frv h8300 hexagon ia64 yes yes yes Kconfig m32r m68k metag microblaze mips mn10300 openrisc parisc powerpc yes yes s390 yes yes (a) score sh yes yes (a) sparc yes tile um yes yes yes unicore32 x86 yes xtensa (a) __tlb_remove_page == free_page_and_swap_cache > I think possible workarounds: > > 1. tlb_fast_mode() always returning 0. > 2. add a tlb_flush_mmu(tlb) before cond_resched() in zap_pmd_range(). > 3. implement __tlb_remove_tlb_entry() on xtensa to always flush the > tlb (which is probably costly). > 4. drop the cond_resched() (not sure about preemptible kernels though). > > I would vote for 1 but let's see what the mm people say. -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 4:15 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-29 4:15 UTC (permalink / raw) To: Catalin Marinas Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier Hi Catalin, On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > Max, > > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >> Hello arch and mm people. >> >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? > > If it happens, this would be a bug. It means that a process can access > a physical page that has been allocated to something else, possibly > kernel data. > >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> of the same process gets scheduled then it is able to see TLB entries >> pointing to already freed physical pages. > > It looks to me like cond_resched() here introduces a possible bug but > it depends on the actual arch code, especially the > __tlb_remove_tlb_entry() function. On ARM we record the range in > tlb_remove_tlb_entry() and queue the pages to be removed in > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > even for the UP case (which is also needed for hardware speculative > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > to be freed. > > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > cond_resched() in zap_pmd_range() would cause problems. So, looks like most architectures in the UP configuration should have this issue (unless they flush TLB in the switch_mm, even when switching to the same mm): tlb_remove_tlb_entry __tlb_remove_tlb_entry __tlb_remove_page __HAVE_ARCH_ENTER_LAZY_MMU_MODE non-default non-trivial non-default defined alpha arc arm yes yes arm64 yes yes avr32 blackfin c6x cris frv h8300 hexagon ia64 yes yes yes Kconfig m32r m68k metag microblaze mips mn10300 openrisc parisc powerpc yes yes s390 yes yes (a) score sh yes yes (a) sparc yes tile um yes yes yes unicore32 x86 yes xtensa (a) __tlb_remove_page == free_page_and_swap_cache > I think possible workarounds: > > 1. tlb_fast_mode() always returning 0. > 2. add a tlb_flush_mmu(tlb) before cond_resched() in zap_pmd_range(). > 3. implement __tlb_remove_tlb_entry() on xtensa to always flush the > tlb (which is probably costly). > 4. drop the cond_resched() (not sure about preemptible kernels though). > > I would vote for 1 but let's see what the mm people say. -- Thanks. -- Max -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 10:15 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-29 10:15 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: > On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: > >> Is it intentional that threads of a process that invoked munmap syscall > >> can see TLB entries pointing to already freed pages, or it is a bug? > > > > If it happens, this would be a bug. It means that a process can access > > a physical page that has been allocated to something else, possibly > > kernel data. > > > >> I'm talking about zap_pmd_range and zap_pte_range: > >> > >> zap_pmd_range > >> zap_pte_range > >> arch_enter_lazy_mmu_mode > >> ptep_get_and_clear_full > >> tlb_remove_tlb_entry > >> __tlb_remove_page > >> arch_leave_lazy_mmu_mode > >> cond_resched > >> > >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > >> and __tlb_remove_page there is a loop in the zap_pte_range that clears > >> PTEs and frees corresponding pages, but doesn't flush TLB, and > >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > >> of the same process gets scheduled then it is able to see TLB entries > >> pointing to already freed physical pages. > > > > It looks to me like cond_resched() here introduces a possible bug but > > it depends on the actual arch code, especially the > > __tlb_remove_tlb_entry() function. On ARM we record the range in > > tlb_remove_tlb_entry() and queue the pages to be removed in > > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > > even for the UP case (which is also needed for hardware speculative > > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > > to be freed. > > > > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > > cond_resched() in zap_pmd_range() would cause problems. > > So, looks like most architectures in the UP configuration should have > this issue (unless they flush TLB in the switch_mm, even when switching > to the same mm): switch_mm() wouldn't be called if switching to the same mm. You could do it in switch_to() but it's not efficient (or before returning to user space on the same processor). Do you happen to have a user-space test for this? Something like one thread does an mmap(), writes some poison value, munmap(). The other thread keeps checking the poison value while trapping and ignoring any SIGSEGV. If it's working correctly, the second thread should either get a SIGSEGV or read the poison value. -- Catalin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 10:15 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-29 10:15 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: > On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: > >> Is it intentional that threads of a process that invoked munmap syscall > >> can see TLB entries pointing to already freed pages, or it is a bug? > > > > If it happens, this would be a bug. It means that a process can access > > a physical page that has been allocated to something else, possibly > > kernel data. > > > >> I'm talking about zap_pmd_range and zap_pte_range: > >> > >> zap_pmd_range > >> zap_pte_range > >> arch_enter_lazy_mmu_mode > >> ptep_get_and_clear_full > >> tlb_remove_tlb_entry > >> __tlb_remove_page > >> arch_leave_lazy_mmu_mode > >> cond_resched > >> > >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > >> and __tlb_remove_page there is a loop in the zap_pte_range that clears > >> PTEs and frees corresponding pages, but doesn't flush TLB, and > >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > >> of the same process gets scheduled then it is able to see TLB entries > >> pointing to already freed physical pages. > > > > It looks to me like cond_resched() here introduces a possible bug but > > it depends on the actual arch code, especially the > > __tlb_remove_tlb_entry() function. On ARM we record the range in > > tlb_remove_tlb_entry() and queue the pages to be removed in > > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > > even for the UP case (which is also needed for hardware speculative > > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > > to be freed. > > > > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > > cond_resched() in zap_pmd_range() would cause problems. > > So, looks like most architectures in the UP configuration should have > this issue (unless they flush TLB in the switch_mm, even when switching > to the same mm): switch_mm() wouldn't be called if switching to the same mm. You could do it in switch_to() but it's not efficient (or before returning to user space on the same processor). Do you happen to have a user-space test for this? Something like one thread does an mmap(), writes some poison value, munmap(). The other thread keeps checking the poison value while trapping and ignoring any SIGSEGV. If it's working correctly, the second thread should either get a SIGSEGV or read the poison value. -- Catalin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 1:26 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-31 1:26 UTC (permalink / raw) To: Catalin Marinas Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Wed, May 29, 2013 at 2:15 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: >> On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: >> > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >> >> Is it intentional that threads of a process that invoked munmap syscall >> >> can see TLB entries pointing to already freed pages, or it is a bug? >> > >> > If it happens, this would be a bug. It means that a process can access >> > a physical page that has been allocated to something else, possibly >> > kernel data. >> > >> >> I'm talking about zap_pmd_range and zap_pte_range: >> >> >> >> zap_pmd_range >> >> zap_pte_range >> >> arch_enter_lazy_mmu_mode >> >> ptep_get_and_clear_full >> >> tlb_remove_tlb_entry >> >> __tlb_remove_page >> >> arch_leave_lazy_mmu_mode >> >> cond_resched >> >> >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> >> of the same process gets scheduled then it is able to see TLB entries >> >> pointing to already freed physical pages. >> > >> > It looks to me like cond_resched() here introduces a possible bug but >> > it depends on the actual arch code, especially the >> > __tlb_remove_tlb_entry() function. On ARM we record the range in >> > tlb_remove_tlb_entry() and queue the pages to be removed in >> > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 >> > even for the UP case (which is also needed for hardware speculative >> > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left >> > to be freed. >> > >> > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, >> > cond_resched() in zap_pmd_range() would cause problems. >> >> So, looks like most architectures in the UP configuration should have >> this issue (unless they flush TLB in the switch_mm, even when switching >> to the same mm): > > switch_mm() wouldn't be called if switching to the same mm. You could do Hmm... Strange, but as far as I can tell from the context_switch it would. > it in switch_to() but it's not efficient (or before returning to user > space on the same processor). > > Do you happen to have a user-space test for this? Something like one I only had mtest05 from LTP that triggered TLB/PTE inconsistency, but not anything that would really try to peek at the freed page. I can make such test though. > thread does an mmap(), writes some poison value, munmap(). The other > thread keeps checking the poison value while trapping and ignoring any > SIGSEGV. If it's working correctly, the second thread should either get > a SIGSEGV or read the poison value. -- Thanks. -- Max -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 1:26 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-05-31 1:26 UTC (permalink / raw) To: Catalin Marinas Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Wed, May 29, 2013 at 2:15 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: >> On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: >> > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >> >> Is it intentional that threads of a process that invoked munmap syscall >> >> can see TLB entries pointing to already freed pages, or it is a bug? >> > >> > If it happens, this would be a bug. It means that a process can access >> > a physical page that has been allocated to something else, possibly >> > kernel data. >> > >> >> I'm talking about zap_pmd_range and zap_pte_range: >> >> >> >> zap_pmd_range >> >> zap_pte_range >> >> arch_enter_lazy_mmu_mode >> >> ptep_get_and_clear_full >> >> tlb_remove_tlb_entry >> >> __tlb_remove_page >> >> arch_leave_lazy_mmu_mode >> >> cond_resched >> >> >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> >> of the same process gets scheduled then it is able to see TLB entries >> >> pointing to already freed physical pages. >> > >> > It looks to me like cond_resched() here introduces a possible bug but >> > it depends on the actual arch code, especially the >> > __tlb_remove_tlb_entry() function. On ARM we record the range in >> > tlb_remove_tlb_entry() and queue the pages to be removed in >> > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 >> > even for the UP case (which is also needed for hardware speculative >> > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left >> > to be freed. >> > >> > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, >> > cond_resched() in zap_pmd_range() would cause problems. >> >> So, looks like most architectures in the UP configuration should have >> this issue (unless they flush TLB in the switch_mm, even when switching >> to the same mm): > > switch_mm() wouldn't be called if switching to the same mm. You could do Hmm... Strange, but as far as I can tell from the context_switch it would. > it in switch_to() but it's not efficient (or before returning to user > space on the same processor). > > Do you happen to have a user-space test for this? Something like one I only had mtest05 from LTP that triggered TLB/PTE inconsistency, but not anything that would really try to peek at the freed page. I can make such test though. > thread does an mmap(), writes some poison value, munmap(). The other > thread keeps checking the poison value while trapping and ignoring any > SIGSEGV. If it's working correctly, the second thread should either get > a SIGSEGV or read the poison value. -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 9:06 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-31 9:06 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Fri, May 31, 2013 at 02:26:27AM +0100, Max Filippov wrote: > On Wed, May 29, 2013 at 2:15 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: > >> On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > >> > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: > >> >> Is it intentional that threads of a process that invoked munmap syscall > >> >> can see TLB entries pointing to already freed pages, or it is a bug? > >> > > >> > If it happens, this would be a bug. It means that a process can access > >> > a physical page that has been allocated to something else, possibly > >> > kernel data. > >> > > >> >> I'm talking about zap_pmd_range and zap_pte_range: > >> >> > >> >> zap_pmd_range > >> >> zap_pte_range > >> >> arch_enter_lazy_mmu_mode > >> >> ptep_get_and_clear_full > >> >> tlb_remove_tlb_entry > >> >> __tlb_remove_page > >> >> arch_leave_lazy_mmu_mode > >> >> cond_resched > >> >> > >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > >> >> and __tlb_remove_page there is a loop in the zap_pte_range that clears > >> >> PTEs and frees corresponding pages, but doesn't flush TLB, and > >> >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > >> >> of the same process gets scheduled then it is able to see TLB entries > >> >> pointing to already freed physical pages. > >> > > >> > It looks to me like cond_resched() here introduces a possible bug but > >> > it depends on the actual arch code, especially the > >> > __tlb_remove_tlb_entry() function. On ARM we record the range in > >> > tlb_remove_tlb_entry() and queue the pages to be removed in > >> > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > >> > even for the UP case (which is also needed for hardware speculative > >> > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > >> > to be freed. > >> > > >> > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > >> > cond_resched() in zap_pmd_range() would cause problems. > >> > >> So, looks like most architectures in the UP configuration should have > >> this issue (unless they flush TLB in the switch_mm, even when switching > >> to the same mm): > > > > switch_mm() wouldn't be called if switching to the same mm. You could do > > Hmm... Strange, but as far as I can tell from the context_switch it would. You are right. On ARM we have a prev != next check but that's in the switch_mm() implementation. -- Catalin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-31 9:06 ` Catalin Marinas 0 siblings, 0 replies; 70+ messages in thread From: Catalin Marinas @ 2013-05-31 9:06 UTC (permalink / raw) To: Max Filippov Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Fri, May 31, 2013 at 02:26:27AM +0100, Max Filippov wrote: > On Wed, May 29, 2013 at 2:15 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: > >> On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > >> > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: > >> >> Is it intentional that threads of a process that invoked munmap syscall > >> >> can see TLB entries pointing to already freed pages, or it is a bug? > >> > > >> > If it happens, this would be a bug. It means that a process can access > >> > a physical page that has been allocated to something else, possibly > >> > kernel data. > >> > > >> >> I'm talking about zap_pmd_range and zap_pte_range: > >> >> > >> >> zap_pmd_range > >> >> zap_pte_range > >> >> arch_enter_lazy_mmu_mode > >> >> ptep_get_and_clear_full > >> >> tlb_remove_tlb_entry > >> >> __tlb_remove_page > >> >> arch_leave_lazy_mmu_mode > >> >> cond_resched > >> >> > >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry > >> >> and __tlb_remove_page there is a loop in the zap_pte_range that clears > >> >> PTEs and frees corresponding pages, but doesn't flush TLB, and > >> >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread > >> >> of the same process gets scheduled then it is able to see TLB entries > >> >> pointing to already freed physical pages. > >> > > >> > It looks to me like cond_resched() here introduces a possible bug but > >> > it depends on the actual arch code, especially the > >> > __tlb_remove_tlb_entry() function. On ARM we record the range in > >> > tlb_remove_tlb_entry() and queue the pages to be removed in > >> > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > >> > even for the UP case (which is also needed for hardware speculative > >> > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > >> > to be freed. > >> > > >> > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > >> > cond_resched() in zap_pmd_range() would cause problems. > >> > >> So, looks like most architectures in the UP configuration should have > >> this issue (unless they flush TLB in the switch_mm, even when switching > >> to the same mm): > > > > switch_mm() wouldn't be called if switching to the same mm. You could do > > Hmm... Strange, but as far as I can tell from the context_switch it would. You are right. On ARM we have a prev != next check but that's in the switch_mm() implementation. -- Catalin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 9:16 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-06-03 9:16 UTC (permalink / raw) To: Catalin Marinas Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Fri, May 31, 2013 at 5:26 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: > On Wed, May 29, 2013 at 2:15 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: >> On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: >>> On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: >>> > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >>> >> Is it intentional that threads of a process that invoked munmap syscall >>> >> can see TLB entries pointing to already freed pages, or it is a bug? >>> > >>> > If it happens, this would be a bug. It means that a process can access >>> > a physical page that has been allocated to something else, possibly >>> > kernel data. >>> > >>> >> I'm talking about zap_pmd_range and zap_pte_range: >>> >> >>> >> zap_pmd_range >>> >> zap_pte_range >>> >> arch_enter_lazy_mmu_mode >>> >> ptep_get_and_clear_full >>> >> tlb_remove_tlb_entry >>> >> __tlb_remove_page >>> >> arch_leave_lazy_mmu_mode >>> >> cond_resched >>> >> >>> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >>> >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >>> >> PTEs and frees corresponding pages, but doesn't flush TLB, and >>> >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >>> >> of the same process gets scheduled then it is able to see TLB entries >>> >> pointing to already freed physical pages. >>> > >>> > It looks to me like cond_resched() here introduces a possible bug but >>> > it depends on the actual arch code, especially the >>> > __tlb_remove_tlb_entry() function. On ARM we record the range in >>> > tlb_remove_tlb_entry() and queue the pages to be removed in >>> > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 >>> > even for the UP case (which is also needed for hardware speculative >>> > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left >>> > to be freed. >>> > >>> > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, >>> > cond_resched() in zap_pmd_range() would cause problems. >>> >>> So, looks like most architectures in the UP configuration should have >>> this issue (unless they flush TLB in the switch_mm, even when switching >>> to the same mm): >> >> switch_mm() wouldn't be called if switching to the same mm. You could do > > Hmm... Strange, but as far as I can tell from the context_switch it would. > >> it in switch_to() but it's not efficient (or before returning to user >> space on the same processor). >> >> Do you happen to have a user-space test for this? Something like one > > I only had mtest05 from LTP that triggered TLB/PTE inconsistency, but > not anything that would really try to peek at the freed page. I can make > such test though. > >> thread does an mmap(), writes some poison value, munmap(). The other >> thread keeps checking the poison value while trapping and ignoring any >> SIGSEGV. If it's working correctly, the second thread should either get >> a SIGSEGV or read the poison value. I've made a number of such tests and had them running for a couple of days. Checking thread never read anything other than poison value so far. -- Thanks. -- Max -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-03 9:16 ` Max Filippov 0 siblings, 0 replies; 70+ messages in thread From: Max Filippov @ 2013-06-03 9:16 UTC (permalink / raw) To: Catalin Marinas Cc: linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On Fri, May 31, 2013 at 5:26 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: > On Wed, May 29, 2013 at 2:15 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: >> On Wed, May 29, 2013 at 05:15:28AM +0100, Max Filippov wrote: >>> On Tue, May 28, 2013 at 6:35 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: >>> > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >>> >> Is it intentional that threads of a process that invoked munmap syscall >>> >> can see TLB entries pointing to already freed pages, or it is a bug? >>> > >>> > If it happens, this would be a bug. It means that a process can access >>> > a physical page that has been allocated to something else, possibly >>> > kernel data. >>> > >>> >> I'm talking about zap_pmd_range and zap_pte_range: >>> >> >>> >> zap_pmd_range >>> >> zap_pte_range >>> >> arch_enter_lazy_mmu_mode >>> >> ptep_get_and_clear_full >>> >> tlb_remove_tlb_entry >>> >> __tlb_remove_page >>> >> arch_leave_lazy_mmu_mode >>> >> cond_resched >>> >> >>> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >>> >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >>> >> PTEs and frees corresponding pages, but doesn't flush TLB, and >>> >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >>> >> of the same process gets scheduled then it is able to see TLB entries >>> >> pointing to already freed physical pages. >>> > >>> > It looks to me like cond_resched() here introduces a possible bug but >>> > it depends on the actual arch code, especially the >>> > __tlb_remove_tlb_entry() function. On ARM we record the range in >>> > tlb_remove_tlb_entry() and queue the pages to be removed in >>> > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 >>> > even for the UP case (which is also needed for hardware speculative >>> > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left >>> > to be freed. >>> > >>> > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, >>> > cond_resched() in zap_pmd_range() would cause problems. >>> >>> So, looks like most architectures in the UP configuration should have >>> this issue (unless they flush TLB in the switch_mm, even when switching >>> to the same mm): >> >> switch_mm() wouldn't be called if switching to the same mm. You could do > > Hmm... Strange, but as far as I can tell from the context_switch it would. > >> it in switch_to() but it's not efficient (or before returning to user >> space on the same processor). >> >> Do you happen to have a user-space test for this? Something like one > > I only had mtest05 from LTP that triggered TLB/PTE inconsistency, but > not anything that would really try to peek at the freed page. I can make > such test though. > >> thread does an mmap(), writes some poison value, munmap(). The other >> thread keeps checking the poison value while trapping and ignoring any >> SIGSEGV. If it's working correctly, the second thread should either get >> a SIGSEGV or read the poison value. I've made a number of such tests and had them running for a couple of days. Checking thread never read anything other than poison value so far. -- Thanks. -- Max ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap 2013-05-28 14:35 ` Catalin Marinas 2013-05-29 4:15 ` Max Filippov @ 2013-05-29 11:53 ` Vineet Gupta 2013-05-29 12:00 ` Vineet Gupta 2 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-29 11:53 UTC (permalink / raw) To: Catalin Marinas Cc: Max Filippov, linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier On 05/28/2013 08:05 PM, Catalin Marinas wrote: > Max, > > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >> Hello arch and mm people. >> >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? > > If it happens, this would be a bug. It means that a process can access > a physical page that has been allocated to something else, possibly > kernel data. > >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> of the same process gets scheduled then it is able to see TLB entries >> pointing to already freed physical pages. > > It looks to me like cond_resched() here introduces a possible bug but > it depends on the actual arch code, especially the > __tlb_remove_tlb_entry() function. On ARM we record the range in > tlb_remove_tlb_entry() and queue the pages to be removed in > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > even for the UP case (which is also needed for hardware speculative > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > to be freed. > > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > cond_resched() in zap_pmd_range() would cause problems. > > I think possible workarounds: > > 1. tlb_fast_mode() always returning 0. This might add needless page free batching logic so not very lucrative. > 2. add a tlb_flush_mmu(tlb) before cond_resched() in zap_pmd_range(). For !fullmm flushes it might be no-op on some arches (atleast on ARC) as we use tlb_end_vma() to do TLB range flush. And flushing the entire TLB would be excessive though. Actually zap_pte_range() already has logic to flush the TLB range (if batching runs out of space). Can we re-use that infrastructure to make sure zap_pte_range() does it's share of TLB flushing before returning and going into cond_resched(). However with that, we need to prevent tlb_end_vma()/tlb_finish_mmu() from duplicating the range flush - which can be done by clearing tlb->need_flush. Now simplistically this will cause even the fullmm flushes (simple ASID increment on ARC/ARM..) to become TLB walks to flush the individual entries so we can do this for only for !fullmm, assuming that cond_resched() can potentially cause an exit'ing task's thread to be scheduled in and reuse the entries. Let me go off cook a patch to see if this might work. -Vineet ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 12:00 ` Vineet Gupta 0 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-29 12:00 UTC (permalink / raw) To: Catalin Marinas Cc: Max Filippov, linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier [Resending with fixed linux-mm address] On 05/28/2013 08:05 PM, Catalin Marinas wrote: > Max, > > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >> Hello arch and mm people. >> >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? > > If it happens, this would be a bug. It means that a process can access > a physical page that has been allocated to something else, possibly > kernel data. > >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> of the same process gets scheduled then it is able to see TLB entries >> pointing to already freed physical pages. > > It looks to me like cond_resched() here introduces a possible bug but > it depends on the actual arch code, especially the > __tlb_remove_tlb_entry() function. On ARM we record the range in > tlb_remove_tlb_entry() and queue the pages to be removed in > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > even for the UP case (which is also needed for hardware speculative > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > to be freed. > > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > cond_resched() in zap_pmd_range() would cause problems. > > I think possible workarounds: > > 1. tlb_fast_mode() always returning 0. This might add needless page free batching logic so not very lucrative. > 2. add a tlb_flush_mmu(tlb) before cond_resched() in zap_pmd_range(). For !fullmm flushes it might be no-op on some arches (atleast on ARC) as we use tlb_end_vma() to do TLB range flush. And flushing the entire TLB would be excessive though. Actually zap_pte_range() already has logic to flush the TLB range (if batching runs out of space). Can we re-use that infrastructure to make sure zap_pte_range() does it's share of TLB flushing before returning and going into cond_resched(). However with that, we need to prevent tlb_end_vma()/tlb_finish_mmu() from duplicating the range flush - which can be done by clearing tlb->need_flush. Now simplistically this will cause even the fullmm flushes (simple ASID increment on ARC/ARM..) to become TLB walks to flush the individual entries so we can do this for only for !fullmm, assuming that cond_resched() can potentially cause an exit'ing task's thread to be scheduled in and reuse the entries. Let me go off cook a patch to see if this might work. -Vineet -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-05-29 12:00 ` Vineet Gupta 0 siblings, 0 replies; 70+ messages in thread From: Vineet Gupta @ 2013-05-29 12:00 UTC (permalink / raw) To: Catalin Marinas Cc: Max Filippov, linux-arch, linux-mm, linux-xtensa, Chris Zankel, Marc Gauthier [Resending with fixed linux-mm address] On 05/28/2013 08:05 PM, Catalin Marinas wrote: > Max, > > On 26 May 2013 03:42, Max Filippov <jcmvbkbc@gmail.com> wrote: >> Hello arch and mm people. >> >> Is it intentional that threads of a process that invoked munmap syscall >> can see TLB entries pointing to already freed pages, or it is a bug? > > If it happens, this would be a bug. It means that a process can access > a physical page that has been allocated to something else, possibly > kernel data. > >> I'm talking about zap_pmd_range and zap_pte_range: >> >> zap_pmd_range >> zap_pte_range >> arch_enter_lazy_mmu_mode >> ptep_get_and_clear_full >> tlb_remove_tlb_entry >> __tlb_remove_page >> arch_leave_lazy_mmu_mode >> cond_resched >> >> With the default arch_{enter,leave}_lazy_mmu_mode, tlb_remove_tlb_entry >> and __tlb_remove_page there is a loop in the zap_pte_range that clears >> PTEs and frees corresponding pages, but doesn't flush TLB, and >> surrounding loop in the zap_pmd_range that calls cond_resched. If a thread >> of the same process gets scheduled then it is able to see TLB entries >> pointing to already freed physical pages. > > It looks to me like cond_resched() here introduces a possible bug but > it depends on the actual arch code, especially the > __tlb_remove_tlb_entry() function. On ARM we record the range in > tlb_remove_tlb_entry() and queue the pages to be removed in > __tlb_remove_page(). It pretty much acts like tlb_fast_mode() == 0 > even for the UP case (which is also needed for hardware speculative > TLB loads). The tlb_finish_mmu() takes care of whatever pages are left > to be freed. > > With a dummy __tlb_remove_tlb_entry() and tlb_fast_mode() == 1, > cond_resched() in zap_pmd_range() would cause problems. > > I think possible workarounds: > > 1. tlb_fast_mode() always returning 0. This might add needless page free batching logic so not very lucrative. > 2. add a tlb_flush_mmu(tlb) before cond_resched() in zap_pmd_range(). For !fullmm flushes it might be no-op on some arches (atleast on ARC) as we use tlb_end_vma() to do TLB range flush. And flushing the entire TLB would be excessive though. Actually zap_pte_range() already has logic to flush the TLB range (if batching runs out of space). Can we re-use that infrastructure to make sure zap_pte_range() does it's share of TLB flushing before returning and going into cond_resched(). However with that, we need to prevent tlb_end_vma()/tlb_finish_mmu() from duplicating the range flush - which can be done by clearing tlb->need_flush. Now simplistically this will cause even the fullmm flushes (simple ASID increment on ARC/ARM..) to become TLB walks to flush the individual entries so we can do this for only for !fullmm, assuming that cond_resched() can potentially cause an exit'ing task's thread to be scheduled in and reuse the entries. Let me go off cook a patch to see if this might work. -Vineet ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: TLB and PTE coherency during munmap @ 2013-06-07 2:21 George Spelvin 0 siblings, 0 replies; 70+ messages in thread From: George Spelvin @ 2013-06-07 2:21 UTC (permalink / raw) To: peterz; +Cc: linux, linux-kernel It's a bug in the original code, but as long as you're cleaning it up: @@ -103,6 +97,7 @@ extern struct ia64_tr_entry *ia64_idtrs[NR_CPUS]; static inline void ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end) { + unsigned long i; unsigned int nr; if (!tlb->need_flush) @@ -141,13 +136,11 @@ ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long e /* lastly, release the freed pages */ nr = tlb->nr; - if (!tlb_fast_mode(tlb)) { - unsigned long i; - tlb->nr = 0; - tlb->start_addr = ~0UL; - for (i = 0; i < nr; ++i) - free_page_and_swap_cache(tlb->pages[i]); - } + + tlb->nr = 0; + tlb->start_addr = ~0UL; + for (i = 0; i < nr; ++i) + free_page_and_swap_cache(tlb->pages[i]); } How about making that first hunk: @@ -103,6 +97,6 @@ extern struct ia64_tr_entry *ia64_idtrs[NR_CPUS]; static inline void ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end) { - unsigned int nr; + unsigned int nr, i; if (!tlb->need_flush) Because i does not need to be 64 bits; tlb->nr is only 32. ^ permalink raw reply [flat|nested] 70+ messages in thread
end of thread, other threads:[~2013-06-07 2:28 UTC | newest] Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-26 2:42 TLB and PTE coherency during munmap Max Filippov 2013-05-26 2:50 ` Max Filippov 2013-05-26 2:50 ` Max Filippov 2013-05-28 7:10 ` Max Filippov 2013-05-28 7:10 ` Max Filippov 2013-05-29 12:27 ` Peter Zijlstra 2013-05-29 12:27 ` Peter Zijlstra 2013-05-29 12:42 ` Vineet Gupta 2013-05-29 12:42 ` Vineet Gupta 2013-05-29 12:47 ` Peter Zijlstra 2013-05-29 12:47 ` Peter Zijlstra 2013-05-29 17:51 ` Peter Zijlstra 2013-05-29 17:51 ` Peter Zijlstra 2013-05-29 22:04 ` Catalin Marinas 2013-05-29 22:04 ` Catalin Marinas 2013-05-30 6:48 ` Peter Zijlstra 2013-05-30 6:48 ` Peter Zijlstra 2013-05-30 5:04 ` Vineet Gupta 2013-05-30 5:04 ` Vineet Gupta 2013-05-30 6:56 ` Peter Zijlstra 2013-05-30 6:56 ` Peter Zijlstra 2013-05-30 7:00 ` Vineet Gupta 2013-05-30 7:00 ` Vineet Gupta 2013-05-30 11:03 ` Peter Zijlstra 2013-05-30 11:03 ` Peter Zijlstra 2013-05-31 4:09 ` Max Filippov 2013-05-31 4:09 ` Max Filippov 2013-05-31 7:55 ` Peter Zijlstra 2013-05-31 7:55 ` Peter Zijlstra 2013-06-03 9:05 ` Peter Zijlstra 2013-06-03 9:05 ` Peter Zijlstra 2013-06-03 9:16 ` Ingo Molnar 2013-06-03 9:16 ` Ingo Molnar 2013-06-03 10:01 ` Catalin Marinas 2013-06-03 10:01 ` Catalin Marinas 2013-06-03 10:04 ` Peter Zijlstra 2013-06-03 10:04 ` Peter Zijlstra 2013-06-03 10:09 ` Catalin Marinas 2013-06-03 10:09 ` Catalin Marinas 2013-06-04 9:52 ` Peter Zijlstra 2013-06-04 9:52 ` Peter Zijlstra 2013-06-05 0:05 ` Linus Torvalds 2013-06-05 0:05 ` Linus Torvalds 2013-06-05 10:26 ` [PATCH] arch, mm: Remove tlb_fast_mode() Peter Zijlstra 2013-06-05 10:26 ` Peter Zijlstra 2013-05-31 1:40 ` TLB and PTE coherency during munmap Max Filippov 2013-05-31 1:40 ` Max Filippov 2013-05-28 14:34 ` Konrad Rzeszutek Wilk 2013-05-28 14:34 ` Konrad Rzeszutek Wilk 2013-05-29 3:23 ` Max Filippov 2013-05-29 3:23 ` Max Filippov 2013-05-28 15:16 ` Michal Hocko 2013-05-28 15:16 ` Michal Hocko 2013-05-28 15:23 ` Catalin Marinas 2013-05-28 15:23 ` Catalin Marinas 2013-05-28 14:35 ` Catalin Marinas 2013-05-29 4:15 ` Max Filippov 2013-05-29 4:15 ` Max Filippov 2013-05-29 10:15 ` Catalin Marinas 2013-05-29 10:15 ` Catalin Marinas 2013-05-31 1:26 ` Max Filippov 2013-05-31 1:26 ` Max Filippov 2013-05-31 9:06 ` Catalin Marinas 2013-05-31 9:06 ` Catalin Marinas 2013-06-03 9:16 ` Max Filippov 2013-06-03 9:16 ` Max Filippov 2013-05-29 11:53 ` Vineet Gupta 2013-05-29 12:00 ` Vineet Gupta 2013-05-29 12:00 ` Vineet Gupta 2013-06-07 2:21 George Spelvin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.