From: Benjamin Herrenschmidt <benh@kernel.crashing.org> To: Peter Zijlstra <peterz@infradead.org> Cc: Shan Hai <haishan.bai@gmail.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, paulus@samba.org, tglx@linutronix.de, walken@google.com, dhowells@redhat.com, cmetcalf@tilera.com, tony.luck@intel.com, akpm@linux-foundation.org, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core Date: Mon, 18 Jul 2011 14:01:31 +1000 [thread overview] Message-ID: <1310961691.25044.274.camel@pasglop> (raw) In-Reply-To: <1310944453.25044.262.camel@pasglop> On Mon, 2011-07-18 at 09:14 +1000, Benjamin Herrenschmidt wrote: > In fact, with such a flag, we could probably avoid the ifdef entirely, and > always go toward the PTE fixup path when called in such a fixup case, my gut > feeling is that this is going to be seldom enough not to hurt x86 measurably > but we'll have to try it out. > > That leads to that even less tested patch: And here's a version that builds and fixes a bug or two (still not tested :-) Shan, can you verify whether that fixes the problem for you ? I also had a cursory glance at the ARM code and it seems to rely on the same stuff as embedded powerpc does for dirty/young updates, so in theory it should exhibit the same problem. I suspect the scenario is rare enough in practice in embedded workloads that nobody noticed until now. Cheers, Ben. mm/futex: Fix use of gup() to "fixup" failing atomic user accesses The futex code uses atomic (page fault disabled) accesses to user space, and when they fail, uses get_user_pages() to "fixup" the PTE and try again. However, on arch with SW tracking of the dirty and young bits, this will not work properly as neither of the above will perform the necessary fixup of those bits. There's also a possible corner cases with archs who rely on handle_pte_fault() to invalidate the TLB for "spurrious" faults (though I don't know which arch actually needs that). Those would break the same way. This fixes it by factoring out the fixup code from handle_pte_fault() into a separate function, and use it from within gup as well, whenever the FOLL_FIXFAULT flag has been passed to it. The futex code is modified to pass that flag. This doesn't change the "normal" gup case (and thus avoids the overhead of doing that tracking) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71..8a76694 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address, #define FOLL_MLOCK 0x40 /* mark page as mlocked */ #define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */ #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_FIXFAULT 0x200 /* fixup after a fault (PTE dirty/young upd) */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..02adff7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; down_read(&mm->mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, - 1, 1, 0, NULL, NULL); + ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1, + FOLL_WRITE | FOLL_FIXFAULT, NULL, NULL, NULL); up_read(&mm->mmap_sem); return ret < 0 ? ret : 0; diff --git a/mm/memory.c b/mm/memory.c index 40b7531..3c4d502 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1419,6 +1419,29 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(zap_vma_ptes); +static void handle_pte_sw_young_dirty(struct vm_area_struct *vma, + unsigned long address, + pte_t *ptep, int write) +{ + pte_t entry = *ptep; + + if (write) + pte_mkdirty(entry); + entry = pte_mkyoung(entry); + if (ptep_set_access_flags(vma, address, ptep, entry, write)) { + update_mmu_cache(vma, address, ptep); + } else { + /* + * This is needed only for protection faults but the arch code + * is not yet telling us if this is a protection fault or not. + * This still avoids useless tlb flushes for .text page faults + * with threads. + */ + if (write) + flush_tlb_fix_spurious_fault(vma, address); + } +} + /** * follow_page - look up a page descriptor from a user-virtual address * @vma: vm_area_struct mapping @address @@ -1514,6 +1537,10 @@ split_fallthrough: if (flags & FOLL_GET) get_page(page); + + if (flags & FOLL_FIXFAULT) + handle_pte_sw_young_dirty(vma, address, ptep, + flags & FOLL_WRITE); if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1525,6 +1552,7 @@ split_fallthrough: */ mark_page_accessed(page); } + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /* * The preliminary mapping check is mainly to avoid the @@ -3358,21 +3386,8 @@ int handle_pte_fault(struct mm_struct *mm, if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); - entry = pte_mkdirty(entry); - } - entry = pte_mkyoung(entry); - if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) { - update_mmu_cache(vma, address, pte); - } else { - /* - * This is needed only for protection faults but the arch code - * is not yet telling us if this is a protection fault or not. - * This still avoids useless tlb flushes for .text page faults - * with threads. - */ - if (flags & FAULT_FLAG_WRITE) - flush_tlb_fix_spurious_fault(vma, address); } + handle_pte_sw_young_dirty(vma, address, pte, flags & FAULT_FLAG_WRITE); unlock: pte_unmap_unlock(pte, ptl); return 0;
WARNING: multiple messages have this Message-ID (diff)
From: Benjamin Herrenschmidt <benh@kernel.crashing.org> To: Peter Zijlstra <peterz@infradead.org> Cc: tony.luck@intel.com, Peter Zijlstra <a.p.zijlstra@chello.nl>, Shan Hai <haishan.bai@gmail.com>, linux-kernel@vger.kernel.org, cmetcalf@tilera.com, dhowells@redhat.com, paulus@samba.org, tglx@linutronix.de, walken@google.com, linuxppc-dev@lists.ozlabs.org, akpm@linux-foundation.org Subject: Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core Date: Mon, 18 Jul 2011 14:01:31 +1000 [thread overview] Message-ID: <1310961691.25044.274.camel@pasglop> (raw) In-Reply-To: <1310944453.25044.262.camel@pasglop> On Mon, 2011-07-18 at 09:14 +1000, Benjamin Herrenschmidt wrote: > In fact, with such a flag, we could probably avoid the ifdef entirely, and > always go toward the PTE fixup path when called in such a fixup case, my gut > feeling is that this is going to be seldom enough not to hurt x86 measurably > but we'll have to try it out. > > That leads to that even less tested patch: And here's a version that builds and fixes a bug or two (still not tested :-) Shan, can you verify whether that fixes the problem for you ? I also had a cursory glance at the ARM code and it seems to rely on the same stuff as embedded powerpc does for dirty/young updates, so in theory it should exhibit the same problem. I suspect the scenario is rare enough in practice in embedded workloads that nobody noticed until now. Cheers, Ben. mm/futex: Fix use of gup() to "fixup" failing atomic user accesses The futex code uses atomic (page fault disabled) accesses to user space, and when they fail, uses get_user_pages() to "fixup" the PTE and try again. However, on arch with SW tracking of the dirty and young bits, this will not work properly as neither of the above will perform the necessary fixup of those bits. There's also a possible corner cases with archs who rely on handle_pte_fault() to invalidate the TLB for "spurrious" faults (though I don't know which arch actually needs that). Those would break the same way. This fixes it by factoring out the fixup code from handle_pte_fault() into a separate function, and use it from within gup as well, whenever the FOLL_FIXFAULT flag has been passed to it. The futex code is modified to pass that flag. This doesn't change the "normal" gup case (and thus avoids the overhead of doing that tracking) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71..8a76694 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address, #define FOLL_MLOCK 0x40 /* mark page as mlocked */ #define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */ #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_FIXFAULT 0x200 /* fixup after a fault (PTE dirty/young upd) */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..02adff7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; down_read(&mm->mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, - 1, 1, 0, NULL, NULL); + ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1, + FOLL_WRITE | FOLL_FIXFAULT, NULL, NULL, NULL); up_read(&mm->mmap_sem); return ret < 0 ? ret : 0; diff --git a/mm/memory.c b/mm/memory.c index 40b7531..3c4d502 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1419,6 +1419,29 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(zap_vma_ptes); +static void handle_pte_sw_young_dirty(struct vm_area_struct *vma, + unsigned long address, + pte_t *ptep, int write) +{ + pte_t entry = *ptep; + + if (write) + pte_mkdirty(entry); + entry = pte_mkyoung(entry); + if (ptep_set_access_flags(vma, address, ptep, entry, write)) { + update_mmu_cache(vma, address, ptep); + } else { + /* + * This is needed only for protection faults but the arch code + * is not yet telling us if this is a protection fault or not. + * This still avoids useless tlb flushes for .text page faults + * with threads. + */ + if (write) + flush_tlb_fix_spurious_fault(vma, address); + } +} + /** * follow_page - look up a page descriptor from a user-virtual address * @vma: vm_area_struct mapping @address @@ -1514,6 +1537,10 @@ split_fallthrough: if (flags & FOLL_GET) get_page(page); + + if (flags & FOLL_FIXFAULT) + handle_pte_sw_young_dirty(vma, address, ptep, + flags & FOLL_WRITE); if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1525,6 +1552,7 @@ split_fallthrough: */ mark_page_accessed(page); } + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /* * The preliminary mapping check is mainly to avoid the @@ -3358,21 +3386,8 @@ int handle_pte_fault(struct mm_struct *mm, if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); - entry = pte_mkdirty(entry); - } - entry = pte_mkyoung(entry); - if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) { - update_mmu_cache(vma, address, pte); - } else { - /* - * This is needed only for protection faults but the arch code - * is not yet telling us if this is a protection fault or not. - * This still avoids useless tlb flushes for .text page faults - * with threads. - */ - if (flags & FAULT_FLAG_WRITE) - flush_tlb_fix_spurious_fault(vma, address); } + handle_pte_sw_young_dirty(vma, address, pte, flags & FAULT_FLAG_WRITE); unlock: pte_unmap_unlock(pte, ptl); return 0;
next prev parent reply other threads:[~2011-07-18 4:02 UTC|newest] Thread overview: 138+ messages / expand[flat|nested] mbox.gz Atom feed top 2011-07-15 8:07 [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core Shan Hai 2011-07-15 8:07 ` Shan Hai 2011-07-15 8:07 ` [PATCH 1/1] " Shan Hai 2011-07-15 8:07 ` Shan Hai 2011-07-15 10:23 ` Peter Zijlstra 2011-07-15 10:23 ` Peter Zijlstra 2011-07-15 15:18 ` Shan Hai 2011-07-15 15:18 ` Shan Hai 2011-07-15 15:24 ` Peter Zijlstra 2011-07-15 15:24 ` Peter Zijlstra 2011-07-16 15:36 ` Shan Hai 2011-07-16 15:36 ` Shan Hai 2011-07-16 14:50 ` Shan Hai 2011-07-16 14:50 ` Shan Hai 2011-07-16 23:49 ` Benjamin Herrenschmidt 2011-07-16 23:49 ` Benjamin Herrenschmidt 2011-07-17 9:38 ` Peter Zijlstra 2011-07-17 9:38 ` Peter Zijlstra 2011-07-17 14:29 ` Benjamin Herrenschmidt 2011-07-17 14:29 ` Benjamin Herrenschmidt 2011-07-17 23:14 ` Benjamin Herrenschmidt 2011-07-17 23:14 ` Benjamin Herrenschmidt 2011-07-18 3:53 ` Benjamin Herrenschmidt 2011-07-18 3:53 ` Benjamin Herrenschmidt 2011-07-18 4:02 ` Benjamin Herrenschmidt 2011-07-18 4:02 ` Benjamin Herrenschmidt 2011-07-18 4:01 ` Benjamin Herrenschmidt [this message] 2011-07-18 4:01 ` Benjamin Herrenschmidt 2011-07-18 6:48 ` Shan Hai 2011-07-18 6:48 ` Shan Hai 2011-07-18 7:01 ` Benjamin Herrenschmidt 2011-07-18 7:01 ` Benjamin Herrenschmidt 2011-07-18 7:26 ` Shan Hai 2011-07-18 7:26 ` Shan Hai 2011-07-18 7:36 ` Benjamin Herrenschmidt 2011-07-18 7:36 ` Benjamin Herrenschmidt 2011-07-18 7:50 ` Shan Hai 2011-07-18 7:50 ` Shan Hai 2011-07-19 3:30 ` Shan Hai 2011-07-19 3:30 ` Shan Hai 2011-07-19 4:20 ` Benjamin Herrenschmidt 2011-07-19 4:20 ` Benjamin Herrenschmidt 2011-07-19 4:29 ` [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young Benjamin Herrenschmidt 2011-07-19 4:29 ` Benjamin Herrenschmidt 2011-07-19 4:55 ` Shan Hai 2011-07-19 4:55 ` Shan Hai 2011-07-19 5:17 ` Shan Hai 2011-07-19 5:17 ` Shan Hai 2011-07-19 5:24 ` Benjamin Herrenschmidt 2011-07-19 5:24 ` Benjamin Herrenschmidt 2011-07-19 5:38 ` Shan Hai 2011-07-19 5:38 ` Shan Hai 2011-07-19 7:46 ` Benjamin Herrenschmidt 2011-07-19 7:46 ` Benjamin Herrenschmidt 2011-07-19 8:24 ` Shan Hai 2011-07-19 8:24 ` Shan Hai 2011-07-19 8:26 ` [RFC/PATCH] mm/futex: Fix futex writes on archs with SW trackingof " David Laight 2011-07-19 8:26 ` David Laight 2011-07-19 8:45 ` Benjamin Herrenschmidt 2011-07-19 8:45 ` Benjamin Herrenschmidt 2011-07-19 8:45 ` Shan Hai 2011-07-19 8:45 ` Shan Hai 2011-07-19 11:10 ` [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of " Peter Zijlstra 2011-07-19 11:10 ` Peter Zijlstra 2011-07-20 14:39 ` Darren Hart 2011-07-20 14:39 ` Darren Hart 2011-07-21 22:36 ` Andrew Morton 2011-07-21 22:36 ` Andrew Morton 2011-07-21 22:52 ` Benjamin Herrenschmidt 2011-07-21 22:52 ` Benjamin Herrenschmidt 2011-07-21 22:57 ` Benjamin Herrenschmidt 2011-07-21 22:57 ` Benjamin Herrenschmidt 2011-07-21 22:59 ` Andrew Morton 2011-07-21 22:59 ` Andrew Morton 2011-07-22 1:40 ` Benjamin Herrenschmidt 2011-07-22 1:40 ` Benjamin Herrenschmidt 2011-07-22 1:54 ` Shan Hai 2011-07-22 1:54 ` Shan Hai 2011-07-27 6:50 ` Mike Frysinger 2011-07-27 6:50 ` Mike Frysinger 2011-07-27 7:58 ` Benjamin Herrenschmidt 2011-07-27 7:58 ` Benjamin Herrenschmidt 2011-07-27 8:59 ` Peter Zijlstra 2011-07-27 8:59 ` Peter Zijlstra 2011-07-27 10:09 ` David Howells 2011-07-27 10:09 ` David Howells 2011-07-27 10:17 ` Peter Zijlstra 2011-07-27 10:17 ` Peter Zijlstra 2011-07-27 10:20 ` Benjamin Herrenschmidt 2011-07-27 10:20 ` Benjamin Herrenschmidt 2011-07-28 0:12 ` Mike Frysinger 2011-07-28 0:12 ` Mike Frysinger 2011-08-08 2:31 ` Mike Frysinger 2011-08-08 2:31 ` Mike Frysinger 2011-07-28 10:55 ` David Howells 2011-07-28 10:55 ` David Howells 2011-07-17 11:02 ` [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core Peter Zijlstra 2011-07-17 11:02 ` Peter Zijlstra 2011-07-17 13:33 ` Shan Hai 2011-07-17 13:33 ` Shan Hai 2011-07-17 14:48 ` Benjamin Herrenschmidt 2011-07-17 14:48 ` Benjamin Herrenschmidt 2011-07-17 15:40 ` Shan Hai 2011-07-17 15:40 ` Shan Hai 2011-07-17 22:34 ` Benjamin Herrenschmidt 2011-07-17 22:34 ` Benjamin Herrenschmidt 2011-07-17 14:34 ` Benjamin Herrenschmidt 2011-07-17 14:34 ` Benjamin Herrenschmidt 2011-07-15 8:20 ` [PATCH 0/1] " Peter Zijlstra 2011-07-15 8:20 ` Peter Zijlstra 2011-07-15 8:38 ` MailingLists 2011-07-15 8:38 ` MailingLists 2011-07-15 8:44 ` Peter Zijlstra 2011-07-15 8:44 ` Peter Zijlstra 2011-07-15 9:08 ` Shan Hai 2011-07-15 9:08 ` Shan Hai 2011-07-15 9:12 ` Benjamin Herrenschmidt 2011-07-15 9:12 ` Benjamin Herrenschmidt 2011-07-15 9:50 ` Peter Zijlstra 2011-07-15 9:50 ` Peter Zijlstra 2011-07-15 10:06 ` Shan Hai 2011-07-15 10:06 ` Shan Hai 2011-07-15 10:32 ` David Laight 2011-07-15 10:32 ` David Laight 2011-07-15 10:39 ` Peter Zijlstra 2011-07-15 10:39 ` Peter Zijlstra 2011-07-15 15:32 ` Shan Hai 2011-07-15 15:32 ` Shan Hai 2011-07-16 0:20 ` Benjamin Herrenschmidt 2011-07-16 0:20 ` Benjamin Herrenschmidt 2011-07-16 15:03 ` Shan Hai 2011-07-16 15:03 ` Shan Hai 2011-07-15 23:47 ` Benjamin Herrenschmidt 2011-07-15 23:47 ` Benjamin Herrenschmidt 2011-07-15 9:07 ` Benjamin Herrenschmidt 2011-07-15 9:07 ` Benjamin Herrenschmidt 2011-07-15 9:05 ` Benjamin Herrenschmidt 2011-07-15 9:05 ` Benjamin Herrenschmidt
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1310961691.25044.274.camel@pasglop \ --to=benh@kernel.crashing.org \ --cc=a.p.zijlstra@chello.nl \ --cc=akpm@linux-foundation.org \ --cc=cmetcalf@tilera.com \ --cc=dhowells@redhat.com \ --cc=haishan.bai@gmail.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linuxppc-dev@lists.ozlabs.org \ --cc=paulus@samba.org \ --cc=peterz@infradead.org \ --cc=tglx@linutronix.de \ --cc=tony.luck@intel.com \ --cc=walken@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.