From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1044597AbdDWJw6 (ORCPT ); Sun, 23 Apr 2017 05:52:58 -0400 Received: from mail-wr0-f196.google.com ([209.85.128.196]:36616 "EHLO mail-wr0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756391AbdDWJwt (ORCPT ); Sun, 23 Apr 2017 05:52:49 -0400 Date: Sun, 23 Apr 2017 11:52:44 +0200 From: Ingo Molnar To: Dan Williams Cc: "Kirill A. Shutemov" , Catalin Marinas , "Aneesh Kumar K.V" , steve.capper@linaro.org, Thomas Gleixner , Peter Zijlstra , Linux Kernel Mailing List , Andrew Morton , "Kirill A. Shutemov" , "H. Peter Anvin" , Dave Hansen , Borislav Petkov , Rik van Riel , dann.frazier@canonical.com, Linus Torvalds , Michal Hocko , linux-tip-commits@vger.kernel.org Subject: [PATCH] Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" Message-ID: <20170423095244.7brzd7wcpag7es2c@gmail.com> References: <20170421141628.ruxxnq54jvuhiqnz@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Dan Williams wrote: > > I can't find the issue either. > > > > Is it something reproducible without hardware? In KVM? > > You can do it in KVM, just boot with the memmap=ss!nn parameter to > simulate pmem. In this case I'm booting with memmap=4G!8G, you should > also specify "nokaslr". > > > If yes, could you share the test-case? > > Yes, run: > > ./autogen.sh > ./configure CFLAGS='-g -O0' --prefix=/usr --sysconfdir=/etc > --libdir=/usr/lib64 > make TESTS=device-dax check > > ...from a checkout of the ndctl project: > > https://github.com/pmem/ndctl > > Let me know if you run into any problems getting the test to build or run. > > > > >> [ 35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 > >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 > >> [ 35.425328] percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 > >> (0) after switching to atomic Since the bug appears to be pretty severe (GUP race causing possible memory corruption that could affect a lot of code), and the merge window is awfully close, plus the reproducer appears to be pretty quick, I've queued up the revert below for the time being, to not block the rest of the pending tip:x86/mm changes. I'd have loved to see this conversion in v4.12, but not at any cost. Thanks, Ingo ==================> >>From 6dd29b3df975582ef429b5b93c899e6575785940 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Sun, 23 Apr 2017 11:37:17 +0200 Subject: [PATCH] Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" This reverts commit 2947ba054a4dabbd82848728d765346886050029. Dan Williams reported dax-pmem kernel warnings with the following signature: WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic ... and bisected it to this commit, which suggests possible memory corruption caused by the x86 fast-GUP conversion. He also pointed out: " This is similar to the backtrace when we were not properly handling pud faults and was fixed with this commit: 220ced1676c4 "mm: fix get_user_pages() vs device-dax pud mappings" I've found some missing _devmap checks in the generic get_user_pages_fast() path, but this does not fix the regression [...] " So given that there are known bugs, and a pretty robust looking bisection points to this commit suggesting that are unknown bugs in the conversion as well, revert it for the time being - we'll re-try in v4.13. Reported-by: Dan Williams Cc: Andrew Morton Cc: Borislav Petkov Cc: Catalin Marinas Cc: Kirill A. Shutemov Cc: Linus Torvalds Cc: Michal Hocko Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dann.frazier@canonical.com Cc: dave.hansen@intel.com Cc: steve.capper@linaro.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar --- arch/arm/Kconfig | 2 +- arch/arm64/Kconfig | 2 +- arch/powerpc/Kconfig | 2 +- arch/x86/Kconfig | 3 - arch/x86/include/asm/mmu_context.h | 12 + arch/x86/include/asm/pgtable-3level.h | 47 ---- arch/x86/include/asm/pgtable.h | 53 ---- arch/x86/include/asm/pgtable_64.h | 16 +- arch/x86/mm/Makefile | 2 +- arch/x86/mm/gup.c | 496 ++++++++++++++++++++++++++++++++++ mm/Kconfig | 2 +- mm/gup.c | 10 +- 12 files changed, 519 insertions(+), 128 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 454fadd077ad..0d4e71b42c77 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1666,7 +1666,7 @@ config ARCH_SELECT_MEMORY_MODEL config HAVE_ARCH_PFN_VALID def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM -config HAVE_GENERIC_GUP +config HAVE_GENERIC_RCU_GUP def_bool y depends on ARM_LPAE diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index af62bf79721a..3741859765cf 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -205,7 +205,7 @@ config GENERIC_CALIBRATE_DELAY config ZONE_DMA def_bool y -config HAVE_GENERIC_GUP +config HAVE_GENERIC_RCU_GUP def_bool y config ARCH_DMA_ADDR_T_64BIT diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3a716b2dcde9..97a8bc8a095c 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -135,7 +135,7 @@ config PPC select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER select HAVE_GCC_PLUGINS - select HAVE_GENERIC_GUP + select HAVE_GENERIC_RCU_GUP select HAVE_HW_BREAKPOINT if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx) select HAVE_IDE select HAVE_IOREMAP_PROT diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a641b900fc1f..2bde14451e54 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2789,9 +2789,6 @@ config X86_DMA_REMAP bool depends on STA2X11 -config HAVE_GENERIC_GUP - def_bool y - source "net/Kconfig" source "drivers/Kconfig" diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 6e933d2d88d9..68b329d77b3a 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -220,6 +220,18 @@ static inline int vma_pkey(struct vm_area_struct *vma) } #endif +static inline bool __pkru_allows_pkey(u16 pkey, bool write) +{ + u32 pkru = read_pkru(); + + if (!__pkru_allows_read(pkru, pkey)) + return false; + if (write && !__pkru_allows_write(pkru, pkey)) + return false; + + return true; +} + /* * We only want to enforce protection keys on the current process * because we effectively have no access to PKRU for other diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index c8821bab938f..50d35e3185f5 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -212,51 +212,4 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp) #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __swp_entry_to_pte(x) ((pte_t){ { .pte_high = (x).val } }) -#define gup_get_pte gup_get_pte -/* - * WARNING: only to be used in the get_user_pages_fast() implementation. - * - * With get_user_pages_fast(), we walk down the pagetables without taking - * any locks. For this we would like to load the pointers atomically, - * but that is not possible (without expensive cmpxchg8b) on PAE. What - * we do have is the guarantee that a PTE will only either go from not - * present to present, or present to not present or both -- it will not - * switch to a completely different present page without a TLB flush in - * between; something that we are blocking by holding interrupts off. - * - * Setting ptes from not present to present goes: - * - * ptep->pte_high = h; - * smp_wmb(); - * ptep->pte_low = l; - * - * And present to not present goes: - * - * ptep->pte_low = 0; - * smp_wmb(); - * ptep->pte_high = 0; - * - * We must ensure here that the load of pte_low sees 'l' iff pte_high - * sees 'h'. We load pte_high *after* loading pte_low, which ensures we - * don't see an older value of pte_high. *Then* we recheck pte_low, - * which ensures that we haven't picked up a changed pte high. We might - * have gotten rubbish values from pte_low and pte_high, but we are - * guaranteed that pte_low will not have the present bit set *unless* - * it is 'l'. Because get_user_pages_fast() only operates on present ptes - * we're safe. - */ -static inline pte_t gup_get_pte(pte_t *ptep) -{ - pte_t pte; - - do { - pte.pte_low = ptep->pte_low; - smp_rmb(); - pte.pte_high = ptep->pte_high; - smp_rmb(); - } while (unlikely(pte.pte_low != ptep->pte_low)); - - return pte; -} - #endif /* _ASM_X86_PGTABLE_3LEVEL_H */ diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 942482ac36a8..f5af95a0c6b8 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -244,11 +244,6 @@ static inline int pud_devmap(pud_t pud) return 0; } #endif - -static inline int pgd_devmap(pgd_t pgd) -{ - return 0; -} #endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ @@ -1190,54 +1185,6 @@ static inline u16 pte_flags_pkey(unsigned long pte_flags) #endif } -static inline bool __pkru_allows_pkey(u16 pkey, bool write) -{ - u32 pkru = read_pkru(); - - if (!__pkru_allows_read(pkru, pkey)) - return false; - if (write && !__pkru_allows_write(pkru, pkey)) - return false; - - return true; -} - -/* - * 'pteval' can come from a PTE, PMD or PUD. We only check - * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the - * same value on all 3 types. - */ -static inline bool __pte_access_permitted(unsigned long pteval, bool write) -{ - unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER; - - if (write) - need_pte_bits |= _PAGE_RW; - - if ((pteval & need_pte_bits) != need_pte_bits) - return 0; - - return __pkru_allows_pkey(pte_flags_pkey(pteval), write); -} - -#define pte_access_permitted pte_access_permitted -static inline bool pte_access_permitted(pte_t pte, bool write) -{ - return __pte_access_permitted(pte_val(pte), write); -} - -#define pmd_access_permitted pmd_access_permitted -static inline bool pmd_access_permitted(pmd_t pmd, bool write) -{ - return __pte_access_permitted(pmd_val(pmd), write); -} - -#define pud_access_permitted pud_access_permitted -static inline bool pud_access_permitted(pud_t pud, bool write) -{ - return __pte_access_permitted(pud_val(pud), write); -} - #include #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 12ea31274eb6..9991224f6238 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -227,20 +227,6 @@ extern void cleanup_highmap(void); extern void init_extra_mapping_uc(unsigned long phys, unsigned long size); extern void init_extra_mapping_wb(unsigned long phys, unsigned long size); -#define gup_fast_permitted gup_fast_permitted -static inline bool gup_fast_permitted(unsigned long start, int nr_pages, - int write) -{ - unsigned long len, end; - - len = (unsigned long)nr_pages << PAGE_SHIFT; - end = start + len; - if (end < start) - return false; - if (end >> __VIRTUAL_MASK_SHIFT) - return false; - return true; -} - #endif /* !__ASSEMBLY__ */ + #endif /* _ASM_X86_PGTABLE_64_H */ diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 0fbdcb64f9f8..96d2b847e09e 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -2,7 +2,7 @@ KCOV_INSTRUMENT_tlb.o := n obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \ - pat.o pgtable.o physaddr.o setup_nx.o tlb.o + pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o # Make sure __phys_addr has no stackprotector nostackp := $(call cc-option, -fno-stack-protector) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c new file mode 100644 index 000000000000..456dfdfd2249 --- /dev/null +++ b/arch/x86/mm/gup.c @@ -0,0 +1,496 @@ +/* + * Lockless get_user_pages_fast for x86 + * + * Copyright (C) 2008 Nick Piggin + * Copyright (C) 2008 Novell Inc. + */ +#include +#include +#include +#include +#include +#include + +#include +#include + +static inline pte_t gup_get_pte(pte_t *ptep) +{ +#ifndef CONFIG_X86_PAE + return READ_ONCE(*ptep); +#else + /* + * With get_user_pages_fast, we walk down the pagetables without taking + * any locks. For this we would like to load the pointers atomically, + * but that is not possible (without expensive cmpxchg8b) on PAE. What + * we do have is the guarantee that a pte will only either go from not + * present to present, or present to not present or both -- it will not + * switch to a completely different present page without a TLB flush in + * between; something that we are blocking by holding interrupts off. + * + * Setting ptes from not present to present goes: + * ptep->pte_high = h; + * smp_wmb(); + * ptep->pte_low = l; + * + * And present to not present goes: + * ptep->pte_low = 0; + * smp_wmb(); + * ptep->pte_high = 0; + * + * We must ensure here that the load of pte_low sees l iff pte_high + * sees h. We load pte_high *after* loading pte_low, which ensures we + * don't see an older value of pte_high. *Then* we recheck pte_low, + * which ensures that we haven't picked up a changed pte high. We might + * have got rubbish values from pte_low and pte_high, but we are + * guaranteed that pte_low will not have the present bit set *unless* + * it is 'l'. And get_user_pages_fast only operates on present ptes, so + * we're safe. + * + * gup_get_pte should not be used or copied outside gup.c without being + * very careful -- it does not atomically load the pte or anything that + * is likely to be useful for you. + */ + pte_t pte; + +retry: + pte.pte_low = ptep->pte_low; + smp_rmb(); + pte.pte_high = ptep->pte_high; + smp_rmb(); + if (unlikely(pte.pte_low != ptep->pte_low)) + goto retry; + + return pte; +#endif +} + +static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages) +{ + while ((*nr) - nr_start) { + struct page *page = pages[--(*nr)]; + + ClearPageReferenced(page); + put_page(page); + } +} + +/* + * 'pteval' can come from a pte, pmd, pud or p4d. We only check + * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the + * same value on all 4 types. + */ +static inline int pte_allows_gup(unsigned long pteval, int write) +{ + unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER; + + if (write) + need_pte_bits |= _PAGE_RW; + + if ((pteval & need_pte_bits) != need_pte_bits) + return 0; + + /* Check memory protection keys permissions. */ + if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write)) + return 0; + + return 1; +} + +/* + * The performance critical leaf functions are made noinline otherwise gcc + * inlines everything into a single function which results in too much + * register pressure. + */ +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, + unsigned long end, int write, struct page **pages, int *nr) +{ + struct dev_pagemap *pgmap = NULL; + int nr_start = *nr, ret = 0; + pte_t *ptep, *ptem; + + /* + * Keep the original mapped PTE value (ptem) around since we + * might increment ptep off the end of the page when finishing + * our loop iteration. + */ + ptem = ptep = pte_offset_map(&pmd, addr); + do { + pte_t pte = gup_get_pte(ptep); + struct page *page; + + /* Similar to the PMD case, NUMA hinting must take slow path */ + if (pte_protnone(pte)) + break; + + if (!pte_allows_gup(pte_val(pte), write)) + break; + + if (pte_devmap(pte)) { + pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); + if (unlikely(!pgmap)) { + undo_dev_pagemap(nr, nr_start, pages); + break; + } + } else if (pte_special(pte)) + break; + + VM_BUG_ON(!pfn_valid(pte_pfn(pte))); + page = pte_page(pte); + get_page(page); + put_dev_pagemap(pgmap); + SetPageReferenced(page); + pages[*nr] = page; + (*nr)++; + + } while (ptep++, addr += PAGE_SIZE, addr != end); + if (addr == end) + ret = 1; + pte_unmap(ptem); + + return ret; +} + +static inline void get_head_page_multiple(struct page *page, int nr) +{ + VM_BUG_ON_PAGE(page != compound_head(page), page); + VM_BUG_ON_PAGE(page_count(page) == 0, page); + page_ref_add(page, nr); + SetPageReferenced(page); +} + +static int __gup_device_huge(unsigned long pfn, unsigned long addr, + unsigned long end, struct page **pages, int *nr) +{ + int nr_start = *nr; + struct dev_pagemap *pgmap = NULL; + + do { + struct page *page = pfn_to_page(pfn); + + pgmap = get_dev_pagemap(pfn, pgmap); + if (unlikely(!pgmap)) { + undo_dev_pagemap(nr, nr_start, pages); + return 0; + } + SetPageReferenced(page); + pages[*nr] = page; + get_page(page); + put_dev_pagemap(pgmap); + (*nr)++; + pfn++; + } while (addr += PAGE_SIZE, addr != end); + return 1; +} + +static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr, + unsigned long end, struct page **pages, int *nr) +{ + unsigned long fault_pfn; + + fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + return __gup_device_huge(fault_pfn, addr, end, pages, nr); +} + +static int __gup_device_huge_pud(pud_t pud, unsigned long addr, + unsigned long end, struct page **pages, int *nr) +{ + unsigned long fault_pfn; + + fault_pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + return __gup_device_huge(fault_pfn, addr, end, pages, nr); +} + +static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, + unsigned long end, int write, struct page **pages, int *nr) +{ + struct page *head, *page; + int refs; + + if (!pte_allows_gup(pmd_val(pmd), write)) + return 0; + + VM_BUG_ON(!pfn_valid(pmd_pfn(pmd))); + if (pmd_devmap(pmd)) + return __gup_device_huge_pmd(pmd, addr, end, pages, nr); + + /* hugepages are never "special" */ + VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL); + + refs = 0; + head = pmd_page(pmd); + page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + do { + VM_BUG_ON_PAGE(compound_head(page) != head, page); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + get_head_page_multiple(head, refs); + + return 1; +} + +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + unsigned long next; + pmd_t *pmdp; + + pmdp = pmd_offset(&pud, addr); + do { + pmd_t pmd = *pmdp; + + next = pmd_addr_end(addr, end); + if (pmd_none(pmd)) + return 0; + if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) { + /* + * NUMA hinting faults need to be handled in the GUP + * slowpath for accounting purposes and so that they + * can be serialised against THP migration. + */ + if (pmd_protnone(pmd)) + return 0; + if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) + return 0; + } else { + if (!gup_pte_range(pmd, addr, next, write, pages, nr)) + return 0; + } + } while (pmdp++, addr = next, addr != end); + + return 1; +} + +static noinline int gup_huge_pud(pud_t pud, unsigned long addr, + unsigned long end, int write, struct page **pages, int *nr) +{ + struct page *head, *page; + int refs; + + if (!pte_allows_gup(pud_val(pud), write)) + return 0; + + VM_BUG_ON(!pfn_valid(pud_pfn(pud))); + if (pud_devmap(pud)) + return __gup_device_huge_pud(pud, addr, end, pages, nr); + + /* hugepages are never "special" */ + VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL); + + refs = 0; + head = pud_page(pud); + page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + do { + VM_BUG_ON_PAGE(compound_head(page) != head, page); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + get_head_page_multiple(head, refs); + + return 1; +} + +static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + unsigned long next; + pud_t *pudp; + + pudp = pud_offset(&p4d, addr); + do { + pud_t pud = *pudp; + + next = pud_addr_end(addr, end); + if (pud_none(pud)) + return 0; + if (unlikely(pud_large(pud))) { + if (!gup_huge_pud(pud, addr, next, write, pages, nr)) + return 0; + } else { + if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + return 0; + } + } while (pudp++, addr = next, addr != end); + + return 1; +} + +static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + unsigned long next; + p4d_t *p4dp; + + p4dp = p4d_offset(&pgd, addr); + do { + p4d_t p4d = *p4dp; + + next = p4d_addr_end(addr, end); + if (p4d_none(p4d)) + return 0; + BUILD_BUG_ON(p4d_large(p4d)); + if (!gup_pud_range(p4d, addr, next, write, pages, nr)) + return 0; + } while (p4dp++, addr = next, addr != end); + + return 1; +} + +/* + * Like get_user_pages_fast() except its IRQ-safe in that it won't fall + * back to the regular GUP. + */ +int __get_user_pages_fast(unsigned long start, int nr_pages, int write, + struct page **pages) +{ + struct mm_struct *mm = current->mm; + unsigned long addr, len, end; + unsigned long next; + unsigned long flags; + pgd_t *pgdp; + int nr = 0; + + start &= PAGE_MASK; + addr = start; + len = (unsigned long) nr_pages << PAGE_SHIFT; + end = start + len; + if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, + (void __user *)start, len))) + return 0; + + /* + * XXX: batch / limit 'nr', to avoid large irq off latency + * needs some instrumenting to determine the common sizes used by + * important workloads (eg. DB2), and whether limiting the batch size + * will decrease performance. + * + * It seems like we're in the clear for the moment. Direct-IO is + * the main guy that batches up lots of get_user_pages, and even + * they are limited to 64-at-a-time which is not so many. + */ + /* + * This doesn't prevent pagetable teardown, but does prevent + * the pagetables and pages from being freed on x86. + * + * So long as we atomically load page table pointers versus teardown + * (which we do on x86, with the above PAE exception), we can follow the + * address down to the the page and take a ref on it. + */ + local_irq_save(flags); + pgdp = pgd_offset(mm, addr); + do { + pgd_t pgd = *pgdp; + + next = pgd_addr_end(addr, end); + if (pgd_none(pgd)) + break; + if (!gup_p4d_range(pgd, addr, next, write, pages, &nr)) + break; + } while (pgdp++, addr = next, addr != end); + local_irq_restore(flags); + + return nr; +} + +/** + * get_user_pages_fast() - pin user pages in memory + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @write: whether pages will be written to + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. + * + * Attempt to pin user pages in memory without taking mm->mmap_sem. + * If not successful, it will fall back to taking the lock and + * calling get_user_pages(). + * + * Returns number of pages pinned. This may be fewer than the number + * requested. If nr_pages is 0 or negative, returns 0. If no pages + * were pinned, returns -errno. + */ +int get_user_pages_fast(unsigned long start, int nr_pages, int write, + struct page **pages) +{ + struct mm_struct *mm = current->mm; + unsigned long addr, len, end; + unsigned long next; + pgd_t *pgdp; + int nr = 0; + + start &= PAGE_MASK; + addr = start; + len = (unsigned long) nr_pages << PAGE_SHIFT; + + end = start + len; + if (end < start) + goto slow_irqon; + +#ifdef CONFIG_X86_64 + if (end >> __VIRTUAL_MASK_SHIFT) + goto slow_irqon; +#endif + + /* + * XXX: batch / limit 'nr', to avoid large irq off latency + * needs some instrumenting to determine the common sizes used by + * important workloads (eg. DB2), and whether limiting the batch size + * will decrease performance. + * + * It seems like we're in the clear for the moment. Direct-IO is + * the main guy that batches up lots of get_user_pages, and even + * they are limited to 64-at-a-time which is not so many. + */ + /* + * This doesn't prevent pagetable teardown, but does prevent + * the pagetables and pages from being freed on x86. + * + * So long as we atomically load page table pointers versus teardown + * (which we do on x86, with the above PAE exception), we can follow the + * address down to the the page and take a ref on it. + */ + local_irq_disable(); + pgdp = pgd_offset(mm, addr); + do { + pgd_t pgd = *pgdp; + + next = pgd_addr_end(addr, end); + if (pgd_none(pgd)) + goto slow; + if (!gup_p4d_range(pgd, addr, next, write, pages, &nr)) + goto slow; + } while (pgdp++, addr = next, addr != end); + local_irq_enable(); + + VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT); + return nr; + + { + int ret; + +slow: + local_irq_enable(); +slow_irqon: + /* Try to get the remaining pages with get_user_pages */ + start += nr << PAGE_SHIFT; + pages += nr; + + ret = get_user_pages_unlocked(start, + (end - start) >> PAGE_SHIFT, + pages, write ? FOLL_WRITE : 0); + + /* Have to be a bit careful with return values */ + if (nr > 0) { + if (ret < 0) + ret = nr; + else + ret += nr; + } + + return ret; + } +} diff --git a/mm/Kconfig b/mm/Kconfig index c89f472b658c..9b8fccb969dc 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -137,7 +137,7 @@ config HAVE_MEMBLOCK_NODE_MAP config HAVE_MEMBLOCK_PHYS_MAP bool -config HAVE_GENERIC_GUP +config HAVE_GENERIC_RCU_GUP bool config ARCH_DISCARD_MEMBLOCK diff --git a/mm/gup.c b/mm/gup.c index 2559a3987de7..527ec2c6cca3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1155,7 +1155,7 @@ struct page *get_dump_page(unsigned long addr) #endif /* CONFIG_ELF_CORE */ /* - * Generic Fast GUP + * Generic RCU Fast GUP * * get_user_pages_fast attempts to pin user pages by walking the page * tables directly and avoids taking locks. Thus the walker needs to be @@ -1176,8 +1176,8 @@ struct page *get_dump_page(unsigned long addr) * Before activating this code, please be aware that the following assumptions * are currently made: * - * *) Either HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to - * free pages containing page tables or TLB flushing requires IPI broadcast. + * *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free + * pages containing page tables. * * *) ptes can be read atomically by the architecture. * @@ -1187,7 +1187,7 @@ struct page *get_dump_page(unsigned long addr) * * This code is based heavily on the PowerPC implementation by Nick Piggin. */ -#ifdef CONFIG_HAVE_GENERIC_GUP +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP #ifndef gup_get_pte /* @@ -1677,4 +1677,4 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, return ret; } -#endif /* CONFIG_HAVE_GENERIC_GUP */ +#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */