All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2014-09-26 14:03 ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

Hello,
This series implements general forms of get_user_pages_fast and
__get_user_pages_fast in core code and activates them for arm and arm64.

These are required for Transparent HugePages to function correctly, as
a futex on a THP tail will otherwise result in an infinite loop (due to
the core implementation of __get_user_pages_fast always returning 0).

Unfortunately, a futex on THP tail can be quite common for certain
workloads; thus THP is unreliable without a __get_user_pages_fast
implementation.

This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.

I appreciate that the merge window is coming very soon, and am posting
this revision on the off-chance that it gets the nod for 3.18. (The changes
thus far have been minimal and the feedback I've got has been mainly
positive).

Changes since PATCH V3 are
(mainly addressing comments from Hugh Dickins):
 * Added pte_numa and pmd_numa calls.
 * Added comments to clarify what assumptions are being made by the
   implementation.
 * Cleaned up formatting for checkpatch.
 * As these changes are mainly cosmetic, I've retained the Tested-by
   and Reviewed-by tags.

Changes since PATCH V2 are:
 * spelt `PATCH' correctly in the subject prefix this time. :-(
 * Added acks, tested-bys and reviewed-bys.
 * Cleanup of patch #6 with pud_pte and pud_pmd helpers.
 * Switched config option from HAVE_RCU_GUP to HAVE_GENERIC_RCU_GUP.

Changes since PATCH V1 are:
 * Rebase to 3.17-rc1
 * Switched to kick_all_cpus_sync as suggested by Mark Rutland.

The main changes since RFC V5 are:
 * Rebased against 3.16-rc1.
 * pmd_present no longer tested for by gup_huge_pmd and gup_huge_pud,
   because the entry must be present for these leaf functions to be
   called. 
 * Rather than assume puds can be re-cast as pmds, a separate
   function pud_write is instead used by the core gup.
 * ARM activation logic changed, now it will only activate
   RCU_TABLE_FREE and RCU_GUP when running with LPAE.

The main changes since RFC V4 are:
 * corrected the arm64 logic so it now correctly rcu-frees page
   table backing pages.
 * rcu free logic relaxed for pre-ARMv7 ARM as we need an IPI to
   invalidate TLBs anyway.
 * rebased to 3.15-rc3 (some minor changes were needed to allow it to merge).
 * dropped Catalin's mmu_gather patch as that's been merged already.

This series has been tested with LTP mm tests and some custom futex tests
that exacerbate the futex on THP tail case; on both an Arndale board and
a Juno board. Also debug counters were temporarily employed to ensure that
the RCU_TABLE_FREE logic was behaving as expected.

Cheers,
--
Steve


Steve Capper (6):
  mm: Introduce a general RCU get_user_pages_fast.
  arm: mm: Introduce special ptes for LPAE
  arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm: mm: Enable RCU fast_gup
  arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm64: mm: Enable RCU fast_gup

 arch/arm/Kconfig                      |   5 +
 arch/arm/include/asm/pgtable-2level.h |   2 +
 arch/arm/include/asm/pgtable-3level.h |  15 ++
 arch/arm/include/asm/pgtable.h        |   6 +-
 arch/arm/include/asm/tlb.h            |  38 +++-
 arch/arm/mm/flush.c                   |  15 ++
 arch/arm64/Kconfig                    |   4 +
 arch/arm64/include/asm/pgtable.h      |  21 +-
 arch/arm64/include/asm/tlb.h          |  20 +-
 arch/arm64/mm/flush.c                 |  15 ++
 mm/Kconfig                            |   3 +
 mm/gup.c                              | 354 ++++++++++++++++++++++++++++++++++
 12 files changed, 488 insertions(+), 10 deletions(-)

-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2014-09-26 14:03 ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

Hello,
This series implements general forms of get_user_pages_fast and
__get_user_pages_fast in core code and activates them for arm and arm64.

These are required for Transparent HugePages to function correctly, as
a futex on a THP tail will otherwise result in an infinite loop (due to
the core implementation of __get_user_pages_fast always returning 0).

Unfortunately, a futex on THP tail can be quite common for certain
workloads; thus THP is unreliable without a __get_user_pages_fast
implementation.

This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.

I appreciate that the merge window is coming very soon, and am posting
this revision on the off-chance that it gets the nod for 3.18. (The changes
thus far have been minimal and the feedback I've got has been mainly
positive).

Changes since PATCH V3 are
(mainly addressing comments from Hugh Dickins):
 * Added pte_numa and pmd_numa calls.
 * Added comments to clarify what assumptions are being made by the
   implementation.
 * Cleaned up formatting for checkpatch.
 * As these changes are mainly cosmetic, I've retained the Tested-by
   and Reviewed-by tags.

Changes since PATCH V2 are:
 * spelt `PATCH' correctly in the subject prefix this time. :-(
 * Added acks, tested-bys and reviewed-bys.
 * Cleanup of patch #6 with pud_pte and pud_pmd helpers.
 * Switched config option from HAVE_RCU_GUP to HAVE_GENERIC_RCU_GUP.

Changes since PATCH V1 are:
 * Rebase to 3.17-rc1
 * Switched to kick_all_cpus_sync as suggested by Mark Rutland.

The main changes since RFC V5 are:
 * Rebased against 3.16-rc1.
 * pmd_present no longer tested for by gup_huge_pmd and gup_huge_pud,
   because the entry must be present for these leaf functions to be
   called. 
 * Rather than assume puds can be re-cast as pmds, a separate
   function pud_write is instead used by the core gup.
 * ARM activation logic changed, now it will only activate
   RCU_TABLE_FREE and RCU_GUP when running with LPAE.

The main changes since RFC V4 are:
 * corrected the arm64 logic so it now correctly rcu-frees page
   table backing pages.
 * rcu free logic relaxed for pre-ARMv7 ARM as we need an IPI to
   invalidate TLBs anyway.
 * rebased to 3.15-rc3 (some minor changes were needed to allow it to merge).
 * dropped Catalin's mmu_gather patch as that's been merged already.

This series has been tested with LTP mm tests and some custom futex tests
that exacerbate the futex on THP tail case; on both an Arndale board and
a Juno board. Also debug counters were temporarily employed to ensure that
the RCU_TABLE_FREE logic was behaving as expected.

Cheers,
--
Steve


Steve Capper (6):
  mm: Introduce a general RCU get_user_pages_fast.
  arm: mm: Introduce special ptes for LPAE
  arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm: mm: Enable RCU fast_gup
  arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm64: mm: Enable RCU fast_gup

 arch/arm/Kconfig                      |   5 +
 arch/arm/include/asm/pgtable-2level.h |   2 +
 arch/arm/include/asm/pgtable-3level.h |  15 ++
 arch/arm/include/asm/pgtable.h        |   6 +-
 arch/arm/include/asm/tlb.h            |  38 +++-
 arch/arm/mm/flush.c                   |  15 ++
 arch/arm64/Kconfig                    |   4 +
 arch/arm64/include/asm/pgtable.h      |  21 +-
 arch/arm64/include/asm/tlb.h          |  20 +-
 arch/arm64/mm/flush.c                 |  15 ++
 mm/Kconfig                            |   3 +
 mm/gup.c                              | 354 ++++++++++++++++++++++++++++++++++
 12 files changed, 488 insertions(+), 10 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2014-09-26 14:03 ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,
This series implements general forms of get_user_pages_fast and
__get_user_pages_fast in core code and activates them for arm and arm64.

These are required for Transparent HugePages to function correctly, as
a futex on a THP tail will otherwise result in an infinite loop (due to
the core implementation of __get_user_pages_fast always returning 0).

Unfortunately, a futex on THP tail can be quite common for certain
workloads; thus THP is unreliable without a __get_user_pages_fast
implementation.

This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.

I appreciate that the merge window is coming very soon, and am posting
this revision on the off-chance that it gets the nod for 3.18. (The changes
thus far have been minimal and the feedback I've got has been mainly
positive).

Changes since PATCH V3 are
(mainly addressing comments from Hugh Dickins):
 * Added pte_numa and pmd_numa calls.
 * Added comments to clarify what assumptions are being made by the
   implementation.
 * Cleaned up formatting for checkpatch.
 * As these changes are mainly cosmetic, I've retained the Tested-by
   and Reviewed-by tags.

Changes since PATCH V2 are:
 * spelt `PATCH' correctly in the subject prefix this time. :-(
 * Added acks, tested-bys and reviewed-bys.
 * Cleanup of patch #6 with pud_pte and pud_pmd helpers.
 * Switched config option from HAVE_RCU_GUP to HAVE_GENERIC_RCU_GUP.

Changes since PATCH V1 are:
 * Rebase to 3.17-rc1
 * Switched to kick_all_cpus_sync as suggested by Mark Rutland.

The main changes since RFC V5 are:
 * Rebased against 3.16-rc1.
 * pmd_present no longer tested for by gup_huge_pmd and gup_huge_pud,
   because the entry must be present for these leaf functions to be
   called. 
 * Rather than assume puds can be re-cast as pmds, a separate
   function pud_write is instead used by the core gup.
 * ARM activation logic changed, now it will only activate
   RCU_TABLE_FREE and RCU_GUP when running with LPAE.

The main changes since RFC V4 are:
 * corrected the arm64 logic so it now correctly rcu-frees page
   table backing pages.
 * rcu free logic relaxed for pre-ARMv7 ARM as we need an IPI to
   invalidate TLBs anyway.
 * rebased to 3.15-rc3 (some minor changes were needed to allow it to merge).
 * dropped Catalin's mmu_gather patch as that's been merged already.

This series has been tested with LTP mm tests and some custom futex tests
that exacerbate the futex on THP tail case; on both an Arndale board and
a Juno board. Also debug counters were temporarily employed to ensure that
the RCU_TABLE_FREE logic was behaving as expected.

Cheers,
--
Steve


Steve Capper (6):
  mm: Introduce a general RCU get_user_pages_fast.
  arm: mm: Introduce special ptes for LPAE
  arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm: mm: Enable RCU fast_gup
  arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm64: mm: Enable RCU fast_gup

 arch/arm/Kconfig                      |   5 +
 arch/arm/include/asm/pgtable-2level.h |   2 +
 arch/arm/include/asm/pgtable-3level.h |  15 ++
 arch/arm/include/asm/pgtable.h        |   6 +-
 arch/arm/include/asm/tlb.h            |  38 +++-
 arch/arm/mm/flush.c                   |  15 ++
 arch/arm64/Kconfig                    |   4 +
 arch/arm64/include/asm/pgtable.h      |  21 +-
 arch/arm64/include/asm/tlb.h          |  20 +-
 arch/arm64/mm/flush.c                 |  15 ++
 mm/Kconfig                            |   3 +
 mm/gup.c                              | 354 ++++++++++++++++++++++++++++++++++
 12 files changed, 488 insertions(+), 10 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

get_user_pages_fast attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs
to block any THP splits.

One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.

On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.

This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic
has been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
their own get_user_pages_fast routines.

The RCU page table free logic coupled with a an IPI broadcast on THP
split (which is a rare event), allows one to protect a page table
walker by merely disabling the interrupts during the walk.

This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of
TLB invalidations.

It is based heavily on the PowerPC implementation by Nick Piggin.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changed in V4:
 * Added pte_numa and pmd_numa calls.
 * Added comments to clarify what assumptions are being made by the
   implementation.
 * Cleaned up formatting for checkpatch.

Catalin, I've kept your Reviewed-by, please shout if you dislike the
pte_numa and pmd_numa calls.
---
 mm/Kconfig |   3 +
 mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 357 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 886db21..0ceb8a5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	boolean
 
+config HAVE_GENERIC_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..35c0160 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,
@@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+/**
+ * Generic RCU Fast GUP
+ *
+ * get_user_pages_fast attempts to pin user pages by walking the page
+ * tables directly and avoids taking locks. Thus the walker needs to be
+ * protected from page table pages being freed from under it, and should
+ * block any THP splits.
+ *
+ * One way to achieve this is to have the walker disable interrupts, and
+ * rely on IPIs from the TLB flushing code blocking before the page table
+ * pages are freed. This is unsuitable for architectures that do not need
+ * to broadcast an IPI when invalidating TLBs.
+ *
+ * Another way to achieve this is to batch up page table containing pages
+ * belonging to more than one mm_user, then rcu_sched a callback to free those
+ * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
+ * (which is a relatively rare event). The code below adopts this strategy.
+ *
+ * Before activating this code, please be aware that the following assumptions
+ * are currently made:
+ *
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
+ *
+ *  *) THP splits will broadcast an IPI, this can be achieved by overriding
+ *      pmdp_splitting_flush.
+ *
+ *  *) ptes can be read atomically by the architecture.
+ *
+ *  *) access_ok is sufficient to validate userspace address ranges.
+ *
+ * The last two assumptions can be relaxed by the addition of helper functions.
+ *
+ * This code is based heavily on the PowerPC implementation by Nick Piggin.
+ */
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * In the line below we are assuming that the pte can be read
+		 * atomically. If this is not the case for your architecture,
+		 * please wrap this in a helper function!
+		 *
+		 * for an example see gup_get_pte in arch/x86/mm/gup.c
+		 */
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		/*
+		 * Similar to the PMD case below, NUMA hinting must take slow
+		 * path
+		 */
+		if (!pte_present(pte) || pte_special(pte) ||
+			pte_numa(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * __get_user_pages_fast implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pmd_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pud_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pud_page(orig);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
+
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+				pages, nr))
+				return 0;
+
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pud(pud, pudp, addr, next, write,
+					pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP. It will only return non-negative values.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

get_user_pages_fast attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs
to block any THP splits.

One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.

On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.

This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic
has been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
their own get_user_pages_fast routines.

The RCU page table free logic coupled with a an IPI broadcast on THP
split (which is a rare event), allows one to protect a page table
walker by merely disabling the interrupts during the walk.

This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of
TLB invalidations.

It is based heavily on the PowerPC implementation by Nick Piggin.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changed in V4:
 * Added pte_numa and pmd_numa calls.
 * Added comments to clarify what assumptions are being made by the
   implementation.
 * Cleaned up formatting for checkpatch.

Catalin, I've kept your Reviewed-by, please shout if you dislike the
pte_numa and pmd_numa calls.
---
 mm/Kconfig |   3 +
 mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 357 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 886db21..0ceb8a5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	boolean
 
+config HAVE_GENERIC_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..35c0160 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,
@@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+/**
+ * Generic RCU Fast GUP
+ *
+ * get_user_pages_fast attempts to pin user pages by walking the page
+ * tables directly and avoids taking locks. Thus the walker needs to be
+ * protected from page table pages being freed from under it, and should
+ * block any THP splits.
+ *
+ * One way to achieve this is to have the walker disable interrupts, and
+ * rely on IPIs from the TLB flushing code blocking before the page table
+ * pages are freed. This is unsuitable for architectures that do not need
+ * to broadcast an IPI when invalidating TLBs.
+ *
+ * Another way to achieve this is to batch up page table containing pages
+ * belonging to more than one mm_user, then rcu_sched a callback to free those
+ * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
+ * (which is a relatively rare event). The code below adopts this strategy.
+ *
+ * Before activating this code, please be aware that the following assumptions
+ * are currently made:
+ *
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
+ *
+ *  *) THP splits will broadcast an IPI, this can be achieved by overriding
+ *      pmdp_splitting_flush.
+ *
+ *  *) ptes can be read atomically by the architecture.
+ *
+ *  *) access_ok is sufficient to validate userspace address ranges.
+ *
+ * The last two assumptions can be relaxed by the addition of helper functions.
+ *
+ * This code is based heavily on the PowerPC implementation by Nick Piggin.
+ */
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * In the line below we are assuming that the pte can be read
+		 * atomically. If this is not the case for your architecture,
+		 * please wrap this in a helper function!
+		 *
+		 * for an example see gup_get_pte in arch/x86/mm/gup.c
+		 */
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		/*
+		 * Similar to the PMD case below, NUMA hinting must take slow
+		 * path
+		 */
+		if (!pte_present(pte) || pte_special(pte) ||
+			pte_numa(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * __get_user_pages_fast implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pmd_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pud_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pud_page(orig);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
+
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+				pages, nr))
+				return 0;
+
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pud(pud, pudp, addr, next, write,
+					pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP. It will only return non-negative values.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

get_user_pages_fast attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs
to block any THP splits.

One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.

On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.

This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic
has been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
their own get_user_pages_fast routines.

The RCU page table free logic coupled with a an IPI broadcast on THP
split (which is a rare event), allows one to protect a page table
walker by merely disabling the interrupts during the walk.

This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of
TLB invalidations.

It is based heavily on the PowerPC implementation by Nick Piggin.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
Changed in V4:
 * Added pte_numa and pmd_numa calls.
 * Added comments to clarify what assumptions are being made by the
   implementation.
 * Cleaned up formatting for checkpatch.

Catalin, I've kept your Reviewed-by, please shout if you dislike the
pte_numa and pmd_numa calls.
---
 mm/Kconfig |   3 +
 mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 357 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 886db21..0ceb8a5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	boolean
 
+config HAVE_GENERIC_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..35c0160 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,
@@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+/**
+ * Generic RCU Fast GUP
+ *
+ * get_user_pages_fast attempts to pin user pages by walking the page
+ * tables directly and avoids taking locks. Thus the walker needs to be
+ * protected from page table pages being freed from under it, and should
+ * block any THP splits.
+ *
+ * One way to achieve this is to have the walker disable interrupts, and
+ * rely on IPIs from the TLB flushing code blocking before the page table
+ * pages are freed. This is unsuitable for architectures that do not need
+ * to broadcast an IPI when invalidating TLBs.
+ *
+ * Another way to achieve this is to batch up page table containing pages
+ * belonging to more than one mm_user, then rcu_sched a callback to free those
+ * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
+ * (which is a relatively rare event). The code below adopts this strategy.
+ *
+ * Before activating this code, please be aware that the following assumptions
+ * are currently made:
+ *
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
+ *
+ *  *) THP splits will broadcast an IPI, this can be achieved by overriding
+ *      pmdp_splitting_flush.
+ *
+ *  *) ptes can be read atomically by the architecture.
+ *
+ *  *) access_ok is sufficient to validate userspace address ranges.
+ *
+ * The last two assumptions can be relaxed by the addition of helper functions.
+ *
+ * This code is based heavily on the PowerPC implementation by Nick Piggin.
+ */
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * In the line below we are assuming that the pte can be read
+		 * atomically. If this is not the case for your architecture,
+		 * please wrap this in a helper function!
+		 *
+		 * for an example see gup_get_pte in arch/x86/mm/gup.c
+		 */
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		/*
+		 * Similar to the PMD case below, NUMA hinting must take slow
+		 * path
+		 */
+		if (!pte_present(pte) || pte_special(pte) ||
+			pte_numa(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * __get_user_pages_fast implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pmd_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (write && !pud_write(orig))
+		return 0;
+
+	refs = 0;
+	head = pud_page(orig);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
+
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+				pages, nr))
+				return 0;
+
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pud(pud, pudp, addr, next, write,
+					pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP. It will only return non-negative values.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be@least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 2/6] arm: mm: Introduce special ptes for LPAE
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

We need a mechanism to tag ptes as being special, this indicates that
no attempt should be made to access the underlying struct page *
associated with the pte. This is used by the fast_gup when operating on
ptes as it has no means to access VMAs (that also contain this
information) locklessly.

The L_PTE_SPECIAL bit is already allocated for LPAE, this patch modifies
pte_special and pte_mkspecial to make use of it, and defines
__HAVE_ARCH_PTE_SPECIAL.

This patch also excludes special ptes from the icache/dcache sync logic.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/include/asm/pgtable-2level.h | 2 ++
 arch/arm/include/asm/pgtable-3level.h | 7 +++++++
 arch/arm/include/asm/pgtable.h        | 6 ++----
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 219ac88..f027941 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -182,6 +182,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_addr_end(addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
+#define pte_special(pte)	(0)
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
 
 /*
  * We don't have huge page support for short descriptors, for the moment
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 06e0bc0..16122d4 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -213,6 +213,13 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_isclear(pmd, val)	(!(pmd_val(pmd) & (val)))
 
 #define pmd_young(pmd)		(pmd_isset((pmd), PMD_SECT_AF))
+#define pte_special(pte)	(pte_isset((pte), L_PTE_SPECIAL))
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+	pte_val(pte) |= L_PTE_SPECIAL;
+	return pte;
+}
+#define	__HAVE_ARCH_PTE_SPECIAL
 
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd)		(pmd_isclear((pmd), L_PMD_SECT_RDONLY))
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 01baef0..90aa4583 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -226,7 +226,6 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
 #define pte_dirty(pte)		(pte_isset((pte), L_PTE_DIRTY))
 #define pte_young(pte)		(pte_isset((pte), L_PTE_YOUNG))
 #define pte_exec(pte)		(pte_isclear((pte), L_PTE_XN))
-#define pte_special(pte)	(0)
 
 #define pte_valid_user(pte)	\
 	(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
@@ -245,7 +244,8 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 	unsigned long ext = 0;
 
 	if (addr < TASK_SIZE && pte_valid_user(pteval)) {
-		__sync_icache_dcache(pteval);
+		if (!pte_special(pteval))
+			__sync_icache_dcache(pteval);
 		ext |= PTE_EXT_NG;
 	}
 
@@ -264,8 +264,6 @@ PTE_BIT_FUNC(mkyoung,   |= L_PTE_YOUNG);
 PTE_BIT_FUNC(mkexec,   &= ~L_PTE_XN);
 PTE_BIT_FUNC(mknexec,   |= L_PTE_XN);
 
-static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	const pteval_t mask = L_PTE_XN | L_PTE_RDONLY | L_PTE_USER |
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 2/6] arm: mm: Introduce special ptes for LPAE
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

We need a mechanism to tag ptes as being special, this indicates that
no attempt should be made to access the underlying struct page *
associated with the pte. This is used by the fast_gup when operating on
ptes as it has no means to access VMAs (that also contain this
information) locklessly.

The L_PTE_SPECIAL bit is already allocated for LPAE, this patch modifies
pte_special and pte_mkspecial to make use of it, and defines
__HAVE_ARCH_PTE_SPECIAL.

This patch also excludes special ptes from the icache/dcache sync logic.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/include/asm/pgtable-2level.h | 2 ++
 arch/arm/include/asm/pgtable-3level.h | 7 +++++++
 arch/arm/include/asm/pgtable.h        | 6 ++----
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 219ac88..f027941 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -182,6 +182,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_addr_end(addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
+#define pte_special(pte)	(0)
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
 
 /*
  * We don't have huge page support for short descriptors, for the moment
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 06e0bc0..16122d4 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -213,6 +213,13 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_isclear(pmd, val)	(!(pmd_val(pmd) & (val)))
 
 #define pmd_young(pmd)		(pmd_isset((pmd), PMD_SECT_AF))
+#define pte_special(pte)	(pte_isset((pte), L_PTE_SPECIAL))
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+	pte_val(pte) |= L_PTE_SPECIAL;
+	return pte;
+}
+#define	__HAVE_ARCH_PTE_SPECIAL
 
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd)		(pmd_isclear((pmd), L_PMD_SECT_RDONLY))
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 01baef0..90aa4583 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -226,7 +226,6 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
 #define pte_dirty(pte)		(pte_isset((pte), L_PTE_DIRTY))
 #define pte_young(pte)		(pte_isset((pte), L_PTE_YOUNG))
 #define pte_exec(pte)		(pte_isclear((pte), L_PTE_XN))
-#define pte_special(pte)	(0)
 
 #define pte_valid_user(pte)	\
 	(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
@@ -245,7 +244,8 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 	unsigned long ext = 0;
 
 	if (addr < TASK_SIZE && pte_valid_user(pteval)) {
-		__sync_icache_dcache(pteval);
+		if (!pte_special(pteval))
+			__sync_icache_dcache(pteval);
 		ext |= PTE_EXT_NG;
 	}
 
@@ -264,8 +264,6 @@ PTE_BIT_FUNC(mkyoung,   |= L_PTE_YOUNG);
 PTE_BIT_FUNC(mkexec,   &= ~L_PTE_XN);
 PTE_BIT_FUNC(mknexec,   |= L_PTE_XN);
 
-static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	const pteval_t mask = L_PTE_XN | L_PTE_RDONLY | L_PTE_USER |
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 2/6] arm: mm: Introduce special ptes for LPAE
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

We need a mechanism to tag ptes as being special, this indicates that
no attempt should be made to access the underlying struct page *
associated with the pte. This is used by the fast_gup when operating on
ptes as it has no means to access VMAs (that also contain this
information) locklessly.

The L_PTE_SPECIAL bit is already allocated for LPAE, this patch modifies
pte_special and pte_mkspecial to make use of it, and defines
__HAVE_ARCH_PTE_SPECIAL.

This patch also excludes special ptes from the icache/dcache sync logic.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/include/asm/pgtable-2level.h | 2 ++
 arch/arm/include/asm/pgtable-3level.h | 7 +++++++
 arch/arm/include/asm/pgtable.h        | 6 ++----
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index 219ac88..f027941 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -182,6 +182,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_addr_end(addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
+#define pte_special(pte)	(0)
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
 
 /*
  * We don't have huge page support for short descriptors, for the moment
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 06e0bc0..16122d4 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -213,6 +213,13 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_isclear(pmd, val)	(!(pmd_val(pmd) & (val)))
 
 #define pmd_young(pmd)		(pmd_isset((pmd), PMD_SECT_AF))
+#define pte_special(pte)	(pte_isset((pte), L_PTE_SPECIAL))
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+	pte_val(pte) |= L_PTE_SPECIAL;
+	return pte;
+}
+#define	__HAVE_ARCH_PTE_SPECIAL
 
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd)		(pmd_isclear((pmd), L_PMD_SECT_RDONLY))
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 01baef0..90aa4583 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -226,7 +226,6 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
 #define pte_dirty(pte)		(pte_isset((pte), L_PTE_DIRTY))
 #define pte_young(pte)		(pte_isset((pte), L_PTE_YOUNG))
 #define pte_exec(pte)		(pte_isclear((pte), L_PTE_XN))
-#define pte_special(pte)	(0)
 
 #define pte_valid_user(pte)	\
 	(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
@@ -245,7 +244,8 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 	unsigned long ext = 0;
 
 	if (addr < TASK_SIZE && pte_valid_user(pteval)) {
-		__sync_icache_dcache(pteval);
+		if (!pte_special(pteval))
+			__sync_icache_dcache(pteval);
 		ext |= PTE_EXT_NG;
 	}
 
@@ -264,8 +264,6 @@ PTE_BIT_FUNC(mkyoung,   |= L_PTE_YOUNG);
 PTE_BIT_FUNC(mkexec,   &= ~L_PTE_XN);
 PTE_BIT_FUNC(mknexec,   |= L_PTE_XN);
 
-static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	const pteval_t mask = L_PTE_XN | L_PTE_RDONLY | L_PTE_USER |
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 3/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect
the fast gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/Kconfig           |  1 +
 arch/arm/include/asm/tlb.h | 38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c49a775..cc740d2 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -60,6 +60,7 @@ config ARM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index f1a0dac..3cadb72 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -35,12 +35,39 @@
 
 #define MMU_GATHER_BUNDLE	8
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * TLB handling.  This allows us to remove pages from the page
  * tables, and efficiently handle the TLB issues.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+	unsigned int		need_flush;
+#endif
 	unsigned int		fullmm;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
@@ -101,6 +128,9 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 }
 
 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -129,6 +159,10 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
 	tlb->pages = tlb->local;
 	tlb->nr = 0;
 	__tlb_alloc_page(tlb);
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
+#endif
 }
 
 static inline void
@@ -205,7 +239,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 	tlb_add_flush(tlb, addr + SZ_1M);
 #endif
 
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
@@ -213,7 +247,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 {
 #ifdef CONFIG_ARM_LPAE
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 #endif
 }
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 3/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect
the fast gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/Kconfig           |  1 +
 arch/arm/include/asm/tlb.h | 38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c49a775..cc740d2 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -60,6 +60,7 @@ config ARM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index f1a0dac..3cadb72 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -35,12 +35,39 @@
 
 #define MMU_GATHER_BUNDLE	8
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * TLB handling.  This allows us to remove pages from the page
  * tables, and efficiently handle the TLB issues.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+	unsigned int		need_flush;
+#endif
 	unsigned int		fullmm;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
@@ -101,6 +128,9 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 }
 
 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -129,6 +159,10 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
 	tlb->pages = tlb->local;
 	tlb->nr = 0;
 	__tlb_alloc_page(tlb);
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
+#endif
 }
 
 static inline void
@@ -205,7 +239,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 	tlb_add_flush(tlb, addr + SZ_1M);
 #endif
 
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
@@ -213,7 +247,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 {
 #ifdef CONFIG_ARM_LPAE
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 #endif
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 3/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect
the fast gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/Kconfig           |  1 +
 arch/arm/include/asm/tlb.h | 38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c49a775..cc740d2 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -60,6 +60,7 @@ config ARM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index f1a0dac..3cadb72 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -35,12 +35,39 @@
 
 #define MMU_GATHER_BUNDLE	8
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * TLB handling.  This allows us to remove pages from the page
  * tables, and efficiently handle the TLB issues.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+	unsigned int		need_flush;
+#endif
 	unsigned int		fullmm;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
@@ -101,6 +128,9 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 }
 
 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -129,6 +159,10 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
 	tlb->pages = tlb->local;
 	tlb->nr = 0;
 	__tlb_alloc_page(tlb);
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
+#endif
 }
 
 static inline void
@@ -205,7 +239,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 	tlb_add_flush(tlb, addr + SZ_1M);
 #endif
 
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
@@ -213,7 +247,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 {
 #ifdef CONFIG_ARM_LPAE
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 #endif
 }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 4/6] arm: mm: Enable RCU fast_gup
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

Activate the RCU fast_gup for ARM. We also need to force THP splits to
broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.

Some pre-requisite functions pud_write and pud_page are also added.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/Kconfig                      |  4 ++++
 arch/arm/include/asm/pgtable-3level.h |  8 ++++++++
 arch/arm/mm/flush.c                   | 15 +++++++++++++++
 3 files changed, 27 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index cc740d2..0e5b47f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1645,6 +1645,10 @@ config ARCH_SELECT_MEMORY_MODEL
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
 
+config HAVE_GENERIC_RCU_GUP
+	def_bool y
+	depends on ARM_LPAE
+
 config HIGHMEM
 	bool "High Memory Support"
 	depends on MMU
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 16122d4..a31ecdad 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -224,6 +224,8 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd)		(pmd_isclear((pmd), L_PMD_SECT_RDONLY))
 #define pmd_dirty(pmd)		(pmd_isset((pmd), L_PMD_SECT_DIRTY))
+#define pud_page(pud)		pmd_page(__pmd(pud_val(pud)))
+#define pud_write(pud)		pmd_write(__pmd(pud_val(pud)))
 
 #define pmd_hugewillfault(pmd)	(!pmd_young(pmd) || !pmd_write(pmd))
 #define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))
@@ -231,6 +233,12 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !pmd_table(pmd))
 #define pmd_trans_splitting(pmd) (pmd_isset((pmd), L_PMD_SECT_SPLITTING))
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 43d54f5..265b836 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -400,3 +400,18 @@ void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned l
 	 */
 	__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	kick_all_cpus_sync();
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 4/6] arm: mm: Enable RCU fast_gup
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

Activate the RCU fast_gup for ARM. We also need to force THP splits to
broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.

Some pre-requisite functions pud_write and pud_page are also added.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/Kconfig                      |  4 ++++
 arch/arm/include/asm/pgtable-3level.h |  8 ++++++++
 arch/arm/mm/flush.c                   | 15 +++++++++++++++
 3 files changed, 27 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index cc740d2..0e5b47f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1645,6 +1645,10 @@ config ARCH_SELECT_MEMORY_MODEL
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
 
+config HAVE_GENERIC_RCU_GUP
+	def_bool y
+	depends on ARM_LPAE
+
 config HIGHMEM
 	bool "High Memory Support"
 	depends on MMU
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 16122d4..a31ecdad 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -224,6 +224,8 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd)		(pmd_isclear((pmd), L_PMD_SECT_RDONLY))
 #define pmd_dirty(pmd)		(pmd_isset((pmd), L_PMD_SECT_DIRTY))
+#define pud_page(pud)		pmd_page(__pmd(pud_val(pud)))
+#define pud_write(pud)		pmd_write(__pmd(pud_val(pud)))
 
 #define pmd_hugewillfault(pmd)	(!pmd_young(pmd) || !pmd_write(pmd))
 #define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))
@@ -231,6 +233,12 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !pmd_table(pmd))
 #define pmd_trans_splitting(pmd) (pmd_isset((pmd), L_PMD_SECT_SPLITTING))
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 43d54f5..265b836 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -400,3 +400,18 @@ void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned l
 	 */
 	__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	kick_all_cpus_sync();
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 4/6] arm: mm: Enable RCU fast_gup
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

Activate the RCU fast_gup for ARM. We also need to force THP splits to
broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.

Some pre-requisite functions pud_write and pud_page are also added.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/Kconfig                      |  4 ++++
 arch/arm/include/asm/pgtable-3level.h |  8 ++++++++
 arch/arm/mm/flush.c                   | 15 +++++++++++++++
 3 files changed, 27 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index cc740d2..0e5b47f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1645,6 +1645,10 @@ config ARCH_SELECT_MEMORY_MODEL
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
 
+config HAVE_GENERIC_RCU_GUP
+	def_bool y
+	depends on ARM_LPAE
+
 config HIGHMEM
 	bool "High Memory Support"
 	depends on MMU
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 16122d4..a31ecdad 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -224,6 +224,8 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd)		(pmd_isclear((pmd), L_PMD_SECT_RDONLY))
 #define pmd_dirty(pmd)		(pmd_isset((pmd), L_PMD_SECT_DIRTY))
+#define pud_page(pud)		pmd_page(__pmd(pud_val(pud)))
+#define pud_write(pud)		pmd_write(__pmd(pud_val(pud)))
 
 #define pmd_hugewillfault(pmd)	(!pmd_young(pmd) || !pmd_write(pmd))
 #define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))
@@ -231,6 +233,12 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !pmd_table(pmd))
 #define pmd_trans_splitting(pmd) (pmd_isset((pmd), L_PMD_SECT_SPLITTING))
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 43d54f5..265b836 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -400,3 +400,18 @@ void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned l
 	 */
 	__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	kick_all_cpus_sync();
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-09-26 14:03 ` Steve Capper
  (?)
@ 2014-09-26 14:03   ` Steve Capper
  -1 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect the
fast gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig           |  1 +
 arch/arm64/include/asm/tlb.h | 20 +++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fd4e81a..ce9062b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -54,6 +54,7 @@ config ARM64
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE
 	select HAVE_SYSCALL_TRACEPOINTS
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 62731ef..a82c0c5 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -23,6 +23,20 @@
 
 #include <asm-generic/tlb.h>
 
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
@@ -88,7 +102,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 {
 	pgtable_page_dtor(pte);
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 #if CONFIG_ARM64_PGTABLE_LEVELS > 2
@@ -96,7 +110,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 }
 #endif
 
@@ -105,7 +119,7 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 				  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pudp));
+	tlb_remove_entry(tlb, virt_to_page(pudp));
 }
 #endif
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect the
fast gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig           |  1 +
 arch/arm64/include/asm/tlb.h | 20 +++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fd4e81a..ce9062b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -54,6 +54,7 @@ config ARM64
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE
 	select HAVE_SYSCALL_TRACEPOINTS
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 62731ef..a82c0c5 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -23,6 +23,20 @@
 
 #include <asm-generic/tlb.h>
 
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
@@ -88,7 +102,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 {
 	pgtable_page_dtor(pte);
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 #if CONFIG_ARM64_PGTABLE_LEVELS > 2
@@ -96,7 +110,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 }
 #endif
 
@@ -105,7 +119,7 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 				  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pudp));
+	tlb_remove_entry(tlb, virt_to_page(pudp));
 }
 #endif
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect the
fast gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig           |  1 +
 arch/arm64/include/asm/tlb.h | 20 +++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fd4e81a..ce9062b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -54,6 +54,7 @@ config ARM64
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE
 	select HAVE_SYSCALL_TRACEPOINTS
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 62731ef..a82c0c5 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -23,6 +23,20 @@
 
 #include <asm-generic/tlb.h>
 
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
@@ -88,7 +102,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 {
 	pgtable_page_dtor(pte);
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 #if CONFIG_ARM64_PGTABLE_LEVELS > 2
@@ -96,7 +110,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 }
 #endif
 
@@ -105,7 +119,7 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 				  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pudp));
+	tlb_remove_entry(tlb, virt_to_page(pudp));
 }
 #endif
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 6/6] arm64: mm: Enable RCU fast_gup
  2014-09-26 14:03 ` Steve Capper
  (?)
@ 2014-09-26 14:03   ` Steve Capper
  -1 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

Activate the RCU fast_gup for ARM64. We also need to force THP splits
to broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.

Some pre-requisite functions pud_write and pud_page are also added.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig               |  3 +++
 arch/arm64/include/asm/pgtable.h | 21 ++++++++++++++++++++-
 arch/arm64/mm/flush.c            | 15 +++++++++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ce9062b..435305e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -108,6 +108,9 @@ config GENERIC_CALIBRATE_DELAY
 config ZONE_DMA
 	def_bool y
 
+config HAVE_GENERIC_RCU_GUP
+	def_bool y
+
 config ARCH_DMA_ADDR_T_64BIT
 	def_bool y
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffe1ba0..6d81471 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -239,6 +239,16 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 #define __HAVE_ARCH_PTE_SPECIAL
 
+static inline pte_t pud_pte(pud_t pud)
+{
+	return __pte(pud_val(pud));
+}
+
+static inline pmd_t pud_pmd(pud_t pud)
+{
+	return __pmd(pud_val(pud));
+}
+
 static inline pte_t pmd_pte(pmd_t pmd)
 {
 	return __pte(pmd_val(pmd));
@@ -256,7 +266,13 @@ static inline pmd_t pte_pmd(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd)	pte_special(pmd_pte(pmd))
-#endif
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+struct vm_area_struct;
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
@@ -277,6 +293,7 @@ static inline pmd_t pte_pmd(pte_t pte)
 #define mk_pmd(page,prot)	pfn_pmd(page_to_pfn(page),prot)
 
 #define pmd_page(pmd)           pfn_to_page(__phys_to_pfn(pmd_val(pmd) & PHYS_MASK))
+#define pud_write(pud)		pte_write(pud_pte(pud))
 #define pud_pfn(pud)		(((pud_val(pud) & PUD_MASK) & PHYS_MASK) >> PAGE_SHIFT)
 
 #define set_pmd_at(mm, addr, pmdp, pmd)	set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
@@ -376,6 +393,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(addr);
 }
 
+#define pud_page(pud)           pmd_page(pud_pmd(pud))
+
 #endif	/* CONFIG_ARM64_PGTABLE_LEVELS > 2 */
 
 #if CONFIG_ARM64_PGTABLE_LEVELS > 3
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 0d64089..2d5fd47 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -104,3 +104,18 @@ EXPORT_SYMBOL(flush_dcache_page);
  */
 EXPORT_SYMBOL(flush_cache_all);
 EXPORT_SYMBOL(flush_icache_range);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	kick_all_cpus_sync();
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 6/6] arm64: mm: Enable RCU fast_gup
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

Activate the RCU fast_gup for ARM64. We also need to force THP splits
to broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.

Some pre-requisite functions pud_write and pud_page are also added.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig               |  3 +++
 arch/arm64/include/asm/pgtable.h | 21 ++++++++++++++++++++-
 arch/arm64/mm/flush.c            | 15 +++++++++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ce9062b..435305e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -108,6 +108,9 @@ config GENERIC_CALIBRATE_DELAY
 config ZONE_DMA
 	def_bool y
 
+config HAVE_GENERIC_RCU_GUP
+	def_bool y
+
 config ARCH_DMA_ADDR_T_64BIT
 	def_bool y
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffe1ba0..6d81471 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -239,6 +239,16 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 #define __HAVE_ARCH_PTE_SPECIAL
 
+static inline pte_t pud_pte(pud_t pud)
+{
+	return __pte(pud_val(pud));
+}
+
+static inline pmd_t pud_pmd(pud_t pud)
+{
+	return __pmd(pud_val(pud));
+}
+
 static inline pte_t pmd_pte(pmd_t pmd)
 {
 	return __pte(pmd_val(pmd));
@@ -256,7 +266,13 @@ static inline pmd_t pte_pmd(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd)	pte_special(pmd_pte(pmd))
-#endif
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+struct vm_area_struct;
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
@@ -277,6 +293,7 @@ static inline pmd_t pte_pmd(pte_t pte)
 #define mk_pmd(page,prot)	pfn_pmd(page_to_pfn(page),prot)
 
 #define pmd_page(pmd)           pfn_to_page(__phys_to_pfn(pmd_val(pmd) & PHYS_MASK))
+#define pud_write(pud)		pte_write(pud_pte(pud))
 #define pud_pfn(pud)		(((pud_val(pud) & PUD_MASK) & PHYS_MASK) >> PAGE_SHIFT)
 
 #define set_pmd_at(mm, addr, pmdp, pmd)	set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
@@ -376,6 +393,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(addr);
 }
 
+#define pud_page(pud)           pmd_page(pud_pmd(pud))
+
 #endif	/* CONFIG_ARM64_PGTABLE_LEVELS > 2 */
 
 #if CONFIG_ARM64_PGTABLE_LEVELS > 3
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 0d64089..2d5fd47 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -104,3 +104,18 @@ EXPORT_SYMBOL(flush_dcache_page);
  */
 EXPORT_SYMBOL(flush_cache_all);
 EXPORT_SYMBOL(flush_icache_range);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	kick_all_cpus_sync();
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 6/6] arm64: mm: Enable RCU fast_gup
@ 2014-09-26 14:03   ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-09-26 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

Activate the RCU fast_gup for ARM64. We also need to force THP splits
to broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.

Some pre-requisite functions pud_write and pud_page are also added.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig               |  3 +++
 arch/arm64/include/asm/pgtable.h | 21 ++++++++++++++++++++-
 arch/arm64/mm/flush.c            | 15 +++++++++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ce9062b..435305e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -108,6 +108,9 @@ config GENERIC_CALIBRATE_DELAY
 config ZONE_DMA
 	def_bool y
 
+config HAVE_GENERIC_RCU_GUP
+	def_bool y
+
 config ARCH_DMA_ADDR_T_64BIT
 	def_bool y
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffe1ba0..6d81471 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -239,6 +239,16 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 #define __HAVE_ARCH_PTE_SPECIAL
 
+static inline pte_t pud_pte(pud_t pud)
+{
+	return __pte(pud_val(pud));
+}
+
+static inline pmd_t pud_pmd(pud_t pud)
+{
+	return __pmd(pud_val(pud));
+}
+
 static inline pte_t pmd_pte(pmd_t pmd)
 {
 	return __pte(pmd_val(pmd));
@@ -256,7 +266,13 @@ static inline pmd_t pte_pmd(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd)	pte_special(pmd_pte(pmd))
-#endif
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+struct vm_area_struct;
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
@@ -277,6 +293,7 @@ static inline pmd_t pte_pmd(pte_t pte)
 #define mk_pmd(page,prot)	pfn_pmd(page_to_pfn(page),prot)
 
 #define pmd_page(pmd)           pfn_to_page(__phys_to_pfn(pmd_val(pmd) & PHYS_MASK))
+#define pud_write(pud)		pte_write(pud_pte(pud))
 #define pud_pfn(pud)		(((pud_val(pud) & PUD_MASK) & PHYS_MASK) >> PAGE_SHIFT)
 
 #define set_pmd_at(mm, addr, pmdp, pmd)	set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
@@ -376,6 +393,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 	return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(addr);
 }
 
+#define pud_page(pud)           pmd_page(pud_pmd(pud))
+
 #endif	/* CONFIG_ARM64_PGTABLE_LEVELS > 2 */
 
 #if CONFIG_ARM64_PGTABLE_LEVELS > 3
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 0d64089..2d5fd47 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -104,3 +104,18 @@ EXPORT_SYMBOL(flush_dcache_page);
  */
 EXPORT_SYMBOL(flush_cache_all);
 EXPORT_SYMBOL(flush_icache_range);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	kick_all_cpus_sync();
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-09-29 21:51     ` Hugh Dickins
  0 siblings, 0 replies; 103+ messages in thread
From: Hugh Dickins @ 2014-09-29 21:51 UTC (permalink / raw)
  To: Steve Capper
  Cc: Andrew Morton, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, dann.frazier,
	mark.rutland, mgorman, hughd

On Fri, 26 Sep 2014, Steve Capper wrote:

> get_user_pages_fast attempts to pin user pages by walking the page
> tables directly and avoids taking locks. Thus the walker needs to be
> protected from page table pages being freed from under it, and needs
> to block any THP splits.
> 
> One way to achieve this is to have the walker disable interrupts, and
> rely on IPIs from the TLB flushing code blocking before the page table
> pages are freed.
> 
> On some platforms we have hardware broadcast of TLB invalidations, thus
> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
> spuriously broadcasting IPIs can hurt system performance if done too
> often.
> 
> This problem has been solved on PowerPC and Sparc by batching up page
> table pages belonging to more than one mm_user, then scheduling an
> rcu_sched callback to free the pages. This RCU page table free logic
> has been promoted to core code and is activated when one enables
> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
> their own get_user_pages_fast routines.
> 
> The RCU page table free logic coupled with a an IPI broadcast on THP
> split (which is a rare event), allows one to protect a page table
> walker by merely disabling the interrupts during the walk.
> 
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> Tested-by: Dann Frazier <dann.frazier@canonical.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Acked-by: Hugh Dickins <hughd@google.com>

Thanks for making all those clarifications, Steve: this looks very
good to me now.  I'm not sure which tree you're hoping will take this
and the arm+arm64 patches 2-6: although this one would normally go
through akpm, I expect it's easier for you to synchronize if it goes
in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
I see no clash with what's currently in mmotm.

> ---
> Changed in V4:
>  * Added pte_numa and pmd_numa calls.
>  * Added comments to clarify what assumptions are being made by the
>    implementation.
>  * Cleaned up formatting for checkpatch.
> 
> Catalin, I've kept your Reviewed-by, please shout if you dislike the
> pte_numa and pmd_numa calls.
> ---
>  mm/Kconfig |   3 +
>  mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 357 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 886db21..0ceb8a5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
>  config HAVE_MEMBLOCK_PHYS_MAP
>  	boolean
>  
> +config HAVE_GENERIC_RCU_GUP
> +	boolean
> +
>  config ARCH_DISCARD_MEMBLOCK
>  	boolean
>  
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..35c0160 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -10,6 +10,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +#include <asm/pgtable.h>
> +
>  #include "internal.h"
>  
>  static struct page *no_page_table(struct vm_area_struct *vma,
> @@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
>  	return page;
>  }
>  #endif /* CONFIG_ELF_CORE */
> +
> +/**
> + * Generic RCU Fast GUP
> + *
> + * get_user_pages_fast attempts to pin user pages by walking the page
> + * tables directly and avoids taking locks. Thus the walker needs to be
> + * protected from page table pages being freed from under it, and should
> + * block any THP splits.
> + *
> + * One way to achieve this is to have the walker disable interrupts, and
> + * rely on IPIs from the TLB flushing code blocking before the page table
> + * pages are freed. This is unsuitable for architectures that do not need
> + * to broadcast an IPI when invalidating TLBs.
> + *
> + * Another way to achieve this is to batch up page table containing pages
> + * belonging to more than one mm_user, then rcu_sched a callback to free those
> + * pages. Disabling interrupts will allow the fast_gup walker to both block
> + * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> + * (which is a relatively rare event). The code below adopts this strategy.
> + *
> + * Before activating this code, please be aware that the following assumptions
> + * are currently made:
> + *
> + *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
> + *      pages containing page tables.
> + *
> + *  *) THP splits will broadcast an IPI, this can be achieved by overriding
> + *      pmdp_splitting_flush.
> + *
> + *  *) ptes can be read atomically by the architecture.
> + *
> + *  *) access_ok is sufficient to validate userspace address ranges.
> + *
> + * The last two assumptions can be relaxed by the addition of helper functions.
> + *
> + * This code is based heavily on the PowerPC implementation by Nick Piggin.
> + */
> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
> +
> +#ifdef __HAVE_ARCH_PTE_SPECIAL
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	pte_t *ptep, *ptem;
> +	int ret = 0;
> +
> +	ptem = ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		/*
> +		 * In the line below we are assuming that the pte can be read
> +		 * atomically. If this is not the case for your architecture,
> +		 * please wrap this in a helper function!
> +		 *
> +		 * for an example see gup_get_pte in arch/x86/mm/gup.c
> +		 */
> +		pte_t pte = ACCESS_ONCE(*ptep);
> +		struct page *page;
> +
> +		/*
> +		 * Similar to the PMD case below, NUMA hinting must take slow
> +		 * path
> +		 */
> +		if (!pte_present(pte) || pte_special(pte) ||
> +			pte_numa(pte) || (write && !pte_write(pte)))
> +			goto pte_unmap;
> +
> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +
> +		if (!page_cache_get_speculative(page))
> +			goto pte_unmap;
> +
> +		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +			put_page(page);
> +			goto pte_unmap;
> +		}
> +
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +
> +	ret = 1;
> +
> +pte_unmap:
> +	pte_unmap(ptem);
> +	return ret;
> +}
> +#else
> +
> +/*
> + * If we can't determine whether or not a pte is special, then fail immediately
> + * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> + * to be special.
> + *
> + * For a futex to be placed on a THP tail page, get_futex_key requires a
> + * __get_user_pages_fast implementation that can pin pages. Thus it's still
> + * useful to have gup_huge_pmd even if we can't operate on ptes.
> + */
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	return 0;
> +}
> +#endif /* __HAVE_ARCH_PTE_SPECIAL */
> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pud_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pud_page(orig);
> +	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pud_t *pudp;
> +
> +	pudp = pud_offset(pgdp, addr);
> +	do {
> +		pud_t pud = ACCESS_ONCE(*pudp);
> +
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(pud))
> +			return 0;
> +		if (pud_huge(pud)) {
> +			if (!gup_huge_pud(pud, pudp, addr, next, write,
> +					pages, nr))
> +				return 0;
> +		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
> +			return 0;
> +	} while (pudp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +/*
> + * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
> + * back to the regular GUP. It will only return non-negative values.
> + */
> +int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			  struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, len, end;
> +	unsigned long next, flags;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	start &= PAGE_MASK;
> +	addr = start;
> +	len = (unsigned long) nr_pages << PAGE_SHIFT;
> +	end = start + len;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, len)))
> +		return 0;
> +
> +	/*
> +	 * Disable interrupts, we use the nested form as we can already
> +	 * have interrupts disabled by get_futex_key.
> +	 *
> +	 * With interrupts disabled, we block page table pages from being
> +	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> +	 * for more details.
> +	 *
> +	 * We do not adopt an rcu_read_lock(.) here as we also want to
> +	 * block IPIs that come from THPs splitting.
> +	 */
> +
> +	local_irq_save(flags);
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(*pgdp))
> +			break;
> +		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
> +			break;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_restore(flags);
> +
> +	return nr;
> +}
> +
> +/**
> + * get_user_pages_fast() - pin user pages in memory
> + * @start:	starting user address
> + * @nr_pages:	number of pages from start to pin
> + * @write:	whether pages will be written to
> + * @pages:	array that receives pointers to the pages pinned.
> + *		Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * Returns number of pages pinned. This may be fewer than the number
> + * requested. If nr_pages is 0 or negative, returns 0. If no pages
> + * were pinned, returns -errno.
> + */
> +int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int nr, ret;
> +
> +	start &= PAGE_MASK;
> +	nr = __get_user_pages_fast(start, nr_pages, write, pages);
> +	ret = nr;
> +
> +	if (nr < nr_pages) {
> +		/* Try to get the remaining pages with get_user_pages */
> +		start += nr << PAGE_SHIFT;
> +		pages += nr;
> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		/* Have to be a bit careful with return values */
> +		if (nr > 0) {
> +			if (ret < 0)
> +				ret = nr;
> +			else
> +				ret += nr;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
> -- 
> 1.9.3
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-09-29 21:51     ` Hugh Dickins
  0 siblings, 0 replies; 103+ messages in thread
From: Hugh Dickins @ 2014-09-29 21:51 UTC (permalink / raw)
  To: Steve Capper
  Cc: Andrew Morton, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, dann.frazier,
	mark.rutland, mgorman, hughd

On Fri, 26 Sep 2014, Steve Capper wrote:

> get_user_pages_fast attempts to pin user pages by walking the page
> tables directly and avoids taking locks. Thus the walker needs to be
> protected from page table pages being freed from under it, and needs
> to block any THP splits.
> 
> One way to achieve this is to have the walker disable interrupts, and
> rely on IPIs from the TLB flushing code blocking before the page table
> pages are freed.
> 
> On some platforms we have hardware broadcast of TLB invalidations, thus
> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
> spuriously broadcasting IPIs can hurt system performance if done too
> often.
> 
> This problem has been solved on PowerPC and Sparc by batching up page
> table pages belonging to more than one mm_user, then scheduling an
> rcu_sched callback to free the pages. This RCU page table free logic
> has been promoted to core code and is activated when one enables
> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
> their own get_user_pages_fast routines.
> 
> The RCU page table free logic coupled with a an IPI broadcast on THP
> split (which is a rare event), allows one to protect a page table
> walker by merely disabling the interrupts during the walk.
> 
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> Tested-by: Dann Frazier <dann.frazier@canonical.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Acked-by: Hugh Dickins <hughd@google.com>

Thanks for making all those clarifications, Steve: this looks very
good to me now.  I'm not sure which tree you're hoping will take this
and the arm+arm64 patches 2-6: although this one would normally go
through akpm, I expect it's easier for you to synchronize if it goes
in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
I see no clash with what's currently in mmotm.

> ---
> Changed in V4:
>  * Added pte_numa and pmd_numa calls.
>  * Added comments to clarify what assumptions are being made by the
>    implementation.
>  * Cleaned up formatting for checkpatch.
> 
> Catalin, I've kept your Reviewed-by, please shout if you dislike the
> pte_numa and pmd_numa calls.
> ---
>  mm/Kconfig |   3 +
>  mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 357 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 886db21..0ceb8a5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
>  config HAVE_MEMBLOCK_PHYS_MAP
>  	boolean
>  
> +config HAVE_GENERIC_RCU_GUP
> +	boolean
> +
>  config ARCH_DISCARD_MEMBLOCK
>  	boolean
>  
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..35c0160 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -10,6 +10,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +#include <asm/pgtable.h>
> +
>  #include "internal.h"
>  
>  static struct page *no_page_table(struct vm_area_struct *vma,
> @@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
>  	return page;
>  }
>  #endif /* CONFIG_ELF_CORE */
> +
> +/**
> + * Generic RCU Fast GUP
> + *
> + * get_user_pages_fast attempts to pin user pages by walking the page
> + * tables directly and avoids taking locks. Thus the walker needs to be
> + * protected from page table pages being freed from under it, and should
> + * block any THP splits.
> + *
> + * One way to achieve this is to have the walker disable interrupts, and
> + * rely on IPIs from the TLB flushing code blocking before the page table
> + * pages are freed. This is unsuitable for architectures that do not need
> + * to broadcast an IPI when invalidating TLBs.
> + *
> + * Another way to achieve this is to batch up page table containing pages
> + * belonging to more than one mm_user, then rcu_sched a callback to free those
> + * pages. Disabling interrupts will allow the fast_gup walker to both block
> + * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> + * (which is a relatively rare event). The code below adopts this strategy.
> + *
> + * Before activating this code, please be aware that the following assumptions
> + * are currently made:
> + *
> + *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
> + *      pages containing page tables.
> + *
> + *  *) THP splits will broadcast an IPI, this can be achieved by overriding
> + *      pmdp_splitting_flush.
> + *
> + *  *) ptes can be read atomically by the architecture.
> + *
> + *  *) access_ok is sufficient to validate userspace address ranges.
> + *
> + * The last two assumptions can be relaxed by the addition of helper functions.
> + *
> + * This code is based heavily on the PowerPC implementation by Nick Piggin.
> + */
> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
> +
> +#ifdef __HAVE_ARCH_PTE_SPECIAL
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	pte_t *ptep, *ptem;
> +	int ret = 0;
> +
> +	ptem = ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		/*
> +		 * In the line below we are assuming that the pte can be read
> +		 * atomically. If this is not the case for your architecture,
> +		 * please wrap this in a helper function!
> +		 *
> +		 * for an example see gup_get_pte in arch/x86/mm/gup.c
> +		 */
> +		pte_t pte = ACCESS_ONCE(*ptep);
> +		struct page *page;
> +
> +		/*
> +		 * Similar to the PMD case below, NUMA hinting must take slow
> +		 * path
> +		 */
> +		if (!pte_present(pte) || pte_special(pte) ||
> +			pte_numa(pte) || (write && !pte_write(pte)))
> +			goto pte_unmap;
> +
> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +
> +		if (!page_cache_get_speculative(page))
> +			goto pte_unmap;
> +
> +		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +			put_page(page);
> +			goto pte_unmap;
> +		}
> +
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +
> +	ret = 1;
> +
> +pte_unmap:
> +	pte_unmap(ptem);
> +	return ret;
> +}
> +#else
> +
> +/*
> + * If we can't determine whether or not a pte is special, then fail immediately
> + * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> + * to be special.
> + *
> + * For a futex to be placed on a THP tail page, get_futex_key requires a
> + * __get_user_pages_fast implementation that can pin pages. Thus it's still
> + * useful to have gup_huge_pmd even if we can't operate on ptes.
> + */
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	return 0;
> +}
> +#endif /* __HAVE_ARCH_PTE_SPECIAL */
> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pud_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pud_page(orig);
> +	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pud_t *pudp;
> +
> +	pudp = pud_offset(pgdp, addr);
> +	do {
> +		pud_t pud = ACCESS_ONCE(*pudp);
> +
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(pud))
> +			return 0;
> +		if (pud_huge(pud)) {
> +			if (!gup_huge_pud(pud, pudp, addr, next, write,
> +					pages, nr))
> +				return 0;
> +		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
> +			return 0;
> +	} while (pudp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +/*
> + * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
> + * back to the regular GUP. It will only return non-negative values.
> + */
> +int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			  struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, len, end;
> +	unsigned long next, flags;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	start &= PAGE_MASK;
> +	addr = start;
> +	len = (unsigned long) nr_pages << PAGE_SHIFT;
> +	end = start + len;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, len)))
> +		return 0;
> +
> +	/*
> +	 * Disable interrupts, we use the nested form as we can already
> +	 * have interrupts disabled by get_futex_key.
> +	 *
> +	 * With interrupts disabled, we block page table pages from being
> +	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> +	 * for more details.
> +	 *
> +	 * We do not adopt an rcu_read_lock(.) here as we also want to
> +	 * block IPIs that come from THPs splitting.
> +	 */
> +
> +	local_irq_save(flags);
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(*pgdp))
> +			break;
> +		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
> +			break;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_restore(flags);
> +
> +	return nr;
> +}
> +
> +/**
> + * get_user_pages_fast() - pin user pages in memory
> + * @start:	starting user address
> + * @nr_pages:	number of pages from start to pin
> + * @write:	whether pages will be written to
> + * @pages:	array that receives pointers to the pages pinned.
> + *		Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * Returns number of pages pinned. This may be fewer than the number
> + * requested. If nr_pages is 0 or negative, returns 0. If no pages
> + * were pinned, returns -errno.
> + */
> +int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int nr, ret;
> +
> +	start &= PAGE_MASK;
> +	nr = __get_user_pages_fast(start, nr_pages, write, pages);
> +	ret = nr;
> +
> +	if (nr < nr_pages) {
> +		/* Try to get the remaining pages with get_user_pages */
> +		start += nr << PAGE_SHIFT;
> +		pages += nr;
> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		/* Have to be a bit careful with return values */
> +		if (nr > 0) {
> +			if (ret < 0)
> +				ret = nr;
> +			else
> +				ret += nr;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
> -- 
> 1.9.3
> 
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-09-29 21:51     ` Hugh Dickins
  0 siblings, 0 replies; 103+ messages in thread
From: Hugh Dickins @ 2014-09-29 21:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 26 Sep 2014, Steve Capper wrote:

> get_user_pages_fast attempts to pin user pages by walking the page
> tables directly and avoids taking locks. Thus the walker needs to be
> protected from page table pages being freed from under it, and needs
> to block any THP splits.
> 
> One way to achieve this is to have the walker disable interrupts, and
> rely on IPIs from the TLB flushing code blocking before the page table
> pages are freed.
> 
> On some platforms we have hardware broadcast of TLB invalidations, thus
> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
> spuriously broadcasting IPIs can hurt system performance if done too
> often.
> 
> This problem has been solved on PowerPC and Sparc by batching up page
> table pages belonging to more than one mm_user, then scheduling an
> rcu_sched callback to free the pages. This RCU page table free logic
> has been promoted to core code and is activated when one enables
> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
> their own get_user_pages_fast routines.
> 
> The RCU page table free logic coupled with a an IPI broadcast on THP
> split (which is a rare event), allows one to protect a page table
> walker by merely disabling the interrupts during the walk.
> 
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> Tested-by: Dann Frazier <dann.frazier@canonical.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Acked-by: Hugh Dickins <hughd@google.com>

Thanks for making all those clarifications, Steve: this looks very
good to me now.  I'm not sure which tree you're hoping will take this
and the arm+arm64 patches 2-6: although this one would normally go
through akpm, I expect it's easier for you to synchronize if it goes
in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
I see no clash with what's currently in mmotm.

> ---
> Changed in V4:
>  * Added pte_numa and pmd_numa calls.
>  * Added comments to clarify what assumptions are being made by the
>    implementation.
>  * Cleaned up formatting for checkpatch.
> 
> Catalin, I've kept your Reviewed-by, please shout if you dislike the
> pte_numa and pmd_numa calls.
> ---
>  mm/Kconfig |   3 +
>  mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 357 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 886db21..0ceb8a5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
>  config HAVE_MEMBLOCK_PHYS_MAP
>  	boolean
>  
> +config HAVE_GENERIC_RCU_GUP
> +	boolean
> +
>  config ARCH_DISCARD_MEMBLOCK
>  	boolean
>  
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..35c0160 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -10,6 +10,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +#include <asm/pgtable.h>
> +
>  #include "internal.h"
>  
>  static struct page *no_page_table(struct vm_area_struct *vma,
> @@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
>  	return page;
>  }
>  #endif /* CONFIG_ELF_CORE */
> +
> +/**
> + * Generic RCU Fast GUP
> + *
> + * get_user_pages_fast attempts to pin user pages by walking the page
> + * tables directly and avoids taking locks. Thus the walker needs to be
> + * protected from page table pages being freed from under it, and should
> + * block any THP splits.
> + *
> + * One way to achieve this is to have the walker disable interrupts, and
> + * rely on IPIs from the TLB flushing code blocking before the page table
> + * pages are freed. This is unsuitable for architectures that do not need
> + * to broadcast an IPI when invalidating TLBs.
> + *
> + * Another way to achieve this is to batch up page table containing pages
> + * belonging to more than one mm_user, then rcu_sched a callback to free those
> + * pages. Disabling interrupts will allow the fast_gup walker to both block
> + * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> + * (which is a relatively rare event). The code below adopts this strategy.
> + *
> + * Before activating this code, please be aware that the following assumptions
> + * are currently made:
> + *
> + *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
> + *      pages containing page tables.
> + *
> + *  *) THP splits will broadcast an IPI, this can be achieved by overriding
> + *      pmdp_splitting_flush.
> + *
> + *  *) ptes can be read atomically by the architecture.
> + *
> + *  *) access_ok is sufficient to validate userspace address ranges.
> + *
> + * The last two assumptions can be relaxed by the addition of helper functions.
> + *
> + * This code is based heavily on the PowerPC implementation by Nick Piggin.
> + */
> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
> +
> +#ifdef __HAVE_ARCH_PTE_SPECIAL
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	pte_t *ptep, *ptem;
> +	int ret = 0;
> +
> +	ptem = ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		/*
> +		 * In the line below we are assuming that the pte can be read
> +		 * atomically. If this is not the case for your architecture,
> +		 * please wrap this in a helper function!
> +		 *
> +		 * for an example see gup_get_pte in arch/x86/mm/gup.c
> +		 */
> +		pte_t pte = ACCESS_ONCE(*ptep);
> +		struct page *page;
> +
> +		/*
> +		 * Similar to the PMD case below, NUMA hinting must take slow
> +		 * path
> +		 */
> +		if (!pte_present(pte) || pte_special(pte) ||
> +			pte_numa(pte) || (write && !pte_write(pte)))
> +			goto pte_unmap;
> +
> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +
> +		if (!page_cache_get_speculative(page))
> +			goto pte_unmap;
> +
> +		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +			put_page(page);
> +			goto pte_unmap;
> +		}
> +
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +
> +	ret = 1;
> +
> +pte_unmap:
> +	pte_unmap(ptem);
> +	return ret;
> +}
> +#else
> +
> +/*
> + * If we can't determine whether or not a pte is special, then fail immediately
> + * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> + * to be special.
> + *
> + * For a futex to be placed on a THP tail page, get_futex_key requires a
> + * __get_user_pages_fast implementation that can pin pages. Thus it's still
> + * useful to have gup_huge_pmd even if we can't operate on ptes.
> + */
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	return 0;
> +}
> +#endif /* __HAVE_ARCH_PTE_SPECIAL */
> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pud_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pud_page(orig);
> +	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pud_t *pudp;
> +
> +	pudp = pud_offset(pgdp, addr);
> +	do {
> +		pud_t pud = ACCESS_ONCE(*pudp);
> +
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(pud))
> +			return 0;
> +		if (pud_huge(pud)) {
> +			if (!gup_huge_pud(pud, pudp, addr, next, write,
> +					pages, nr))
> +				return 0;
> +		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
> +			return 0;
> +	} while (pudp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +/*
> + * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
> + * back to the regular GUP. It will only return non-negative values.
> + */
> +int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			  struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, len, end;
> +	unsigned long next, flags;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	start &= PAGE_MASK;
> +	addr = start;
> +	len = (unsigned long) nr_pages << PAGE_SHIFT;
> +	end = start + len;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, len)))
> +		return 0;
> +
> +	/*
> +	 * Disable interrupts, we use the nested form as we can already
> +	 * have interrupts disabled by get_futex_key.
> +	 *
> +	 * With interrupts disabled, we block page table pages from being
> +	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> +	 * for more details.
> +	 *
> +	 * We do not adopt an rcu_read_lock(.) here as we also want to
> +	 * block IPIs that come from THPs splitting.
> +	 */
> +
> +	local_irq_save(flags);
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(*pgdp))
> +			break;
> +		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
> +			break;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_restore(flags);
> +
> +	return nr;
> +}
> +
> +/**
> + * get_user_pages_fast() - pin user pages in memory
> + * @start:	starting user address
> + * @nr_pages:	number of pages from start to pin
> + * @write:	whether pages will be written to
> + * @pages:	array that receives pointers to the pages pinned.
> + *		Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * Returns number of pages pinned. This may be fewer than the number
> + * requested. If nr_pages is 0 or negative, returns 0. If no pages
> + * were pinned, returns -errno.
> + */
> +int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int nr, ret;
> +
> +	start &= PAGE_MASK;
> +	nr = __get_user_pages_fast(start, nr_pages, write, pages);
> +	ret = nr;
> +
> +	if (nr < nr_pages) {
> +		/* Try to get the remaining pages with get_user_pages */
> +		start += nr << PAGE_SHIFT;
> +		pages += nr;
> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		/* Have to be a bit careful with return values */
> +		if (nr > 0) {
> +			if (ret < 0)
> +				ret = nr;
> +			else
> +				ret += nr;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
> -- 
> 1.9.3
> 
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-01 11:11       ` Catalin Marinas
  0 siblings, 0 replies; 103+ messages in thread
From: Catalin Marinas @ 2014-10-01 11:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Steve Capper, Andrew Morton, linux-arm-kernel, linux, linux-arch,
	linux-mm, Will Deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, dann.frazier, Mark Rutland, mgorman

On Mon, Sep 29, 2014 at 10:51:25PM +0100, Hugh Dickins wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
> > This patch provides a general RCU implementation of get_user_pages_fast
> > that can be used by architectures that perform hardware broadcast of
> > TLB invalidations.
> >
> > It is based heavily on the PowerPC implementation by Nick Piggin.
> >
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > Tested-by: Dann Frazier <dann.frazier@canonical.com>
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

From an arm64 perspective, I'm more than happy for Andrew to pick up the
entire series. I already reviewed the patches.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-01 11:11       ` Catalin Marinas
  0 siblings, 0 replies; 103+ messages in thread
From: Catalin Marinas @ 2014-10-01 11:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Steve Capper, Andrew Morton, linux-arm-kernel, linux, linux-arch,
	linux-mm, Will Deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, dann.frazier, Mark Rutland, mgorman

On Mon, Sep 29, 2014 at 10:51:25PM +0100, Hugh Dickins wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
> > This patch provides a general RCU implementation of get_user_pages_fast
> > that can be used by architectures that perform hardware broadcast of
> > TLB invalidations.
> >
> > It is based heavily on the PowerPC implementation by Nick Piggin.
> >
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > Tested-by: Dann Frazier <dann.frazier@canonical.com>
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-01 11:11       ` Catalin Marinas
  0 siblings, 0 replies; 103+ messages in thread
From: Catalin Marinas @ 2014-10-01 11:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Steve Capper, Andrew Morton, linux-arm-kernel, linux, linux-arch,
	linux-mm, Will Deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, dann.frazier, Mark Rutland, mgorman

On Mon, Sep 29, 2014 at 10:51:25PM +0100, Hugh Dickins wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
> > This patch provides a general RCU implementation of get_user_pages_fast
> > that can be used by architectures that perform hardware broadcast of
> > TLB invalidations.
> >
> > It is based heavily on the PowerPC implementation by Nick Piggin.
> >
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > Tested-by: Dann Frazier <dann.frazier@canonical.com>
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

>From an arm64 perspective, I'm more than happy for Andrew to pick up the
entire series. I already reviewed the patches.

Thanks.

-- 
Catalin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-01 11:11       ` Catalin Marinas
  0 siblings, 0 replies; 103+ messages in thread
From: Catalin Marinas @ 2014-10-01 11:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 29, 2014 at 10:51:25PM +0100, Hugh Dickins wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
> > This patch provides a general RCU implementation of get_user_pages_fast
> > that can be used by architectures that perform hardware broadcast of
> > TLB invalidations.
> >
> > It is based heavily on the PowerPC implementation by Nick Piggin.
> >
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > Tested-by: Dann Frazier <dann.frazier@canonical.com>
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

>From an arm64 perspective, I'm more than happy for Andrew to pick up the
entire series. I already reviewed the patches.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 12:19     ` Andrea Arcangeli
  0 siblings, 0 replies; 103+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 12:19 UTC (permalink / raw)
  To: Steve Capper
  Cc: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm,
	will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Hi Steve,

On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.

It'd be nice if you could also at the same time apply it to sparc and
powerpc in this same patchset to show the effectiveness of having a
generic version. Because if it's not a trivial drop-in replacement,
then this should go in arch/arm* instead of mm/gup.c...

Also I wonder if it wouldn't be better to add it to mm/util.c along
with the __weak gup_fast but then this is ok too. I'm just saying
because we never had sings of gup_fast code in mm/gup.c so far but
then this isn't exactly a __weak version of it... so I don't mind
either ways.

> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);

This has a collision with a patchset I posted, but it's trivial to
solve, the above three lines need to be replaced with:

+		ret = get_user_pages_unlocked(current, mm, start,
+				     nr_pages - nr, write, 0, pages);

And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
the first time to release the mmap_sem before I/O.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 12:19     ` Andrea Arcangeli
  0 siblings, 0 replies; 103+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 12:19 UTC (permalink / raw)
  To: Steve Capper
  Cc: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm,
	will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Hi Steve,

On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.

It'd be nice if you could also at the same time apply it to sparc and
powerpc in this same patchset to show the effectiveness of having a
generic version. Because if it's not a trivial drop-in replacement,
then this should go in arch/arm* instead of mm/gup.c...

Also I wonder if it wouldn't be better to add it to mm/util.c along
with the __weak gup_fast but then this is ok too. I'm just saying
because we never had sings of gup_fast code in mm/gup.c so far but
then this isn't exactly a __weak version of it... so I don't mind
either ways.

> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);

This has a collision with a patchset I posted, but it's trivial to
solve, the above three lines need to be replaced with:

+		ret = get_user_pages_unlocked(current, mm, start,
+				     nr_pages - nr, write, 0, pages);

And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
the first time to release the mmap_sem before I/O.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 12:19     ` Andrea Arcangeli
  0 siblings, 0 replies; 103+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 12:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Steve,

On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.

It'd be nice if you could also at the same time apply it to sparc and
powerpc in this same patchset to show the effectiveness of having a
generic version. Because if it's not a trivial drop-in replacement,
then this should go in arch/arm* instead of mm/gup.c...

Also I wonder if it wouldn't be better to add it to mm/util.c along
with the __weak gup_fast but then this is ok too. I'm just saying
because we never had sings of gup_fast code in mm/gup.c so far but
then this isn't exactly a __weak version of it... so I don't mind
either ways.

> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);

This has a collision with a patchset I posted, but it's trivial to
solve, the above three lines need to be replaced with:

+		ret = get_user_pages_unlocked(current, mm, start,
+				     nr_pages - nr, write, 0, pages);

And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
the first time to release the mmap_sem before I/O.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:00       ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-02 16:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-arm-kernel, Catalin Marinas, linux,
	linux-arch, linux-mm, Will Deacon, Gary Robertson,
	Christoffer Dall, Peter Zijlstra, Anders Roxell, Dann Frazier,
	Mark Rutland, Mel Gorman

On 30 September 2014 04:51, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
>
>> get_user_pages_fast attempts to pin user pages by walking the page
>> tables directly and avoids taking locks. Thus the walker needs to be
>> protected from page table pages being freed from under it, and needs
>> to block any THP splits.
>>
>> One way to achieve this is to have the walker disable interrupts, and
>> rely on IPIs from the TLB flushing code blocking before the page table
>> pages are freed.
>>
>> On some platforms we have hardware broadcast of TLB invalidations, thus
>> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
>> spuriously broadcasting IPIs can hurt system performance if done too
>> often.
>>
>> This problem has been solved on PowerPC and Sparc by batching up page
>> table pages belonging to more than one mm_user, then scheduling an
>> rcu_sched callback to free the pages. This RCU page table free logic
>> has been promoted to core code and is activated when one enables
>> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
>> their own get_user_pages_fast routines.
>>
>> The RCU page table free logic coupled with a an IPI broadcast on THP
>> split (which is a rare event), allows one to protect a page table
>> walker by merely disabling the interrupts during the walk.
>>
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> Signed-off-by: Steve Capper <steve.capper@linaro.org>
>> Tested-by: Dann Frazier <dann.frazier@canonical.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>
> Acked-by: Hugh Dickins <hughd@google.com>
>

Thanks Hugh!

> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

I see it's gone into mmotm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:00       ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-02 16:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-arm-kernel, Catalin Marinas, linux,
	linux-arch, linux-mm, Will Deacon, Gary Robertson,
	Christoffer Dall, Peter Zijlstra, Anders Roxell, Dann Frazier,
	Mark Rutland, Mel Gorman

On 30 September 2014 04:51, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
>
>> get_user_pages_fast attempts to pin user pages by walking the page
>> tables directly and avoids taking locks. Thus the walker needs to be
>> protected from page table pages being freed from under it, and needs
>> to block any THP splits.
>>
>> One way to achieve this is to have the walker disable interrupts, and
>> rely on IPIs from the TLB flushing code blocking before the page table
>> pages are freed.
>>
>> On some platforms we have hardware broadcast of TLB invalidations, thus
>> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
>> spuriously broadcasting IPIs can hurt system performance if done too
>> often.
>>
>> This problem has been solved on PowerPC and Sparc by batching up page
>> table pages belonging to more than one mm_user, then scheduling an
>> rcu_sched callback to free the pages. This RCU page table free logic
>> has been promoted to core code and is activated when one enables
>> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
>> their own get_user_pages_fast routines.
>>
>> The RCU page table free logic coupled with a an IPI broadcast on THP
>> split (which is a rare event), allows one to protect a page table
>> walker by merely disabling the interrupts during the walk.
>>
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> Signed-off-by: Steve Capper <steve.capper@linaro.org>
>> Tested-by: Dann Frazier <dann.frazier@canonical.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>
> Acked-by: Hugh Dickins <hughd@google.com>
>

Thanks Hugh!

> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

I see it's gone into mmotm.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:00       ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-02 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 September 2014 04:51, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 26 Sep 2014, Steve Capper wrote:
>
>> get_user_pages_fast attempts to pin user pages by walking the page
>> tables directly and avoids taking locks. Thus the walker needs to be
>> protected from page table pages being freed from under it, and needs
>> to block any THP splits.
>>
>> One way to achieve this is to have the walker disable interrupts, and
>> rely on IPIs from the TLB flushing code blocking before the page table
>> pages are freed.
>>
>> On some platforms we have hardware broadcast of TLB invalidations, thus
>> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
>> spuriously broadcasting IPIs can hurt system performance if done too
>> often.
>>
>> This problem has been solved on PowerPC and Sparc by batching up page
>> table pages belonging to more than one mm_user, then scheduling an
>> rcu_sched callback to free the pages. This RCU page table free logic
>> has been promoted to core code and is activated when one enables
>> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
>> their own get_user_pages_fast routines.
>>
>> The RCU page table free logic coupled with a an IPI broadcast on THP
>> split (which is a rare event), allows one to protect a page table
>> walker by merely disabling the interrupts during the walk.
>>
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> Signed-off-by: Steve Capper <steve.capper@linaro.org>
>> Tested-by: Dann Frazier <dann.frazier@canonical.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>
> Acked-by: Hugh Dickins <hughd@google.com>
>

Thanks Hugh!

> Thanks for making all those clarifications, Steve: this looks very
> good to me now.  I'm not sure which tree you're hoping will take this
> and the arm+arm64 patches 2-6: although this one would normally go
> through akpm, I expect it's easier for you to synchronize if it goes
> in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
> I see no clash with what's currently in mmotm.

I see it's gone into mmotm.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:18       ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-02 16:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-arm-kernel, Catalin Marinas, linux, linux-arch, linux-mm,
	Will Deacon, Gary Robertson, Christoffer Dall, Peter Zijlstra,
	Anders Roxell, akpm, Dann Frazier, Mark Rutland, Mel Gorman,
	Hugh Dickins

On 2 October 2014 19:19, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Steve,
>

Hi Andrea,

> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

I think it should be adapted (if need be) and adopted for sparc, power
and others, especially as it will result in a reduction in code size
and make future alterations to gup easier.
I would prefer to get this in iteratively; and have people who are
knowledgeable of those architectures and have a means of testing the
code thoroughly to help out. (it will be very hard for me to implement
this on my own, but likely trivial for people who know and can test
those architectures).

>
> Also I wonder if it wouldn't be better to add it to mm/util.c along
> with the __weak gup_fast but then this is ok too. I'm just saying
> because we never had sings of gup_fast code in mm/gup.c so far but
> then this isn't exactly a __weak version of it... so I don't mind
> either ways.

mm/gup.c was recently created?
It may even make sense to move the weak version in a future patch?

>
>> +             down_read(&mm->mmap_sem);
>> +             ret = get_user_pages(current, mm, start,
>> +                                  nr_pages - nr, write, 0, pages, NULL);
>> +             up_read(&mm->mmap_sem);
>
> This has a collision with a patchset I posted, but it's trivial to
> solve, the above three lines need to be replaced with:
>
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                    nr_pages - nr, write, 0, pages);
>
> And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
> the first time to release the mmap_sem before I/O.
>

Ahh thanks.
I'm currently on holiday and have very limited access to email, I'd
appreciate it if someone can keep an eye out for this during the merge
window if this conflict arises?

> Thanks,
> Andrea

Cheers,
--
Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:18       ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-02 16:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-arm-kernel, Catalin Marinas, linux, linux-arch, linux-mm,
	Will Deacon, Gary Robertson, Christoffer Dall, Peter Zijlstra,
	Anders Roxell, akpm, Dann Frazier, Mark Rutland, Mel Gorman,
	Hugh Dickins

On 2 October 2014 19:19, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Steve,
>

Hi Andrea,

> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

I think it should be adapted (if need be) and adopted for sparc, power
and others, especially as it will result in a reduction in code size
and make future alterations to gup easier.
I would prefer to get this in iteratively; and have people who are
knowledgeable of those architectures and have a means of testing the
code thoroughly to help out. (it will be very hard for me to implement
this on my own, but likely trivial for people who know and can test
those architectures).

>
> Also I wonder if it wouldn't be better to add it to mm/util.c along
> with the __weak gup_fast but then this is ok too. I'm just saying
> because we never had sings of gup_fast code in mm/gup.c so far but
> then this isn't exactly a __weak version of it... so I don't mind
> either ways.

mm/gup.c was recently created?
It may even make sense to move the weak version in a future patch?

>
>> +             down_read(&mm->mmap_sem);
>> +             ret = get_user_pages(current, mm, start,
>> +                                  nr_pages - nr, write, 0, pages, NULL);
>> +             up_read(&mm->mmap_sem);
>
> This has a collision with a patchset I posted, but it's trivial to
> solve, the above three lines need to be replaced with:
>
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                    nr_pages - nr, write, 0, pages);
>
> And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
> the first time to release the mmap_sem before I/O.
>

Ahh thanks.
I'm currently on holiday and have very limited access to email, I'd
appreciate it if someone can keep an eye out for this during the merge
window if this conflict arises?

> Thanks,
> Andrea

Cheers,
--
Steve

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:18       ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-02 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 2 October 2014 19:19, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Steve,
>

Hi Andrea,

> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>>
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

I think it should be adapted (if need be) and adopted for sparc, power
and others, especially as it will result in a reduction in code size
and make future alterations to gup easier.
I would prefer to get this in iteratively; and have people who are
knowledgeable of those architectures and have a means of testing the
code thoroughly to help out. (it will be very hard for me to implement
this on my own, but likely trivial for people who know and can test
those architectures).

>
> Also I wonder if it wouldn't be better to add it to mm/util.c along
> with the __weak gup_fast but then this is ok too. I'm just saying
> because we never had sings of gup_fast code in mm/gup.c so far but
> then this isn't exactly a __weak version of it... so I don't mind
> either ways.

mm/gup.c was recently created?
It may even make sense to move the weak version in a future patch?

>
>> +             down_read(&mm->mmap_sem);
>> +             ret = get_user_pages(current, mm, start,
>> +                                  nr_pages - nr, write, 0, pages, NULL);
>> +             up_read(&mm->mmap_sem);
>
> This has a collision with a patchset I posted, but it's trivial to
> solve, the above three lines need to be replaced with:
>
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                    nr_pages - nr, write, 0, pages);
>
> And then arm gup_fast will also page fault with FOLL_FAULT_ALLOW_RETRY
> the first time to release the mmap_sem before I/O.
>

Ahh thanks.
I'm currently on holiday and have very limited access to email, I'd
appreciate it if someone can keep an eye out for this during the merge
window if this conflict arises?

> Thanks,
> Andrea

Cheers,
--
Steve

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:54         ` Andrea Arcangeli
  0 siblings, 0 replies; 103+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 16:54 UTC (permalink / raw)
  To: Steve Capper
  Cc: linux-arm-kernel, Catalin Marinas, linux, linux-arch, linux-mm,
	Will Deacon, Gary Robertson, Christoffer Dall, Peter Zijlstra,
	Anders Roxell, akpm, Dann Frazier, Mark Rutland, Mel Gorman,
	Hugh Dickins

On Thu, Oct 02, 2014 at 11:18:00PM +0700, Steve Capper wrote:
> mm/gup.c was recently created?
> It may even make sense to move the weak version in a future patch?

I think the __weak stuff tends to go in lib, that's probably why it's
there. I don't mind either ways.

> I'm currently on holiday and have very limited access to email, I'd
> appreciate it if someone can keep an eye out for this during the merge
> window if this conflict arises?

No problem, I assume Andrew will merge your patchset first, so I can
resubmit against -mm patching the gup_fast_rcu too.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:54         ` Andrea Arcangeli
  0 siblings, 0 replies; 103+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 16:54 UTC (permalink / raw)
  To: Steve Capper
  Cc: linux-arm-kernel, Catalin Marinas, linux, linux-arch, linux-mm,
	Will Deacon, Gary Robertson, Christoffer Dall, Peter Zijlstra,
	Anders Roxell, akpm, Dann Frazier, Mark Rutland, Mel Gorman,
	Hugh Dickins

On Thu, Oct 02, 2014 at 11:18:00PM +0700, Steve Capper wrote:
> mm/gup.c was recently created?
> It may even make sense to move the weak version in a future patch?

I think the __weak stuff tends to go in lib, that's probably why it's
there. I don't mind either ways.

> I'm currently on holiday and have very limited access to email, I'd
> appreciate it if someone can keep an eye out for this during the merge
> window if this conflict arises?

No problem, I assume Andrew will merge your patchset first, so I can
resubmit against -mm patching the gup_fast_rcu too.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-02 16:54         ` Andrea Arcangeli
  0 siblings, 0 replies; 103+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 16:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Oct 02, 2014 at 11:18:00PM +0700, Steve Capper wrote:
> mm/gup.c was recently created?
> It may even make sense to move the weak version in a future patch?

I think the __weak stuff tends to go in lib, that's probably why it's
there. I don't mind either ways.

> I'm currently on holiday and have very limited access to email, I'd
> appreciate it if someone can keep an eye out for this during the merge
> window if this conflict arises?

No problem, I assume Andrew will merge your patchset first, so I can
resubmit against -mm patching the gup_fast_rcu too.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
  2014-10-02 12:19     ` Andrea Arcangeli
  (?)
@ 2014-10-13  5:15       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13  5:15 UTC (permalink / raw)
  To: Andrea Arcangeli, Steve Capper
  Cc: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm,
	will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Andrea Arcangeli <aarcange@redhat.com> writes:

> Hi Steve,
>
> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>> 
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

on ppc64 we have one challenge, we do need to support hugepd. At the pmd
level we can have hugepte, normal pmd pointer or a pointer to hugepage
directory which is used in case of some sub-architectures/platforms. ie,
the below part of gup implementation in ppc64

else if (is_hugepd(pmdp)) {
	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
			addr, next, write, pages, nr))
		return 0;


-aneesh

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  5:15       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13  5:15 UTC (permalink / raw)
  To: Andrea Arcangeli, Steve Capper
  Cc: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm,
	will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Andrea Arcangeli <aarcange@redhat.com> writes:

> Hi Steve,
>
> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>> 
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

on ppc64 we have one challenge, we do need to support hugepd. At the pmd
level we can have hugepte, normal pmd pointer or a pointer to hugepage
directory which is used in case of some sub-architectures/platforms. ie,
the below part of gup implementation in ppc64

else if (is_hugepd(pmdp)) {
	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
			addr, next, write, pages, nr))
		return 0;


-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  5:15       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13  5:15 UTC (permalink / raw)
  To: linux-arm-kernel

Andrea Arcangeli <aarcange@redhat.com> writes:

> Hi Steve,
>
> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> This patch provides a general RCU implementation of get_user_pages_fast
>> that can be used by architectures that perform hardware broadcast of
>> TLB invalidations.
>> 
>> It is based heavily on the PowerPC implementation by Nick Piggin.
>
> It'd be nice if you could also at the same time apply it to sparc and
> powerpc in this same patchset to show the effectiveness of having a
> generic version. Because if it's not a trivial drop-in replacement,
> then this should go in arch/arm* instead of mm/gup.c...

on ppc64 we have one challenge, we do need to support hugepd. At the pmd
level we can have hugepte, normal pmd pointer or a pointer to hugepage
directory which is used in case of some sub-architectures/platforms. ie,
the below part of gup implementation in ppc64

else if (is_hugepd(pmdp)) {
	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
			addr, next, write, pages, nr))
		return 0;


-aneesh

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  5:21         ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-13  5:21 UTC (permalink / raw)
  To: aneesh.kumar
  Cc: aarcange, steve.capper, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Date: Mon, 13 Oct 2014 10:45:24 +0530

> Andrea Arcangeli <aarcange@redhat.com> writes:
> 
>> Hi Steve,
>>
>> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>>> This patch provides a general RCU implementation of get_user_pages_fast
>>> that can be used by architectures that perform hardware broadcast of
>>> TLB invalidations.
>>> 
>>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> It'd be nice if you could also at the same time apply it to sparc and
>> powerpc in this same patchset to show the effectiveness of having a
>> generic version. Because if it's not a trivial drop-in replacement,
>> then this should go in arch/arm* instead of mm/gup.c...
> 
> on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> level we can have hugepte, normal pmd pointer or a pointer to hugepage
> directory which is used in case of some sub-architectures/platforms. ie,
> the below part of gup implementation in ppc64
> 
> else if (is_hugepd(pmdp)) {
> 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> 			addr, next, write, pages, nr))
> 		return 0;

Sparc has to deal with the same issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  5:21         ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-13  5:21 UTC (permalink / raw)
  To: aneesh.kumar
  Cc: aarcange, steve.capper, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Date: Mon, 13 Oct 2014 10:45:24 +0530

> Andrea Arcangeli <aarcange@redhat.com> writes:
> 
>> Hi Steve,
>>
>> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>>> This patch provides a general RCU implementation of get_user_pages_fast
>>> that can be used by architectures that perform hardware broadcast of
>>> TLB invalidations.
>>> 
>>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> It'd be nice if you could also at the same time apply it to sparc and
>> powerpc in this same patchset to show the effectiveness of having a
>> generic version. Because if it's not a trivial drop-in replacement,
>> then this should go in arch/arm* instead of mm/gup.c...
> 
> on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> level we can have hugepte, normal pmd pointer or a pointer to hugepage
> directory which is used in case of some sub-architectures/platforms. ie,
> the below part of gup implementation in ppc64
> 
> else if (is_hugepd(pmdp)) {
> 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> 			addr, next, write, pages, nr))
> 		return 0;

Sparc has to deal with the same issue.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  5:21         ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-13  5:21 UTC (permalink / raw)
  To: linux-arm-kernel

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Date: Mon, 13 Oct 2014 10:45:24 +0530

> Andrea Arcangeli <aarcange@redhat.com> writes:
> 
>> Hi Steve,
>>
>> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>>> This patch provides a general RCU implementation of get_user_pages_fast
>>> that can be used by architectures that perform hardware broadcast of
>>> TLB invalidations.
>>> 
>>> It is based heavily on the PowerPC implementation by Nick Piggin.
>>
>> It'd be nice if you could also at the same time apply it to sparc and
>> powerpc in this same patchset to show the effectiveness of having a
>> generic version. Because if it's not a trivial drop-in replacement,
>> then this should go in arch/arm* instead of mm/gup.c...
> 
> on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> level we can have hugepte, normal pmd pointer or a pointer to hugepage
> directory which is used in case of some sub-architectures/platforms. ie,
> the below part of gup implementation in ppc64
> 
> else if (is_hugepd(pmdp)) {
> 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> 			addr, next, write, pages, nr))
> 		return 0;

Sparc has to deal with the same issue.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  6:22     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13  6:22 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd,
	Steve Capper

Steve Capper <steve.capper@linaro.org> writes:

.....

> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
.....

> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {

We don't check the _PAGE_PRESENT here


> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  6:22     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13  6:22 UTC (permalink / raw)
  To: Steve Capper, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Steve Capper <steve.capper@linaro.org> writes:

.....

> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
.....

> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {

We don't check the _PAGE_PRESENT here


> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +

-aneesh


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  6:22     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13  6:22 UTC (permalink / raw)
  To: Steve Capper, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Steve Capper <steve.capper@linaro.org> writes:

.....

> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
.....

> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {

We don't check the _PAGE_PRESENT here


> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13  6:22     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13  6:22 UTC (permalink / raw)
  To: linux-arm-kernel

Steve Capper <steve.capper@linaro.org> writes:

.....

> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
.....

> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {

We don't check the _PAGE_PRESENT here


> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +

-aneesh

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13 11:44           ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-13 11:44 UTC (permalink / raw)
  To: David Miller, aneesh.kumar
  Cc: aarcange, linux-arm-kernel, catalin.marinas, linux, linux-arch,
	linux-mm, will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

On Mon, Oct 13, 2014 at 01:21:46AM -0400, David Miller wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Date: Mon, 13 Oct 2014 10:45:24 +0530
> 
> > Andrea Arcangeli <aarcange@redhat.com> writes:
> > 
> >> Hi Steve,
> >>
> >> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> >>> This patch provides a general RCU implementation of get_user_pages_fast
> >>> that can be used by architectures that perform hardware broadcast of
> >>> TLB invalidations.
> >>> 
> >>> It is based heavily on the PowerPC implementation by Nick Piggin.
> >>
> >> It'd be nice if you could also at the same time apply it to sparc and
> >> powerpc in this same patchset to show the effectiveness of having a
> >> generic version. Because if it's not a trivial drop-in replacement,
> >> then this should go in arch/arm* instead of mm/gup.c...
> > 
> > on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> > level we can have hugepte, normal pmd pointer or a pointer to hugepage
> > directory which is used in case of some sub-architectures/platforms. ie,
> > the below part of gup implementation in ppc64
> > 
> > else if (is_hugepd(pmdp)) {
> > 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> > 			addr, next, write, pages, nr))
> > 		return 0;
> 
> Sparc has to deal with the same issue.

Hi Aneesh, David,

Could we add some helpers to mm/gup.c to deal with the hugepage
directory cases? If my understanding is correct, this arises for
HugeTLB pages rather than THP? (I should have listed under the
assumptions made that HugeTLB and THP have the same page table
entries).

For Sparc, if the huge pte case were to be separated out from the
normal pte case we could use page_cache_add_speculative rather than
make repeated calls to page_cache_get_speculative?

Also, as a heads up for Sparc. I don't see any definition of
__get_user_pages_fast. Does this mean that a futex on THP tail page
can cause an infinite loop?

I don't have the means to thoroughly test patches for PowerPC and Sparc
(nor do I have enough knowledge to safely write them). I was going to
ask if you could please have a go at enabling this for PowerPC and
Sparc and I could check the ARM side and help out with mm/gup.c?

Cheers,
-- 
Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13 11:44           ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-13 11:44 UTC (permalink / raw)
  To: David Miller, aneesh.kumar
  Cc: aarcange, linux-arm-kernel, catalin.marinas, linux, linux-arch,
	linux-mm, will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

On Mon, Oct 13, 2014 at 01:21:46AM -0400, David Miller wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Date: Mon, 13 Oct 2014 10:45:24 +0530
> 
> > Andrea Arcangeli <aarcange@redhat.com> writes:
> > 
> >> Hi Steve,
> >>
> >> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> >>> This patch provides a general RCU implementation of get_user_pages_fast
> >>> that can be used by architectures that perform hardware broadcast of
> >>> TLB invalidations.
> >>> 
> >>> It is based heavily on the PowerPC implementation by Nick Piggin.
> >>
> >> It'd be nice if you could also at the same time apply it to sparc and
> >> powerpc in this same patchset to show the effectiveness of having a
> >> generic version. Because if it's not a trivial drop-in replacement,
> >> then this should go in arch/arm* instead of mm/gup.c...
> > 
> > on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> > level we can have hugepte, normal pmd pointer or a pointer to hugepage
> > directory which is used in case of some sub-architectures/platforms. ie,
> > the below part of gup implementation in ppc64
> > 
> > else if (is_hugepd(pmdp)) {
> > 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> > 			addr, next, write, pages, nr))
> > 		return 0;
> 
> Sparc has to deal with the same issue.

Hi Aneesh, David,

Could we add some helpers to mm/gup.c to deal with the hugepage
directory cases? If my understanding is correct, this arises for
HugeTLB pages rather than THP? (I should have listed under the
assumptions made that HugeTLB and THP have the same page table
entries).

For Sparc, if the huge pte case were to be separated out from the
normal pte case we could use page_cache_add_speculative rather than
make repeated calls to page_cache_get_speculative?

Also, as a heads up for Sparc. I don't see any definition of
__get_user_pages_fast. Does this mean that a futex on THP tail page
can cause an infinite loop?

I don't have the means to thoroughly test patches for PowerPC and Sparc
(nor do I have enough knowledge to safely write them). I was going to
ask if you could please have a go at enabling this for PowerPC and
Sparc and I could check the ARM side and help out with mm/gup.c?

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13 11:44           ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-13 11:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 13, 2014 at 01:21:46AM -0400, David Miller wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Date: Mon, 13 Oct 2014 10:45:24 +0530
> 
> > Andrea Arcangeli <aarcange@redhat.com> writes:
> > 
> >> Hi Steve,
> >>
> >> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
> >>> This patch provides a general RCU implementation of get_user_pages_fast
> >>> that can be used by architectures that perform hardware broadcast of
> >>> TLB invalidations.
> >>> 
> >>> It is based heavily on the PowerPC implementation by Nick Piggin.
> >>
> >> It'd be nice if you could also at the same time apply it to sparc and
> >> powerpc in this same patchset to show the effectiveness of having a
> >> generic version. Because if it's not a trivial drop-in replacement,
> >> then this should go in arch/arm* instead of mm/gup.c...
> > 
> > on ppc64 we have one challenge, we do need to support hugepd. At the pmd
> > level we can have hugepte, normal pmd pointer or a pointer to hugepage
> > directory which is used in case of some sub-architectures/platforms. ie,
> > the below part of gup implementation in ppc64
> > 
> > else if (is_hugepd(pmdp)) {
> > 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
> > 			addr, next, write, pages, nr))
> > 		return 0;
> 
> Sparc has to deal with the same issue.

Hi Aneesh, David,

Could we add some helpers to mm/gup.c to deal with the hugepage
directory cases? If my understanding is correct, this arises for
HugeTLB pages rather than THP? (I should have listed under the
assumptions made that HugeTLB and THP have the same page table
entries).

For Sparc, if the huge pte case were to be separated out from the
normal pte case we could use page_cache_add_speculative rather than
make repeated calls to page_cache_get_speculative?

Also, as a heads up for Sparc. I don't see any definition of
__get_user_pages_fast. Does this mean that a futex on THP tail page
can cause an infinite loop?

I don't have the means to thoroughly test patches for PowerPC and Sparc
(nor do I have enough knowledge to safely write them). I was going to
ask if you could please have a go at enabling this for PowerPC and
Sparc and I could check the ARM side and help out with mm/gup.c?

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
  2014-10-13 11:44           ` Steve Capper
  (?)
@ 2014-10-13 16:06             ` David Miller
  -1 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-13 16:06 UTC (permalink / raw)
  To: steve.capper
  Cc: aneesh.kumar, aarcange, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

From: Steve Capper <steve.capper@linaro.org>
Date: Mon, 13 Oct 2014 12:44:28 +0100

> Also, as a heads up for Sparc. I don't see any definition of
> __get_user_pages_fast. Does this mean that a futex on THP tail page
> can cause an infinite loop?

I have no idea, I didn't realize this was required to be implemented.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13 16:06             ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-13 16:06 UTC (permalink / raw)
  To: steve.capper
  Cc: aneesh.kumar, aarcange, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

From: Steve Capper <steve.capper@linaro.org>
Date: Mon, 13 Oct 2014 12:44:28 +0100

> Also, as a heads up for Sparc. I don't see any definition of
> __get_user_pages_fast. Does this mean that a futex on THP tail page
> can cause an infinite loop?

I have no idea, I didn't realize this was required to be implemented.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13 16:06             ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-13 16:06 UTC (permalink / raw)
  To: linux-arm-kernel

From: Steve Capper <steve.capper@linaro.org>
Date: Mon, 13 Oct 2014 12:44:28 +0100

> Also, as a heads up for Sparc. I don't see any definition of
> __get_user_pages_fast. Does this mean that a futex on THP tail page
> can cause an infinite loop?

I have no idea, I didn't realize this was required to be implemented.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
  2014-10-13 11:44           ` Steve Capper
  (?)
@ 2014-10-13 17:04             ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13 17:04 UTC (permalink / raw)
  To: Steve Capper, David Miller
  Cc: aarcange, linux-arm-kernel, catalin.marinas, linux, linux-arch,
	linux-mm, will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Steve Capper <steve.capper@linaro.org> writes:

> On Mon, Oct 13, 2014 at 01:21:46AM -0400, David Miller wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> Date: Mon, 13 Oct 2014 10:45:24 +0530
>> 
>> > Andrea Arcangeli <aarcange@redhat.com> writes:
>> > 
>> >> Hi Steve,
>> >>
>> >> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> >>> This patch provides a general RCU implementation of get_user_pages_fast
>> >>> that can be used by architectures that perform hardware broadcast of
>> >>> TLB invalidations.
>> >>> 
>> >>> It is based heavily on the PowerPC implementation by Nick Piggin.
>> >>
>> >> It'd be nice if you could also at the same time apply it to sparc and
>> >> powerpc in this same patchset to show the effectiveness of having a
>> >> generic version. Because if it's not a trivial drop-in replacement,
>> >> then this should go in arch/arm* instead of mm/gup.c...
>> > 
>> > on ppc64 we have one challenge, we do need to support hugepd. At the pmd
>> > level we can have hugepte, normal pmd pointer or a pointer to hugepage
>> > directory which is used in case of some sub-architectures/platforms. ie,
>> > the below part of gup implementation in ppc64
>> > 
>> > else if (is_hugepd(pmdp)) {
>> > 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
>> > 			addr, next, write, pages, nr))
>> > 		return 0;
>> 
>> Sparc has to deal with the same issue.
>
> Hi Aneesh, David,
>
> Could we add some helpers to mm/gup.c to deal with the hugepage
> directory cases? If my understanding is correct, this arises for
> HugeTLB pages rather than THP? (I should have listed under the
> assumptions made that HugeTLB and THP have the same page table
> entries).

This is a straight lift of what we have in ppc64 on top of your patch. I
did respective hack on ppc64 side and did a simple boot test. Let me
know whether this works for arm too. It needs further cleanup to get
some typecasting fixed up.


diff --git a/mm/Kconfig b/mm/Kconfig
index 886db2158538..0ceb8a567dab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	boolean
 
+config HAVE_GENERIC_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b1600d..f9d2803f0c62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,
@@ -672,3 +676,379 @@ struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+/**
+ * Generic RCU Fast GUP
+ *
+ * get_user_pages_fast attempts to pin user pages by walking the page
+ * tables directly and avoids taking locks. Thus the walker needs to be
+ * protected from page table pages being freed from under it, and should
+ * block any THP splits.
+ *
+ * One way to achieve this is to have the walker disable interrupts, and
+ * rely on IPIs from the TLB flushing code blocking before the page table
+ * pages are freed. This is unsuitable for architectures that do not need
+ * to broadcast an IPI when invalidating TLBs.
+ *
+ * Another way to achieve this is to batch up page table containing pages
+ * belonging to more than one mm_user, then rcu_sched a callback to free those
+ * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
+ * (which is a relatively rare event). The code below adopts this strategy.
+ *
+ * Before activating this code, please be aware that the following assumptions
+ * are currently made:
+ *
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
+ *
+ *  *) THP splits will broadcast an IPI, this can be achieved by overriding
+ *      pmdp_splitting_flush.
+ *
+ *  *) ptes can be read atomically by the architecture.
+ *
+ *  *) access_ok is sufficient to validate userspace address ranges.
+ *
+ * The last two assumptions can be relaxed by the addition of helper functions.
+ *
+ * This code is based heavily on the PowerPC implementation by Nick Piggin.
+ */
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * In the line below we are assuming that the pte can be read
+		 * atomically. If this is not the case for your architecture,
+		 * please wrap this in a helper function!
+		 *
+		 * for an example see gup_get_pte in arch/x86/mm/gup.c
+		 */
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		/*
+		 * Similar to the PMD case below, NUMA hinting must take slow
+		 * path
+		 */
+		if (!pte_present(pte) || pte_special(pte) ||
+			pte_numa(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * __get_user_pages_fast implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
+		 unsigned long sz, unsigned long end, int write,
+		 struct page **pages, int *nr)
+{
+	int refs;
+	unsigned long pte_end;
+	struct page *head, *page, *tail;
+
+
+	if (write && !pte_write(orig))
+		return 0;
+
+	if (!pte_present(orig))
+		return 0;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(orig)));
+
+	refs = 0;
+	head = pte_page(orig);
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pte_val(orig) != pte_val(*ptep))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+#ifndef is_hugepd
+typedef struct { signed long pd; } hugepd_t;
+
+/*
+ * Some architectures support hugepage directory format that is
+ * required to support different hugetlbfs sizes.
+ */
+#define is_hugepd(hugepd) (0)
+
+static inline hugepd_t pmd_hugepd(pmd_t pmd)
+{
+	return  (hugepd_t){ pmd_val(pmd) };
+}
+
+static inline hugepd_t pud_hugepd(pud_t pud)
+{
+	return  (hugepd_t){ pud_val(pud) };
+}
+
+static inline hugepd_t pgd_hugepd(pgd_t pgd)
+{
+	return  (hugepd_t){ pgd_val(pgd) };
+}
+
+static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
+			     unsigned pdshift, unsigned long end,
+			     int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (pmd_trans_huge(pmd) || pmd_huge(pmd)) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
+
+			if (!gup_huge_pte(pmd_pte(pmd), pmdp_ptep(pmdp), addr,
+					  PMD_SIZE, next, write, pages, nr))
+				return 0;
+
+		} else if (is_hugepd(pmd_hugepd(pmd))) {
+			/*
+			 * architecture have different format for hugetlbfs
+			 * pmd format and THP pmd format
+			 */
+			if (!gup_hugepd(pmd_hugepd(pmd), addr, PMD_SHIFT, next,
+					write, pages, nr))
+				return 0;
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pte(__pte(pud_val(pud)), (pte_t *)pudp,
+					  addr, PUD_SIZE, next, write,
+					  pages, nr))
+				return 0;
+		} else if (is_hugepd(pud_hugepd(pud))) {
+			if (!gup_hugepd((pud_hugepd(pud)), addr, PUD_SHIFT,
+					 next, write, pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP. It will only return non-negative values.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = ACCESS_ONCE(*pgdp);
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			break;
+		if (pgd_huge(pgd)) {
+			if (!gup_huge_pte(pgd, (pte_t *)pgdp, addr, PGDIR_SIZE,
+					 next, write, pages, &nr))
+				break;
+		} else if (is_hugepd(pgd_hugepd(pgd))) {
+			if (!gup_hugepd((pgd_hugepd(pgd)), addr, PGDIR_SHIFT,
+					 next, write, pages, &nr))
+				break;
+		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13 17:04             ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13 17:04 UTC (permalink / raw)
  To: Steve Capper, David Miller
  Cc: aarcange, linux-arm-kernel, catalin.marinas, linux, linux-arch,
	linux-mm, will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Steve Capper <steve.capper@linaro.org> writes:

> On Mon, Oct 13, 2014 at 01:21:46AM -0400, David Miller wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> Date: Mon, 13 Oct 2014 10:45:24 +0530
>> 
>> > Andrea Arcangeli <aarcange@redhat.com> writes:
>> > 
>> >> Hi Steve,
>> >>
>> >> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> >>> This patch provides a general RCU implementation of get_user_pages_fast
>> >>> that can be used by architectures that perform hardware broadcast of
>> >>> TLB invalidations.
>> >>> 
>> >>> It is based heavily on the PowerPC implementation by Nick Piggin.
>> >>
>> >> It'd be nice if you could also at the same time apply it to sparc and
>> >> powerpc in this same patchset to show the effectiveness of having a
>> >> generic version. Because if it's not a trivial drop-in replacement,
>> >> then this should go in arch/arm* instead of mm/gup.c...
>> > 
>> > on ppc64 we have one challenge, we do need to support hugepd. At the pmd
>> > level we can have hugepte, normal pmd pointer or a pointer to hugepage
>> > directory which is used in case of some sub-architectures/platforms. ie,
>> > the below part of gup implementation in ppc64
>> > 
>> > else if (is_hugepd(pmdp)) {
>> > 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
>> > 			addr, next, write, pages, nr))
>> > 		return 0;
>> 
>> Sparc has to deal with the same issue.
>
> Hi Aneesh, David,
>
> Could we add some helpers to mm/gup.c to deal with the hugepage
> directory cases? If my understanding is correct, this arises for
> HugeTLB pages rather than THP? (I should have listed under the
> assumptions made that HugeTLB and THP have the same page table
> entries).

This is a straight lift of what we have in ppc64 on top of your patch. I
did respective hack on ppc64 side and did a simple boot test. Let me
know whether this works for arm too. It needs further cleanup to get
some typecasting fixed up.


diff --git a/mm/Kconfig b/mm/Kconfig
index 886db2158538..0ceb8a567dab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	boolean
 
+config HAVE_GENERIC_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b1600d..f9d2803f0c62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,
@@ -672,3 +676,379 @@ struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+/**
+ * Generic RCU Fast GUP
+ *
+ * get_user_pages_fast attempts to pin user pages by walking the page
+ * tables directly and avoids taking locks. Thus the walker needs to be
+ * protected from page table pages being freed from under it, and should
+ * block any THP splits.
+ *
+ * One way to achieve this is to have the walker disable interrupts, and
+ * rely on IPIs from the TLB flushing code blocking before the page table
+ * pages are freed. This is unsuitable for architectures that do not need
+ * to broadcast an IPI when invalidating TLBs.
+ *
+ * Another way to achieve this is to batch up page table containing pages
+ * belonging to more than one mm_user, then rcu_sched a callback to free those
+ * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
+ * (which is a relatively rare event). The code below adopts this strategy.
+ *
+ * Before activating this code, please be aware that the following assumptions
+ * are currently made:
+ *
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
+ *
+ *  *) THP splits will broadcast an IPI, this can be achieved by overriding
+ *      pmdp_splitting_flush.
+ *
+ *  *) ptes can be read atomically by the architecture.
+ *
+ *  *) access_ok is sufficient to validate userspace address ranges.
+ *
+ * The last two assumptions can be relaxed by the addition of helper functions.
+ *
+ * This code is based heavily on the PowerPC implementation by Nick Piggin.
+ */
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * In the line below we are assuming that the pte can be read
+		 * atomically. If this is not the case for your architecture,
+		 * please wrap this in a helper function!
+		 *
+		 * for an example see gup_get_pte in arch/x86/mm/gup.c
+		 */
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		/*
+		 * Similar to the PMD case below, NUMA hinting must take slow
+		 * path
+		 */
+		if (!pte_present(pte) || pte_special(pte) ||
+			pte_numa(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * __get_user_pages_fast implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
+		 unsigned long sz, unsigned long end, int write,
+		 struct page **pages, int *nr)
+{
+	int refs;
+	unsigned long pte_end;
+	struct page *head, *page, *tail;
+
+
+	if (write && !pte_write(orig))
+		return 0;
+
+	if (!pte_present(orig))
+		return 0;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(orig)));
+
+	refs = 0;
+	head = pte_page(orig);
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pte_val(orig) != pte_val(*ptep))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+#ifndef is_hugepd
+typedef struct { signed long pd; } hugepd_t;
+
+/*
+ * Some architectures support hugepage directory format that is
+ * required to support different hugetlbfs sizes.
+ */
+#define is_hugepd(hugepd) (0)
+
+static inline hugepd_t pmd_hugepd(pmd_t pmd)
+{
+	return  (hugepd_t){ pmd_val(pmd) };
+}
+
+static inline hugepd_t pud_hugepd(pud_t pud)
+{
+	return  (hugepd_t){ pud_val(pud) };
+}
+
+static inline hugepd_t pgd_hugepd(pgd_t pgd)
+{
+	return  (hugepd_t){ pgd_val(pgd) };
+}
+
+static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
+			     unsigned pdshift, unsigned long end,
+			     int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (pmd_trans_huge(pmd) || pmd_huge(pmd)) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
+
+			if (!gup_huge_pte(pmd_pte(pmd), pmdp_ptep(pmdp), addr,
+					  PMD_SIZE, next, write, pages, nr))
+				return 0;
+
+		} else if (is_hugepd(pmd_hugepd(pmd))) {
+			/*
+			 * architecture have different format for hugetlbfs
+			 * pmd format and THP pmd format
+			 */
+			if (!gup_hugepd(pmd_hugepd(pmd), addr, PMD_SHIFT, next,
+					write, pages, nr))
+				return 0;
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pte(__pte(pud_val(pud)), (pte_t *)pudp,
+					  addr, PUD_SIZE, next, write,
+					  pages, nr))
+				return 0;
+		} else if (is_hugepd(pud_hugepd(pud))) {
+			if (!gup_hugepd((pud_hugepd(pud)), addr, PUD_SHIFT,
+					 next, write, pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP. It will only return non-negative values.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = ACCESS_ONCE(*pgdp);
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			break;
+		if (pgd_huge(pgd)) {
+			if (!gup_huge_pte(pgd, (pte_t *)pgdp, addr, PGDIR_SIZE,
+					 next, write, pages, &nr))
+				break;
+		} else if (is_hugepd(pgd_hugepd(pgd))) {
+			if (!gup_hugepd((pgd_hugepd(pgd)), addr, PGDIR_SHIFT,
+					 next, write, pages, &nr))
+				break;
+		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-13 17:04             ` Aneesh Kumar K.V
  0 siblings, 0 replies; 103+ messages in thread
From: Aneesh Kumar K.V @ 2014-10-13 17:04 UTC (permalink / raw)
  To: linux-arm-kernel

Steve Capper <steve.capper@linaro.org> writes:

> On Mon, Oct 13, 2014 at 01:21:46AM -0400, David Miller wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> Date: Mon, 13 Oct 2014 10:45:24 +0530
>> 
>> > Andrea Arcangeli <aarcange@redhat.com> writes:
>> > 
>> >> Hi Steve,
>> >>
>> >> On Fri, Sep 26, 2014 at 03:03:48PM +0100, Steve Capper wrote:
>> >>> This patch provides a general RCU implementation of get_user_pages_fast
>> >>> that can be used by architectures that perform hardware broadcast of
>> >>> TLB invalidations.
>> >>> 
>> >>> It is based heavily on the PowerPC implementation by Nick Piggin.
>> >>
>> >> It'd be nice if you could also at the same time apply it to sparc and
>> >> powerpc in this same patchset to show the effectiveness of having a
>> >> generic version. Because if it's not a trivial drop-in replacement,
>> >> then this should go in arch/arm* instead of mm/gup.c...
>> > 
>> > on ppc64 we have one challenge, we do need to support hugepd. At the pmd
>> > level we can have hugepte, normal pmd pointer or a pointer to hugepage
>> > directory which is used in case of some sub-architectures/platforms. ie,
>> > the below part of gup implementation in ppc64
>> > 
>> > else if (is_hugepd(pmdp)) {
>> > 	if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
>> > 			addr, next, write, pages, nr))
>> > 		return 0;
>> 
>> Sparc has to deal with the same issue.
>
> Hi Aneesh, David,
>
> Could we add some helpers to mm/gup.c to deal with the hugepage
> directory cases? If my understanding is correct, this arises for
> HugeTLB pages rather than THP? (I should have listed under the
> assumptions made that HugeTLB and THP have the same page table
> entries).

This is a straight lift of what we have in ppc64 on top of your patch. I
did respective hack on ppc64 side and did a simple boot test. Let me
know whether this works for arm too. It needs further cleanup to get
some typecasting fixed up.


diff --git a/mm/Kconfig b/mm/Kconfig
index 886db2158538..0ceb8a567dab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	boolean
 
+config HAVE_GENERIC_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b1600d..f9d2803f0c62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,
@@ -672,3 +676,379 @@ struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+/**
+ * Generic RCU Fast GUP
+ *
+ * get_user_pages_fast attempts to pin user pages by walking the page
+ * tables directly and avoids taking locks. Thus the walker needs to be
+ * protected from page table pages being freed from under it, and should
+ * block any THP splits.
+ *
+ * One way to achieve this is to have the walker disable interrupts, and
+ * rely on IPIs from the TLB flushing code blocking before the page table
+ * pages are freed. This is unsuitable for architectures that do not need
+ * to broadcast an IPI when invalidating TLBs.
+ *
+ * Another way to achieve this is to batch up page table containing pages
+ * belonging to more than one mm_user, then rcu_sched a callback to free those
+ * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
+ * (which is a relatively rare event). The code below adopts this strategy.
+ *
+ * Before activating this code, please be aware that the following assumptions
+ * are currently made:
+ *
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
+ *
+ *  *) THP splits will broadcast an IPI, this can be achieved by overriding
+ *      pmdp_splitting_flush.
+ *
+ *  *) ptes can be read atomically by the architecture.
+ *
+ *  *) access_ok is sufficient to validate userspace address ranges.
+ *
+ * The last two assumptions can be relaxed by the addition of helper functions.
+ *
+ * This code is based heavily on the PowerPC implementation by Nick Piggin.
+ */
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * In the line below we are assuming that the pte can be read
+		 * atomically. If this is not the case for your architecture,
+		 * please wrap this in a helper function!
+		 *
+		 * for an example see gup_get_pte in arch/x86/mm/gup.c
+		 */
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		/*
+		 * Similar to the PMD case below, NUMA hinting must take slow
+		 * path
+		 */
+		if (!pte_present(pte) || pte_special(pte) ||
+			pte_numa(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ *
+ * For a futex to be placed on a THP tail page, get_futex_key requires a
+ * __get_user_pages_fast implementation that can pin pages. Thus it's still
+ * useful to have gup_huge_pmd even if we can't operate on ptes.
+ */
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
+		 unsigned long sz, unsigned long end, int write,
+		 struct page **pages, int *nr)
+{
+	int refs;
+	unsigned long pte_end;
+	struct page *head, *page, *tail;
+
+
+	if (write && !pte_write(orig))
+		return 0;
+
+	if (!pte_present(orig))
+		return 0;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(orig)));
+
+	refs = 0;
+	head = pte_page(orig);
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pte_val(orig) != pte_val(*ptep))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+#ifndef is_hugepd
+typedef struct { signed long pd; } hugepd_t;
+
+/*
+ * Some architectures support hugepage directory format that is
+ * required to support different hugetlbfs sizes.
+ */
+#define is_hugepd(hugepd) (0)
+
+static inline hugepd_t pmd_hugepd(pmd_t pmd)
+{
+	return  (hugepd_t){ pmd_val(pmd) };
+}
+
+static inline hugepd_t pud_hugepd(pud_t pud)
+{
+	return  (hugepd_t){ pud_val(pud) };
+}
+
+static inline hugepd_t pgd_hugepd(pgd_t pgd)
+{
+	return  (hugepd_t){ pgd_val(pgd) };
+}
+
+static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
+			     unsigned pdshift, unsigned long end,
+			     int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (pmd_trans_huge(pmd) || pmd_huge(pmd)) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_numa(pmd))
+				return 0;
+
+			if (!gup_huge_pte(pmd_pte(pmd), pmdp_ptep(pmdp), addr,
+					  PMD_SIZE, next, write, pages, nr))
+				return 0;
+
+		} else if (is_hugepd(pmd_hugepd(pmd))) {
+			/*
+			 * architecture have different format for hugetlbfs
+			 * pmd format and THP pmd format
+			 */
+			if (!gup_hugepd(pmd_hugepd(pmd), addr, PMD_SHIFT, next,
+					write, pages, nr))
+				return 0;
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pte(__pte(pud_val(pud)), (pte_t *)pudp,
+					  addr, PUD_SIZE, next, write,
+					  pages, nr))
+				return 0;
+		} else if (is_hugepd(pud_hugepd(pud))) {
+			if (!gup_hugepd((pud_hugepd(pud)), addr, PUD_SHIFT,
+					 next, write, pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP. It will only return non-negative values.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = ACCESS_ONCE(*pgdp);
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			break;
+		if (pgd_huge(pgd)) {
+			if (!gup_huge_pte(pgd, (pte_t *)pgdp, addr, PGDIR_SIZE,
+					 next, write, pages, &nr))
+				break;
+		} else if (is_hugepd(pgd_hugepd(pgd))) {
+			if (!gup_hugepd((pgd_hugepd(pgd)), addr, PGDIR_SHIFT,
+					 next, write, pages, &nr))
+				break;
+		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be@least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-14 12:38               ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-14 12:38 UTC (permalink / raw)
  To: David Miller
  Cc: aneesh.kumar, aarcange, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
> From: Steve Capper <steve.capper@linaro.org>
> Date: Mon, 13 Oct 2014 12:44:28 +0100
> 
> > Also, as a heads up for Sparc. I don't see any definition of
> > __get_user_pages_fast. Does this mean that a futex on THP tail page
> > can cause an infinite loop?
> 
> I have no idea, I didn't realize this was required to be implemented.

In get_futex_key, a call is made to __get_user_pages_fast to handle the
case where a THP tail page needs to be pinned for the futex. There is a
stock implementation of __get_user_pages_fast, but this is just an
empty function that returns 0. Unfortunately this will provoke a goto
to "again:" and end up in an infinite loop. The process will appear
to hang with a high system cpu usage.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-14 12:38               ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-14 12:38 UTC (permalink / raw)
  To: David Miller
  Cc: aneesh.kumar, aarcange, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
> From: Steve Capper <steve.capper@linaro.org>
> Date: Mon, 13 Oct 2014 12:44:28 +0100
> 
> > Also, as a heads up for Sparc. I don't see any definition of
> > __get_user_pages_fast. Does this mean that a futex on THP tail page
> > can cause an infinite loop?
> 
> I have no idea, I didn't realize this was required to be implemented.

In get_futex_key, a call is made to __get_user_pages_fast to handle the
case where a THP tail page needs to be pinned for the futex. There is a
stock implementation of __get_user_pages_fast, but this is just an
empty function that returns 0. Unfortunately this will provoke a goto
to "again:" and end up in an infinite loop. The process will appear
to hang with a high system cpu usage.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-14 12:38               ` Steve Capper
  0 siblings, 0 replies; 103+ messages in thread
From: Steve Capper @ 2014-10-14 12:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
> From: Steve Capper <steve.capper@linaro.org>
> Date: Mon, 13 Oct 2014 12:44:28 +0100
> 
> > Also, as a heads up for Sparc. I don't see any definition of
> > __get_user_pages_fast. Does this mean that a futex on THP tail page
> > can cause an infinite loop?
> 
> I have no idea, I didn't realize this was required to be implemented.

In get_futex_key, a call is made to __get_user_pages_fast to handle the
case where a THP tail page needs to be pinned for the futex. There is a
stock implementation of __get_user_pages_fast, but this is just an
empty function that returns 0. Unfortunately this will provoke a goto
to "again:" and end up in an infinite loop. The process will appear
to hang with a high system cpu usage.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-14 16:30                 ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-14 16:30 UTC (permalink / raw)
  To: steve.capper
  Cc: aneesh.kumar, aarcange, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

From: Steve Capper <steve.capper@linaro.org>
Date: Tue, 14 Oct 2014 13:38:34 +0100

> On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
>> From: Steve Capper <steve.capper@linaro.org>
>> Date: Mon, 13 Oct 2014 12:44:28 +0100
>> 
>> > Also, as a heads up for Sparc. I don't see any definition of
>> > __get_user_pages_fast. Does this mean that a futex on THP tail page
>> > can cause an infinite loop?
>> 
>> I have no idea, I didn't realize this was required to be implemented.
> 
> In get_futex_key, a call is made to __get_user_pages_fast to handle the
> case where a THP tail page needs to be pinned for the futex. There is a
> stock implementation of __get_user_pages_fast, but this is just an
> empty function that returns 0. Unfortunately this will provoke a goto
> to "again:" and end up in an infinite loop. The process will appear
> to hang with a high system cpu usage.

I'd rather the build fail and force me to implement the interface for
my architecture than have a default implementation that causes issues
like that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-14 16:30                 ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-14 16:30 UTC (permalink / raw)
  To: steve.capper
  Cc: aneesh.kumar, aarcange, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm, will.deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mark.rutland, mgorman, hughd

From: Steve Capper <steve.capper@linaro.org>
Date: Tue, 14 Oct 2014 13:38:34 +0100

> On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
>> From: Steve Capper <steve.capper@linaro.org>
>> Date: Mon, 13 Oct 2014 12:44:28 +0100
>> 
>> > Also, as a heads up for Sparc. I don't see any definition of
>> > __get_user_pages_fast. Does this mean that a futex on THP tail page
>> > can cause an infinite loop?
>> 
>> I have no idea, I didn't realize this was required to be implemented.
> 
> In get_futex_key, a call is made to __get_user_pages_fast to handle the
> case where a THP tail page needs to be pinned for the futex. There is a
> stock implementation of __get_user_pages_fast, but this is just an
> empty function that returns 0. Unfortunately this will provoke a goto
> to "again:" and end up in an infinite loop. The process will appear
> to hang with a high system cpu usage.

I'd rather the build fail and force me to implement the interface for
my architecture than have a default implementation that causes issues
like that.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.
@ 2014-10-14 16:30                 ` David Miller
  0 siblings, 0 replies; 103+ messages in thread
From: David Miller @ 2014-10-14 16:30 UTC (permalink / raw)
  To: linux-arm-kernel

From: Steve Capper <steve.capper@linaro.org>
Date: Tue, 14 Oct 2014 13:38:34 +0100

> On Mon, Oct 13, 2014 at 12:06:18PM -0400, David Miller wrote:
>> From: Steve Capper <steve.capper@linaro.org>
>> Date: Mon, 13 Oct 2014 12:44:28 +0100
>> 
>> > Also, as a heads up for Sparc. I don't see any definition of
>> > __get_user_pages_fast. Does this mean that a futex on THP tail page
>> > can cause an infinite loop?
>> 
>> I have no idea, I didn't realize this was required to be implemented.
> 
> In get_futex_key, a call is made to __get_user_pages_fast to handle the
> case where a THP tail page needs to be pinned for the futex. There is a
> stock implementation of __get_user_pages_fast, but this is just an
> empty function that returns 0. Unfortunately this will provoke a goto
> to "again:" and end up in an infinite loop. The process will appear
> to hang with a high system cpu usage.

I'd rather the build fail and force me to implement the interface for
my architecture than have a default implementation that causes issues
like that.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2015-02-27 12:42   ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-02-27 12:42 UTC (permalink / raw)
  To: Steve Capper, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

On 09/26/2014 10:03 AM, Steve Capper wrote:

> This series implements general forms of get_user_pages_fast and
> __get_user_pages_fast in core code and activates them for arm and arm64.
> 
> These are required for Transparent HugePages to function correctly, as
> a futex on a THP tail will otherwise result in an infinite loop (due to
> the core implementation of __get_user_pages_fast always returning 0).
> 
> Unfortunately, a futex on THP tail can be quite common for certain
> workloads; thus THP is unreliable without a __get_user_pages_fast
> implementation.
> 
> This series may also be beneficial for direct-IO heavy workloads and
> certain KVM workloads.
> 
> I appreciate that the merge window is coming very soon, and am posting
> this revision on the off-chance that it gets the nod for 3.18. (The changes
> thus far have been minimal and the feedback I've got has been mainly
> positive).

Head's up: these patches are currently implicated in a rare-to-trigger
hang that we are seeing on an internal kernel. An extensive effort is
underway to confirm whether these are the cause. Will followup.

Jon.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2015-02-27 12:42   ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-02-27 12:42 UTC (permalink / raw)
  To: Steve Capper, linux-arm-kernel, catalin.marinas, linux,
	linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

On 09/26/2014 10:03 AM, Steve Capper wrote:

> This series implements general forms of get_user_pages_fast and
> __get_user_pages_fast in core code and activates them for arm and arm64.
> 
> These are required for Transparent HugePages to function correctly, as
> a futex on a THP tail will otherwise result in an infinite loop (due to
> the core implementation of __get_user_pages_fast always returning 0).
> 
> Unfortunately, a futex on THP tail can be quite common for certain
> workloads; thus THP is unreliable without a __get_user_pages_fast
> implementation.
> 
> This series may also be beneficial for direct-IO heavy workloads and
> certain KVM workloads.
> 
> I appreciate that the merge window is coming very soon, and am posting
> this revision on the off-chance that it gets the nod for 3.18. (The changes
> thus far have been minimal and the feedback I've got has been mainly
> positive).

Head's up: these patches are currently implicated in a rare-to-trigger
hang that we are seeing on an internal kernel. An extensive effort is
underway to confirm whether these are the cause. Will followup.

Jon.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2015-02-27 12:42   ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-02-27 12:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/26/2014 10:03 AM, Steve Capper wrote:

> This series implements general forms of get_user_pages_fast and
> __get_user_pages_fast in core code and activates them for arm and arm64.
> 
> These are required for Transparent HugePages to function correctly, as
> a futex on a THP tail will otherwise result in an infinite loop (due to
> the core implementation of __get_user_pages_fast always returning 0).
> 
> Unfortunately, a futex on THP tail can be quite common for certain
> workloads; thus THP is unreliable without a __get_user_pages_fast
> implementation.
> 
> This series may also be beneficial for direct-IO heavy workloads and
> certain KVM workloads.
> 
> I appreciate that the merge window is coming very soon, and am posting
> this revision on the off-chance that it gets the nod for 3.18. (The changes
> thus far have been minimal and the feedback I've got has been mainly
> positive).

Head's up: these patches are currently implicated in a rare-to-trigger
hang that we are seeing on an internal kernel. An extensive effort is
underway to confirm whether these are the cause. Will followup.

Jon.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
  2015-02-27 12:42   ` Jon Masters
  (?)
@ 2015-02-27 13:20     ` Mark Rutland
  -1 siblings, 0 replies; 103+ messages in thread
From: Mark Rutland @ 2015-02-27 13:20 UTC (permalink / raw)
  To: Jon Masters
  Cc: Steve Capper, linux-arm-kernel, Catalin Marinas, linux,
	linux-arch, linux-mm, Will Deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mgorman, hughd

Hi Jon,

Steve is currently away, but should be back in the office next week.

On Fri, Feb 27, 2015 at 12:42:30PM +0000, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
> > This series implements general forms of get_user_pages_fast and
> > __get_user_pages_fast in core code and activates them for arm and arm64.
> > 
> > These are required for Transparent HugePages to function correctly, as
> > a futex on a THP tail will otherwise result in an infinite loop (due to
> > the core implementation of __get_user_pages_fast always returning 0).
> > 
> > Unfortunately, a futex on THP tail can be quite common for certain
> > workloads; thus THP is unreliable without a __get_user_pages_fast
> > implementation.
> > 
> > This series may also be beneficial for direct-IO heavy workloads and
> > certain KVM workloads.
> > 
> > I appreciate that the merge window is coming very soon, and am posting
> > this revision on the off-chance that it gets the nod for 3.18. (The changes
> > thus far have been minimal and the feedback I've got has been mainly
> > positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.

I'm currently investigating an intermittent memory corruption issue in
v4.0-rc1 I'm able to trigger on Seattle with 4K pages and 48-bit VA,
which may or may not be related. Sometimes it results in a hang (when
the vectors get corrupted and the CPUs get caught in a recursive
exception loop).

Which architecture(s) are you hitting this on?

Which configurations configuration(s)?

What are you using to tickle the issue?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2015-02-27 13:20     ` Mark Rutland
  0 siblings, 0 replies; 103+ messages in thread
From: Mark Rutland @ 2015-02-27 13:20 UTC (permalink / raw)
  To: Jon Masters
  Cc: Steve Capper, linux-arm-kernel, Catalin Marinas, linux,
	linux-arch, linux-mm, Will Deacon, gary.robertson,
	christoffer.dall, peterz, anders.roxell, akpm, dann.frazier,
	mgorman, hughd

Hi Jon,

Steve is currently away, but should be back in the office next week.

On Fri, Feb 27, 2015 at 12:42:30PM +0000, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
> > This series implements general forms of get_user_pages_fast and
> > __get_user_pages_fast in core code and activates them for arm and arm64.
> > 
> > These are required for Transparent HugePages to function correctly, as
> > a futex on a THP tail will otherwise result in an infinite loop (due to
> > the core implementation of __get_user_pages_fast always returning 0).
> > 
> > Unfortunately, a futex on THP tail can be quite common for certain
> > workloads; thus THP is unreliable without a __get_user_pages_fast
> > implementation.
> > 
> > This series may also be beneficial for direct-IO heavy workloads and
> > certain KVM workloads.
> > 
> > I appreciate that the merge window is coming very soon, and am posting
> > this revision on the off-chance that it gets the nod for 3.18. (The changes
> > thus far have been minimal and the feedback I've got has been mainly
> > positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.

I'm currently investigating an intermittent memory corruption issue in
v4.0-rc1 I'm able to trigger on Seattle with 4K pages and 48-bit VA,
which may or may not be related. Sometimes it results in a hang (when
the vectors get corrupted and the CPUs get caught in a recursive
exception loop).

Which architecture(s) are you hitting this on?

Which configurations configuration(s)?

What are you using to tickle the issue?

Thanks,
Mark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2015-02-27 13:20     ` Mark Rutland
  0 siblings, 0 replies; 103+ messages in thread
From: Mark Rutland @ 2015-02-27 13:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jon,

Steve is currently away, but should be back in the office next week.

On Fri, Feb 27, 2015 at 12:42:30PM +0000, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
> > This series implements general forms of get_user_pages_fast and
> > __get_user_pages_fast in core code and activates them for arm and arm64.
> > 
> > These are required for Transparent HugePages to function correctly, as
> > a futex on a THP tail will otherwise result in an infinite loop (due to
> > the core implementation of __get_user_pages_fast always returning 0).
> > 
> > Unfortunately, a futex on THP tail can be quite common for certain
> > workloads; thus THP is unreliable without a __get_user_pages_fast
> > implementation.
> > 
> > This series may also be beneficial for direct-IO heavy workloads and
> > certain KVM workloads.
> > 
> > I appreciate that the merge window is coming very soon, and am posting
> > this revision on the off-chance that it gets the nod for 3.18. (The changes
> > thus far have been minimal and the feedback I've got has been mainly
> > positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.

I'm currently investigating an intermittent memory corruption issue in
v4.0-rc1 I'm able to trigger on Seattle with 4K pages and 48-bit VA,
which may or may not be related. Sometimes it results in a hang (when
the vectors get corrupted and the CPUs get caught in a recursive
exception loop).

Which architecture(s) are you hitting this on?

Which configurations configuration(s)?

What are you using to tickle the issue?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02  2:10     ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02  2:10 UTC (permalink / raw)
  To: Jon Masters, Steve Capper, linux-arm-kernel, catalin.marinas,
	linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Hi Folks,

I've pulled a couple of all nighters reproducing this hard to trigger
issue and got some data. It looks like the high half of the (note always
userspace) PMD is all zeros or all ones, which makes me wonder if the
logic in update_mmu_cache might be missing something on AArch64.

When a kernel is built with 64K pages and 2 levels the PMD is
effectively updated using set_pte_at, which explicitly won't perform a
DSB if the address is userspace (it expects this to happen later, in
update_mmu_cache as an example.

Can anyone think of an obvious reason why we might not be properly
flushing the changes prior to them being consumed by a hardware walker?

Jon.

On 02/27/2015 07:42 AM, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
>> This series implements general forms of get_user_pages_fast and
>> __get_user_pages_fast in core code and activates them for arm and arm64.
>>
>> These are required for Transparent HugePages to function correctly, as
>> a futex on a THP tail will otherwise result in an infinite loop (due to
>> the core implementation of __get_user_pages_fast always returning 0).
>>
>> Unfortunately, a futex on THP tail can be quite common for certain
>> workloads; thus THP is unreliable without a __get_user_pages_fast
>> implementation.
>>
>> This series may also be beneficial for direct-IO heavy workloads and
>> certain KVM workloads.
>>
>> I appreciate that the merge window is coming very soon, and am posting
>> this revision on the off-chance that it gets the nod for 3.18. (The changes
>> thus far have been minimal and the feedback I've got has been mainly
>> positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.
> 
> Jon.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02  2:10     ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02  2:10 UTC (permalink / raw)
  To: Jon Masters, Steve Capper, linux-arm-kernel, catalin.marinas,
	linux, linux-arch, linux-mm
  Cc: will.deacon, gary.robertson, christoffer.dall, peterz,
	anders.roxell, akpm, dann.frazier, mark.rutland, mgorman, hughd

Hi Folks,

I've pulled a couple of all nighters reproducing this hard to trigger
issue and got some data. It looks like the high half of the (note always
userspace) PMD is all zeros or all ones, which makes me wonder if the
logic in update_mmu_cache might be missing something on AArch64.

When a kernel is built with 64K pages and 2 levels the PMD is
effectively updated using set_pte_at, which explicitly won't perform a
DSB if the address is userspace (it expects this to happen later, in
update_mmu_cache as an example.

Can anyone think of an obvious reason why we might not be properly
flushing the changes prior to them being consumed by a hardware walker?

Jon.

On 02/27/2015 07:42 AM, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
>> This series implements general forms of get_user_pages_fast and
>> __get_user_pages_fast in core code and activates them for arm and arm64.
>>
>> These are required for Transparent HugePages to function correctly, as
>> a futex on a THP tail will otherwise result in an infinite loop (due to
>> the core implementation of __get_user_pages_fast always returning 0).
>>
>> Unfortunately, a futex on THP tail can be quite common for certain
>> workloads; thus THP is unreliable without a __get_user_pages_fast
>> implementation.
>>
>> This series may also be beneficial for direct-IO heavy workloads and
>> certain KVM workloads.
>>
>> I appreciate that the merge window is coming very soon, and am posting
>> this revision on the off-chance that it gets the nod for 3.18. (The changes
>> thus far have been minimal and the feedback I've got has been mainly
>> positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.
> 
> Jon.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02  5:58       ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02  5:58 UTC (permalink / raw)
  To: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	catalin.marinas
  Cc: gary.robertson, mark.rutland, hughd, akpm, christoffer.dall,
	peterz, mgorman, will.deacon, dann.frazier, anders.roxell

Test kernels running with an explicit DSB in all PTE update cases now running overnight. Just in case.

-- 
Computer Architect | Sent from my #ARM Powered Mobile Device

On Mar 1, 2015 9:10 PM, Jon Masters <jcm@redhat.com> wrote:
>
> Hi Folks, 
>
> I've pulled a couple of all nighters reproducing this hard to trHi Folks,

I've pulled a couple of all nighters reproducing this hard to trigger
issue and got some data. It looks like the high half of the (note always
userspace) PMD is all zeros or all ones, which makes me wonder if the
logic in update_mmu_cache might be missing something on AArch64.

When a kernel is built with 64K pages and 2 levels the PMD is
effectively updated using set_pte_at, which explicitly won't perform a
DSB if the address is userspace (it expects this to happen later, in
update_mmu_cache as an example.

Can anyone think of an obvious reason why we might not be properly
flushing the changes prior to them being consumed by a hardware walker?

Jon.

On 02/27/2015 07:42 AM, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
>> This series implements general forms of get_user_pages_fast and
>> __get_user_pages_fast in core code and activates them for arm and arm64.
>>
>> These are required for Transparent HugePages to function correctly, as
>> a futex on a THP tail will otherwise result in an infinite loop (due to
>> the core implementation of __get_user_pages_fast always returning 0).
>>
>> Unfortunately, a futex on THP tail can be quite common for certain
>> workloads; thus THP is unreliable without a __get_user_pages_fast
>> implementation.
>>
>> This series may also be beneficial for direct-IO heavy workloads and
>> certain KVM workloads.
>>
>> I appreciate that the merge window is coming very soon, and am posting
>> this revision on the off-chance that it gets the nod for 3.18. (The changes
>> thus far have been minimal and the feedback I've got has been mainly
>> positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.
> 
> Jon.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02  5:58       ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02  5:58 UTC (permalink / raw)
  To: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	catalin.marinas
  Cc: gary.robertson, mark.rutland, hughd, akpm, christoffer.dall,
	peterz, mgorman, will.deacon, dann.frazier, anders.roxell

Test kernels running with an explicit DSB in all PTE update cases now running overnight. Just in case.

-- 
Computer Architect | Sent from my #ARM Powered Mobile Device

On Mar 1, 2015 9:10 PM, Jon Masters <jcm@redhat.com> wrote:
>
> Hi Folks, 
>
> I've pulled a couple of all nighters reproducing this hard to trHi Folks,

I've pulled a couple of all nighters reproducing this hard to trigger
issue and got some data. It looks like the high half of the (note always
userspace) PMD is all zeros or all ones, which makes me wonder if the
logic in update_mmu_cache might be missing something on AArch64.

When a kernel is built with 64K pages and 2 levels the PMD is
effectively updated using set_pte_at, which explicitly won't perform a
DSB if the address is userspace (it expects this to happen later, in
update_mmu_cache as an example.

Can anyone think of an obvious reason why we might not be properly
flushing the changes prior to them being consumed by a hardware walker?

Jon.

On 02/27/2015 07:42 AM, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
>> This series implements general forms of get_user_pages_fast and
>> __get_user_pages_fast in core code and activates them for arm and arm64.
>>
>> These are required for Transparent HugePages to function correctly, as
>> a futex on a THP tail will otherwise result in an infinite loop (due to
>> the core implementation of __get_user_pages_fast always returning 0).
>>
>> Unfortunately, a futex on THP tail can be quite common for certain
>> workloads; thus THP is unreliable without a __get_user_pages_fast
>> implementation.
>>
>> This series may also be beneficial for direct-IO heavy workloads and
>> certain KVM workloads.
>>
>> I appreciate that the merge window is coming very soon, and am posting
>> this revision on the off-chance that it gets the nod for 3.18. (The changes
>> thus far have been minimal and the feedback I've got has been mainly
>> positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.
> 
> Jon.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02  5:58       ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02  5:58 UTC (permalink / raw)
  To: linux-arm-kernel

Test kernels running with an explicit DSB in all PTE update cases now running overnight. Just in case.

-- 
Computer Architect | Sent from my #ARM Powered Mobile Device

On Mar 1, 2015 9:10 PM, Jon Masters <jcm@redhat.com> wrote:
>
> Hi Folks, 
>
> I've pulled a couple of all nighters reproducing this hard to trHi Folks,

I've pulled a couple of all nighters reproducing this hard to trigger
issue and got some data. It looks like the high half of the (note always
userspace) PMD is all zeros or all ones, which makes me wonder if the
logic in update_mmu_cache might be missing something on AArch64.

When a kernel is built with 64K pages and 2 levels the PMD is
effectively updated using set_pte_at, which explicitly won't perform a
DSB if the address is userspace (it expects this to happen later, in
update_mmu_cache as an example.

Can anyone think of an obvious reason why we might not be properly
flushing the changes prior to them being consumed by a hardware walker?

Jon.

On 02/27/2015 07:42 AM, Jon Masters wrote:
> On 09/26/2014 10:03 AM, Steve Capper wrote:
> 
>> This series implements general forms of get_user_pages_fast and
>> __get_user_pages_fast in core code and activates them for arm and arm64.
>>
>> These are required for Transparent HugePages to function correctly, as
>> a futex on a THP tail will otherwise result in an infinite loop (due to
>> the core implementation of __get_user_pages_fast always returning 0).
>>
>> Unfortunately, a futex on THP tail can be quite common for certain
>> workloads; thus THP is unreliable without a __get_user_pages_fast
>> implementation.
>>
>> This series may also be beneficial for direct-IO heavy workloads and
>> certain KVM workloads.
>>
>> I appreciate that the merge window is coming very soon, and am posting
>> this revision on the off-chance that it gets the nod for 3.18. (The changes
>> thus far have been minimal and the feedback I've got has been mainly
>> positive).
> 
> Head's up: these patches are currently implicated in a rare-to-trigger
> hang that we are seeing on an internal kernel. An extensive effort is
> underway to confirm whether these are the cause. Will followup.
> 
> Jon.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo at kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email at kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
  2015-03-02  5:58       ` Jon Masters
  (?)
@ 2015-03-02 10:50         ` Catalin Marinas
  -1 siblings, 0 replies; 103+ messages in thread
From: Catalin Marinas @ 2015-03-02 10:50 UTC (permalink / raw)
  To: Jon Masters
  Cc: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	mark.rutland, anders.roxell, peterz, gary.robertson, hughd,
	will.deacon, mgorman, dann.frazier, akpm, christoffer.dall

On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> I've pulled a couple of all nighters reproducing this hard to trigger
> issue and got some data. It looks like the high half of the (note always
> userspace) PMD is all zeros or all ones, which makes me wonder if the
> logic in update_mmu_cache might be missing something on AArch64.

That's worrying but I can tell you offline why ;).

Anyway, 64-bit writes are atomic on ARMv8, so you shouldn't see half
updates. To make sure the compiler does not generate something weird,
change the set_(pte|pmd|pud) to use an inline assembly with a 64-bit
STR.

One question - is the PMD a table or a block? You mentioned set_pte_at
at some point, which leads me to think it's a (transparent) huge page,
hence block mapping.

> When a kernel is built with 64K pages and 2 levels the PMD is
> effectively updated using set_pte_at, which explicitly won't perform a
> DSB if the address is userspace (it expects this to happen later, in
> update_mmu_cache as an example.
> 
> Can anyone think of an obvious reason why we might not be properly
> flushing the changes prior to them being consumed by a hardware walker?

Even if you don't have that barrier, the worst that can happen is that
you get another trap back in the kernel (from user; translation fault)
but the page table read by the kernel is valid and normally the
instruction restarted.

> Test kernels running with an explicit DSB in all PTE update cases now
> running overnight. Just in case.

It could be hiding some other problems.

-- 
Catalin

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 10:50         ` Catalin Marinas
  0 siblings, 0 replies; 103+ messages in thread
From: Catalin Marinas @ 2015-03-02 10:50 UTC (permalink / raw)
  To: Jon Masters
  Cc: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	mark.rutland, anders.roxell, peterz, gary.robertson, hughd,
	will.deacon, mgorman, dann.frazier, akpm, christoffer.dall

On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> I've pulled a couple of all nighters reproducing this hard to trigger
> issue and got some data. It looks like the high half of the (note always
> userspace) PMD is all zeros or all ones, which makes me wonder if the
> logic in update_mmu_cache might be missing something on AArch64.

That's worrying but I can tell you offline why ;).

Anyway, 64-bit writes are atomic on ARMv8, so you shouldn't see half
updates. To make sure the compiler does not generate something weird,
change the set_(pte|pmd|pud) to use an inline assembly with a 64-bit
STR.

One question - is the PMD a table or a block? You mentioned set_pte_at
at some point, which leads me to think it's a (transparent) huge page,
hence block mapping.

> When a kernel is built with 64K pages and 2 levels the PMD is
> effectively updated using set_pte_at, which explicitly won't perform a
> DSB if the address is userspace (it expects this to happen later, in
> update_mmu_cache as an example.
> 
> Can anyone think of an obvious reason why we might not be properly
> flushing the changes prior to them being consumed by a hardware walker?

Even if you don't have that barrier, the worst that can happen is that
you get another trap back in the kernel (from user; translation fault)
but the page table read by the kernel is valid and normally the
instruction restarted.

> Test kernels running with an explicit DSB in all PTE update cases now
> running overnight. Just in case.

It could be hiding some other problems.

-- 
Catalin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02 10:50         ` Catalin Marinas
  0 siblings, 0 replies; 103+ messages in thread
From: Catalin Marinas @ 2015-03-02 10:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> I've pulled a couple of all nighters reproducing this hard to trigger
> issue and got some data. It looks like the high half of the (note always
> userspace) PMD is all zeros or all ones, which makes me wonder if the
> logic in update_mmu_cache might be missing something on AArch64.

That's worrying but I can tell you offline why ;).

Anyway, 64-bit writes are atomic on ARMv8, so you shouldn't see half
updates. To make sure the compiler does not generate something weird,
change the set_(pte|pmd|pud) to use an inline assembly with a 64-bit
STR.

One question - is the PMD a table or a block? You mentioned set_pte_at
at some point, which leads me to think it's a (transparent) huge page,
hence block mapping.

> When a kernel is built with 64K pages and 2 levels the PMD is
> effectively updated using set_pte_at, which explicitly won't perform a
> DSB if the address is userspace (it expects this to happen later, in
> update_mmu_cache as an example.
> 
> Can anyone think of an obvious reason why we might not be properly
> flushing the changes prior to them being consumed by a hardware walker?

Even if you don't have that barrier, the worst that can happen is that
you get another trap back in the kernel (from user; translation fault)
but the page table read by the kernel is valid and normally the
instruction restarted.

> Test kernels running with an explicit DSB in all PTE update cases now
> running overnight. Just in case.

It could be hiding some other problems.

-- 
Catalin

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 11:06           ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 11:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: gary.robertson, Steve Capper, mark.rutland, hughd,
	christoffer.dall, akpm, peterz, mgorman, linux, linux-arch,
	linux-mm, linux-arm-kernel, will.deacon, dann.frazier,
	anders.roxell

64-bit writes are /usually/ atomic but alignment or compiler emiting 32-bit opcodes could also do it. I agree there are a few other pieces to this we will chat about separately and come back to this thread. Time for some zzzz...long weekend!

-- 
Computer Architect | Sent from my #ARM Powered Mobile Device

On Mar 2, 2015 5:50 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote: 
> > I've pulled aOn Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> I've pulled a couple of all nighters reproducing this hard to trigger
> issue and got some data. It looks like the high half of the (note always
> userspace) PMD is all zeros or all ones, which makes me wonder if the
> logic in update_mmu_cache might be missing something on AArch64.

That's worrying but I can tell you offline why ;).

Anyway, 64-bit writes are atomic on ARMv8, so you shouldn't see half
updates. To make sure the compiler does not generate something weird,
change the set_(pte|pmd|pud) to use an inline assembly with a 64-bit
STR.

One question - is the PMD a table or a block? You mentioned set_pte_at
at some point, which leads me to think it's a (transparent) huge page,
hence block mapping.

> When a kernel is built with 64K pages and 2 levels the PMD is
> effectively updated using set_pte_at, which explicitly won't perform a
> DSB if the address is userspace (it expects this to happen later, in
> update_mmu_cache as an example.
> 
> Can anyone think of an obvious reason why we might not be properly
> flushing the changes prior to them being consumed by a hardware walker?

Even if you don't have that barrier, the worst that can happen is that
you get another trap back in the kernel (from user; translation fault)
but the page table read by the kernel is valid and normally the
instruction restarted.

> Test kernels running with an explicit DSB in all PTE update cases now
> running overnight. Just in case.

It could be hiding some other problems.

-- 
Catalin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 11:06           ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 11:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: gary.robertson, Steve Capper, mark.rutland, hughd,
	christoffer.dall, akpm, peterz, mgorman, linux, linux-arch,
	linux-mm, linux-arm-kernel, will.deacon, dann.frazier,
	anders.roxell

64-bit writes are /usually/ atomic but alignment or compiler emiting 32-bit opcodes could also do it. I agree there are a few other pieces to this we will chat about separately and come back to this thread. Time for some zzzz...long weekend!

-- 
Computer Architect | Sent from my #ARM Powered Mobile Device

On Mar 2, 2015 5:50 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote: 
> > I've pulled aOn Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> I've pulled a couple of all nighters reproducing this hard to trigger
> issue and got some data. It looks like the high half of the (note always
> userspace) PMD is all zeros or all ones, which makes me wonder if the
> logic in update_mmu_cache might be missing something on AArch64.

That's worrying but I can tell you offline why ;).

Anyway, 64-bit writes are atomic on ARMv8, so you shouldn't see half
updates. To make sure the compiler does not generate something weird,
change the set_(pte|pmd|pud) to use an inline assembly with a 64-bit
STR.

One question - is the PMD a table or a block? You mentioned set_pte_at
at some point, which leads me to think it's a (transparent) huge page,
hence block mapping.

> When a kernel is built with 64K pages and 2 levels the PMD is
> effectively updated using set_pte_at, which explicitly won't perform a
> DSB if the address is userspace (it expects this to happen later, in
> update_mmu_cache as an example.
> 
> Can anyone think of an obvious reason why we might not be properly
> flushing the changes prior to them being consumed by a hardware walker?

Even if you don't have that barrier, the worst that can happen is that
you get another trap back in the kernel (from user; translation fault)
but the page table read by the kernel is valid and normally the
instruction restarted.

> Test kernels running with an explicit DSB in all PTE update cases now
> running overnight. Just in case.

It could be hiding some other problems.

-- 
Catalin

^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02 11:06           ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

64-bit writes are /usually/ atomic but alignment or compiler emiting 32-bit opcodes could also do it. I agree there are a few other pieces to this we will chat about separately and come back to this thread. Time for some zzzz...long weekend!

-- 
Computer Architect | Sent from my #ARM Powered Mobile Device

On Mar 2, 2015 5:50 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote: 
> > I've pulled aOn Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> I've pulled a couple of all nighters reproducing this hard to trigger
> issue and got some data. It looks like the high half of the (note always
> userspace) PMD is all zeros or all ones, which makes me wonder if the
> logic in update_mmu_cache might be missing something on AArch64.

That's worrying but I can tell you offline why ;).

Anyway, 64-bit writes are atomic on ARMv8, so you shouldn't see half
updates. To make sure the compiler does not generate something weird,
change the set_(pte|pmd|pud) to use an inline assembly with a 64-bit
STR.

One question - is the PMD a table or a block? You mentioned set_pte_at
at some point, which leads me to think it's a (transparent) huge page,
hence block mapping.

> When a kernel is built with 64K pages and 2 levels the PMD is
> effectively updated using set_pte_at, which explicitly won't perform a
> DSB if the address is userspace (it expects this to happen later, in
> update_mmu_cache as an example.
> 
> Can anyone think of an obvious reason why we might not be properly
> flushing the changes prior to them being consumed by a hardware walker?

Even if you don't have that barrier, the worst that can happen is that
you get another trap back in the kernel (from user; translation fault)
but the page table read by the kernel is valid and normally the
instruction restarted.

> Test kernels running with an explicit DSB in all PTE update cases now
> running overnight. Just in case.

It could be hiding some other problems.

-- 
Catalin

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 12:31             ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2015-03-02 12:31 UTC (permalink / raw)
  To: Jon Masters
  Cc: Catalin Marinas, gary.robertson, Steve Capper, mark.rutland,
	hughd, christoffer.dall, akpm, mgorman, linux, linux-arch,
	linux-mm, linux-arm-kernel, will.deacon, dann.frazier,
	anders.roxell

On Mon, Mar 02, 2015 at 06:06:14AM -0500, Jon Masters wrote:

> 64-bit writes are /usually/ atomic but alignment or compiler emiting
> 32-bit opcodes could also do it. I agree there are a few other pieces
> to this we will chat about separately and come back to this thread.

Looking at the asm will quickly tell you if its emitting 32bit stores or
not. If it is, use WRITE_ONCE() (you should anyway I suppose) and see if
that cures it, if not file a compiler bug, volatile stores should never
be split.

As to alignment, you can simply put a BUG_ON((unsigned long)ptep & 7);
in there.

Also:

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 12:31             ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2015-03-02 12:31 UTC (permalink / raw)
  To: Jon Masters
  Cc: Catalin Marinas, gary.robertson, Steve Capper, mark.rutland,
	hughd, christoffer.dall, akpm, mgorman, linux, linux-arch,
	linux-mm, linux-arm-kernel, will.deacon, dann.frazier,
	anders.roxell

On Mon, Mar 02, 2015 at 06:06:14AM -0500, Jon Masters wrote:

> 64-bit writes are /usually/ atomic but alignment or compiler emiting
> 32-bit opcodes could also do it. I agree there are a few other pieces
> to this we will chat about separately and come back to this thread.

Looking at the asm will quickly tell you if its emitting 32bit stores or
not. If it is, use WRITE_ONCE() (you should anyway I suppose) and see if
that cures it, if not file a compiler bug, volatile stores should never
be split.

As to alignment, you can simply put a BUG_ON((unsigned long)ptep & 7);
in there.

Also:

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02 12:31             ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2015-03-02 12:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 02, 2015 at 06:06:14AM -0500, Jon Masters wrote:

> 64-bit writes are /usually/ atomic but alignment or compiler emiting
> 32-bit opcodes could also do it. I agree there are a few other pieces
> to this we will chat about separately and come back to this thread.

Looking at the asm will quickly tell you if its emitting 32bit stores or
not. If it is, use WRITE_ONCE() (you should anyway I suppose) and see if
that cures it, if not file a compiler bug, volatile stores should never
be split.

As to alignment, you can simply put a BUG_ON((unsigned long)ptep & 7);
in there.

Also:

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 12:40               ` Geert Uytterhoeven
  0 siblings, 0 replies; 103+ messages in thread
From: Geert Uytterhoeven @ 2015-03-02 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jon Masters, Catalin Marinas, gary.robertson, Steve Capper,
	Mark Rutland, Hugh Dickins, christoffer.dall, Andrew Morton,
	Mel Gorman, Russell King, Linux-Arch, Linux MM, linux-arm-kernel,
	Will Deacon, dann.frazier, anders.roxell

On Mon, Mar 2, 2015 at 1:31 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Q: What is the most annoying thing in e-mail?

"Sent from my #ARM Powered Mobile Device" ;-)

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 12:40               ` Geert Uytterhoeven
  0 siblings, 0 replies; 103+ messages in thread
From: Geert Uytterhoeven @ 2015-03-02 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jon Masters, Catalin Marinas, gary.robertson, Steve Capper,
	Mark Rutland, Hugh Dickins, christoffer.dall, Andrew Morton,
	Mel Gorman, Russell King, Linux-Arch, Linux MM, linux-arm-kernel,
	Will Deacon, dann.frazier, anders.roxell

On Mon, Mar 2, 2015 at 1:31 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Q: What is the most annoying thing in e-mail?

"Sent from my #ARM Powered Mobile Device" ;-)

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02 12:40               ` Geert Uytterhoeven
  0 siblings, 0 replies; 103+ messages in thread
From: Geert Uytterhoeven @ 2015-03-02 12:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 2, 2015 at 1:31 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Q: What is the most annoying thing in e-mail?

"Sent from my #ARM Powered Mobile Device" ;-)

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert at linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
  2015-02-27 13:20     ` Mark Rutland
  (?)
@ 2015-03-02 14:16       ` Mark Rutland
  -1 siblings, 0 replies; 103+ messages in thread
From: Mark Rutland @ 2015-03-02 14:16 UTC (permalink / raw)
  To: jcm
  Cc: linux-arch, dann.frazier, Steve Capper, peterz, Catalin Marinas,
	anders.roxell, Will Deacon, linux-mm, hughd, christoffer.dall,
	gary.robertson, linux, akpm, linux-arm-kernel, mgorman

> > Head's up: these patches are currently implicated in a rare-to-trigger
> > hang that we are seeing on an internal kernel. An extensive effort is
> > underway to confirm whether these are the cause. Will followup.
> 
> I'm currently investigating an intermittent memory corruption issue in
> v4.0-rc1 I'm able to trigger on Seattle with 4K pages and 48-bit VA,
> which may or may not be related. Sometimes it results in a hang (when
> the vectors get corrupted and the CPUs get caught in a recursive
> exception loop).

FWIW my issue appears to be a bug in the old firmware I'm running.
Sorry for the noise!

Mark.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2015-03-02 14:16       ` Mark Rutland
  0 siblings, 0 replies; 103+ messages in thread
From: Mark Rutland @ 2015-03-02 14:16 UTC (permalink / raw)
  To: jcm
  Cc: linux-arch, dann.frazier, Steve Capper, peterz, Catalin Marinas,
	anders.roxell, Will Deacon, linux-mm, hughd, christoffer.dall,
	gary.robertson, linux, akpm, linux-arm-kernel, mgorman

> > Head's up: these patches are currently implicated in a rare-to-trigger
> > hang that we are seeing on an internal kernel. An extensive effort is
> > underway to confirm whether these are the cause. Will followup.
> 
> I'm currently investigating an intermittent memory corruption issue in
> v4.0-rc1 I'm able to trigger on Seattle with 4K pages and 48-bit VA,
> which may or may not be related. Sometimes it results in a hang (when
> the vectors get corrupted and the CPUs get caught in a recursive
> exception loop).

FWIW my issue appears to be a bug in the old firmware I'm running.
Sorry for the noise!

Mark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast
@ 2015-03-02 14:16       ` Mark Rutland
  0 siblings, 0 replies; 103+ messages in thread
From: Mark Rutland @ 2015-03-02 14:16 UTC (permalink / raw)
  To: linux-arm-kernel

> > Head's up: these patches are currently implicated in a rare-to-trigger
> > hang that we are seeing on an internal kernel. An extensive effort is
> > underway to confirm whether these are the cause. Will followup.
> 
> I'm currently investigating an intermittent memory corruption issue in
> v4.0-rc1 I'm able to trigger on Seattle with 4K pages and 48-bit VA,
> which may or may not be related. Sometimes it results in a hang (when
> the vectors get corrupted and the CPUs get caught in a recursive
> exception loop).

FWIW my issue appears to be a bug in the old firmware I'm running.
Sorry for the noise!

Mark.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 22:21           ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 22:21 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	mark.rutland, anders.roxell, peterz, gary.robertson, hughd,
	will.deacon, mgorman, dann.frazier, akpm, christoffer.dall

On 03/02/2015 05:50 AM, Catalin Marinas wrote:
> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:

>> Test kernels running with an explicit DSB in all PTE update cases now
>> running overnight. Just in case.

...and stay up after 19 hours. But that's just timing I'm sure.

> It could be hiding some other problems.

I checked my GDB macros and they were correct BUT my debugger went out
to lunch soon after that dump so I suspect it was just garbage :)

Instead, for my immediate issue, I have a much more likely suspect. For
anyone interested in the followup, you should know that hardware page
table walkers generally do respond well when you feed them Makefiles:

0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............

So that explains why things were falling over. It is likely indeed the
bad DMA I have been craving all along. And this time it was so gracious
as to give me the answer in plain ASCII :) I suspect there will be a
patch for a certain AHCI driver in the not too distant future.

Jon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 22:21           ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 22:21 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	mark.rutland, anders.roxell, peterz, gary.robertson, hughd,
	will.deacon, mgorman, dann.frazier, akpm, christoffer.dall

On 03/02/2015 05:50 AM, Catalin Marinas wrote:
> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:

>> Test kernels running with an explicit DSB in all PTE update cases now
>> running overnight. Just in case.

...and stay up after 19 hours. But that's just timing I'm sure.

> It could be hiding some other problems.

I checked my GDB macros and they were correct BUT my debugger went out
to lunch soon after that dump so I suspect it was just garbage :)

Instead, for my immediate issue, I have a much more likely suspect. For
anyone interested in the followup, you should know that hardware page
table walkers generally do respond well when you feed them Makefiles:

0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............

So that explains why things were falling over. It is likely indeed the
bad DMA I have been craving all along. And this time it was so gracious
as to give me the answer in plain ASCII :) I suspect there will be a
patch for a certain AHCI driver in the not too distant future.

Jon.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02 22:21           ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 22:21 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/02/2015 05:50 AM, Catalin Marinas wrote:
> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:

>> Test kernels running with an explicit DSB in all PTE update cases now
>> running overnight. Just in case.

...and stay up after 19 hours. But that's just timing I'm sure.

> It could be hiding some other problems.

I checked my GDB macros and they were correct BUT my debugger went out
to lunch soon after that dump so I suspect it was just garbage :)

Instead, for my immediate issue, I have a much more likely suspect. For
anyone interested in the followup, you should know that hardware page
table walkers generally do respond well when you feed them Makefiles:

0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............

So that explains why things were falling over. It is likely indeed the
bad DMA I have been craving all along. And this time it was so gracious
as to give me the answer in plain ASCII :) I suspect there will be a
patch for a certain AHCI driver in the not too distant future.

Jon.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 22:29             ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 22:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	mark.rutland, anders.roxell, peterz, gary.robertson, hughd,
	will.deacon, mgorman, dann.frazier, akpm, christoffer.dall

On 03/02/2015 05:21 PM, Jon Masters wrote:
> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
>> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> 
>>> Test kernels running with an explicit DSB in all PTE update cases now
>>> running overnight. Just in case.
> 
> ...and stay up after 19 hours. But that's just timing I'm sure.
> 
>> It could be hiding some other problems.
> 
> I checked my GDB macros and they were correct BUT my debugger went out
> to lunch soon after that dump so I suspect it was just garbage :)
> 
> Instead, for my immediate issue, I have a much more likely suspect. For
> anyone interested in the followup, you should know that hardware page
> table walkers generally do respond well when you feed them Makefiles:
> 
> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
> 
> So that explains why things were falling over. It is likely indeed the
> bad DMA I have been craving all along. And this time it was so gracious
> as to give me the answer in plain ASCII :) I suspect there will be a
> patch for a certain AHCI driver in the not too distant future.

Also if I have time machine, the ASCII for '#' never ended in 1'b1.

Jon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-02 22:29             ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 22:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, linux-arch, linux, Steve Capper, linux-mm,
	mark.rutland, anders.roxell, peterz, gary.robertson, hughd,
	will.deacon, mgorman, dann.frazier, akpm, christoffer.dall

On 03/02/2015 05:21 PM, Jon Masters wrote:
> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
>> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> 
>>> Test kernels running with an explicit DSB in all PTE update cases now
>>> running overnight. Just in case.
> 
> ...and stay up after 19 hours. But that's just timing I'm sure.
> 
>> It could be hiding some other problems.
> 
> I checked my GDB macros and they were correct BUT my debugger went out
> to lunch soon after that dump so I suspect it was just garbage :)
> 
> Instead, for my immediate issue, I have a much more likely suspect. For
> anyone interested in the followup, you should know that hardware page
> table walkers generally do respond well when you feed them Makefiles:
> 
> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
> 
> So that explains why things were falling over. It is likely indeed the
> bad DMA I have been craving all along. And this time it was so gracious
> as to give me the answer in plain ASCII :) I suspect there will be a
> patch for a certain AHCI driver in the not too distant future.

Also if I have time machine, the ASCII for '#' never ended in 1'b1.

Jon.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-02 22:29             ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-02 22:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/02/2015 05:21 PM, Jon Masters wrote:
> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
>> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> 
>>> Test kernels running with an explicit DSB in all PTE update cases now
>>> running overnight. Just in case.
> 
> ...and stay up after 19 hours. But that's just timing I'm sure.
> 
>> It could be hiding some other problems.
> 
> I checked my GDB macros and they were correct BUT my debugger went out
> to lunch soon after that dump so I suspect it was just garbage :)
> 
> Instead, for my immediate issue, I have a much more likely suspect. For
> anyone interested in the followup, you should know that hardware page
> table walkers generally do respond well when you feed them Makefiles:
> 
> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux at gmail.co
> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
> 
> So that explains why things were falling over. It is likely indeed the
> bad DMA I have been craving all along. And this time it was so gracious
> as to give me the answer in plain ASCII :) I suspect there will be a
> patch for a certain AHCI driver in the not too distant future.

Also if I have time machine, the ASCII for '#' never ended in 1'b1.

Jon.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
  2015-03-02 22:21           ` Jon Masters
  (?)
@ 2015-03-03  9:06             ` Arnd Bergmann
  -1 siblings, 0 replies; 103+ messages in thread
From: Arnd Bergmann @ 2015-03-03  9:06 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Jon Masters, Catalin Marinas, linux-arch, mark.rutland,
	Steve Capper, peterz, gary.robertson, anders.roxell, hughd,
	christoffer.dall, will.deacon, linux-mm, mgorman, dann.frazier,
	linux, akpm

On Monday 02 March 2015 17:21:26 Jon Masters wrote:
> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
> > On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> 
> >> Test kernels running with an explicit DSB in all PTE update cases now
> >> running overnight. Just in case.
> 
> ...and stay up after 19 hours. But that's just timing I'm sure.
> 
> > It could be hiding some other problems.
> 
> I checked my GDB macros and they were correct BUT my debugger went out
> to lunch soon after that dump so I suspect it was just garbage 
> 
> Instead, for my immediate issue, I have a much more likely suspect. For
> anyone interested in the followup, you should know that hardware page
> table walkers generally do respond well when you feed them Makefiles:
> 
> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
> 
> So that explains why things were falling over. It is likely indeed the
> bad DMA I have been craving all along. And this time it was so gracious
> as to give me the answer in plain ASCII  I suspect there will be a
> patch for a certain AHCI driver in the not too distant future.

I hope this kind of problem becomes easier to debug once we have
full iommu support working on arm64. When we had problems like this
on PowerPC, using iommu=force to ensure DMA would only be done to
pages that are currently mapped to the device was really helpful.

	Arnd

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-03  9:06             ` Arnd Bergmann
  0 siblings, 0 replies; 103+ messages in thread
From: Arnd Bergmann @ 2015-03-03  9:06 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Jon Masters, Catalin Marinas, linux-arch, mark.rutland,
	Steve Capper, peterz, gary.robertson, anders.roxell, hughd,
	christoffer.dall, will.deacon, linux-mm, mgorman, dann.frazier,
	linux, akpm

On Monday 02 March 2015 17:21:26 Jon Masters wrote:
> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
> > On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> 
> >> Test kernels running with an explicit DSB in all PTE update cases now
> >> running overnight. Just in case.
> 
> ...and stay up after 19 hours. But that's just timing I'm sure.
> 
> > It could be hiding some other problems.
> 
> I checked my GDB macros and they were correct BUT my debugger went out
> to lunch soon after that dump so I suspect it was just garbage 
> 
> Instead, for my immediate issue, I have a much more likely suspect. For
> anyone interested in the followup, you should know that hardware page
> table walkers generally do respond well when you feed them Makefiles:
> 
> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
> 
> So that explains why things were falling over. It is likely indeed the
> bad DMA I have been craving all along. And this time it was so gracious
> as to give me the answer in plain ASCII  I suspect there will be a
> patch for a certain AHCI driver in the not too distant future.

I hope this kind of problem becomes easier to debug once we have
full iommu support working on arm64. When we had problems like this
on PowerPC, using iommu=force to ensure DMA would only be done to
pages that are currently mapped to the device was really helpful.

	Arnd

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-03  9:06             ` Arnd Bergmann
  0 siblings, 0 replies; 103+ messages in thread
From: Arnd Bergmann @ 2015-03-03  9:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 02 March 2015 17:21:26 Jon Masters wrote:
> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
> > On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
> 
> >> Test kernels running with an explicit DSB in all PTE update cases now
> >> running overnight. Just in case.
> 
> ...and stay up after 19 hours. But that's just timing I'm sure.
> 
> > It could be hiding some other problems.
> 
> I checked my GDB macros and they were correct BUT my debugger went out
> to lunch soon after that dump so I suspect it was just garbage 
> 
> Instead, for my immediate issue, I have a much more likely suspect. For
> anyone interested in the followup, you should know that hardware page
> table walkers generally do respond well when you feed them Makefiles:
> 
> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux at gmail.co
> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
> 
> So that explains why things were falling over. It is likely indeed the
> bad DMA I have been craving all along. And this time it was so gracious
> as to give me the answer in plain ASCII  I suspect there will be a
> patch for a certain AHCI driver in the not too distant future.

I hope this kind of problem becomes easier to debug once we have
full iommu support working on arm64. When we had problems like this
on PowerPC, using iommu=force to ensure DMA would only be done to
pages that are currently mapped to the device was really helpful.

	Arnd

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-03 15:46               ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-03 15:46 UTC (permalink / raw)
  To: Arnd Bergmann, linux-arm-kernel
  Cc: Catalin Marinas, linux-arch, mark.rutland, Steve Capper, peterz,
	gary.robertson, anders.roxell, hughd, christoffer.dall,
	will.deacon, linux-mm, mgorman, dann.frazier, linux, akpm

On 03/03/2015 04:06 AM, Arnd Bergmann wrote:
> On Monday 02 March 2015 17:21:26 Jon Masters wrote:
>> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
>>> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
>>
>>>> Test kernels running with an explicit DSB in all PTE update cases now
>>>> running overnight. Just in case.
>>
>> ...and stay up after 19 hours. But that's just timing I'm sure.
>>
>>> It could be hiding some other problems.
>>
>> I checked my GDB macros and they were correct BUT my debugger went out
>> to lunch soon after that dump so I suspect it was just garbage 
>>
>> Instead, for my immediate issue, I have a much more likely suspect. For
>> anyone interested in the followup, you should know that hardware page
>> table walkers generally do respond well when you feed them Makefiles:
>>
>> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
>> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
>> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
>> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
>> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
>> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
>> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
>> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
>> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
>> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
>> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
>> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
>> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
>> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
>> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
>> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
>> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
>> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
>> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
>>
>> So that explains why things were falling over. It is likely indeed the
>> bad DMA I have been craving all along. And this time it was so gracious
>> as to give me the answer in plain ASCII  I suspect there will be a
>> patch for a certain AHCI driver in the not too distant future.
> 
> I hope this kind of problem becomes easier to debug once we have
> full iommu support working on arm64. When we had problems like this
> on PowerPC, using iommu=force to ensure DMA would only be done to
> pages that are currently mapped to the device was really helpful.

Oh, you can imagine that I put my best Dr. Evil hat on this week and
have my finger on the button already. In fact if I have my way future
SBSA compliant systems will be required to use an IOMMU with no way to
avoid having one. Whether that was before or after I was reduced to
walking kernel memory one word at a time and using pen and paper to
derive the above...We've a couple of years of "robustness investment"
ahead on ARM servers to ensure that we catch all of these things. And we
will catch all of them. And it will be utterly perfect.

Jon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: PMD update corruption (sync question)
@ 2015-03-03 15:46               ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-03 15:46 UTC (permalink / raw)
  To: Arnd Bergmann, linux-arm-kernel
  Cc: Catalin Marinas, linux-arch, mark.rutland, Steve Capper, peterz,
	gary.robertson, anders.roxell, hughd, christoffer.dall,
	will.deacon, linux-mm, mgorman, dann.frazier, linux, akpm

On 03/03/2015 04:06 AM, Arnd Bergmann wrote:
> On Monday 02 March 2015 17:21:26 Jon Masters wrote:
>> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
>>> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
>>
>>>> Test kernels running with an explicit DSB in all PTE update cases now
>>>> running overnight. Just in case.
>>
>> ...and stay up after 19 hours. But that's just timing I'm sure.
>>
>>> It could be hiding some other problems.
>>
>> I checked my GDB macros and they were correct BUT my debugger went out
>> to lunch soon after that dump so I suspect it was just garbage 
>>
>> Instead, for my immediate issue, I have a much more likely suspect. For
>> anyone interested in the followup, you should know that hardware page
>> table walkers generally do respond well when you feed them Makefiles:
>>
>> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
>> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
>> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
>> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
>> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux@gmail.co
>> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
>> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
>> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
>> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
>> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
>> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
>> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
>> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
>> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
>> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
>> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
>> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
>> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
>> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
>>
>> So that explains why things were falling over. It is likely indeed the
>> bad DMA I have been craving all along. And this time it was so gracious
>> as to give me the answer in plain ASCII  I suspect there will be a
>> patch for a certain AHCI driver in the not too distant future.
> 
> I hope this kind of problem becomes easier to debug once we have
> full iommu support working on arm64. When we had problems like this
> on PowerPC, using iommu=force to ensure DMA would only be done to
> pages that are currently mapped to the device was really helpful.

Oh, you can imagine that I put my best Dr. Evil hat on this week and
have my finger on the button already. In fact if I have my way future
SBSA compliant systems will be required to use an IOMMU with no way to
avoid having one. Whether that was before or after I was reduced to
walking kernel memory one word at a time and using pen and paper to
derive the above...We've a couple of years of "robustness investment"
ahead on ARM servers to ensure that we catch all of these things. And we
will catch all of them. And it will be utterly perfect.

Jon.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* PMD update corruption (sync question)
@ 2015-03-03 15:46               ` Jon Masters
  0 siblings, 0 replies; 103+ messages in thread
From: Jon Masters @ 2015-03-03 15:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/03/2015 04:06 AM, Arnd Bergmann wrote:
> On Monday 02 March 2015 17:21:26 Jon Masters wrote:
>> On 03/02/2015 05:50 AM, Catalin Marinas wrote:
>>> On Mon, Mar 02, 2015 at 12:58:36AM -0500, Jon Masters wrote:
>>
>>>> Test kernels running with an explicit DSB in all PTE update cases now
>>>> running overnight. Just in case.
>>
>> ...and stay up after 19 hours. But that's just timing I'm sure.
>>
>>> It could be hiding some other problems.
>>
>> I checked my GDB macros and they were correct BUT my debugger went out
>> to lunch soon after that dump so I suspect it was just garbage 
>>
>> Instead, for my immediate issue, I have a much more likely suspect. For
>> anyone interested in the followup, you should know that hardware page
>> table walkers generally do respond well when you feed them Makefiles:
>>
>> 0x43e81c0000: 20230a23 656b614d 656c6966 726f6620  : #.# Makefile for
>> 0x43e81c0010: 65687420 462d4920 6563726f 69726420  :  the I-Force dri
>> 0x43e81c0020: 0a726576 20230a23 4a207942 6e61686f  : ver.#.# By Johan
>> 0x43e81c0030: 6544206e 7875656e 6f6a3c20 6e6e6168  : n Deneux <johann
>> 0x43e81c0040: 6e65642e 40787565 69616d67 6f632e6c  : .deneux at gmail.co
>> 0x43e81c0050: 230a3e6d 626f0a0a 28242d6a 464e4f43  : m>.#..obj-$(CONF
>> 0x43e81c0060: 4a5f4749 5453594f 5f4b4349 524f4649  : IG_JOYSTICK_IFOR
>> 0x43e81c0070: 09294543 69203d2b 63726f66 0a6f2e65  : CE).+= iforce.o.
>> 0x43e81c0080: 6f66690a 2d656372 3d3a2079 6f666920  : .iforce-y := ifo
>> 0x43e81c0090: 2d656372 6f2e6666 6f666920 2d656372  : rce-ff.o iforce-
>> 0x43e81c00a0: 6e69616d 69206f2e 63726f66 61702d65  : main.o iforce-pa
>> 0x43e81c00b0: 74656b63 0a6f2e73 726f6669 242d6563  : ckets.o.iforce-$
>> 0x43e81c00c0: 4e4f4328 5f474946 53594f4a 4b434954  : (CONFIG_JOYSTICK
>> 0x43e81c00d0: 4f46495f 5f454352 29323332 203d2b09  : _IFORCE_232).+=
>> 0x43e81c00e0: 726f6669 732d6563 6f697265 690a6f2e  : iforce-serio.o.i
>> 0x43e81c00f0: 63726f66 28242d65 464e4f43 4a5f4749  : force-$(CONFIG_J
>> 0x43e81c0100: 5453594f 5f4b4349 524f4649 555f4543  : OYSTICK_IFORCE_U
>> 0x43e81c0110: 09294253 69203d2b 63726f66 73752d65  : SB).+= iforce-us
>> 0x43e81c0120: 0a6f2e62 00000000 00000000 00000000  : b.o.............
>>
>> So that explains why things were falling over. It is likely indeed the
>> bad DMA I have been craving all along. And this time it was so gracious
>> as to give me the answer in plain ASCII  I suspect there will be a
>> patch for a certain AHCI driver in the not too distant future.
> 
> I hope this kind of problem becomes easier to debug once we have
> full iommu support working on arm64. When we had problems like this
> on PowerPC, using iommu=force to ensure DMA would only be done to
> pages that are currently mapped to the device was really helpful.

Oh, you can imagine that I put my best Dr. Evil hat on this week and
have my finger on the button already. In fact if I have my way future
SBSA compliant systems will be required to use an IOMMU with no way to
avoid having one. Whether that was before or after I was reduced to
walking kernel memory one word at a time and using pen and paper to
derive the above...We've a couple of years of "robustness investment"
ahead on ARM servers to ensure that we catch all of these things. And we
will catch all of them. And it will be utterly perfect.

Jon.

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2015-03-03 17:05 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-26 14:03 [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast Steve Capper
2014-09-26 14:03 ` Steve Capper
2014-09-26 14:03 ` Steve Capper
2014-09-26 14:03 ` [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-29 21:51   ` Hugh Dickins
2014-09-29 21:51     ` Hugh Dickins
2014-09-29 21:51     ` Hugh Dickins
2014-10-01 11:11     ` Catalin Marinas
2014-10-01 11:11       ` Catalin Marinas
2014-10-01 11:11       ` Catalin Marinas
2014-10-01 11:11       ` Catalin Marinas
2014-10-02 16:00     ` Steve Capper
2014-10-02 16:00       ` Steve Capper
2014-10-02 16:00       ` Steve Capper
2014-10-02 12:19   ` Andrea Arcangeli
2014-10-02 12:19     ` Andrea Arcangeli
2014-10-02 12:19     ` Andrea Arcangeli
2014-10-02 16:18     ` Steve Capper
2014-10-02 16:18       ` Steve Capper
2014-10-02 16:18       ` Steve Capper
2014-10-02 16:54       ` Andrea Arcangeli
2014-10-02 16:54         ` Andrea Arcangeli
2014-10-02 16:54         ` Andrea Arcangeli
2014-10-13  5:15     ` Aneesh Kumar K.V
2014-10-13  5:15       ` Aneesh Kumar K.V
2014-10-13  5:15       ` Aneesh Kumar K.V
2014-10-13  5:21       ` David Miller
2014-10-13  5:21         ` David Miller
2014-10-13  5:21         ` David Miller
2014-10-13 11:44         ` Steve Capper
2014-10-13 11:44           ` Steve Capper
2014-10-13 11:44           ` Steve Capper
2014-10-13 16:06           ` David Miller
2014-10-13 16:06             ` David Miller
2014-10-13 16:06             ` David Miller
2014-10-14 12:38             ` Steve Capper
2014-10-14 12:38               ` Steve Capper
2014-10-14 12:38               ` Steve Capper
2014-10-14 16:30               ` David Miller
2014-10-14 16:30                 ` David Miller
2014-10-14 16:30                 ` David Miller
2014-10-13 17:04           ` Aneesh Kumar K.V
2014-10-13 17:04             ` Aneesh Kumar K.V
2014-10-13 17:04             ` Aneesh Kumar K.V
2014-10-13  6:22   ` Aneesh Kumar K.V
2014-10-13  6:22     ` Aneesh Kumar K.V
2014-10-13  6:22     ` Aneesh Kumar K.V
2014-10-13  6:22     ` Aneesh Kumar K.V
2014-09-26 14:03 ` [PATCH V4 2/6] arm: mm: Introduce special ptes for LPAE Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03 ` [PATCH V4 3/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03 ` [PATCH V4 4/6] arm: mm: Enable RCU fast_gup Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03 ` [PATCH V4 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03 ` [PATCH V4 6/6] arm64: mm: Enable RCU fast_gup Steve Capper
2014-09-26 14:03   ` Steve Capper
2014-09-26 14:03   ` Steve Capper
2015-02-27 12:42 ` [PATCH V4 0/6] RCU get_user_pages_fast and __get_user_pages_fast Jon Masters
2015-02-27 12:42   ` Jon Masters
2015-02-27 12:42   ` Jon Masters
2015-02-27 13:20   ` Mark Rutland
2015-02-27 13:20     ` Mark Rutland
2015-02-27 13:20     ` Mark Rutland
2015-03-02 14:16     ` Mark Rutland
2015-03-02 14:16       ` Mark Rutland
2015-03-02 14:16       ` Mark Rutland
2015-03-02  2:10   ` PMD update corruption (sync question) Jon Masters
2015-03-02  2:10     ` Jon Masters
2015-03-02  5:58     ` Jon Masters
2015-03-02  5:58       ` Jon Masters
2015-03-02  5:58       ` Jon Masters
2015-03-02 10:50       ` Catalin Marinas
2015-03-02 10:50         ` Catalin Marinas
2015-03-02 10:50         ` Catalin Marinas
2015-03-02 11:06         ` Jon Masters
2015-03-02 11:06           ` Jon Masters
2015-03-02 11:06           ` Jon Masters
2015-03-02 12:31           ` Peter Zijlstra
2015-03-02 12:31             ` Peter Zijlstra
2015-03-02 12:31             ` Peter Zijlstra
2015-03-02 12:40             ` Geert Uytterhoeven
2015-03-02 12:40               ` Geert Uytterhoeven
2015-03-02 12:40               ` Geert Uytterhoeven
2015-03-02 22:21         ` Jon Masters
2015-03-02 22:21           ` Jon Masters
2015-03-02 22:21           ` Jon Masters
2015-03-02 22:29           ` Jon Masters
2015-03-02 22:29             ` Jon Masters
2015-03-02 22:29             ` Jon Masters
2015-03-03  9:06           ` Arnd Bergmann
2015-03-03  9:06             ` Arnd Bergmann
2015-03-03  9:06             ` Arnd Bergmann
2015-03-03 15:46             ` Jon Masters
2015-03-03 15:46               ` Jon Masters
2015-03-03 15:46               ` Jon Masters

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.