linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4 00/20] Speculative page faults
@ 2017-06-09 14:20 Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 01/20] mm: Dont assume page-table invariance during faults Laurent Dufour
                   ` (22 more replies)
  0 siblings, 23 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is a port on kernel 4.12 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore.

http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none

Compared to the Peter initial work, this series introduce a try spin
lock when dealing with speculative page fault. This is required to
avoid dead lock when handling a page fault while a TLB invalidate is
requested by an other CPU holding the PTE. Another change due to a
lock dependency issue with mapping->i_mmap_rwsem.

This series also protect changes to VMA's data which are read or
change by the page fault handler. The protections is done through the
VMA's sequence number.

This series is functional on x86 and PowerPC.

It's building on top of v4.12-rc4 and relies on the change done by
Paul McKenney to the SRCU code allowing better performance by
maintaining per-CPU callback lists:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e

Tests have been made using a large commercial in-memory database on a
PowerPC system with 752 CPUs. The results are very encouraging since
the loading of the 2TB database was faster by 20% with the speculative
page fault.

Since tests are encouraging and running test suite didn't raise any
issue, I'd like this request for comment series to move to a patch
series soon. So please feel free to comment.

Changes since V3:
 - support for the 5-level paging.
 - abort speculative path before entering userfault code
 - support for PowerPC architecture
 - reorder the patch to fix build test errors.

Laurent Dufour (14):
  mm: Introduce pte_spinlock
  mm/spf: Try spin lock in speculative path
  mm/spf: Fix fe.sequence init in __handle_mm_fault()
  mm/spf: don't set fault entry's fields if locking failed
  mm/spf; fix lock dependency against mapping->i_mmap_rwsem
  mm/spf: Protect changes to vm_flags
  mm/spf Protect vm_policy's changes against speculative pf
  mm/spf: Add check on the VMA's flags
  mm/spf: protect madvise vs speculative pf
  mm/spf: protect mremap() against speculative pf
  mm/spf: Don't call user fault callback in the speculative path
  x86/mm: Update the handle_speculative_fault's path
  powerpc/mm: Add speculative page fault
  mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path

Peter Zijlstra (6):
  mm: Dont assume page-table invariance during faults
  mm: Prepare for FAULT_FLAG_SPECULATIVE
  mm: VMA sequence count
  mm: RCU free VMAs
  mm: Provide speculative fault infrastructure
  x86/mm: Add speculative pagefault handling

 arch/powerpc/mm/fault.c  |  25 +++-
 arch/x86/mm/fault.c      |  14 +++
 fs/proc/task_mmu.c       |   2 +
 include/linux/mm.h       |   4 +
 include/linux/mm_types.h |   3 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  20 ++++
 mm/madvise.c             |   4 +
 mm/memory.c              | 291 +++++++++++++++++++++++++++++++++++++++--------
 mm/mempolicy.c           |  10 +-
 mm/mlock.c               |   9 +-
 mm/mmap.c                | 123 +++++++++++++++-----
 mm/mprotect.c            |   2 +
 mm/mremap.c              |   7 ++
 15 files changed, 435 insertions(+), 81 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v4 01/20] mm: Dont assume page-table invariance during faults
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 02/20] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the relyance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2e65df1831d9..fd952f05e016 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2102,30 +2102,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	debug_dma_assert_idle(src);
@@ -2682,9 +2658,6 @@ int do_swap_page(struct vm_fault *vmf)
 	int exclusive = 0;
 	int ret = 0;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 02/20] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 01/20] mm: Dont assume page-table invariance during faults Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 03/20] mm: Introduce pte_spinlock Laurent Dufour
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[port to 4.10 kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 57 +++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b892e95d4929..6b7ec2a76953 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -286,6 +286,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
 
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
diff --git a/mm/memory.c b/mm/memory.c
index fd952f05e016..c61042cf8b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2240,6 +2240,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+	return true;
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page.
  *
@@ -2267,6 +2273,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 	const unsigned long mmun_start = vmf->address & PAGE_MASK;
 	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
 	struct mem_cgroup *memcg;
+	int ret = VM_FAULT_OOM;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2294,7 +2301,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		mem_cgroup_cancel_charge(new_page, memcg, false);
+		ret = VM_FAULT_RETRY;
+		goto oom_free_new;
+	}
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2382,7 +2393,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 oom:
 	if (old_page)
 		put_page(old_page);
-	return VM_FAULT_OOM;
+	return ret;
 }
 
 /**
@@ -2403,8 +2414,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
 	WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-				       &vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
 	/*
 	 * We might have raced with another page fault while we released the
 	 * pte_offset_map_lock.
@@ -2522,8 +2533,11 @@ static int do_wp_page(struct vm_fault *vmf)
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			lock_page(vmf->page);
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				unlock_page(vmf->page);
+				put_page(vmf->page);
+				return VM_FAULT_RETRY;
+			}
 			if (!pte_same(*vmf->pte, vmf->orig_pte)) {
 				unlock_page(vmf->page);
 				pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2681,8 +2695,10 @@ int do_swap_page(struct vm_fault *vmf)
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+				return VM_FAULT_RETRY;
+			}
 			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2738,8 +2754,11 @@ int do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		ret = VM_FAULT_RETRY;
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto out_page;
+	}
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
 		goto out_nomap;
 
@@ -2903,8 +2922,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 			!mm_forbids_zeropage(vma->vm_mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
 						vma->vm_page_prot));
-		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				vmf->address, &vmf->ptl);
+		if (!pte_map_lock(vmf))
+			return VM_FAULT_RETRY;
 		if (!pte_none(*vmf->pte))
 			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
@@ -2936,8 +2955,12 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		/* XXX: should be factorized */
+		mem_cgroup_cancel_charge(page, memcg, false);
+		put_page(page);
+		return VM_FAULT_RETRY;
+	}
 	if (!pte_none(*vmf->pte))
 		goto release;
 
@@ -3057,8 +3080,9 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
 	 * pte_none() under vmf->ptl protection when we return to
 	 * alloc_set_pte().
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
+
 	return 0;
 }
 
@@ -3378,6 +3402,7 @@ static int do_read_fault(struct vm_fault *vmf)
 	 * something).
 	 */
 	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
+		/* XXX: is a call to pte_map_lock(fe) required here ? */
 		ret = do_fault_around(vmf);
 		if (ret)
 			return ret;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 03/20] mm: Introduce pte_spinlock
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 01/20] mm: Dont assume page-table invariance during faults Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 02/20] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 04/20] mm: VMA sequence count Laurent Dufour
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is needed because in handle_pte_fault() pte_offset_map() is
called and then fe->ptl is fetched and spin_locked.

This was previously embedded in the call to pte_offset_map_lock().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c61042cf8b52..3cbce3762d17 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2240,6 +2240,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
@@ -3554,8 +3561,8 @@ static int do_numa_page(struct vm_fault *vmf)
 	 * validation through pte_unmap_same(). It's of NUMA type but
 	 * the pfn may be screwed if the read is non atomic.
 	 */
-	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto out;
@@ -3747,8 +3754,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	entry = vmf->orig_pte;
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 04/20] mm: VMA sequence count
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (2 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 03/20] mm: Introduce pte_spinlock Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 05/20] mm: RCU free VMAs Laurent Dufour
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[port to 4.10 kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |  1 +
 mm/memory.c              |  2 ++
 mm/mmap.c                | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..8945743e4609 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,6 +342,7 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	seqcount_t vm_sequence;
 };
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 3cbce3762d17..8f0468d099fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1379,6 +1379,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	BUG_ON(addr >= end);
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1388,6 +1389,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f82741e199c0..9f86356d0012 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -543,6 +543,8 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	else
 		mm->highest_vm_end = vma->vm_end;
 
+	seqcount_init(&vma->vm_sequence);
+
 	/*
 	 * vma->vm_prev wasn't known when we followed the rbtree to find the
 	 * correct insertion point for that vma. As a result, we could not
@@ -677,6 +679,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -888,6 +894,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
+		write_seqcount_end(&next->vm_sequence);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -901,6 +908,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
+			if (next)
+				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -947,6 +956,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	if (next)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 
 	return 0;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 05/20] mm: RCU free VMAs
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (3 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 04/20] mm: VMA sequence count Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 06/20] mm: Provide speculative fault infrastructure Laurent Dufour
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

XXX: mmap()/munmap() heavy workloads might suffer from the global lock
in call_srcu() -- this is fixable with a 'better' SRCU implementation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Rename vma_is_dead() to vma_has_changed()]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |   2 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  20 ++++++++++
 mm/mmap.c                | 102 ++++++++++++++++++++++++++++++++++-------------
 5 files changed, 99 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8945743e4609..7da5e565046c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -343,6 +343,7 @@ struct vm_area_struct {
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
@@ -360,6 +361,7 @@ struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
+	seqlock_t mm_seq;
 	u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 	unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d2bf95..e4d335bee820 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -793,6 +793,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
 	mm->vmacache_seqnum = 0;
+	seqlock_init(&mm->mm_seq);
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 975e49f00f34..2b1fa061684f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -16,6 +16,7 @@
 
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
+	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..3c6041d53310 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,26 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm,
+					    unsigned long addr);
+
+static inline bool vma_has_changed(struct vm_area_struct *vma,
+				   unsigned int sequence)
+{
+	int ret = RB_EMPTY_NODE(&vma->vm_rb);
+	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
+
+	/*
+	 * Matches both the wmb in write_seqlock_{begin,end}() and
+	 * the wmb in vma_rb_erase().
+	 */
+	smp_rmb();
+
+	return ret || seq != sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 9f86356d0012..008fc35aa75e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -159,6 +159,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	if (vma->vm_file)
+		fput(vma->vm_file);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -169,10 +186,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -396,26 +411,37 @@ static void vma_gap_update(struct vm_area_struct *vma)
 }
 
 static inline void vma_rb_insert(struct vm_area_struct *vma,
-				 struct rb_root *root)
+				 struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
+
 	/* All rb_subtree_gap values must be consistent prior to insertion */
 	validate_mm_rb(root, NULL);
 
 	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
 }
 
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
 	/*
 	 * Note rb_erase_augmented is a fairly large inline function,
 	 * so make sure we instantiate it only once with our desired
 	 * augmented rbtree callbacks.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+	write_sequnlock(&mm->mm_seq); /* wmb */
+
+	/*
+	 * Ensure the removal is complete before clearing the node.
+	 * Matched by vma_has_changed()/handle_speculative_fault().
+	 */
+	RB_CLEAR_NODE(&vma->vm_rb);
 }
 
 static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
-						struct rb_root *root,
+						struct mm_struct *mm,
 						struct vm_area_struct *ignore)
 {
 	/*
@@ -423,21 +449,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
 	 * with the possible exception of the "next" vma being erased if
 	 * next->vm_start was reduced.
 	 */
-	validate_mm_rb(root, ignore);
+	validate_mm_rb(&mm->mm_rb, ignore);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
-					 struct rb_root *root)
+					 struct mm_struct *mm)
 {
 	/*
 	 * All rb_subtree_gap values must be consistent prior to erase,
 	 * with the possible exception of the vma being erased.
 	 */
-	validate_mm_rb(root, vma);
+	validate_mm_rb(&mm->mm_rb, vma);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 /*
@@ -554,10 +580,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * immediately update the gap to the correct value. Finally we
 	 * rebalance the rbtree after all augmented values have been set.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	vma->rb_subtree_gap = 0;
 	vma_gap_update(vma);
-	vma_rb_insert(vma, &mm->mm_rb);
+	vma_rb_insert(vma, mm);
+	write_sequnlock(&mm->mm_seq);
 }
 
 static void __vma_link_file(struct vm_area_struct *vma)
@@ -633,7 +661,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
 {
 	struct vm_area_struct *next;
 
-	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
+	vma_rb_erase_ignore(vma, mm, ignore);
 	next = vma->vm_next;
 	if (has_prev)
 		prev->vm_next = next;
@@ -885,21 +913,19 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
-		if (file) {
+		if (file)
 			uprobe_munmap(next, next->vm_start, next->vm_end);
-			fput(file);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
-		write_seqcount_end(&next->vm_sequence);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
 		 * up the code too much to do both in one go.
 		 */
+		write_seqcount_end(&next->vm_sequence);
 		if (remove_next != 3) {
 			/*
 			 * If "next" was removed and vma->vm_end was
@@ -2107,15 +2133,10 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
-	struct vm_area_struct *vma;
-
-	/* Check the cache first. */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
+	struct vm_area_struct *vma = NULL;
 
 	rb_node = mm->mm_rb.rb_node;
 
@@ -2133,13 +2154,40 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 			rb_node = rb_node->rb_right;
 	}
 
+	return vma;
+}
+
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* Check the cache first. */
+	vma = vmacache_find(mm, addr);
+	if (likely(vma))
+		return vma;
+
+	vma = __find_vma(mm, addr);
 	if (vma)
 		vmacache_update(addr, vma);
 	return vma;
 }
-
 EXPORT_SYMBOL(find_vma);
 
+struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+	unsigned int seq;
+
+	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
+
+	do {
+		seq = read_seqbegin(&mm->mm_seq);
+		vma = __find_vma(mm, addr);
+	} while (read_seqretry(&mm->mm_seq, seq));
+
+	return vma;
+}
+
 /*
  * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
  */
@@ -2494,7 +2542,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
 	do {
-		vma_rb_erase(vma, &mm->mm_rb);
+		vma_rb_erase(vma, mm);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 06/20] mm: Provide speculative fault infrastructure
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (4 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 05/20] mm: RCU free VMAs Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 07/20] mm/spf: Try spin lock in speculative path Laurent Dufour
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Fix newly introduced pte_spinlock() for speculative page fault]
[Rename vma_is_dead() to vma_has_changed()]
[Call p4d_alloc() as it is safe since pgd is valid]
[Call pud_alloc() as it is safe since p4d is valid]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |   3 ++
 mm/memory.c        | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b7ec2a76953..671541e00d26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,6 +315,7 @@ struct vm_fault {
 	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	unsigned long address;		/* Faulting virtual address */
+	unsigned int sequence;
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pud_t *pud;			/* Pointer to pud entry matching
@@ -1286,6 +1287,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+				    unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 8f0468d099fc..8c43895e9310 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2244,15 +2244,66 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
 static bool pte_spinlock(struct vm_fault *vmf)
 {
+	bool ret = false;
+
+	/* Check if vma is still valid */
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+		spin_lock(vmf->ptl);
+		return true;
+	}
+
+	local_irq_disable();
+	if (vma_has_changed(vmf->vma, vmf->sequence))
+		goto out;
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	return true;
+
+	if (vma_has_changed(vmf->vma, vmf->sequence)) {
+		spin_unlock(vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 static bool pte_map_lock(struct vm_fault *vmf)
 {
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
-	return true;
+	bool ret = false;
+
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+					       vmf->address, &vmf->ptl);
+		return true;
+	}
+
+	/*
+	 * The first vma_has_changed() guarantees the page-tables are still
+	 * valid, having IRQs disabled ensures they stay around, hence the
+	 * second vma_has_changed() to make sure they are still valid once
+	 * we've got the lock. After that a concurrent zap_pte_range() will
+	 * block on the PTL and thus we're safe.
+	 */
+	local_irq_disable();
+	if (vma_has_changed(vmf->vma, vmf->sequence))
+		goto out;
+
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				       vmf->address, &vmf->ptl);
+
+	if (vma_has_changed(vmf->vma, vmf->sequence)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 /*
@@ -2684,6 +2735,7 @@ int do_swap_page(struct vm_fault *vmf)
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
+			/* XXX fe->pmd might be dead */
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
 		} else if (is_hwpoison_entry(entry)) {
@@ -3868,6 +3920,96 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	return handle_pte_fault(&vmf);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+			     unsigned int flags)
+{
+	struct vm_fault vmf = {
+		.address = address,
+		.flags = flags | FAULT_FLAG_SPECULATIVE,
+	};
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	int dead, seq, idx, ret = VM_FAULT_RETRY;
+	struct vm_area_struct *vma;
+
+	idx = srcu_read_lock(&vma_srcu);
+	vma = find_vma_srcu(mm, address);
+	if (!vma)
+		goto unlock;
+
+	/*
+	 * Validate the VMA found by the lockless lookup.
+	 */
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
+	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
+		goto unlock;
+
+	if (address < vma->vm_start || vma->vm_end <= address)
+		goto unlock;
+
+	/*
+	 * We need to re-validate the VMA after checking the bounds, otherwise
+	 * we might have a false positive on the bounds.
+	 */
+	if (read_seqcount_retry(&vma->vm_sequence, seq))
+		goto unlock;
+
+	/*
+	 * Do a speculative lookup of the PTE entry.
+	 */
+	local_irq_disable();
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_walk;
+
+	p4d = p4d_alloc(mm, pgd, address);
+	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
+		goto out_walk;
+
+	pud = pud_alloc(mm, p4d, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_walk;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_walk;
+
+	/*
+	 * The above does not allocate/instantiate page-tables because doing so
+	 * would lead to the possibility of instantiating page-tables after
+	 * free_pgtables() -- and consequently leaking them.
+	 *
+	 * The result is that we take at least one !speculative fault per PMD
+	 * in order to instantiate it.
+	 *
+	 * XXX try and fix that.. should be possible somehow.
+	 */
+
+	if (pmd_huge(*pmd)) /* XXX no huge support */
+		goto out_walk;
+
+	vmf.vma = vma;
+	vmf.pmd = pmd;
+	vmf.pgoff = linear_page_index(vma, address);
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
+	vmf.sequence = seq;
+
+	local_irq_enable();
+
+	ret = handle_pte_fault(&vmf);
+
+unlock:
+	srcu_read_unlock(&vma_srcu, idx);
+	return ret;
+
+out_walk:
+	local_irq_enable();
+	goto unlock;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 07/20] mm/spf: Try spin lock in speculative path
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (5 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 06/20] mm: Provide speculative fault infrastructure Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 08/20] mm/spf: Fix fe.sequence init in __handle_mm_fault() Laurent Dufour
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

There is a deadlock when a CPU is doing a speculative page fault and
another one is calling do_unmap().

The deadlock occurred because the speculative path try to spinlock the
pte while the interrupt are disabled. When the other CPU in the
unmap's path has locked the pte then is waiting for all the CPU to
invalidate the TLB. As the CPU doing the speculative fault have the
interrupt disable it can't invalidate the TLB, and can't get the lock.

Since we are in a speculative path, we can race with other mm action.
So let assume that the lock may not get acquired and fail the
speculative page fault.

Here are the stack captured during the deadlock:

CPU 0
native_flush_tlb_others+0x7c/0x260
flush_tlb_mm_range+0x6a/0x220
tlb_flush_mmu_tlbonly+0x63/0xc0
unmap_page_range+0x897/0x9d0
? unmap_single_vma+0x7d/0xe0
? release_pages+0x2b3/0x360
unmap_single_vma+0x7d/0xe0
unmap_vmas+0x51/0xa0
unmap_region+0xbd/0x130
do_munmap+0x279/0x460
SyS_munmap+0x53/0x70

CPU 1
do_raw_spin_lock+0x14e/0x160
_raw_spin_lock+0x5d/0x80
? pte_map_lock+0x169/0x1b0
pte_map_lock+0x169/0x1b0
handle_pte_fault+0xbf2/0xd80
? trace_hardirqs_on+0xd/0x10
handle_speculative_fault+0x272/0x280
handle_speculative_fault+0x5/0x280
__do_page_fault+0x187/0x580
trace_do_page_fault+0x52/0x260
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8c43895e9310..9de741554e15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2258,7 +2258,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 		goto out;
 
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (unlikely(!spin_trylock(vmf->ptl)))
+		goto out;
 
 	if (vma_has_changed(vmf->vma, vmf->sequence)) {
 		spin_unlock(vmf->ptl);
@@ -2292,8 +2293,18 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	if (vma_has_changed(vmf->vma, vmf->sequence))
 		goto out;
 
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
-				       vmf->address, &vmf->ptl);
+	/* Same as pte_offset_map_lock() except that we call
+	 * spin_trylock() in place of spin_lock() to avoid race with
+	 * unmap path which may have the lock and wait for this CPU
+	 * to invalidate TLB but this CPU has irq disabled.
+	 * Since we are in a speculative patch, accept it could fail
+	 */
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(vmf->ptl))) {
+		pte_unmap(vmf->pte);
+		goto out;
+	}
 
 	if (vma_has_changed(vmf->vma, vmf->sequence)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 08/20] mm/spf: Fix fe.sequence init in __handle_mm_fault()
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (6 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 07/20] mm/spf: Try spin lock in speculative path Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 09/20] mm/spf: don't set fault entry's fields if locking failed Laurent Dufour
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

__handle_mm_fault() calls handle_pte_fault which requires the sequence
field of the fault_env to be initialized.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9de741554e15..f05288797c60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3904,6 +3904,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
+	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 09/20] mm/spf: don't set fault entry's fields if locking failed
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (7 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 08/20] mm/spf: Fix fe.sequence init in __handle_mm_fault() Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:20 ` [RFC v4 10/20] mm/spf; fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

In the case pte_map_lock failed to lock the pte or if the VMA is no
more valid, the fault entry's fields should not be set so that caller
won't try to unlock it.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f05288797c60..75d24e74c4ff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2275,6 +2275,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	bool ret = false;
+	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
 		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -2299,18 +2301,20 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	 * to invalidate TLB but this CPU has irq disabled.
 	 * Since we are in a speculative patch, accept it could fail
 	 */
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-	if (unlikely(!spin_trylock(vmf->ptl))) {
-		pte_unmap(vmf->pte);
+	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(ptl))) {
+		pte_unmap(pte);
 		goto out;
 	}
 
 	if (vma_has_changed(vmf->vma, vmf->sequence)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		pte_unmap_unlock(pte, ptl);
 		goto out;
 	}
 
+	vmf->pte = pte;
+	vmf->ptl = ptl;
 	ret = true;
 out:
 	local_irq_enable();
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 10/20] mm/spf; fix lock dependency against mapping->i_mmap_rwsem
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (8 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 09/20] mm/spf: don't set fault entry's fields if locking failed Laurent Dufour
@ 2017-06-09 14:20 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 11/20] mm/spf: Protect changes to vm_flags Laurent Dufour
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:20 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

kworker/32:1/819 is trying to acquire lock:
 (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
zap_page_range_single+0xd0/0x1a0

but task is already holding lock:
 (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
unmap_mapping_range+0x7c/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&mapping->i_mmap_rwsem){++++..}:
       down_write+0x84/0x130
       __vma_adjust+0x1f4/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #1 (&vma->vm_sequence/1){+.+...}:
       __vma_adjust+0x124/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #0 (&vma->vm_sequence){+.+...}:
       lock_acquire+0xf4/0x310
       unmap_page_range+0xcc/0xfa0
       zap_page_range_single+0xd0/0x1a0
       unmap_mapping_range+0x138/0x160
       truncate_pagecache+0x50/0xa0
       put_aio_ring_file+0x48/0xb0
       aio_free_ring+0x40/0x1b0
       free_ioctx+0x38/0xc0
       process_one_work+0x2cc/0x8a0
       worker_thread+0xac/0x580
       kthread+0x164/0x1b0
       ret_from_kernel_thread+0x5c/0x68

other info that might help us debug this:

Chain exists of:
  &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mapping->i_mmap_rwsem);
                               lock(&vma->vm_sequence/1);
                               lock(&mapping->i_mmap_rwsem);
  lock(&vma->vm_sequence);

 *** DEADLOCK ***

To fix that we must grab the vm_sequence lock after any mapping one in
__vma_adjust().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mmap.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 008fc35aa75e..0cad4d9b71d8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -707,10 +707,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
-	write_seqcount_begin(&vma->vm_sequence);
-	if (next)
-		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
-
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -818,6 +814,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		}
 	}
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence,
+					    SINGLE_DEPTH_NESTING);
+
 	anon_vma = vma->anon_vma;
 	if (!anon_vma && adjust_next)
 		anon_vma = next->anon_vma;
@@ -934,8 +935,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
-			if (next)
-				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -952,11 +951,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		if (remove_next == 2) {
 			remove_next = 1;
 			end = next->vm_end;
+			write_seqcount_end(&vma->vm_sequence);
 			goto again;
-		}
-		else if (next)
+		} else if (next) {
+			if (next != vma)
+				write_seqcount_begin_nested(&next->vm_sequence,
+							    SINGLE_DEPTH_NESTING);
 			vma_gap_update(next);
-		else {
+		} else {
 			/*
 			 * If remove_next == 2 we obviously can't
 			 * reach this path.
@@ -982,7 +984,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
-	if (next)
+	if (next && next != vma)
 		write_seqcount_end(&next->vm_sequence);
 	write_seqcount_end(&vma->vm_sequence);
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 11/20] mm/spf: Protect changes to vm_flags
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (9 preceding siblings ...)
  2017-06-09 14:20 ` [RFC v4 10/20] mm/spf; fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 12/20] mm/spf Protect vm_policy's changes against speculative pf Laurent Dufour
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

Protect VMA's flags change against the speculative page fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c | 2 ++
 mm/mempolicy.c     | 2 ++
 mm/mlock.c         | 9 ++++++---
 mm/mmap.c          | 2 ++
 mm/mprotect.c      | 2 ++
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33d99b1..9bc40620ba39 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1069,8 +1069,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					goto out_mm;
 				}
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
+					write_seqcount_begin(&vma->vm_sequence);
 					vma->vm_flags &= ~VM_SOFTDIRTY;
 					vma_set_page_prot(vma);
+					write_seqcount_end(&vma->vm_sequence);
 				}
 				downgrade_write(&mm->mmap_sem);
 				break;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..13d32c25226c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -606,9 +606,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
+	write_seqcount_begin(&vma->vm_sequence);
 	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return nr_updated;
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index b562b5523a65..30d9bfc61929 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -438,7 +438,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	write_seqcount_end(&vma->vm_sequence);
 
 	while (start < end) {
 		struct page *page;
@@ -563,10 +565,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
-
-	if (lock)
+	if (lock) {
+		write_seqcount_begin(&vma->vm_sequence);
 		vma->vm_flags = newflags;
-	else
+		write_seqcount_end(&vma->vm_sequence);
+	} else
 		munlock_vma_pages_range(vma, start, end);
 
 out:
diff --git a/mm/mmap.c b/mm/mmap.c
index 0cad4d9b71d8..c4a1e1aecef3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1746,6 +1746,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
@@ -1768,6 +1769,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_flags |= VM_SOFTDIRTY;
 
 	vma_set_page_prot(vma);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return addr;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..1db5b0bf6952 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -345,6 +345,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = newflags;
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
@@ -360,6 +361,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 			(newflags & VM_WRITE)) {
 		populate_vma_page_range(vma, start, end, NULL);
 	}
+	write_seqcount_end(&vma->vm_sequence);
 
 	vm_stat_account(mm, oldflags, -nrpages);
 	vm_stat_account(mm, newflags, nrpages);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 12/20] mm/spf Protect vm_policy's changes against speculative pf
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (10 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 11/20] mm/spf: Protect changes to vm_flags Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 13/20] mm/spf: Add check on the VMA's flags Laurent Dufour
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

Mark the VMA touched when policy changes are applied to it so that
speculative page fault will be aborted.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mempolicy.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 13d32c25226c..5e44b3e69a0d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -447,8 +447,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		write_seqcount_begin(&vma->vm_sequence);
 		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+		write_seqcount_end(&vma->vm_sequence);
+	}
 	up_write(&mm->mmap_sem);
 }
 
@@ -711,6 +714,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	if (vma->vm_ops && vma->vm_ops->set_policy) {
 		err = vma->vm_ops->set_policy(vma, new);
 		if (err)
@@ -719,10 +723,12 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 
 	old = vma->vm_policy;
 	vma->vm_policy = new; /* protected by mmap_sem */
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(old);
 
 	return 0;
  err_out:
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(new);
 	return err;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 13/20] mm/spf: Add check on the VMA's flags
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (11 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 12/20] mm/spf Protect vm_policy's changes against speculative pf Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 14/20] mm/spf: protect madvise vs speculative pf Laurent Dufour
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

When handling speculative page fault we should check for the VMA's
access permission as it is done in handle_mm_fault() or access_error
in x86's fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 75d24e74c4ff..27e44ebc5440 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3966,6 +3966,30 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
 	if (address < vma->vm_start || vma->vm_end <= address)
 		goto unlock;
 
+	/* XXX Could we handle huge page here ? */
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		goto unlock;
+
+	/*
+	 * The three following checks are copied from access_error from
+	 * arch/x86/mm/fault.c
+	 * XXX they may not be applicable to all architectures
+	 */
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+				       flags & FAULT_FLAG_INSTRUCTION,
+				       flags & FAULT_FLAG_REMOTE))
+		goto unlock;
+
+	/* This is one is required to check that the VMA has write access set */
+	if (flags & FAULT_FLAG_WRITE) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			goto unlock;
+	} else {
+		/* XXX This may not be required */
+		if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+			goto unlock;
+	}
+
 	/*
 	 * We need to re-validate the VMA after checking the bounds, otherwise
 	 * we might have a false positive on the bounds.
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 14/20] mm/spf: protect madvise vs speculative pf
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (12 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 13/20] mm/spf: Add check on the VMA's flags Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 15/20] mm/spf: protect mremap() against " Laurent Dufour
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This patch protects  madvise's effect against the speculative page
fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/madvise.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee4fc2c..d1fa6a7ee604 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -172,7 +172,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = new_flags;
+	write_seqcount_end(&vma->vm_sequence);
 out:
 	return error;
 }
@@ -439,9 +441,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
 		.private = tlb,
 	};
 
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	walk_page_range(addr, end, &free_walk);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 static int madvise_free_single_vma(struct vm_area_struct *vma,
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 15/20] mm/spf: protect mremap() against speculative pf
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (13 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 14/20] mm/spf: protect madvise vs speculative pf Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 16/20] mm/spf: Don't call user fault callback in the speculative path Laurent Dufour
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

mremap() is modifying the VMA layout and thus must be protected against
the speculative page fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mremap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..9c7f69c9e80f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -300,6 +300,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	if (!new_vma)
 		return -ENOMEM;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	write_seqcount_begin_nested(&new_vma->vm_sequence,
+				    SINGLE_DEPTH_NESTING);
+
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
 				     need_rmap_locks);
 	if (moved_len < old_len) {
@@ -316,6 +320,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
 				 true);
+		write_seqcount_end(&vma->vm_sequence);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -324,7 +329,9 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		mremap_userfaultfd_prep(new_vma, uf);
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
+		write_seqcount_end(&vma->vm_sequence);
 	}
+	write_seqcount_end(&new_vma->vm_sequence);
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 16/20] mm/spf: Don't call user fault callback in the speculative path
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (14 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 15/20] mm/spf: protect mremap() against " Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 17/20] x86/mm: Add speculative pagefault handling Laurent Dufour
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

The handle_userfault() function assumes that the mmap_sem is held
which is not true in the case of a speculative page fault handling.

When doing a speculative page fault, lets retry it in the usual path
to call handle_userfault().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 27e44ebc5440..5b158549789b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3005,6 +3005,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+				return VM_FAULT_RETRY;
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
 		goto setpte;
@@ -3045,6 +3047,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		mem_cgroup_cancel_charge(page, memcg, false);
 		put_page(page);
+		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+			return VM_FAULT_RETRY;
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 17/20] x86/mm: Add speculative pagefault handling
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (15 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 16/20] mm/spf: Don't call user fault callback in the speculative path Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 18/20] x86/mm: Update the handle_speculative_fault's path Laurent Dufour
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/mm/fault.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a01cbc8..0c0b45dfda76 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1365,6 +1365,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (error_code & PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address,
+					flags & ~FAULT_FLAG_ALLOW_RETRY);
+
+		if (fault & VM_FAULT_RETRY)
+			goto retry;
+
+		goto done;
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1468,7 +1478,15 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		if (fatal_signal_pending(current))
+			return;
+
+		goto done;
+	}
+
 	up_read(&mm->mmap_sem);
+done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 18/20] x86/mm: Update the handle_speculative_fault's path
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (16 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 17/20] x86/mm: Add speculative pagefault handling Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 19/20] powerpc/mm: Add speculative page fault Laurent Dufour
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

If handle_speculative_fault failed due to a VM ERROR, we try again the
slow path to allow the signal to be delivered.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 0c0b45dfda76..02c0b884ca18 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1369,10 +1369,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		fault = handle_speculative_fault(mm, address,
 					flags & ~FAULT_FLAG_ALLOW_RETRY);
 
-		if (fault & VM_FAULT_RETRY)
-			goto retry;
-
-		goto done;
+		/*
+		 * We also check against VM_FAULT_ERROR because we have to
+		 * raise a signal by calling later mm_fault_error() which
+		 * requires the vma pointer to be set. So in that case,
+		 * we fall through the normal path.
+		 */
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
 	}
 
 	/*
@@ -1478,20 +1482,13 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
-	if (unlikely(fault & VM_FAULT_RETRY)) {
-		if (fatal_signal_pending(current))
-			return;
-
-		goto done;
-	}
-
 	up_read(&mm->mmap_sem);
-done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting. If any of the events
 	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 19/20] powerpc/mm: Add speculative page fault
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (17 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 18/20] x86/mm: Update the handle_speculative_fault's path Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 14:21 ` [RFC v4 20/20] mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path Laurent Dufour
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with WM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580fdc59..6dd6a50f412f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -290,9 +290,32 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (!is_exec && user_mode(regs))
 		store_update_sp = store_updates_sp(regs);
 
-	if (user_mode(regs))
+	if (user_mode(regs)) {
 		flags |= FAULT_FLAG_USER;
 
+		/* let's try a speculative page fault without grabbing the
+		 * mmap_sem.
+		 */
+
+		/*
+		 * flags is set later based on the VMA's flags, for the common
+		 * speculative service, we need some flags to be set.
+		 */
+		if (is_write)
+			flags |= FAULT_FLAG_WRITE;
+
+		fault = handle_speculative_fault(mm, address,
+					 flags & ~FAULT_FLAG_ALLOW_RETRY);
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
+
+		/*
+		 * Resetting flags since the following code assumes
+		 * FAULT_FLAG_WRITE is not set.
+		 */
+		flags &= ~FAULT_FLAG_WRITE;
+	}
+
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -478,6 +501,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 			rc = 0;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting.
 	 */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v4 20/20] mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (18 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 19/20] powerpc/mm: Add speculative page fault Laurent Dufour
@ 2017-06-09 14:21 ` Laurent Dufour
  2017-06-09 15:01 ` [RFC v4 00/20] Speculative page faults Michal Hocko
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 14:21 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

The flag FAULT_FLAG_KILLABLE should be unset to not allow the mmap_sem
to released in __lock_page_or_retry().

In this patch the unsetting of the flag FAULT_FLAG_ALLOW_RETRY is also
moved into handle_speculative_fault() since this has to be done for
all architectures.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 3 +--
 arch/x86/mm/fault.c     | 3 +--
 mm/memory.c             | 6 +++++-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 6dd6a50f412f..4b6d0ed517ca 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -304,8 +304,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		if (is_write)
 			flags |= FAULT_FLAG_WRITE;
 
-		fault = handle_speculative_fault(mm, address,
-					 flags & ~FAULT_FLAG_ALLOW_RETRY);
+		fault = handle_speculative_fault(mm, address, flags);
 		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
 			goto done;
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 02c0b884ca18..c62a7ea5e27b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1366,8 +1366,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		flags |= FAULT_FLAG_INSTRUCTION;
 
 	if (error_code & PF_USER) {
-		fault = handle_speculative_fault(mm, address,
-					flags & ~FAULT_FLAG_ALLOW_RETRY);
+		fault = handle_speculative_fault(mm, address, flags);
 
 		/*
 		 * We also check against VM_FAULT_ERROR because we have to
diff --git a/mm/memory.c b/mm/memory.c
index 5b158549789b..35a311b0d314 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3945,7 +3945,6 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
 {
 	struct vm_fault vmf = {
 		.address = address,
-		.flags = flags | FAULT_FLAG_SPECULATIVE,
 	};
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -3954,6 +3953,10 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
 	int dead, seq, idx, ret = VM_FAULT_RETRY;
 	struct vm_area_struct *vma;
 
+	/* Clear flags that may lead to release the mmap_sem to retry */
+	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
+	flags |= FAULT_FLAG_SPECULATIVE;
+
 	idx = srcu_read_lock(&vma_srcu);
 	vma = find_vma_srcu(mm, address);
 	if (!vma)
@@ -4040,6 +4043,7 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
 	vmf.pgoff = linear_page_index(vma, address);
 	vmf.gfp_mask = __get_fault_gfp_mask(vma);
 	vmf.sequence = seq;
+	vmf.flags = flags;
 
 	local_irq_enable();
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (19 preceding siblings ...)
  2017-06-09 14:21 ` [RFC v4 20/20] mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path Laurent Dufour
@ 2017-06-09 15:01 ` Michal Hocko
  2017-06-09 15:25   ` Laurent Dufour
  2017-06-09 18:55 ` Paul E. McKenney
  2017-06-12 10:20 ` Jan Kara
  22 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-06-09 15:01 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Fri 09-06-17 16:20:49, Laurent Dufour wrote:
> This is a port on kernel 4.12 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore.
> 
> http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
> 
> Compared to the Peter initial work, this series introduce a try spin
> lock when dealing with speculative page fault. This is required to
> avoid dead lock when handling a page fault while a TLB invalidate is
> requested by an other CPU holding the PTE. Another change due to a
> lock dependency issue with mapping->i_mmap_rwsem.
> 
> This series also protect changes to VMA's data which are read or
> change by the page fault handler. The protections is done through the
> VMA's sequence number.
> 
> This series is functional on x86 and PowerPC.
> 
> It's building on top of v4.12-rc4 and relies on the change done by
> Paul McKenney to the SRCU code allowing better performance by
> maintaining per-CPU callback lists:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
> 
> Tests have been made using a large commercial in-memory database on a
> PowerPC system with 752 CPUs. The results are very encouraging since
> the loading of the 2TB database was faster by 20% with the speculative
> page fault.
> 
> Since tests are encouraging and running test suite didn't raise any
> issue, I'd like this request for comment series to move to a patch
> series soon. So please feel free to comment.

What other testing have you done? Other benchmarks (some numbers)? What
about some standard worklaods like kbench? This is a pretty invasive
change so I would expect much more numbers.

It would also help to describe the highlevel design of the change here
in the cover letter. This would make the review of specifics much
easier.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 15:01 ` [RFC v4 00/20] Speculative page faults Michal Hocko
@ 2017-06-09 15:25   ` Laurent Dufour
  2017-06-09 16:35     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Laurent Dufour @ 2017-06-09 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: paulmck, peterz, akpm, kirill, ak, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 09/06/2017 17:01, Michal Hocko wrote:
> On Fri 09-06-17 16:20:49, Laurent Dufour wrote:
>> This is a port on kernel 4.12 of the work done by Peter Zijlstra to
>> handle page fault without holding the mm semaphore.
>>
>> http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>
>> Compared to the Peter initial work, this series introduce a try spin
>> lock when dealing with speculative page fault. This is required to
>> avoid dead lock when handling a page fault while a TLB invalidate is
>> requested by an other CPU holding the PTE. Another change due to a
>> lock dependency issue with mapping->i_mmap_rwsem.
>>
>> This series also protect changes to VMA's data which are read or
>> change by the page fault handler. The protections is done through the
>> VMA's sequence number.
>>
>> This series is functional on x86 and PowerPC.
>>
>> It's building on top of v4.12-rc4 and relies on the change done by
>> Paul McKenney to the SRCU code allowing better performance by
>> maintaining per-CPU callback lists:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
>>
>> Tests have been made using a large commercial in-memory database on a
>> PowerPC system with 752 CPUs. The results are very encouraging since
>> the loading of the 2TB database was faster by 20% with the speculative
>> page fault.
>>
>> Since tests are encouraging and running test suite didn't raise any
>> issue, I'd like this request for comment series to move to a patch
>> series soon. So please feel free to comment.
> 
> What other testing have you done? Other benchmarks (some numbers)? What
> about some standard worklaods like kbench? This is a pretty invasive
> change so I would expect much more numbers.

Thanks Michal for your feedback.

I mostly focused on this database workload since this is the one where
we hit the mmap_sem bottleneck when running on big node. On my usual
victim node, I checked for basic usage like kernel build time, but I
agree that's clearly not enough.

I try to find details about the 'kbench' you mentioned, but I didn't get
any valid entry.
Would you please point me on this or any other bench tool you think will
be useful here ?

> 
> It would also help to describe the highlevel design of the change here
> in the cover letter. This would make the review of specifics much
> easier.

You're right, I'll try to make a highlevel design.

Thanks,
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 15:25   ` Laurent Dufour
@ 2017-06-09 16:35     ` Michal Hocko
  2017-06-09 16:59       ` Tim Chen
  2017-06-13  9:58       ` Laurent Dufour
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-09 16:35 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Fri 09-06-17 17:25:51, Laurent Dufour wrote:
[...]
> Thanks Michal for your feedback.
> 
> I mostly focused on this database workload since this is the one where
> we hit the mmap_sem bottleneck when running on big node. On my usual
> victim node, I checked for basic usage like kernel build time, but I
> agree that's clearly not enough.
> 
> I try to find details about the 'kbench' you mentioned, but I didn't get
> any valid entry.
> Would you please point me on this or any other bench tool you think will
> be useful here ?

Sorry I meant kernbech (aka parallel kernel build). Other highly threaded
workloads doing a lot of page faults and address space modification
would be good to see as well. I wish I could give you much more
comprehensive list but I am not very good at benchmarks.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 16:35     ` Michal Hocko
@ 2017-06-09 16:59       ` Tim Chen
  2017-06-13 10:19         ` Laurent Dufour
  2017-06-13  9:58       ` Laurent Dufour
  1 sibling, 1 reply; 30+ messages in thread
From: Tim Chen @ 2017-06-09 16:59 UTC (permalink / raw)
  To: Michal Hocko, Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 06/09/2017 09:35 AM, Michal Hocko wrote:
> On Fri 09-06-17 17:25:51, Laurent Dufour wrote:
> [...]
>> Thanks Michal for your feedback.
>>
>> I mostly focused on this database workload since this is the one where
>> we hit the mmap_sem bottleneck when running on big node. On my usual
>> victim node, I checked for basic usage like kernel build time, but I
>> agree that's clearly not enough.
>>
>> I try to find details about the 'kbench' you mentioned, but I didn't get
>> any valid entry.
>> Would you please point me on this or any other bench tool you think will
>> be useful here ?
> 
> Sorry I meant kernbech (aka parallel kernel build). Other highly threaded
> workloads doing a lot of page faults and address space modification
> would be good to see as well. I wish I could give you much more
> comprehensive list but I am not very good at benchmarks.
> 

Laurent,

Have you tried running the multi-fault microbenchmark by Kamezawa?
It does threaded page faults in parallel.
Peter ran that when he posted his specualtive page faults patches.
https://lkml.org/lkml/2010/1/6/28

Thanks.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (20 preceding siblings ...)
  2017-06-09 15:01 ` [RFC v4 00/20] Speculative page faults Michal Hocko
@ 2017-06-09 18:55 ` Paul E. McKenney
  2017-06-12 10:20 ` Jan Kara
  22 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-06-09 18:55 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Fri, Jun 09, 2017 at 04:20:49PM +0200, Laurent Dufour wrote:
> This is a port on kernel 4.12 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore.
> 
> http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
> 
> Compared to the Peter initial work, this series introduce a try spin
> lock when dealing with speculative page fault. This is required to
> avoid dead lock when handling a page fault while a TLB invalidate is
> requested by an other CPU holding the PTE. Another change due to a
> lock dependency issue with mapping->i_mmap_rwsem.
> 
> This series also protect changes to VMA's data which are read or
> change by the page fault handler. The protections is done through the
> VMA's sequence number.
> 
> This series is functional on x86 and PowerPC.
> 
> It's building on top of v4.12-rc4 and relies on the change done by
> Paul McKenney to the SRCU code allowing better performance by
> maintaining per-CPU callback lists:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
> 
> Tests have been made using a large commercial in-memory database on a
> PowerPC system with 752 CPUs. The results are very encouraging since
> the loading of the 2TB database was faster by 20% with the speculative
> page fault.
> 
> Since tests are encouraging and running test suite didn't raise any
> issue, I'd like this request for comment series to move to a patch
> series soon. So please feel free to comment.

For whatever it is worth, this series passes moderate rcutorture testing,
30 minutes on each of sixteen scenarios.  Not that rcutorture is set up
to find mmap_sem issues, but it does place some stress on the kernel as
a whole.

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

							Thanx, Paul

> Changes since V3:
>  - support for the 5-level paging.
>  - abort speculative path before entering userfault code
>  - support for PowerPC architecture
>  - reorder the patch to fix build test errors.
> 
> Laurent Dufour (14):
>   mm: Introduce pte_spinlock
>   mm/spf: Try spin lock in speculative path
>   mm/spf: Fix fe.sequence init in __handle_mm_fault()
>   mm/spf: don't set fault entry's fields if locking failed
>   mm/spf; fix lock dependency against mapping->i_mmap_rwsem
>   mm/spf: Protect changes to vm_flags
>   mm/spf Protect vm_policy's changes against speculative pf
>   mm/spf: Add check on the VMA's flags
>   mm/spf: protect madvise vs speculative pf
>   mm/spf: protect mremap() against speculative pf
>   mm/spf: Don't call user fault callback in the speculative path
>   x86/mm: Update the handle_speculative_fault's path
>   powerpc/mm: Add speculative page fault
>   mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path
> 
> Peter Zijlstra (6):
>   mm: Dont assume page-table invariance during faults
>   mm: Prepare for FAULT_FLAG_SPECULATIVE
>   mm: VMA sequence count
>   mm: RCU free VMAs
>   mm: Provide speculative fault infrastructure
>   x86/mm: Add speculative pagefault handling
> 
>  arch/powerpc/mm/fault.c  |  25 +++-
>  arch/x86/mm/fault.c      |  14 +++
>  fs/proc/task_mmu.c       |   2 +
>  include/linux/mm.h       |   4 +
>  include/linux/mm_types.h |   3 +
>  kernel/fork.c            |   1 +
>  mm/init-mm.c             |   1 +
>  mm/internal.h            |  20 ++++
>  mm/madvise.c             |   4 +
>  mm/memory.c              | 291 +++++++++++++++++++++++++++++++++++++++--------
>  mm/mempolicy.c           |  10 +-
>  mm/mlock.c               |   9 +-
>  mm/mmap.c                | 123 +++++++++++++++-----
>  mm/mprotect.c            |   2 +
>  mm/mremap.c              |   7 ++
>  15 files changed, 435 insertions(+), 81 deletions(-)
> 
> -- 
> 2.7.4
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
                   ` (21 preceding siblings ...)
  2017-06-09 18:55 ` Paul E. McKenney
@ 2017-06-12 10:20 ` Jan Kara
  2017-06-13 10:24   ` Laurent Dufour
  22 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2017-06-12 10:20 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora

Hello,

On Fri 09-06-17 16:20:49, Laurent Dufour wrote:
> This is a port on kernel 4.12 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore.
> 
> http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
> 
> Compared to the Peter initial work, this series introduce a try spin
> lock when dealing with speculative page fault. This is required to
> avoid dead lock when handling a page fault while a TLB invalidate is
> requested by an other CPU holding the PTE. Another change due to a
> lock dependency issue with mapping->i_mmap_rwsem.
> 
> This series also protect changes to VMA's data which are read or
> change by the page fault handler. The protections is done through the
> VMA's sequence number.
> 
> This series is functional on x86 and PowerPC.
> 
> It's building on top of v4.12-rc4 and relies on the change done by
> Paul McKenney to the SRCU code allowing better performance by
> maintaining per-CPU callback lists:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
> 
> Tests have been made using a large commercial in-memory database on a
> PowerPC system with 752 CPUs. The results are very encouraging since
> the loading of the 2TB database was faster by 20% with the speculative
> page fault.
> 
> Since tests are encouraging and running test suite didn't raise any
> issue, I'd like this request for comment series to move to a patch
> series soon. So please feel free to comment.

I had a look at the series and I have one comment regarding the whole
structure of the series: Instead of taking original Peter's patches and
then fixing up various problems with them, either fold the fixes into
original patches which introduced problems (this would make sense for
example for the lock inversion issue you fix) or just put these changes
to a place in the series where they logically belong - e.g. VMA is
protected by the sequence counter in patch 4 and then you add various
places that were missed later in the series. Instead of this just handle
sequence count protection in consecutive logical steps like vma_adjust()
changes, mremap() changes, munmap() changes, vma->flags protection, ...

Also amount of 'XXX' comments seems to be a bit to high and these should be
addressed.

								Honza
> 
> Changes since V3:
>  - support for the 5-level paging.
>  - abort speculative path before entering userfault code
>  - support for PowerPC architecture
>  - reorder the patch to fix build test errors.
> 
> Laurent Dufour (14):
>   mm: Introduce pte_spinlock
>   mm/spf: Try spin lock in speculative path
>   mm/spf: Fix fe.sequence init in __handle_mm_fault()
>   mm/spf: don't set fault entry's fields if locking failed
>   mm/spf; fix lock dependency against mapping->i_mmap_rwsem
>   mm/spf: Protect changes to vm_flags
>   mm/spf Protect vm_policy's changes against speculative pf
>   mm/spf: Add check on the VMA's flags
>   mm/spf: protect madvise vs speculative pf
>   mm/spf: protect mremap() against speculative pf
>   mm/spf: Don't call user fault callback in the speculative path
>   x86/mm: Update the handle_speculative_fault's path
>   powerpc/mm: Add speculative page fault
>   mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path
> 
> Peter Zijlstra (6):
>   mm: Dont assume page-table invariance during faults
>   mm: Prepare for FAULT_FLAG_SPECULATIVE
>   mm: VMA sequence count
>   mm: RCU free VMAs
>   mm: Provide speculative fault infrastructure
>   x86/mm: Add speculative pagefault handling
> 
>  arch/powerpc/mm/fault.c  |  25 +++-
>  arch/x86/mm/fault.c      |  14 +++
>  fs/proc/task_mmu.c       |   2 +
>  include/linux/mm.h       |   4 +
>  include/linux/mm_types.h |   3 +
>  kernel/fork.c            |   1 +
>  mm/init-mm.c             |   1 +
>  mm/internal.h            |  20 ++++
>  mm/madvise.c             |   4 +
>  mm/memory.c              | 291 +++++++++++++++++++++++++++++++++++++++--------
>  mm/mempolicy.c           |  10 +-
>  mm/mlock.c               |   9 +-
>  mm/mmap.c                | 123 +++++++++++++++-----
>  mm/mprotect.c            |   2 +
>  mm/mremap.c              |   7 ++
>  15 files changed, 435 insertions(+), 81 deletions(-)
> 
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 16:35     ` Michal Hocko
  2017-06-09 16:59       ` Tim Chen
@ 2017-06-13  9:58       ` Laurent Dufour
  1 sibling, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-13  9:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: paulmck, peterz, akpm, kirill, ak, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 09/06/2017 18:35, Michal Hocko wrote:
> On Fri 09-06-17 17:25:51, Laurent Dufour wrote:
> [...]
>> Thanks Michal for your feedback.
>>
>> I mostly focused on this database workload since this is the one where
>> we hit the mmap_sem bottleneck when running on big node. On my usual
>> victim node, I checked for basic usage like kernel build time, but I
>> agree that's clearly not enough.
>>
>> I try to find details about the 'kbench' you mentioned, but I didn't get
>> any valid entry.
>> Would you please point me on this or any other bench tool you think will
>> be useful here ?
> 
> Sorry I meant kernbech (aka parallel kernel build). Other highly threaded
> workloads doing a lot of page faults and address space modification
> would be good to see as well. I wish I could give you much more
> comprehensive list but I am not very good at benchmarks.
> 

Thanks Michal, I found kernbench 0.5, I will give it a try.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-09 16:59       ` Tim Chen
@ 2017-06-13 10:19         ` Laurent Dufour
  0 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-13 10:19 UTC (permalink / raw)
  To: Tim Chen, Michal Hocko
  Cc: paulmck, peterz, akpm, kirill, ak, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 09/06/2017 18:59, Tim Chen wrote:
> On 06/09/2017 09:35 AM, Michal Hocko wrote:
>> On Fri 09-06-17 17:25:51, Laurent Dufour wrote:
>> [...]
>>> Thanks Michal for your feedback.
>>>
>>> I mostly focused on this database workload since this is the one where
>>> we hit the mmap_sem bottleneck when running on big node. On my usual
>>> victim node, I checked for basic usage like kernel build time, but I
>>> agree that's clearly not enough.
>>>
>>> I try to find details about the 'kbench' you mentioned, but I didn't get
>>> any valid entry.
>>> Would you please point me on this or any other bench tool you think will
>>> be useful here ?
>>
>> Sorry I meant kernbech (aka parallel kernel build). Other highly threaded
>> workloads doing a lot of page faults and address space modification
>> would be good to see as well. I wish I could give you much more
>> comprehensive list but I am not very good at benchmarks.
>>
> 
> Laurent,
> 
> Have you tried running the multi-fault microbenchmark by Kamezawa?
> It does threaded page faults in parallel.
> Peter ran that when he posted his specualtive page faults patches.
> https://lkml.org/lkml/2010/1/6/28


Thanks Tim to remind me about this, I downloaded and built it a time ago
and forget about it. I'll give it another try !

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v4 00/20] Speculative page faults
  2017-06-12 10:20 ` Jan Kara
@ 2017-06-13 10:24   ` Laurent Dufour
  0 siblings, 0 replies; 30+ messages in thread
From: Laurent Dufour @ 2017-06-13 10:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 12/06/2017 12:20, Jan Kara wrote:
> Hello,
> 
> On Fri 09-06-17 16:20:49, Laurent Dufour wrote:
>> This is a port on kernel 4.12 of the work done by Peter Zijlstra to
>> handle page fault without holding the mm semaphore.
>>
>> http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>
>> Compared to the Peter initial work, this series introduce a try spin
>> lock when dealing with speculative page fault. This is required to
>> avoid dead lock when handling a page fault while a TLB invalidate is
>> requested by an other CPU holding the PTE. Another change due to a
>> lock dependency issue with mapping->i_mmap_rwsem.
>>
>> This series also protect changes to VMA's data which are read or
>> change by the page fault handler. The protections is done through the
>> VMA's sequence number.
>>
>> This series is functional on x86 and PowerPC.
>>
>> It's building on top of v4.12-rc4 and relies on the change done by
>> Paul McKenney to the SRCU code allowing better performance by
>> maintaining per-CPU callback lists:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
>>
>> Tests have been made using a large commercial in-memory database on a
>> PowerPC system with 752 CPUs. The results are very encouraging since
>> the loading of the 2TB database was faster by 20% with the speculative
>> page fault.
>>
>> Since tests are encouraging and running test suite didn't raise any
>> issue, I'd like this request for comment series to move to a patch
>> series soon. So please feel free to comment.
> 
> I had a look at the series and I have one comment regarding the whole
> structure of the series: Instead of taking original Peter's patches and
> then fixing up various problems with them, either fold the fixes into
> original patches which introduced problems (this would make sense for
> example for the lock inversion issue you fix) or just put these changes
> to a place in the series where they logically belong - e.g. VMA is
> protected by the sequence counter in patch 4 and then you add various
> places that were missed later in the series. Instead of this just handle
> sequence count protection in consecutive logical steps like vma_adjust()
> changes, mremap() changes, munmap() changes, vma->flags protection, ...

Thanks Jan for the review,

I tried to keep the Peter's patches intact, but I agree, that may not be
a good idea, and I tend to split this series in too much small patches.

I think I'll fold the changes I made into the original patches, this may
also fix some build test issue raised earlier.

> Also amount of 'XXX' comments seems to be a bit to high and these should be
> addressed.

Right, I'll get rid of them !

Thanks,
Laurent.

> 
> 								Honza
>>
>> Changes since V3:
>>  - support for the 5-level paging.
>>  - abort speculative path before entering userfault code
>>  - support for PowerPC architecture
>>  - reorder the patch to fix build test errors.
>>
>> Laurent Dufour (14):
>>   mm: Introduce pte_spinlock
>>   mm/spf: Try spin lock in speculative path
>>   mm/spf: Fix fe.sequence init in __handle_mm_fault()
>>   mm/spf: don't set fault entry's fields if locking failed
>>   mm/spf; fix lock dependency against mapping->i_mmap_rwsem
>>   mm/spf: Protect changes to vm_flags
>>   mm/spf Protect vm_policy's changes against speculative pf
>>   mm/spf: Add check on the VMA's flags
>>   mm/spf: protect madvise vs speculative pf
>>   mm/spf: protect mremap() against speculative pf
>>   mm/spf: Don't call user fault callback in the speculative path
>>   x86/mm: Update the handle_speculative_fault's path
>>   powerpc/mm: Add speculative page fault
>>   mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path
>>
>> Peter Zijlstra (6):
>>   mm: Dont assume page-table invariance during faults
>>   mm: Prepare for FAULT_FLAG_SPECULATIVE
>>   mm: VMA sequence count
>>   mm: RCU free VMAs
>>   mm: Provide speculative fault infrastructure
>>   x86/mm: Add speculative pagefault handling
>>
>>  arch/powerpc/mm/fault.c  |  25 +++-
>>  arch/x86/mm/fault.c      |  14 +++
>>  fs/proc/task_mmu.c       |   2 +
>>  include/linux/mm.h       |   4 +
>>  include/linux/mm_types.h |   3 +
>>  kernel/fork.c            |   1 +
>>  mm/init-mm.c             |   1 +
>>  mm/internal.h            |  20 ++++
>>  mm/madvise.c             |   4 +
>>  mm/memory.c              | 291 +++++++++++++++++++++++++++++++++++++++--------
>>  mm/mempolicy.c           |  10 +-
>>  mm/mlock.c               |   9 +-
>>  mm/mmap.c                | 123 +++++++++++++++-----
>>  mm/mprotect.c            |   2 +
>>  mm/mremap.c              |   7 ++
>>  15 files changed, 435 insertions(+), 81 deletions(-)
>>
>> -- 
>> 2.7.4
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-06-13 10:24 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
2017-06-09 14:20 ` [RFC v4 01/20] mm: Dont assume page-table invariance during faults Laurent Dufour
2017-06-09 14:20 ` [RFC v4 02/20] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2017-06-09 14:20 ` [RFC v4 03/20] mm: Introduce pte_spinlock Laurent Dufour
2017-06-09 14:20 ` [RFC v4 04/20] mm: VMA sequence count Laurent Dufour
2017-06-09 14:20 ` [RFC v4 05/20] mm: RCU free VMAs Laurent Dufour
2017-06-09 14:20 ` [RFC v4 06/20] mm: Provide speculative fault infrastructure Laurent Dufour
2017-06-09 14:20 ` [RFC v4 07/20] mm/spf: Try spin lock in speculative path Laurent Dufour
2017-06-09 14:20 ` [RFC v4 08/20] mm/spf: Fix fe.sequence init in __handle_mm_fault() Laurent Dufour
2017-06-09 14:20 ` [RFC v4 09/20] mm/spf: don't set fault entry's fields if locking failed Laurent Dufour
2017-06-09 14:20 ` [RFC v4 10/20] mm/spf; fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
2017-06-09 14:21 ` [RFC v4 11/20] mm/spf: Protect changes to vm_flags Laurent Dufour
2017-06-09 14:21 ` [RFC v4 12/20] mm/spf Protect vm_policy's changes against speculative pf Laurent Dufour
2017-06-09 14:21 ` [RFC v4 13/20] mm/spf: Add check on the VMA's flags Laurent Dufour
2017-06-09 14:21 ` [RFC v4 14/20] mm/spf: protect madvise vs speculative pf Laurent Dufour
2017-06-09 14:21 ` [RFC v4 15/20] mm/spf: protect mremap() against " Laurent Dufour
2017-06-09 14:21 ` [RFC v4 16/20] mm/spf: Don't call user fault callback in the speculative path Laurent Dufour
2017-06-09 14:21 ` [RFC v4 17/20] x86/mm: Add speculative pagefault handling Laurent Dufour
2017-06-09 14:21 ` [RFC v4 18/20] x86/mm: Update the handle_speculative_fault's path Laurent Dufour
2017-06-09 14:21 ` [RFC v4 19/20] powerpc/mm: Add speculative page fault Laurent Dufour
2017-06-09 14:21 ` [RFC v4 20/20] mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path Laurent Dufour
2017-06-09 15:01 ` [RFC v4 00/20] Speculative page faults Michal Hocko
2017-06-09 15:25   ` Laurent Dufour
2017-06-09 16:35     ` Michal Hocko
2017-06-09 16:59       ` Tim Chen
2017-06-13 10:19         ` Laurent Dufour
2017-06-13  9:58       ` Laurent Dufour
2017-06-09 18:55 ` Paul E. McKenney
2017-06-12 10:20 ` Jan Kara
2017-06-13 10:24   ` Laurent Dufour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).