All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 00/17] Speculative page faults
@ 2017-04-27 15:52 ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is a port on kernel 4.10 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore.

http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none

This series is functional on x86, but there may be some pending
issues. It's building on top of v4.10.

Compared to the Peter initial work, this series introduces a try spin
lock when dealing with speculative page fault. This is required to
avoid dead lock when handling a page fault while a TLB invalidate is
requested by an other CPU holding the PTE. Another change due to a
lock dependency issue with mapping->i_mmap_rwsem.

This series also protect changes to VMA's data which are read or
change by the page fault handler. The protections is done through the
VMA's sequence number.

Laurent Dufour (11):
  mm: Introduce pte_spinlock
  mm/spf: Try spin lock in speculative path
  mm/spf: Fix fe.sequence init in __handle_mm_fault()
  mm/spf: don't set fault entry's fields if locking failed
  mm/spf; fix lock dependency against mapping->i_mmap_rwsem
  mm/spf: Protect changes to vm_flags
  mm/spf Protect vm_policy's changes against speculative pf
  x86/mm: Update the handle_speculative_fault's path
  mm/spf: Add check on the VMA's flags
  mm: protect madvise vs speculative pf
  mm/spf: protect mremap() against speculative pf

Peter Zijlstra (6):
  mm: Dont assume page-table invariance during faults
  mm: Prepare for FAULT_FLAG_SPECULATIVE
  mm: VMA sequence count
  RCU free VMAs
  mm: Provide speculative fault infrastructure
  mm,x86: Add speculative pagefault handling

 arch/x86/mm/fault.c      |  15 +++
 fs/proc/task_mmu.c       |   2 +
 include/linux/mm.h       |   4 +
 include/linux/mm_types.h |   3 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  18 +++
 mm/madvise.c             |   5 +-
 mm/memory.c              | 284 +++++++++++++++++++++++++++++++++++++++--------
 mm/mempolicy.c           |  10 +-
 mm/mlock.c               |   9 +-
 mm/mmap.c                | 121 +++++++++++++++-----
 mm/mprotect.c            |   2 +
 mm/mremap.c              |   7 ++
 14 files changed, 402 insertions(+), 80 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC v3 00/17] Speculative page faults
@ 2017-04-27 15:52 ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is a port on kernel 4.10 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore.

http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none

This series is functional on x86, but there may be some pending
issues. It's building on top of v4.10.

Compared to the Peter initial work, this series introduces a try spin
lock when dealing with speculative page fault. This is required to
avoid dead lock when handling a page fault while a TLB invalidate is
requested by an other CPU holding the PTE. Another change due to a
lock dependency issue with mapping->i_mmap_rwsem.

This series also protect changes to VMA's data which are read or
change by the page fault handler. The protections is done through the
VMA's sequence number.

Laurent Dufour (11):
  mm: Introduce pte_spinlock
  mm/spf: Try spin lock in speculative path
  mm/spf: Fix fe.sequence init in __handle_mm_fault()
  mm/spf: don't set fault entry's fields if locking failed
  mm/spf; fix lock dependency against mapping->i_mmap_rwsem
  mm/spf: Protect changes to vm_flags
  mm/spf Protect vm_policy's changes against speculative pf
  x86/mm: Update the handle_speculative_fault's path
  mm/spf: Add check on the VMA's flags
  mm: protect madvise vs speculative pf
  mm/spf: protect mremap() against speculative pf

Peter Zijlstra (6):
  mm: Dont assume page-table invariance during faults
  mm: Prepare for FAULT_FLAG_SPECULATIVE
  mm: VMA sequence count
  RCU free VMAs
  mm: Provide speculative fault infrastructure
  mm,x86: Add speculative pagefault handling

 arch/x86/mm/fault.c      |  15 +++
 fs/proc/task_mmu.c       |   2 +
 include/linux/mm.h       |   4 +
 include/linux/mm_types.h |   3 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  18 +++
 mm/madvise.c             |   5 +-
 mm/memory.c              | 284 +++++++++++++++++++++++++++++++++++++++--------
 mm/mempolicy.c           |  10 +-
 mm/mlock.c               |   9 +-
 mm/mmap.c                | 121 +++++++++++++++-----
 mm/mprotect.c            |   2 +
 mm/mremap.c              |   7 ++
 14 files changed, 402 insertions(+), 80 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC v3 01/17] mm: Dont assume page-table invariance during faults
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the relyance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b471e30c..374b99de75a5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1962,30 +1962,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	debug_dma_assert_idle(src);
@@ -2542,9 +2518,6 @@ int do_swap_page(struct vm_fault *vmf)
 	int exclusive = 0;
 	int ret = 0;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 01/17] mm: Dont assume page-table invariance during faults
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the relyance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b471e30c..374b99de75a5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1962,30 +1962,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	debug_dma_assert_idle(src);
@@ -2542,9 +2518,6 @@ int do_swap_page(struct vm_fault *vmf)
 	int exclusive = 0;
 	int ret = 0;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 02/17] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[port to 4.10 kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 57 +++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b0f64c..555ac9ac7202 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -280,6 +280,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/memory.c b/mm/memory.c
index 374b99de75a5..bce32c9d73c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2100,6 +2100,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+	return true;
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page.
  *
@@ -2127,6 +2133,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 	const unsigned long mmun_start = vmf->address & PAGE_MASK;
 	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
 	struct mem_cgroup *memcg;
+	int ret = VM_FAULT_OOM;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2154,7 +2161,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		mem_cgroup_cancel_charge(new_page, memcg, false);
+		ret = VM_FAULT_RETRY;
+		goto oom_free_new;
+	}
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2242,7 +2253,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 oom:
 	if (old_page)
 		put_page(old_page);
-	return VM_FAULT_OOM;
+	return ret;
 }
 
 /**
@@ -2263,8 +2274,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
 	WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-				       &vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
 	/*
 	 * We might have raced with another page fault while we released the
 	 * pte_offset_map_lock.
@@ -2382,8 +2393,11 @@ static int do_wp_page(struct vm_fault *vmf)
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			lock_page(vmf->page);
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				unlock_page(vmf->page);
+				put_page(vmf->page);
+				return VM_FAULT_RETRY;
+			}
 			if (!pte_same(*vmf->pte, vmf->orig_pte)) {
 				unlock_page(vmf->page);
 				pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2541,8 +2555,10 @@ int do_swap_page(struct vm_fault *vmf)
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+				return VM_FAULT_RETRY;
+			}
 			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2598,8 +2614,11 @@ int do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		ret = VM_FAULT_RETRY;
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto out_page;
+	}
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
 		goto out_nomap;
 
@@ -2763,8 +2782,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 			!mm_forbids_zeropage(vma->vm_mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
 						vma->vm_page_prot));
-		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				vmf->address, &vmf->ptl);
+		if (!pte_map_lock(vmf))
+			return VM_FAULT_RETRY;
 		if (!pte_none(*vmf->pte))
 			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
@@ -2796,8 +2815,12 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		/* XXX: should be factorized */
+		mem_cgroup_cancel_charge(page, memcg, false);
+		put_page(page);
+		return VM_FAULT_RETRY;
+	}
 	if (!pte_none(*vmf->pte))
 		goto release;
 
@@ -2897,8 +2920,9 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
 	if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
 		return VM_FAULT_NOPAGE;
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
+
 	return 0;
 }
 
@@ -3218,6 +3242,7 @@ static int do_read_fault(struct vm_fault *vmf)
 	 * something).
 	 */
 	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
+		/* XXX: is a call to pte_map_lock(fe) required here ? */
 		ret = do_fault_around(vmf);
 		if (ret)
 			return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 02/17] mm: Prepare for FAULT_FLAG_SPECULATIVE
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[port to 4.10 kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 57 +++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b0f64c..555ac9ac7202 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -280,6 +280,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/memory.c b/mm/memory.c
index 374b99de75a5..bce32c9d73c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2100,6 +2100,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+	return true;
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page.
  *
@@ -2127,6 +2133,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 	const unsigned long mmun_start = vmf->address & PAGE_MASK;
 	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
 	struct mem_cgroup *memcg;
+	int ret = VM_FAULT_OOM;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2154,7 +2161,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		mem_cgroup_cancel_charge(new_page, memcg, false);
+		ret = VM_FAULT_RETRY;
+		goto oom_free_new;
+	}
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2242,7 +2253,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 oom:
 	if (old_page)
 		put_page(old_page);
-	return VM_FAULT_OOM;
+	return ret;
 }
 
 /**
@@ -2263,8 +2274,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
 	WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-				       &vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
 	/*
 	 * We might have raced with another page fault while we released the
 	 * pte_offset_map_lock.
@@ -2382,8 +2393,11 @@ static int do_wp_page(struct vm_fault *vmf)
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			lock_page(vmf->page);
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				unlock_page(vmf->page);
+				put_page(vmf->page);
+				return VM_FAULT_RETRY;
+			}
 			if (!pte_same(*vmf->pte, vmf->orig_pte)) {
 				unlock_page(vmf->page);
 				pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2541,8 +2555,10 @@ int do_swap_page(struct vm_fault *vmf)
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+				return VM_FAULT_RETRY;
+			}
 			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2598,8 +2614,11 @@ int do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		ret = VM_FAULT_RETRY;
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto out_page;
+	}
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
 		goto out_nomap;
 
@@ -2763,8 +2782,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 			!mm_forbids_zeropage(vma->vm_mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
 						vma->vm_page_prot));
-		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				vmf->address, &vmf->ptl);
+		if (!pte_map_lock(vmf))
+			return VM_FAULT_RETRY;
 		if (!pte_none(*vmf->pte))
 			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
@@ -2796,8 +2815,12 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		/* XXX: should be factorized */
+		mem_cgroup_cancel_charge(page, memcg, false);
+		put_page(page);
+		return VM_FAULT_RETRY;
+	}
 	if (!pte_none(*vmf->pte))
 		goto release;
 
@@ -2897,8 +2920,9 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
 	if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
 		return VM_FAULT_NOPAGE;
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
+
 	return 0;
 }
 
@@ -3218,6 +3242,7 @@ static int do_read_fault(struct vm_fault *vmf)
 	 * something).
 	 */
 	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
+		/* XXX: is a call to pte_map_lock(fe) required here ? */
 		ret = do_fault_around(vmf);
 		if (ret)
 			return ret;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 03/17] mm: Introduce pte_spinlock
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is needed because in handle_pte_fault() pte_offset_map() is
called and then fe->ptl is fetched and spin_locked.

This was previously embedded in the call to pte_offset_map_lock().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index bce32c9d73c2..441c0e3f3a0f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2100,6 +2100,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
@@ -3398,8 +3405,8 @@ static int do_numa_page(struct vm_fault *vmf)
 	* page table entry is not accessible, so there would be no
 	* concurrent hardware modifications to the PTE.
 	*/
-	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	if (unlikely(!pte_same(*vmf->pte, pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto out;
@@ -3566,8 +3573,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	entry = vmf->orig_pte;
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 03/17] mm: Introduce pte_spinlock
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is needed because in handle_pte_fault() pte_offset_map() is
called and then fe->ptl is fetched and spin_locked.

This was previously embedded in the call to pte_offset_map_lock().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index bce32c9d73c2..441c0e3f3a0f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2100,6 +2100,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
@@ -3398,8 +3405,8 @@ static int do_numa_page(struct vm_fault *vmf)
 	* page table entry is not accessible, so there would be no
 	* concurrent hardware modifications to the PTE.
 	*/
-	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	if (unlikely(!pte_same(*vmf->pte, pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto out;
@@ -3566,8 +3573,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	entry = vmf->orig_pte;
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 04/17] mm: VMA sequence count
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[port to 4.10 kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |  1 +
 mm/memory.c              |  2 ++
 mm/mmap.c                | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 808751d7b737..daa5fbba9349 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -358,6 +358,7 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	seqcount_t vm_sequence;
 };
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 441c0e3f3a0f..0f7fbee554c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1286,6 +1286,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	BUG_ON(addr >= end);
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1295,6 +1296,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index dc4291dcc99b..cb41659bc9f9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -541,6 +541,8 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	else
 		mm->highest_vm_end = vma->vm_end;
 
+	seqcount_init(&vma->vm_sequence);
+
 	/*
 	 * vma->vm_prev wasn't known when we followed the rbtree to find the
 	 * correct insertion point for that vma. As a result, we could not
@@ -675,6 +677,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -886,6 +892,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
+		write_seqcount_end(&next->vm_sequence);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -899,6 +906,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
+			if (next)
+				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -945,6 +954,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	if (next)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 
 	return 0;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 04/17] mm: VMA sequence count
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[port to 4.10 kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |  1 +
 mm/memory.c              |  2 ++
 mm/mmap.c                | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 808751d7b737..daa5fbba9349 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -358,6 +358,7 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	seqcount_t vm_sequence;
 };
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 441c0e3f3a0f..0f7fbee554c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1286,6 +1286,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	BUG_ON(addr >= end);
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1295,6 +1296,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index dc4291dcc99b..cb41659bc9f9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -541,6 +541,8 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	else
 		mm->highest_vm_end = vma->vm_end;
 
+	seqcount_init(&vma->vm_sequence);
+
 	/*
 	 * vma->vm_prev wasn't known when we followed the rbtree to find the
 	 * correct insertion point for that vma. As a result, we could not
@@ -675,6 +677,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -886,6 +892,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
+		write_seqcount_end(&next->vm_sequence);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -899,6 +906,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
+			if (next)
+				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -945,6 +954,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	if (next)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 
 	return 0;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 05/17] RCU free VMAs
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

XXX: mmap()/munmap() heavy workloads might suffer from the global lock
in call_srcu() -- this is fixable with a 'better' SRCU implementation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |   2 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  18 +++++++++
 mm/mmap.c                | 100 +++++++++++++++++++++++++++++++++++------------
 5 files changed, 96 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index daa5fbba9349..f276973b0f91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -359,6 +359,7 @@ struct vm_area_struct {
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
@@ -397,6 +398,7 @@ struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
+	seqlock_t mm_seq;
 	u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 	unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8ab827c..352cf3fd6c19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -753,6 +753,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
 	mm->vmacache_seqnum = 0;
+	seqlock_init(&mm->mm_seq);
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 975e49f00f34..2b1fa061684f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -16,6 +16,7 @@
 
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
+	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 7aa2ea0a8623..69df80ebc93d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,24 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr);
+
+static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
+{
+	int ret = RB_EMPTY_NODE(&vma->vm_rb);
+	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
+
+	/*
+	 * Matches both the wmb in write_seqlock_{begin,end}() and
+	 * the wmb in vma_rb_erase().
+	 */
+	smp_rmb();
+
+	return ret || seq != sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index cb41659bc9f9..44e19aa31315 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -159,6 +159,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	if (vma->vm_file)
+		fput(vma->vm_file);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -169,10 +186,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -394,26 +409,37 @@ static void vma_gap_update(struct vm_area_struct *vma)
 }
 
 static inline void vma_rb_insert(struct vm_area_struct *vma,
-				 struct rb_root *root)
+				 struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
+
 	/* All rb_subtree_gap values must be consistent prior to insertion */
 	validate_mm_rb(root, NULL);
 
 	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
 }
 
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
 	/*
 	 * Note rb_erase_augmented is a fairly large inline function,
 	 * so make sure we instantiate it only once with our desired
 	 * augmented rbtree callbacks.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+	write_sequnlock(&mm->mm_seq); /* wmb */
+
+	/*
+	 * Ensure the removal is complete before clearing the node.
+	 * Matched by vma_is_dead()/handle_speculative_fault().
+	 */
+	RB_CLEAR_NODE(&vma->vm_rb);
 }
 
 static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
-						struct rb_root *root,
+						struct mm_struct *mm,
 						struct vm_area_struct *ignore)
 {
 	/*
@@ -421,21 +447,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
 	 * with the possible exception of the "next" vma being erased if
 	 * next->vm_start was reduced.
 	 */
-	validate_mm_rb(root, ignore);
+	validate_mm_rb(&mm->mm_rb, ignore);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
-					 struct rb_root *root)
+					 struct mm_struct *mm)
 {
 	/*
 	 * All rb_subtree_gap values must be consistent prior to erase,
 	 * with the possible exception of the vma being erased.
 	 */
-	validate_mm_rb(root, vma);
+	validate_mm_rb(&mm->mm_rb, vma);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 /*
@@ -552,10 +578,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * immediately update the gap to the correct value. Finally we
 	 * rebalance the rbtree after all augmented values have been set.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	vma->rb_subtree_gap = 0;
 	vma_gap_update(vma);
-	vma_rb_insert(vma, &mm->mm_rb);
+	vma_rb_insert(vma, mm);
+	write_sequnlock(&mm->mm_seq);
 }
 
 static void __vma_link_file(struct vm_area_struct *vma)
@@ -631,7 +659,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
 {
 	struct vm_area_struct *next;
 
-	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
+	vma_rb_erase_ignore(vma, mm, ignore);
 	next = vma->vm_next;
 	if (has_prev)
 		prev->vm_next = next;
@@ -883,21 +911,19 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
-		if (file) {
+		if (file)
 			uprobe_munmap(next, next->vm_start, next->vm_end);
-			fput(file);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
-		write_seqcount_end(&next->vm_sequence);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
 		 * up the code too much to do both in one go.
 		 */
+		write_seqcount_end(&next->vm_sequence);
 		if (remove_next != 3) {
 			/*
 			 * If "next" was removed and vma->vm_end was
@@ -2103,16 +2129,11 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
 	struct vm_area_struct *vma;
 
-	/* Check the cache first. */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
-
 	rb_node = mm->mm_rb.rb_node;
 
 	while (rb_node) {
@@ -2129,13 +2150,40 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 			rb_node = rb_node->rb_right;
 	}
 
+	return vma;
+}
+
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* Check the cache first. */
+	vma = vmacache_find(mm, addr);
+	if (likely(vma))
+		return vma;
+
+	vma = __find_vma(mm, addr);
 	if (vma)
 		vmacache_update(addr, vma);
 	return vma;
 }
-
 EXPORT_SYMBOL(find_vma);
 
+struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+	unsigned int seq;
+
+	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
+
+	do {
+		seq = read_seqbegin(&mm->mm_seq);
+		vma = __find_vma(mm, addr);
+	} while (read_seqretry(&mm->mm_seq, seq));
+
+	return vma;
+}
+
 /*
  * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
  */
@@ -2490,7 +2538,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
 	do {
-		vma_rb_erase(vma, &mm->mm_rb);
+		vma_rb_erase(vma, mm);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 05/17] RCU free VMAs
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

XXX: mmap()/munmap() heavy workloads might suffer from the global lock
in call_srcu() -- this is fixable with a 'better' SRCU implementation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |   2 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  18 +++++++++
 mm/mmap.c                | 100 +++++++++++++++++++++++++++++++++++------------
 5 files changed, 96 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index daa5fbba9349..f276973b0f91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -359,6 +359,7 @@ struct vm_area_struct {
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
@@ -397,6 +398,7 @@ struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
+	seqlock_t mm_seq;
 	u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 	unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8ab827c..352cf3fd6c19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -753,6 +753,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
 	mm->vmacache_seqnum = 0;
+	seqlock_init(&mm->mm_seq);
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 975e49f00f34..2b1fa061684f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -16,6 +16,7 @@
 
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
+	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 7aa2ea0a8623..69df80ebc93d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,24 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr);
+
+static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
+{
+	int ret = RB_EMPTY_NODE(&vma->vm_rb);
+	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
+
+	/*
+	 * Matches both the wmb in write_seqlock_{begin,end}() and
+	 * the wmb in vma_rb_erase().
+	 */
+	smp_rmb();
+
+	return ret || seq != sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index cb41659bc9f9..44e19aa31315 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -159,6 +159,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	if (vma->vm_file)
+		fput(vma->vm_file);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -169,10 +186,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -394,26 +409,37 @@ static void vma_gap_update(struct vm_area_struct *vma)
 }
 
 static inline void vma_rb_insert(struct vm_area_struct *vma,
-				 struct rb_root *root)
+				 struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
+
 	/* All rb_subtree_gap values must be consistent prior to insertion */
 	validate_mm_rb(root, NULL);
 
 	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
 }
 
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
 	/*
 	 * Note rb_erase_augmented is a fairly large inline function,
 	 * so make sure we instantiate it only once with our desired
 	 * augmented rbtree callbacks.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+	write_sequnlock(&mm->mm_seq); /* wmb */
+
+	/*
+	 * Ensure the removal is complete before clearing the node.
+	 * Matched by vma_is_dead()/handle_speculative_fault().
+	 */
+	RB_CLEAR_NODE(&vma->vm_rb);
 }
 
 static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
-						struct rb_root *root,
+						struct mm_struct *mm,
 						struct vm_area_struct *ignore)
 {
 	/*
@@ -421,21 +447,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
 	 * with the possible exception of the "next" vma being erased if
 	 * next->vm_start was reduced.
 	 */
-	validate_mm_rb(root, ignore);
+	validate_mm_rb(&mm->mm_rb, ignore);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
-					 struct rb_root *root)
+					 struct mm_struct *mm)
 {
 	/*
 	 * All rb_subtree_gap values must be consistent prior to erase,
 	 * with the possible exception of the vma being erased.
 	 */
-	validate_mm_rb(root, vma);
+	validate_mm_rb(&mm->mm_rb, vma);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 /*
@@ -552,10 +578,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * immediately update the gap to the correct value. Finally we
 	 * rebalance the rbtree after all augmented values have been set.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	vma->rb_subtree_gap = 0;
 	vma_gap_update(vma);
-	vma_rb_insert(vma, &mm->mm_rb);
+	vma_rb_insert(vma, mm);
+	write_sequnlock(&mm->mm_seq);
 }
 
 static void __vma_link_file(struct vm_area_struct *vma)
@@ -631,7 +659,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
 {
 	struct vm_area_struct *next;
 
-	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
+	vma_rb_erase_ignore(vma, mm, ignore);
 	next = vma->vm_next;
 	if (has_prev)
 		prev->vm_next = next;
@@ -883,21 +911,19 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
-		if (file) {
+		if (file)
 			uprobe_munmap(next, next->vm_start, next->vm_end);
-			fput(file);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
-		write_seqcount_end(&next->vm_sequence);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
 		 * up the code too much to do both in one go.
 		 */
+		write_seqcount_end(&next->vm_sequence);
 		if (remove_next != 3) {
 			/*
 			 * If "next" was removed and vma->vm_end was
@@ -2103,16 +2129,11 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
 	struct vm_area_struct *vma;
 
-	/* Check the cache first. */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
-
 	rb_node = mm->mm_rb.rb_node;
 
 	while (rb_node) {
@@ -2129,13 +2150,40 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 			rb_node = rb_node->rb_right;
 	}
 
+	return vma;
+}
+
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* Check the cache first. */
+	vma = vmacache_find(mm, addr);
+	if (likely(vma))
+		return vma;
+
+	vma = __find_vma(mm, addr);
 	if (vma)
 		vmacache_update(addr, vma);
 	return vma;
 }
-
 EXPORT_SYMBOL(find_vma);
 
+struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+	unsigned int seq;
+
+	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
+
+	do {
+		seq = read_seqbegin(&mm->mm_seq);
+		vma = __find_vma(mm, addr);
+	} while (read_seqretry(&mm->mm_seq, seq));
+
+	return vma;
+}
+
 /*
  * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
  */
@@ -2490,7 +2538,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
 	do {
-		vma_rb_erase(vma, &mm->mm_rb);
+		vma_rb_erase(vma, mm);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 06/17] mm: Provide speculative fault infrastructure
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Fix newly introduced pte_spinlock() for speculative page fault]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |   3 ++
 mm/memory.c        | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 149 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 555ac9ac7202..4667be54ba74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,6 +298,7 @@ struct vm_fault {
 	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	unsigned long address;		/* Faulting virtual address */
+	unsigned int sequence;
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pte_t orig_pte;			/* Value of PTE at the time of fault */
@@ -1237,6 +1238,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+				    unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 0f7fbee554c4..fd3a0dc122c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2104,15 +2104,66 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
 static bool pte_spinlock(struct vm_fault *vmf)
 {
+	bool ret = false;
+
+	/* Check if vma is still valid */
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+		spin_lock(vmf->ptl);
+		return true;
+	}
+
+	local_irq_disable();
+	if (vma_is_dead(vmf->vma, vmf->sequence))
+		goto out;
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	return true;
+
+	if (vma_is_dead(vmf->vma, vmf->sequence)) {
+		spin_unlock(vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 static bool pte_map_lock(struct vm_fault *vmf)
 {
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
-	return true;
+	bool ret = false;
+
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+					       vmf->address, &vmf->ptl);
+		return true;
+	}
+
+	/*
+	 * The first vma_is_dead() guarantees the page-tables are still valid,
+	 * having IRQs disabled ensures they stay around, hence the second
+	 * vma_is_dead() to make sure they are still valid once we've got the
+	 * lock. After that a concurrent zap_pte_range() will block on the PTL
+	 * and thus we're safe.
+	 */
+	local_irq_disable();
+	if (vma_is_dead(vmf->vma, vmf->sequence))
+		goto out;
+
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				       vmf->address, &vmf->ptl);
+
+	if (vma_is_dead(vmf->vma, vmf->sequence)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 /*
@@ -2544,6 +2595,7 @@ int do_swap_page(struct vm_fault *vmf)
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
+			/* XXX fe->pmd might be dead */
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
 		} else if (is_hwpoison_entry(entry)) {
@@ -3659,6 +3711,97 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	return handle_pte_fault(&vmf);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+			     unsigned int flags)
+{
+	struct vm_fault vmf = {
+		.address = address,
+		.flags = flags | FAULT_FLAG_SPECULATIVE,
+	};
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	int dead, seq, idx, ret = VM_FAULT_RETRY;
+	struct vm_area_struct *vma;
+
+	idx = srcu_read_lock(&vma_srcu);
+	vma = find_vma_srcu(mm, address);
+	if (!vma)
+		goto unlock;
+
+	/*
+	 * Validate the VMA found by the lockless lookup.
+	 */
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
+	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
+		goto unlock;
+
+	if (address < vma->vm_start || vma->vm_end <= address)
+		goto unlock;
+
+	/*
+	 * We need to re-validate the VMA after checking the bounds, otherwise
+	 * we might have a false positive on the bounds.
+	 */
+	if (read_seqcount_retry(&vma->vm_sequence, seq))
+		goto unlock;
+
+	/*
+	 * Do a speculative lookup of the PTE entry.
+	 */
+	local_irq_disable();
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_walk;
+
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_walk;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_walk;
+
+	/*
+	 * The above does not allocate/instantiate page-tables because doing so
+	 * would lead to the possibility of instantiating page-tables after
+	 * free_pgtables() -- and consequently leaking them.
+	 *
+	 * The result is that we take at least one !speculative fault per PMD
+	 * in order to instantiate it.
+	 *
+	 * XXX try and fix that.. should be possible somehow.
+	 */
+
+	if (pmd_huge(*pmd)) /* XXX no huge support */
+		goto out_walk;
+
+	vmf.vma = vma;
+	vmf.pmd = pmd;
+	vmf.pgoff = linear_page_index(vma, address);
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
+	vmf.sequence = seq;
+
+#if 0
+#warning This is done in handle_pte_fault()...
+	pte = pte_offset_map(pmd, address);
+	fe.entry = ACCESS_ONCE(pte); /* XXX gup_get_pte() */
+	pte_unmap(pte);
+#endif
+	local_irq_enable();
+
+	ret = handle_pte_fault(&vmf);
+
+unlock:
+	srcu_read_unlock(&vma_srcu, idx);
+	return ret;
+
+out_walk:
+	local_irq_enable();
+	goto unlock;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 06/17] mm: Provide speculative fault infrastructure
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Fix newly introduced pte_spinlock() for speculative page fault]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |   3 ++
 mm/memory.c        | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 149 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 555ac9ac7202..4667be54ba74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,6 +298,7 @@ struct vm_fault {
 	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	unsigned long address;		/* Faulting virtual address */
+	unsigned int sequence;
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pte_t orig_pte;			/* Value of PTE at the time of fault */
@@ -1237,6 +1238,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+				    unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 0f7fbee554c4..fd3a0dc122c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2104,15 +2104,66 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
 static bool pte_spinlock(struct vm_fault *vmf)
 {
+	bool ret = false;
+
+	/* Check if vma is still valid */
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+		spin_lock(vmf->ptl);
+		return true;
+	}
+
+	local_irq_disable();
+	if (vma_is_dead(vmf->vma, vmf->sequence))
+		goto out;
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	return true;
+
+	if (vma_is_dead(vmf->vma, vmf->sequence)) {
+		spin_unlock(vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 static bool pte_map_lock(struct vm_fault *vmf)
 {
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
-	return true;
+	bool ret = false;
+
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+					       vmf->address, &vmf->ptl);
+		return true;
+	}
+
+	/*
+	 * The first vma_is_dead() guarantees the page-tables are still valid,
+	 * having IRQs disabled ensures they stay around, hence the second
+	 * vma_is_dead() to make sure they are still valid once we've got the
+	 * lock. After that a concurrent zap_pte_range() will block on the PTL
+	 * and thus we're safe.
+	 */
+	local_irq_disable();
+	if (vma_is_dead(vmf->vma, vmf->sequence))
+		goto out;
+
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				       vmf->address, &vmf->ptl);
+
+	if (vma_is_dead(vmf->vma, vmf->sequence)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 /*
@@ -2544,6 +2595,7 @@ int do_swap_page(struct vm_fault *vmf)
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
+			/* XXX fe->pmd might be dead */
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
 		} else if (is_hwpoison_entry(entry)) {
@@ -3659,6 +3711,97 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	return handle_pte_fault(&vmf);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+			     unsigned int flags)
+{
+	struct vm_fault vmf = {
+		.address = address,
+		.flags = flags | FAULT_FLAG_SPECULATIVE,
+	};
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	int dead, seq, idx, ret = VM_FAULT_RETRY;
+	struct vm_area_struct *vma;
+
+	idx = srcu_read_lock(&vma_srcu);
+	vma = find_vma_srcu(mm, address);
+	if (!vma)
+		goto unlock;
+
+	/*
+	 * Validate the VMA found by the lockless lookup.
+	 */
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
+	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
+		goto unlock;
+
+	if (address < vma->vm_start || vma->vm_end <= address)
+		goto unlock;
+
+	/*
+	 * We need to re-validate the VMA after checking the bounds, otherwise
+	 * we might have a false positive on the bounds.
+	 */
+	if (read_seqcount_retry(&vma->vm_sequence, seq))
+		goto unlock;
+
+	/*
+	 * Do a speculative lookup of the PTE entry.
+	 */
+	local_irq_disable();
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_walk;
+
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_walk;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_walk;
+
+	/*
+	 * The above does not allocate/instantiate page-tables because doing so
+	 * would lead to the possibility of instantiating page-tables after
+	 * free_pgtables() -- and consequently leaking them.
+	 *
+	 * The result is that we take at least one !speculative fault per PMD
+	 * in order to instantiate it.
+	 *
+	 * XXX try and fix that.. should be possible somehow.
+	 */
+
+	if (pmd_huge(*pmd)) /* XXX no huge support */
+		goto out_walk;
+
+	vmf.vma = vma;
+	vmf.pmd = pmd;
+	vmf.pgoff = linear_page_index(vma, address);
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
+	vmf.sequence = seq;
+
+#if 0
+#warning This is done in handle_pte_fault()...
+	pte = pte_offset_map(pmd, address);
+	fe.entry = ACCESS_ONCE(pte); /* XXX gup_get_pte() */
+	pte_unmap(pte);
+#endif
+	local_irq_enable();
+
+	ret = handle_pte_fault(&vmf);
+
+unlock:
+	srcu_read_unlock(&vma_srcu, idx);
+	return ret;
+
+out_walk:
+	local_irq_enable();
+	goto unlock;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 07/17] mm,x86: Add speculative pagefault handling
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/mm/fault.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e3254ca0eec4..ee6d8799d958 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1316,6 +1316,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (error_code & PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address,
+					flags & ~FAULT_FLAG_ALLOW_RETRY);
+
+		if (fault & VM_FAULT_RETRY)
+			goto retry;
+
+		goto done;
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1419,7 +1429,15 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		if (fatal_signal_pending(current))
+			return;
+
+		goto done;
+	}
+
 	up_read(&mm->mmap_sem);
+done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 07/17] mm,x86: Add speculative pagefault handling
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

From: Peter Zijlstra <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/mm/fault.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e3254ca0eec4..ee6d8799d958 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1316,6 +1316,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (error_code & PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address,
+					flags & ~FAULT_FLAG_ALLOW_RETRY);
+
+		if (fault & VM_FAULT_RETRY)
+			goto retry;
+
+		goto done;
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1419,7 +1429,15 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		if (fatal_signal_pending(current))
+			return;
+
+		goto done;
+	}
+
 	up_read(&mm->mmap_sem);
+done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 08/17] mm/spf: Try spin lock in speculative path
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

There is a deadlock when a CPU is doing a speculative page fault and
another one is calling do_unmap().

The deadlock occurred because the speculative path try to spinlock the
pte while the interrupt are disabled. When the other CPU in the
unmap's path has locked the pte then is waiting for all the CPU to
invalidate the TLB. As the CPU doing the speculative fault have the
interrupt disable it can't invalidate the TLB, and can't get the lock.

Since we are in a speculative path, we can race with other mm action.
So let assume that the lock may not get acquired and fail the
speculative page fault.

Here are the stack captured during the deadlock:

CPU 0
native_flush_tlb_others+0x7c/0x260
flush_tlb_mm_range+0x6a/0x220
tlb_flush_mmu_tlbonly+0x63/0xc0
unmap_page_range+0x897/0x9d0
? unmap_single_vma+0x7d/0xe0
? release_pages+0x2b3/0x360
unmap_single_vma+0x7d/0xe0
unmap_vmas+0x51/0xa0
unmap_region+0xbd/0x130
do_munmap+0x279/0x460
SyS_munmap+0x53/0x70

CPU 1
do_raw_spin_lock+0x14e/0x160
_raw_spin_lock+0x5d/0x80
? pte_map_lock+0x169/0x1b0
pte_map_lock+0x169/0x1b0
handle_pte_fault+0xbf2/0xd80
? trace_hardirqs_on+0xd/0x10
handle_speculative_fault+0x272/0x280
handle_speculative_fault+0x5/0x280
__do_page_fault+0x187/0x580
trace_do_page_fault+0x52/0x260
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fd3a0dc122c5..458f579feb6f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2118,7 +2118,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 		goto out;
 
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (unlikely(!spin_trylock(vmf->ptl)))
+		goto out;
 
 	if (vma_is_dead(vmf->vma, vmf->sequence)) {
 		spin_unlock(vmf->ptl);
@@ -2152,8 +2153,18 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	if (vma_is_dead(vmf->vma, vmf->sequence))
 		goto out;
 
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
-				       vmf->address, &vmf->ptl);
+	/* Same as pte_offset_map_lock() except that we call
+	 * spin_trylock() in place of spin_lock() to avoid race with
+	 * unmap path which may have the lock and wait for this CPU
+	 * to invalidate TLB but this CPU has irq disabled.
+	 * Since we are in a speculative patch, accept it could fail
+	 */
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(vmf->ptl))) {
+		pte_unmap(vmf->pte);
+		goto out;
+	}
 
 	if (vma_is_dead(vmf->vma, vmf->sequence)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 08/17] mm/spf: Try spin lock in speculative path
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

There is a deadlock when a CPU is doing a speculative page fault and
another one is calling do_unmap().

The deadlock occurred because the speculative path try to spinlock the
pte while the interrupt are disabled. When the other CPU in the
unmap's path has locked the pte then is waiting for all the CPU to
invalidate the TLB. As the CPU doing the speculative fault have the
interrupt disable it can't invalidate the TLB, and can't get the lock.

Since we are in a speculative path, we can race with other mm action.
So let assume that the lock may not get acquired and fail the
speculative page fault.

Here are the stack captured during the deadlock:

CPU 0
native_flush_tlb_others+0x7c/0x260
flush_tlb_mm_range+0x6a/0x220
tlb_flush_mmu_tlbonly+0x63/0xc0
unmap_page_range+0x897/0x9d0
? unmap_single_vma+0x7d/0xe0
? release_pages+0x2b3/0x360
unmap_single_vma+0x7d/0xe0
unmap_vmas+0x51/0xa0
unmap_region+0xbd/0x130
do_munmap+0x279/0x460
SyS_munmap+0x53/0x70

CPU 1
do_raw_spin_lock+0x14e/0x160
_raw_spin_lock+0x5d/0x80
? pte_map_lock+0x169/0x1b0
pte_map_lock+0x169/0x1b0
handle_pte_fault+0xbf2/0xd80
? trace_hardirqs_on+0xd/0x10
handle_speculative_fault+0x272/0x280
handle_speculative_fault+0x5/0x280
__do_page_fault+0x187/0x580
trace_do_page_fault+0x52/0x260
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fd3a0dc122c5..458f579feb6f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2118,7 +2118,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 		goto out;
 
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (unlikely(!spin_trylock(vmf->ptl)))
+		goto out;
 
 	if (vma_is_dead(vmf->vma, vmf->sequence)) {
 		spin_unlock(vmf->ptl);
@@ -2152,8 +2153,18 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	if (vma_is_dead(vmf->vma, vmf->sequence))
 		goto out;
 
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
-				       vmf->address, &vmf->ptl);
+	/* Same as pte_offset_map_lock() except that we call
+	 * spin_trylock() in place of spin_lock() to avoid race with
+	 * unmap path which may have the lock and wait for this CPU
+	 * to invalidate TLB but this CPU has irq disabled.
+	 * Since we are in a speculative patch, accept it could fail
+	 */
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(vmf->ptl))) {
+		pte_unmap(vmf->pte);
+		goto out;
+	}
 
 	if (vma_is_dead(vmf->vma, vmf->sequence)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 09/17] mm/spf: Fix fe.sequence init in __handle_mm_fault()
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

__handle_mm_fault() calls handle_pte_fault which requires the sequence
field of the fault_env to be initialized.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index 458f579feb6f..f8afd52f0d34 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3694,6 +3694,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pmd = pmd_alloc(mm, pud, address);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
+	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
 		int ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 09/17] mm/spf: Fix fe.sequence init in __handle_mm_fault()
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

__handle_mm_fault() calls handle_pte_fault which requires the sequence
field of the fault_env to be initialized.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index 458f579feb6f..f8afd52f0d34 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3694,6 +3694,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pmd = pmd_alloc(mm, pud, address);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
+	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
 		int ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 10/17] mm/spf: don't set fault entry's fields if locking failed
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

In the case pte_map_lock failed to lock the pte or if the VMA is no
more valid, the fault entry's fields should not be set so that caller
won't try to unlock it.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f8afd52f0d34..3b28de5838c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2135,6 +2135,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	bool ret = false;
+	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
 		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -2159,18 +2161,20 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	 * to invalidate TLB but this CPU has irq disabled.
 	 * Since we are in a speculative patch, accept it could fail
 	 */
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-	if (unlikely(!spin_trylock(vmf->ptl))) {
-		pte_unmap(vmf->pte);
+	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(ptl))) {
+		pte_unmap(pte);
 		goto out;
 	}
 
 	if (vma_is_dead(vmf->vma, vmf->sequence)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		pte_unmap_unlock(pte, ptl);
 		goto out;
 	}
 
+	vmf->pte = pte;
+	vmf->ptl = ptl;
 	ret = true;
 out:
 	local_irq_enable();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 10/17] mm/spf: don't set fault entry's fields if locking failed
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

In the case pte_map_lock failed to lock the pte or if the VMA is no
more valid, the fault entry's fields should not be set so that caller
won't try to unlock it.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f8afd52f0d34..3b28de5838c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2135,6 +2135,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	bool ret = false;
+	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
 		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -2159,18 +2161,20 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	 * to invalidate TLB but this CPU has irq disabled.
 	 * Since we are in a speculative patch, accept it could fail
 	 */
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-	if (unlikely(!spin_trylock(vmf->ptl))) {
-		pte_unmap(vmf->pte);
+	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(ptl))) {
+		pte_unmap(pte);
 		goto out;
 	}
 
 	if (vma_is_dead(vmf->vma, vmf->sequence)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		pte_unmap_unlock(pte, ptl);
 		goto out;
 	}
 
+	vmf->pte = pte;
+	vmf->ptl = ptl;
 	ret = true;
 out:
 	local_irq_enable();
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 11/17] mm/spf; fix lock dependency against mapping->i_mmap_rwsem
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

kworker/32:1/819 is trying to acquire lock:
 (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
zap_page_range_single+0xd0/0x1a0

but task is already holding lock:
 (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
unmap_mapping_range+0x7c/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&mapping->i_mmap_rwsem){++++..}:
       down_write+0x84/0x130
       __vma_adjust+0x1f4/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #1 (&vma->vm_sequence/1){+.+...}:
       __vma_adjust+0x124/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #0 (&vma->vm_sequence){+.+...}:
       lock_acquire+0xf4/0x310
       unmap_page_range+0xcc/0xfa0
       zap_page_range_single+0xd0/0x1a0
       unmap_mapping_range+0x138/0x160
       truncate_pagecache+0x50/0xa0
       put_aio_ring_file+0x48/0xb0
       aio_free_ring+0x40/0x1b0
       free_ioctx+0x38/0xc0
       process_one_work+0x2cc/0x8a0
       worker_thread+0xac/0x580
       kthread+0x164/0x1b0
       ret_from_kernel_thread+0x5c/0x68

other info that might help us debug this:

Chain exists of:
  &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mapping->i_mmap_rwsem);
                               lock(&vma->vm_sequence/1);
                               lock(&mapping->i_mmap_rwsem);
  lock(&vma->vm_sequence);

 *** DEADLOCK ***

To fix that we must grab the vm_sequence lock after any mapping one in
__vma_adjust().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mmap.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 44e19aa31315..27f407d8f7d7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -705,10 +705,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
-	write_seqcount_begin(&vma->vm_sequence);
-	if (next)
-		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
-
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -816,6 +812,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		}
 	}
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence,
+					    SINGLE_DEPTH_NESTING);
+
 	anon_vma = vma->anon_vma;
 	if (!anon_vma && adjust_next)
 		anon_vma = next->anon_vma;
@@ -932,8 +933,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
-			if (next)
-				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -950,11 +949,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		if (remove_next == 2) {
 			remove_next = 1;
 			end = next->vm_end;
+			write_seqcount_end(&vma->vm_sequence);
 			goto again;
-		}
-		else if (next)
+		} else if (next) {
+			if (next != vma)
+				write_seqcount_begin_nested(&next->vm_sequence,
+							    SINGLE_DEPTH_NESTING);
 			vma_gap_update(next);
-		else {
+		} else {
 			/*
 			 * If remove_next == 2 we obviously can't
 			 * reach this path.
@@ -980,7 +982,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
-	if (next)
+	if (next && next != vma)
 		write_seqcount_end(&next->vm_sequence);
 	write_seqcount_end(&vma->vm_sequence);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 11/17] mm/spf; fix lock dependency against mapping->i_mmap_rwsem
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

kworker/32:1/819 is trying to acquire lock:
 (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
zap_page_range_single+0xd0/0x1a0

but task is already holding lock:
 (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
unmap_mapping_range+0x7c/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&mapping->i_mmap_rwsem){++++..}:
       down_write+0x84/0x130
       __vma_adjust+0x1f4/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #1 (&vma->vm_sequence/1){+.+...}:
       __vma_adjust+0x124/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #0 (&vma->vm_sequence){+.+...}:
       lock_acquire+0xf4/0x310
       unmap_page_range+0xcc/0xfa0
       zap_page_range_single+0xd0/0x1a0
       unmap_mapping_range+0x138/0x160
       truncate_pagecache+0x50/0xa0
       put_aio_ring_file+0x48/0xb0
       aio_free_ring+0x40/0x1b0
       free_ioctx+0x38/0xc0
       process_one_work+0x2cc/0x8a0
       worker_thread+0xac/0x580
       kthread+0x164/0x1b0
       ret_from_kernel_thread+0x5c/0x68

other info that might help us debug this:

Chain exists of:
  &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mapping->i_mmap_rwsem);
                               lock(&vma->vm_sequence/1);
                               lock(&mapping->i_mmap_rwsem);
  lock(&vma->vm_sequence);

 *** DEADLOCK ***

To fix that we must grab the vm_sequence lock after any mapping one in
__vma_adjust().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mmap.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 44e19aa31315..27f407d8f7d7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -705,10 +705,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
-	write_seqcount_begin(&vma->vm_sequence);
-	if (next)
-		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
-
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -816,6 +812,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		}
 	}
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence,
+					    SINGLE_DEPTH_NESTING);
+
 	anon_vma = vma->anon_vma;
 	if (!anon_vma && adjust_next)
 		anon_vma = next->anon_vma;
@@ -932,8 +933,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
-			if (next)
-				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -950,11 +949,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		if (remove_next == 2) {
 			remove_next = 1;
 			end = next->vm_end;
+			write_seqcount_end(&vma->vm_sequence);
 			goto again;
-		}
-		else if (next)
+		} else if (next) {
+			if (next != vma)
+				write_seqcount_begin_nested(&next->vm_sequence,
+							    SINGLE_DEPTH_NESTING);
 			vma_gap_update(next);
-		else {
+		} else {
 			/*
 			 * If remove_next == 2 we obviously can't
 			 * reach this path.
@@ -980,7 +982,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
-	if (next)
+	if (next && next != vma)
 		write_seqcount_end(&next->vm_sequence);
 	write_seqcount_end(&vma->vm_sequence);
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 12/17] mm/spf: Protect changes to vm_flags
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

Protect VMA's flags change against the speculative page fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c | 2 ++
 mm/mempolicy.c     | 2 ++
 mm/mlock.c         | 9 ++++++---
 mm/mmap.c          | 2 ++
 mm/mprotect.c      | 2 ++
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8f96a49178d0..54c9a87530cb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1055,8 +1055,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					goto out_mm;
 				}
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
+					write_seqcount_begin(&vma->vm_sequence);
 					vma->vm_flags &= ~VM_SOFTDIRTY;
 					vma_set_page_prot(vma);
+					write_seqcount_end(&vma->vm_sequence);
 				}
 				downgrade_write(&mm->mmap_sem);
 				break;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1e7873e40c9a..1518b022927d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -603,9 +603,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
+	write_seqcount_begin(&vma->vm_sequence);
 	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return nr_updated;
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index cdbed8aaa426..44cf70413530 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -437,7 +437,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	write_seqcount_end(&vma->vm_sequence);
 
 	while (start < end) {
 		struct page *page;
@@ -563,10 +565,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
-
-	if (lock)
+	if (lock) {
+		write_seqcount_begin(&vma->vm_sequence);
 		vma->vm_flags = newflags;
-	else
+		write_seqcount_end(&vma->vm_sequence);
+	} else
 		munlock_vma_pages_range(vma, start, end);
 
 out:
diff --git a/mm/mmap.c b/mm/mmap.c
index 27f407d8f7d7..815065d740c4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1742,6 +1742,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
@@ -1764,6 +1765,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_flags |= VM_SOFTDIRTY;
 
 	vma_set_page_prot(vma);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return addr;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f9c07f54dd62..646347faf4d5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -341,6 +341,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = newflags;
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
@@ -356,6 +357,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 			(newflags & VM_WRITE)) {
 		populate_vma_page_range(vma, start, end, NULL);
 	}
+	write_seqcount_end(&vma->vm_sequence);
 
 	vm_stat_account(mm, oldflags, -nrpages);
 	vm_stat_account(mm, newflags, nrpages);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 12/17] mm/spf: Protect changes to vm_flags
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

Protect VMA's flags change against the speculative page fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c | 2 ++
 mm/mempolicy.c     | 2 ++
 mm/mlock.c         | 9 ++++++---
 mm/mmap.c          | 2 ++
 mm/mprotect.c      | 2 ++
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8f96a49178d0..54c9a87530cb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1055,8 +1055,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					goto out_mm;
 				}
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
+					write_seqcount_begin(&vma->vm_sequence);
 					vma->vm_flags &= ~VM_SOFTDIRTY;
 					vma_set_page_prot(vma);
+					write_seqcount_end(&vma->vm_sequence);
 				}
 				downgrade_write(&mm->mmap_sem);
 				break;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1e7873e40c9a..1518b022927d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -603,9 +603,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
+	write_seqcount_begin(&vma->vm_sequence);
 	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return nr_updated;
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index cdbed8aaa426..44cf70413530 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -437,7 +437,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	write_seqcount_end(&vma->vm_sequence);
 
 	while (start < end) {
 		struct page *page;
@@ -563,10 +565,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
-
-	if (lock)
+	if (lock) {
+		write_seqcount_begin(&vma->vm_sequence);
 		vma->vm_flags = newflags;
-	else
+		write_seqcount_end(&vma->vm_sequence);
+	} else
 		munlock_vma_pages_range(vma, start, end);
 
 out:
diff --git a/mm/mmap.c b/mm/mmap.c
index 27f407d8f7d7..815065d740c4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1742,6 +1742,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
@@ -1764,6 +1765,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_flags |= VM_SOFTDIRTY;
 
 	vma_set_page_prot(vma);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return addr;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f9c07f54dd62..646347faf4d5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -341,6 +341,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = newflags;
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
@@ -356,6 +357,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 			(newflags & VM_WRITE)) {
 		populate_vma_page_range(vma, start, end, NULL);
 	}
+	write_seqcount_end(&vma->vm_sequence);
 
 	vm_stat_account(mm, oldflags, -nrpages);
 	vm_stat_account(mm, newflags, nrpages);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 13/17] mm/spf Protect vm_policy's changes against speculative pf
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

Mark the VMA touched when policy changes are applied to it so that
speculative page fault will be aborted.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mempolicy.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1518b022927d..57ec8d0a9c95 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -444,8 +444,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		write_seqcount_begin(&vma->vm_sequence);
 		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+		write_seqcount_end(&vma->vm_sequence);
+	}
 	up_write(&mm->mmap_sem);
 }
 
@@ -708,6 +711,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	if (vma->vm_ops && vma->vm_ops->set_policy) {
 		err = vma->vm_ops->set_policy(vma, new);
 		if (err)
@@ -716,10 +720,12 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 
 	old = vma->vm_policy;
 	vma->vm_policy = new; /* protected by mmap_sem */
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(old);
 
 	return 0;
  err_out:
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(new);
 	return err;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 13/17] mm/spf Protect vm_policy's changes against speculative pf
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

Mark the VMA touched when policy changes are applied to it so that
speculative page fault will be aborted.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mempolicy.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1518b022927d..57ec8d0a9c95 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -444,8 +444,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		write_seqcount_begin(&vma->vm_sequence);
 		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+		write_seqcount_end(&vma->vm_sequence);
+	}
 	up_write(&mm->mmap_sem);
 }
 
@@ -708,6 +711,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	if (vma->vm_ops && vma->vm_ops->set_policy) {
 		err = vma->vm_ops->set_policy(vma, new);
 		if (err)
@@ -716,10 +720,12 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 
 	old = vma->vm_policy;
 	vma->vm_policy = new; /* protected by mmap_sem */
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(old);
 
 	return 0;
  err_out:
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(new);
 	return err;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 14/17] x86/mm: Update the handle_speculative_fault's path
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

If handle_speculative_fault failed due to a VM ERROR, we try again the
slow path to allow the signal to be delivered.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ee6d8799d958..8f3bd8a53d66 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1320,10 +1320,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		fault = handle_speculative_fault(mm, address,
 					flags & ~FAULT_FLAG_ALLOW_RETRY);
 
-		if (fault & VM_FAULT_RETRY)
-			goto retry;
-
-		goto done;
+		/*
+		 * We also check against VM_FAULT_ERROR because we have to
+		 * raise a signal by calling later mm_fault_error() which
+		 * requires the vma pointer to be set. So in that case,
+		 * we fall through the normal path.
+		 */
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
 	}
 
 	/*
@@ -1429,20 +1433,13 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
-	if (unlikely(fault & VM_FAULT_RETRY)) {
-		if (fatal_signal_pending(current))
-			return;
-
-		goto done;
-	}
-
 	up_read(&mm->mmap_sem);
-done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting. If any of the events
 	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 14/17] x86/mm: Update the handle_speculative_fault's path
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

If handle_speculative_fault failed due to a VM ERROR, we try again the
slow path to allow the signal to be delivered.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ee6d8799d958..8f3bd8a53d66 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1320,10 +1320,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		fault = handle_speculative_fault(mm, address,
 					flags & ~FAULT_FLAG_ALLOW_RETRY);
 
-		if (fault & VM_FAULT_RETRY)
-			goto retry;
-
-		goto done;
+		/*
+		 * We also check against VM_FAULT_ERROR because we have to
+		 * raise a signal by calling later mm_fault_error() which
+		 * requires the vma pointer to be set. So in that case,
+		 * we fall through the normal path.
+		 */
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
 	}
 
 	/*
@@ -1429,20 +1433,13 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
-	if (unlikely(fault & VM_FAULT_RETRY)) {
-		if (fatal_signal_pending(current))
-			return;
-
-		goto done;
-	}
-
 	up_read(&mm->mmap_sem);
-done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting. If any of the events
 	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 15/17] mm/spf: Add check on the VMA's flags
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

When handling speculative page fault we should check for the VMA's
access permission as it is done in handle_mm_fault() or access_error
in x86's fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 3b28de5838c7..4d9c6331ada1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3756,6 +3756,30 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
 	if (address < vma->vm_start || vma->vm_end <= address)
 		goto unlock;
 
+	/* XXX Could we handle huge page here ? */
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		goto unlock;
+
+	/*
+	 * The three following checks are copied from access_error from
+	 * arch/x86/mm/fault.c
+	 * XXX they may not be applicable to all architectures
+	 */
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+				       flags & FAULT_FLAG_INSTRUCTION,
+				       flags & FAULT_FLAG_REMOTE))
+		goto unlock;
+
+	/* This is one is required to check that the VMA has write access set */
+	if (flags & FAULT_FLAG_WRITE) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			goto unlock;
+	} else {
+		/* XXX This may not be required */
+		if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+			goto unlock;
+	}
+
 	/*
 	 * We need to re-validate the VMA after checking the bounds, otherwise
 	 * we might have a false positive on the bounds.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 15/17] mm/spf: Add check on the VMA's flags
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

When handling speculative page fault we should check for the VMA's
access permission as it is done in handle_mm_fault() or access_error
in x86's fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 3b28de5838c7..4d9c6331ada1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3756,6 +3756,30 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
 	if (address < vma->vm_start || vma->vm_end <= address)
 		goto unlock;
 
+	/* XXX Could we handle huge page here ? */
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		goto unlock;
+
+	/*
+	 * The three following checks are copied from access_error from
+	 * arch/x86/mm/fault.c
+	 * XXX they may not be applicable to all architectures
+	 */
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+				       flags & FAULT_FLAG_INSTRUCTION,
+				       flags & FAULT_FLAG_REMOTE))
+		goto unlock;
+
+	/* This is one is required to check that the VMA has write access set */
+	if (flags & FAULT_FLAG_WRITE) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			goto unlock;
+	} else {
+		/* XXX This may not be required */
+		if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+			goto unlock;
+	}
+
 	/*
 	 * We need to re-validate the VMA after checking the bounds, otherwise
 	 * we might have a false positive on the bounds.
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 16/17] mm: protect madvise vs speculative pf
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is an attempt to protect madvise's effect against the speculative
page fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/madvise.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 0e3828eae9f8..f91b64564571 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -132,8 +132,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = new_flags;
-
+	write_seqcount_end(&vma->vm_sequence);
 out:
 	if (error == -ENOMEM)
 		error = -EAGAIN;
@@ -403,9 +404,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
 		.private = tlb,
 	};
 
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	walk_page_range(addr, end, &free_walk);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 static int madvise_free_single_vma(struct vm_area_struct *vma,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 16/17] mm: protect madvise vs speculative pf
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

This is an attempt to protect madvise's effect against the speculative
page fault handler.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/madvise.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 0e3828eae9f8..f91b64564571 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -132,8 +132,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = new_flags;
-
+	write_seqcount_end(&vma->vm_sequence);
 out:
 	if (error == -ENOMEM)
 		error = -EAGAIN;
@@ -403,9 +404,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
 		.private = tlb,
 	};
 
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	walk_page_range(addr, end, &free_walk);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 static int madvise_free_single_vma(struct vm_area_struct *vma,
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 17/17] mm/spf: protect mremap() against speculative pf
  2017-04-27 15:52 ` Laurent Dufour
@ 2017-04-27 15:52   ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

mremap() is modifying the VMA layout and thus must be protected against
the speculative page fault handler.

XXX: Is the change to vma->vm_flags to set VM_ACCOUNT require the
protection ?

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mremap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 30d7d2482eea..40c3c869dffc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -288,6 +288,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	if (!new_vma)
 		return -ENOMEM;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	write_seqcount_begin_nested(&new_vma->vm_sequence,
+				    SINGLE_DEPTH_NESTING);
+
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
 				     need_rmap_locks);
 	if (moved_len < old_len) {
@@ -304,6 +308,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
 				 true);
+		write_seqcount_end(&vma->vm_sequence);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -311,7 +316,9 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	} else {
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
+		write_seqcount_end(&vma->vm_sequence);
 	}
+	write_seqcount_end(&new_vma->vm_sequence);
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC v3 17/17] mm/spf: protect mremap() against speculative pf
@ 2017-04-27 15:52   ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-04-27 15:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

mremap() is modifying the VMA layout and thus must be protected against
the speculative page fault handler.

XXX: Is the change to vma->vm_flags to set VM_ACCOUNT require the
protection ?

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mremap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 30d7d2482eea..40c3c869dffc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -288,6 +288,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	if (!new_vma)
 		return -ENOMEM;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	write_seqcount_begin_nested(&new_vma->vm_sequence,
+				    SINGLE_DEPTH_NESTING);
+
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
 				     need_rmap_locks);
 	if (moved_len < old_len) {
@@ -304,6 +308,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
 				 true);
+		write_seqcount_end(&vma->vm_sequence);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -311,7 +316,9 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	} else {
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
+		write_seqcount_end(&vma->vm_sequence);
 	}
+	write_seqcount_end(&new_vma->vm_sequence);
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
  2017-04-27 15:52   ` Laurent Dufour
@ 2017-04-27 18:28     ` Paul E. McKenney
  -1 siblings, 0 replies; 48+ messages in thread
From: Paul E. McKenney @ 2017-04-27 18:28 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: peterz, akpm, kirill, ak, mhocko, dave, jack, linux-kernel,
	linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> 
> We put the fput(vma->vm_file) in the SRCU callback, this keeps files
> valid during speculative faults, this is possible due to the delayed
> fput work by Al Viro -- do we need srcu_barrier() in unmount
> someplace?
> 
> We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
> we'd have to disable preemption around the write side in order to make
> the retry loop in __read_seqcount_begin() work) such that we can know
> if the rb tree walk was correct. We cannot trust the restult of a
> lockless tree walk in the face of concurrent tree rotations; although
> we can trust on the termination of such walks -- tree rotations
> guarantee the end result is a tree again after all.
> 
> Furthermore, we rely on the WMB implied by the
> write_seqlock/count_begin() to separate the VMA initialization and the
> publishing stores, analogous to the RELEASE in rcu_assign_pointer().
> We also rely on the RMB from read_seqretry() to separate the vma load
> from further loads like the smp_read_barrier_depends() in regular
> RCU.
> 
> We must not touch the vmacache while doing SRCU lookups as that is not
> properly serialized against changes. We update gap information after
> publishing the VMA, but A) we don't use that and B) the seqlock
> read side would fix that anyhow.
> 
> We clear vma->vm_rb for nodes removed from the vma tree such that we
> can easily detect such 'dead' nodes, we rely on the WMB from
> write_sequnlock() to separate the tree removal and clearing the node.
> 
> Provide find_vma_srcu() which wraps the required magic.
> 
> XXX: mmap()/munmap() heavy workloads might suffer from the global lock
> in call_srcu() -- this is fixable with a 'better' SRCU implementation.

An alleged 'better' SRCU implementation is now in -tip at branch
tip/core/rcu.  This implementation maintains per-CPU callback lists,
which eliminates the previous global lock.  The goal is to get this
'better' SRCU into mainline during the upcoming merge window.

							Thanx, Paul

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/mm_types.h |   2 +
>  kernel/fork.c            |   1 +
>  mm/init-mm.c             |   1 +
>  mm/internal.h            |  18 +++++++++
>  mm/mmap.c                | 100 +++++++++++++++++++++++++++++++++++------------
>  5 files changed, 96 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index daa5fbba9349..f276973b0f91 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -359,6 +359,7 @@ struct vm_area_struct {
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  	seqcount_t vm_sequence;
> +	struct rcu_head vm_rcu_head;
>  };
> 
>  struct core_thread {
> @@ -397,6 +398,7 @@ struct kioctx_table;
>  struct mm_struct {
>  	struct vm_area_struct *mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> +	seqlock_t mm_seq;
>  	u32 vmacache_seqnum;                   /* per-thread vmacache */
>  #ifdef CONFIG_MMU
>  	unsigned long (*get_unmapped_area) (struct file *filp,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 11c5c8ab827c..352cf3fd6c19 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -753,6 +753,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	mm->mmap = NULL;
>  	mm->mm_rb = RB_ROOT;
>  	mm->vmacache_seqnum = 0;
> +	seqlock_init(&mm->mm_seq);
>  	atomic_set(&mm->mm_users, 1);
>  	atomic_set(&mm->mm_count, 1);
>  	init_rwsem(&mm->mmap_sem);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 975e49f00f34..2b1fa061684f 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -16,6 +16,7 @@
> 
>  struct mm_struct init_mm = {
>  	.mm_rb		= RB_ROOT,
> +	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
>  	.pgd		= swapper_pg_dir,
>  	.mm_users	= ATOMIC_INIT(2),
>  	.mm_count	= ATOMIC_INIT(1),
> diff --git a/mm/internal.h b/mm/internal.h
> index 7aa2ea0a8623..69df80ebc93d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -40,6 +40,24 @@ void page_writeback_init(void);
> 
>  int do_swap_page(struct vm_fault *vmf);
> 
> +extern struct srcu_struct vma_srcu;
> +
> +extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr);
> +
> +static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
> +{
> +	int ret = RB_EMPTY_NODE(&vma->vm_rb);
> +	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
> +
> +	/*
> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
> +	 * the wmb in vma_rb_erase().
> +	 */
> +	smp_rmb();
> +
> +	return ret || seq != sequence;
> +}
> +
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index cb41659bc9f9..44e19aa31315 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -159,6 +159,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
>  	}
>  }
> 
> +DEFINE_SRCU(vma_srcu);
> +
> +static void __free_vma(struct rcu_head *head)
> +{
> +	struct vm_area_struct *vma =
> +		container_of(head, struct vm_area_struct, vm_rcu_head);
> +
> +	if (vma->vm_file)
> +		fput(vma->vm_file);
> +	kmem_cache_free(vm_area_cachep, vma);
> +}
> +
> +static void free_vma(struct vm_area_struct *vma)
> +{
> +	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
> +}
> +
>  /*
>   * Close a vm structure and free it, returning the next.
>   */
> @@ -169,10 +186,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  	might_sleep();
>  	if (vma->vm_ops && vma->vm_ops->close)
>  		vma->vm_ops->close(vma);
> -	if (vma->vm_file)
> -		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> -	kmem_cache_free(vm_area_cachep, vma);
> +	free_vma(vma);
>  	return next;
>  }
> 
> @@ -394,26 +409,37 @@ static void vma_gap_update(struct vm_area_struct *vma)
>  }
> 
>  static inline void vma_rb_insert(struct vm_area_struct *vma,
> -				 struct rb_root *root)
> +				 struct mm_struct *mm)
>  {
> +	struct rb_root *root = &mm->mm_rb;
> +
>  	/* All rb_subtree_gap values must be consistent prior to insertion */
>  	validate_mm_rb(root, NULL);
> 
>  	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
>  }
> 
> -static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> +static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
> +	struct rb_root *root = &mm->mm_rb;
>  	/*
>  	 * Note rb_erase_augmented is a fairly large inline function,
>  	 * so make sure we instantiate it only once with our desired
>  	 * augmented rbtree callbacks.
>  	 */
> +	write_seqlock(&mm->mm_seq);
>  	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> +	write_sequnlock(&mm->mm_seq); /* wmb */
> +
> +	/*
> +	 * Ensure the removal is complete before clearing the node.
> +	 * Matched by vma_is_dead()/handle_speculative_fault().
> +	 */
> +	RB_CLEAR_NODE(&vma->vm_rb);
>  }
> 
>  static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> -						struct rb_root *root,
> +						struct mm_struct *mm,
>  						struct vm_area_struct *ignore)
>  {
>  	/*
> @@ -421,21 +447,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
>  	 * with the possible exception of the "next" vma being erased if
>  	 * next->vm_start was reduced.
>  	 */
> -	validate_mm_rb(root, ignore);
> +	validate_mm_rb(&mm->mm_rb, ignore);
> 
> -	__vma_rb_erase(vma, root);
> +	__vma_rb_erase(vma, mm);
>  }
> 
>  static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
> -					 struct rb_root *root)
> +					 struct mm_struct *mm)
>  {
>  	/*
>  	 * All rb_subtree_gap values must be consistent prior to erase,
>  	 * with the possible exception of the vma being erased.
>  	 */
> -	validate_mm_rb(root, vma);
> +	validate_mm_rb(&mm->mm_rb, vma);
> 
> -	__vma_rb_erase(vma, root);
> +	__vma_rb_erase(vma, mm);
>  }
> 
>  /*
> @@ -552,10 +578,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * immediately update the gap to the correct value. Finally we
>  	 * rebalance the rbtree after all augmented values have been set.
>  	 */
> +	write_seqlock(&mm->mm_seq);
>  	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
>  	vma->rb_subtree_gap = 0;
>  	vma_gap_update(vma);
> -	vma_rb_insert(vma, &mm->mm_rb);
> +	vma_rb_insert(vma, mm);
> +	write_sequnlock(&mm->mm_seq);
>  }
> 
>  static void __vma_link_file(struct vm_area_struct *vma)
> @@ -631,7 +659,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
>  {
>  	struct vm_area_struct *next;
> 
> -	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
> +	vma_rb_erase_ignore(vma, mm, ignore);
>  	next = vma->vm_next;
>  	if (has_prev)
>  		prev->vm_next = next;
> @@ -883,21 +911,19 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	}
> 
>  	if (remove_next) {
> -		if (file) {
> +		if (file)
>  			uprobe_munmap(next, next->vm_start, next->vm_end);
> -			fput(file);
> -		}
>  		if (next->anon_vma)
>  			anon_vma_merge(vma, next);
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		kmem_cache_free(vm_area_cachep, next);
> -		write_seqcount_end(&next->vm_sequence);
> +		free_vma(next);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
>  		 * up the code too much to do both in one go.
>  		 */
> +		write_seqcount_end(&next->vm_sequence);
>  		if (remove_next != 3) {
>  			/*
>  			 * If "next" was removed and vma->vm_end was
> @@ -2103,16 +2129,11 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>  EXPORT_SYMBOL(get_unmapped_area);
> 
>  /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
>  {
>  	struct rb_node *rb_node;
>  	struct vm_area_struct *vma;
> 
> -	/* Check the cache first. */
> -	vma = vmacache_find(mm, addr);
> -	if (likely(vma))
> -		return vma;
> -
>  	rb_node = mm->mm_rb.rb_node;
> 
>  	while (rb_node) {
> @@ -2129,13 +2150,40 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>  			rb_node = rb_node->rb_right;
>  	}
> 
> +	return vma;
> +}
> +
> +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +
> +	/* Check the cache first. */
> +	vma = vmacache_find(mm, addr);
> +	if (likely(vma))
> +		return vma;
> +
> +	vma = __find_vma(mm, addr);
>  	if (vma)
>  		vmacache_update(addr, vma);
>  	return vma;
>  }
> -
>  EXPORT_SYMBOL(find_vma);
> 
> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned int seq;
> +
> +	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
> +
> +	do {
> +		seq = read_seqbegin(&mm->mm_seq);
> +		vma = __find_vma(mm, addr);
> +	} while (read_seqretry(&mm->mm_seq, seq));
> +
> +	return vma;
> +}
> +
>  /*
>   * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
>   */
> @@ -2490,7 +2538,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
>  	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
>  	vma->vm_prev = NULL;
>  	do {
> -		vma_rb_erase(vma, &mm->mm_rb);
> +		vma_rb_erase(vma, mm);
>  		mm->map_count--;
>  		tail_vma = vma;
>  		vma = vma->vm_next;
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
@ 2017-04-27 18:28     ` Paul E. McKenney
  0 siblings, 0 replies; 48+ messages in thread
From: Paul E. McKenney @ 2017-04-27 18:28 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: peterz, akpm, kirill, ak, mhocko, dave, jack, linux-kernel,
	linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> 
> We put the fput(vma->vm_file) in the SRCU callback, this keeps files
> valid during speculative faults, this is possible due to the delayed
> fput work by Al Viro -- do we need srcu_barrier() in unmount
> someplace?
> 
> We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
> we'd have to disable preemption around the write side in order to make
> the retry loop in __read_seqcount_begin() work) such that we can know
> if the rb tree walk was correct. We cannot trust the restult of a
> lockless tree walk in the face of concurrent tree rotations; although
> we can trust on the termination of such walks -- tree rotations
> guarantee the end result is a tree again after all.
> 
> Furthermore, we rely on the WMB implied by the
> write_seqlock/count_begin() to separate the VMA initialization and the
> publishing stores, analogous to the RELEASE in rcu_assign_pointer().
> We also rely on the RMB from read_seqretry() to separate the vma load
> from further loads like the smp_read_barrier_depends() in regular
> RCU.
> 
> We must not touch the vmacache while doing SRCU lookups as that is not
> properly serialized against changes. We update gap information after
> publishing the VMA, but A) we don't use that and B) the seqlock
> read side would fix that anyhow.
> 
> We clear vma->vm_rb for nodes removed from the vma tree such that we
> can easily detect such 'dead' nodes, we rely on the WMB from
> write_sequnlock() to separate the tree removal and clearing the node.
> 
> Provide find_vma_srcu() which wraps the required magic.
> 
> XXX: mmap()/munmap() heavy workloads might suffer from the global lock
> in call_srcu() -- this is fixable with a 'better' SRCU implementation.

An alleged 'better' SRCU implementation is now in -tip at branch
tip/core/rcu.  This implementation maintains per-CPU callback lists,
which eliminates the previous global lock.  The goal is to get this
'better' SRCU into mainline during the upcoming merge window.

							Thanx, Paul

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/mm_types.h |   2 +
>  kernel/fork.c            |   1 +
>  mm/init-mm.c             |   1 +
>  mm/internal.h            |  18 +++++++++
>  mm/mmap.c                | 100 +++++++++++++++++++++++++++++++++++------------
>  5 files changed, 96 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index daa5fbba9349..f276973b0f91 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -359,6 +359,7 @@ struct vm_area_struct {
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  	seqcount_t vm_sequence;
> +	struct rcu_head vm_rcu_head;
>  };
> 
>  struct core_thread {
> @@ -397,6 +398,7 @@ struct kioctx_table;
>  struct mm_struct {
>  	struct vm_area_struct *mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> +	seqlock_t mm_seq;
>  	u32 vmacache_seqnum;                   /* per-thread vmacache */
>  #ifdef CONFIG_MMU
>  	unsigned long (*get_unmapped_area) (struct file *filp,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 11c5c8ab827c..352cf3fd6c19 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -753,6 +753,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	mm->mmap = NULL;
>  	mm->mm_rb = RB_ROOT;
>  	mm->vmacache_seqnum = 0;
> +	seqlock_init(&mm->mm_seq);
>  	atomic_set(&mm->mm_users, 1);
>  	atomic_set(&mm->mm_count, 1);
>  	init_rwsem(&mm->mmap_sem);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 975e49f00f34..2b1fa061684f 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -16,6 +16,7 @@
> 
>  struct mm_struct init_mm = {
>  	.mm_rb		= RB_ROOT,
> +	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
>  	.pgd		= swapper_pg_dir,
>  	.mm_users	= ATOMIC_INIT(2),
>  	.mm_count	= ATOMIC_INIT(1),
> diff --git a/mm/internal.h b/mm/internal.h
> index 7aa2ea0a8623..69df80ebc93d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -40,6 +40,24 @@ void page_writeback_init(void);
> 
>  int do_swap_page(struct vm_fault *vmf);
> 
> +extern struct srcu_struct vma_srcu;
> +
> +extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr);
> +
> +static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
> +{
> +	int ret = RB_EMPTY_NODE(&vma->vm_rb);
> +	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
> +
> +	/*
> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
> +	 * the wmb in vma_rb_erase().
> +	 */
> +	smp_rmb();
> +
> +	return ret || seq != sequence;
> +}
> +
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index cb41659bc9f9..44e19aa31315 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -159,6 +159,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
>  	}
>  }
> 
> +DEFINE_SRCU(vma_srcu);
> +
> +static void __free_vma(struct rcu_head *head)
> +{
> +	struct vm_area_struct *vma =
> +		container_of(head, struct vm_area_struct, vm_rcu_head);
> +
> +	if (vma->vm_file)
> +		fput(vma->vm_file);
> +	kmem_cache_free(vm_area_cachep, vma);
> +}
> +
> +static void free_vma(struct vm_area_struct *vma)
> +{
> +	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
> +}
> +
>  /*
>   * Close a vm structure and free it, returning the next.
>   */
> @@ -169,10 +186,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  	might_sleep();
>  	if (vma->vm_ops && vma->vm_ops->close)
>  		vma->vm_ops->close(vma);
> -	if (vma->vm_file)
> -		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> -	kmem_cache_free(vm_area_cachep, vma);
> +	free_vma(vma);
>  	return next;
>  }
> 
> @@ -394,26 +409,37 @@ static void vma_gap_update(struct vm_area_struct *vma)
>  }
> 
>  static inline void vma_rb_insert(struct vm_area_struct *vma,
> -				 struct rb_root *root)
> +				 struct mm_struct *mm)
>  {
> +	struct rb_root *root = &mm->mm_rb;
> +
>  	/* All rb_subtree_gap values must be consistent prior to insertion */
>  	validate_mm_rb(root, NULL);
> 
>  	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
>  }
> 
> -static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> +static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
> +	struct rb_root *root = &mm->mm_rb;
>  	/*
>  	 * Note rb_erase_augmented is a fairly large inline function,
>  	 * so make sure we instantiate it only once with our desired
>  	 * augmented rbtree callbacks.
>  	 */
> +	write_seqlock(&mm->mm_seq);
>  	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> +	write_sequnlock(&mm->mm_seq); /* wmb */
> +
> +	/*
> +	 * Ensure the removal is complete before clearing the node.
> +	 * Matched by vma_is_dead()/handle_speculative_fault().
> +	 */
> +	RB_CLEAR_NODE(&vma->vm_rb);
>  }
> 
>  static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> -						struct rb_root *root,
> +						struct mm_struct *mm,
>  						struct vm_area_struct *ignore)
>  {
>  	/*
> @@ -421,21 +447,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
>  	 * with the possible exception of the "next" vma being erased if
>  	 * next->vm_start was reduced.
>  	 */
> -	validate_mm_rb(root, ignore);
> +	validate_mm_rb(&mm->mm_rb, ignore);
> 
> -	__vma_rb_erase(vma, root);
> +	__vma_rb_erase(vma, mm);
>  }
> 
>  static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
> -					 struct rb_root *root)
> +					 struct mm_struct *mm)
>  {
>  	/*
>  	 * All rb_subtree_gap values must be consistent prior to erase,
>  	 * with the possible exception of the vma being erased.
>  	 */
> -	validate_mm_rb(root, vma);
> +	validate_mm_rb(&mm->mm_rb, vma);
> 
> -	__vma_rb_erase(vma, root);
> +	__vma_rb_erase(vma, mm);
>  }
> 
>  /*
> @@ -552,10 +578,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * immediately update the gap to the correct value. Finally we
>  	 * rebalance the rbtree after all augmented values have been set.
>  	 */
> +	write_seqlock(&mm->mm_seq);
>  	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
>  	vma->rb_subtree_gap = 0;
>  	vma_gap_update(vma);
> -	vma_rb_insert(vma, &mm->mm_rb);
> +	vma_rb_insert(vma, mm);
> +	write_sequnlock(&mm->mm_seq);
>  }
> 
>  static void __vma_link_file(struct vm_area_struct *vma)
> @@ -631,7 +659,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
>  {
>  	struct vm_area_struct *next;
> 
> -	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
> +	vma_rb_erase_ignore(vma, mm, ignore);
>  	next = vma->vm_next;
>  	if (has_prev)
>  		prev->vm_next = next;
> @@ -883,21 +911,19 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	}
> 
>  	if (remove_next) {
> -		if (file) {
> +		if (file)
>  			uprobe_munmap(next, next->vm_start, next->vm_end);
> -			fput(file);
> -		}
>  		if (next->anon_vma)
>  			anon_vma_merge(vma, next);
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		kmem_cache_free(vm_area_cachep, next);
> -		write_seqcount_end(&next->vm_sequence);
> +		free_vma(next);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
>  		 * up the code too much to do both in one go.
>  		 */
> +		write_seqcount_end(&next->vm_sequence);
>  		if (remove_next != 3) {
>  			/*
>  			 * If "next" was removed and vma->vm_end was
> @@ -2103,16 +2129,11 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>  EXPORT_SYMBOL(get_unmapped_area);
> 
>  /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
>  {
>  	struct rb_node *rb_node;
>  	struct vm_area_struct *vma;
> 
> -	/* Check the cache first. */
> -	vma = vmacache_find(mm, addr);
> -	if (likely(vma))
> -		return vma;
> -
>  	rb_node = mm->mm_rb.rb_node;
> 
>  	while (rb_node) {
> @@ -2129,13 +2150,40 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>  			rb_node = rb_node->rb_right;
>  	}
> 
> +	return vma;
> +}
> +
> +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +
> +	/* Check the cache first. */
> +	vma = vmacache_find(mm, addr);
> +	if (likely(vma))
> +		return vma;
> +
> +	vma = __find_vma(mm, addr);
>  	if (vma)
>  		vmacache_update(addr, vma);
>  	return vma;
>  }
> -
>  EXPORT_SYMBOL(find_vma);
> 
> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned int seq;
> +
> +	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
> +
> +	do {
> +		seq = read_seqbegin(&mm->mm_seq);
> +		vma = __find_vma(mm, addr);
> +	} while (read_seqretry(&mm->mm_seq, seq));
> +
> +	return vma;
> +}
> +
>  /*
>   * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
>   */
> @@ -2490,7 +2538,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
>  	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
>  	vma->vm_prev = NULL;
>  	do {
> -		vma_rb_erase(vma, &mm->mm_rb);
> +		vma_rb_erase(vma, mm);
>  		mm->map_count--;
>  		tail_vma = vma;
>  		vma = vma->vm_next;
> -- 
> 2.7.4
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 03/17] mm: Introduce pte_spinlock
  2017-04-27 15:52   ` Laurent Dufour
@ 2017-04-30  4:47     ` Matthew Wilcox
  -1 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2017-04-30  4:47 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:42PM +0200, Laurent Dufour wrote:
> +++ b/mm/memory.c
> @@ -2100,6 +2100,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  }
>  
> +static bool pte_spinlock(struct vm_fault *vmf)
> +{
> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +	spin_lock(vmf->ptl);
> +	return true;
> +}

To me 'pte_spinlock' is a noun, but this is really pte_spin_lock() (a verb).

Actually, it's really vmf_lock_pte().  We're locking the pte
referred to by this vmf.  And so we should probably have a matching
vmf_unlock_pte(vmf) to preserve the abstraction.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 03/17] mm: Introduce pte_spinlock
@ 2017-04-30  4:47     ` Matthew Wilcox
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2017-04-30  4:47 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:42PM +0200, Laurent Dufour wrote:
> +++ b/mm/memory.c
> @@ -2100,6 +2100,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  }
>  
> +static bool pte_spinlock(struct vm_fault *vmf)
> +{
> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +	spin_lock(vmf->ptl);
> +	return true;
> +}

To me 'pte_spinlock' is a noun, but this is really pte_spin_lock() (a verb).

Actually, it's really vmf_lock_pte().  We're locking the pte
referred to by this vmf.  And so we should probably have a matching
vmf_unlock_pte(vmf) to preserve the abstraction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
  2017-04-27 15:52   ` Laurent Dufour
@ 2017-04-30  4:57     ` Matthew Wilcox
  -1 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2017-04-30  4:57 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
> @@ -359,6 +359,7 @@ struct vm_area_struct {
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  	seqcount_t vm_sequence;
> +	struct rcu_head vm_rcu_head;
>  };
>  
>  struct core_thread {

It doesn't look like we examine the contents of the VMA until after we've
checked that the seqlock is good, so we should be able to union virtually
any entry in the VMA with the vm_rcu_head.  vm_next, vm_prev, perhaps?
Or anon_vma_chain since a list_head is the same size as an rcu_head.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
@ 2017-04-30  4:57     ` Matthew Wilcox
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2017-04-30  4:57 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
> @@ -359,6 +359,7 @@ struct vm_area_struct {
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  	seqcount_t vm_sequence;
> +	struct rcu_head vm_rcu_head;
>  };
>  
>  struct core_thread {

It doesn't look like we examine the contents of the VMA until after we've
checked that the seqlock is good, so we should be able to union virtually
any entry in the VMA with the vm_rcu_head.  vm_next, vm_prev, perhaps?
Or anon_vma_chain since a list_head is the same size as an rcu_head.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
  2017-04-27 15:52   ` Laurent Dufour
@ 2017-04-30  5:05     ` Matthew Wilcox
  -1 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2017-04-30  5:05 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
> +static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
> +{
> +	int ret = RB_EMPTY_NODE(&vma->vm_rb);
> +	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
> +
> +	/*
> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
> +	 * the wmb in vma_rb_erase().
> +	 */
> +	smp_rmb();
> +
> +	return ret || seq != sequence;
> +}

Hang on, this isn't vma_is_dead().  This is vma_has_changed() (possibly
from live to dead, but also possibly grown or shrunk; see your earlier
patch).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
@ 2017-04-30  5:05     ` Matthew Wilcox
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2017-04-30  5:05 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
> +static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
> +{
> +	int ret = RB_EMPTY_NODE(&vma->vm_rb);
> +	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
> +
> +	/*
> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
> +	 * the wmb in vma_rb_erase().
> +	 */
> +	smp_rmb();
> +
> +	return ret || seq != sequence;
> +}

Hang on, this isn't vma_is_dead().  This is vma_has_changed() (possibly
from live to dead, but also possibly grown or shrunk; see your earlier
patch).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
  2017-04-30  5:05     ` Matthew Wilcox
@ 2017-05-03  7:23       ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-05-03  7:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 30/04/2017 07:05, Matthew Wilcox wrote:
> On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
>> +static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
>> +{
>> +	int ret = RB_EMPTY_NODE(&vma->vm_rb);
>> +	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
>> +
>> +	/*
>> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
>> +	 * the wmb in vma_rb_erase().
>> +	 */
>> +	smp_rmb();
>> +
>> +	return ret || seq != sequence;
>> +}
> 
> Hang on, this isn't vma_is_dead().  This is vma_has_changed() (possibly
> from live to dead, but also possibly grown or shrunk; see your earlier
> patch).

This makes sense.

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 05/17] RCU free VMAs
@ 2017-05-03  7:23       ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-05-03  7:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 30/04/2017 07:05, Matthew Wilcox wrote:
> On Thu, Apr 27, 2017 at 05:52:44PM +0200, Laurent Dufour wrote:
>> +static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
>> +{
>> +	int ret = RB_EMPTY_NODE(&vma->vm_rb);
>> +	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
>> +
>> +	/*
>> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
>> +	 * the wmb in vma_rb_erase().
>> +	 */
>> +	smp_rmb();
>> +
>> +	return ret || seq != sequence;
>> +}
> 
> Hang on, this isn't vma_is_dead().  This is vma_has_changed() (possibly
> from live to dead, but also possibly grown or shrunk; see your earlier
> patch).

This makes sense.

Thanks,
Laurent

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 03/17] mm: Introduce pte_spinlock
  2017-04-30  4:47     ` Matthew Wilcox
@ 2017-05-03 13:01       ` Laurent Dufour
  -1 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-05-03 13:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 30/04/2017 06:47, Matthew Wilcox wrote:
> On Thu, Apr 27, 2017 at 05:52:42PM +0200, Laurent Dufour wrote:
>> +++ b/mm/memory.c
>> @@ -2100,6 +2100,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  }
>>  
>> +static bool pte_spinlock(struct vm_fault *vmf)
>> +{
>> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> +	spin_lock(vmf->ptl);
>> +	return true;
>> +}
> 
> To me 'pte_spinlock' is a noun, but this is really pte_spin_lock() (a verb).

Fair enough. Even pte_trylock() should be more accurate since patch 8/17
changes this function to call spin_trylock().

> Actually, it's really vmf_lock_pte().  We're locking the pte
> referred to by this vmf.  And so we should probably have a matching
> vmf_unlock_pte(vmf) to preserve the abstraction.

I'm not sure this will ease the reading. In most of this code, the pte
are unlocked through the call to pte_unmap_unlock().
The call to pte_trylock() has been introduced because in few cases there
is the need to check the VMA validity before calling spinlock(ptl). The
unlock is then managed through pte_unmap_unlock().

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC v3 03/17] mm: Introduce pte_spinlock
@ 2017-05-03 13:01       ` Laurent Dufour
  0 siblings, 0 replies; 48+ messages in thread
From: Laurent Dufour @ 2017-05-03 13:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora

On 30/04/2017 06:47, Matthew Wilcox wrote:
> On Thu, Apr 27, 2017 at 05:52:42PM +0200, Laurent Dufour wrote:
>> +++ b/mm/memory.c
>> @@ -2100,6 +2100,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  }
>>  
>> +static bool pte_spinlock(struct vm_fault *vmf)
>> +{
>> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> +	spin_lock(vmf->ptl);
>> +	return true;
>> +}
> 
> To me 'pte_spinlock' is a noun, but this is really pte_spin_lock() (a verb).

Fair enough. Even pte_trylock() should be more accurate since patch 8/17
changes this function to call spin_trylock().

> Actually, it's really vmf_lock_pte().  We're locking the pte
> referred to by this vmf.  And so we should probably have a matching
> vmf_unlock_pte(vmf) to preserve the abstraction.

I'm not sure this will ease the reading. In most of this code, the pte
are unlocked through the call to pte_unmap_unlock().
The call to pte_trylock() has been introduced because in few cases there
is the need to check the VMA validity before calling spinlock(ptl). The
unlock is then managed through pte_unmap_unlock().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2017-05-03 13:01 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-27 15:52 [RFC v3 00/17] Speculative page faults Laurent Dufour
2017-04-27 15:52 ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 01/17] mm: Dont assume page-table invariance during faults Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 02/17] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 03/17] mm: Introduce pte_spinlock Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-30  4:47   ` Matthew Wilcox
2017-04-30  4:47     ` Matthew Wilcox
2017-05-03 13:01     ` Laurent Dufour
2017-05-03 13:01       ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 04/17] mm: VMA sequence count Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 05/17] RCU free VMAs Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 18:28   ` Paul E. McKenney
2017-04-27 18:28     ` Paul E. McKenney
2017-04-30  4:57   ` Matthew Wilcox
2017-04-30  4:57     ` Matthew Wilcox
2017-04-30  5:05   ` Matthew Wilcox
2017-04-30  5:05     ` Matthew Wilcox
2017-05-03  7:23     ` Laurent Dufour
2017-05-03  7:23       ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 06/17] mm: Provide speculative fault infrastructure Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 07/17] mm,x86: Add speculative pagefault handling Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 08/17] mm/spf: Try spin lock in speculative path Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 09/17] mm/spf: Fix fe.sequence init in __handle_mm_fault() Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 10/17] mm/spf: don't set fault entry's fields if locking failed Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 11/17] mm/spf; fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 12/17] mm/spf: Protect changes to vm_flags Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 13/17] mm/spf Protect vm_policy's changes against speculative pf Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 14/17] x86/mm: Update the handle_speculative_fault's path Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 15/17] mm/spf: Add check on the VMA's flags Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 16/17] mm: protect madvise vs speculative pf Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour
2017-04-27 15:52 ` [RFC v3 17/17] mm/spf: protect mremap() against " Laurent Dufour
2017-04-27 15:52   ` Laurent Dufour

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.