linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v5 00/11] Speculative page faults
@ 2017-06-16 17:52 Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 01/11] mm: Dont assume page-table invariance during faults Laurent Dufour
                   ` (11 more replies)
  0 siblings, 12 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

This is a port on kernel 4.12 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type page fault is named speculative
page fault. If the speculative page fault fails because of a concurrency is
detected or because underlying PMD or PTE tables are not yet allocating, it
is failing its processing and a classic page fault is then tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, so the VMA list is now managed using
SRCU allowing lockless walking. The only impact would be the deferred file
derefencing in the case of a file mapping, since the file pointer is
released once the SRCU cleaning is done.  This patch relies on the change
done recently by Paul McKenney in SRCU which now runs a callback per CPU
instead of per SRCU structure [1].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA is found, the speculative page fault handler would check for
the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

Compared to the Peter's initial work, this series introduces a spin_trylock
when dealing with speculative page fault. This is required to avoid dead
lock when handling a page fault while a TLB invalidate is requested by an
other CPU holding the PTE. Another change due to a lock dependency issue
with mapping->i_mmap_rwsem.

This series builds on top of v4.12-rc5 and is functional on x86 and
PowerPC.

Tests have been made using a large commercial in-memory database on a
PowerPC system with 752 CPUs. The results are very encouraging since the
loading of the 2TB database was faster by 14% with the speculative page
fault.

However tests done using multi-fault [3] or kernbench [4], on smaller
systems didn't show performance improvements, I saw a little degradation but
running the tests again shows that this is in the noise. So nothing
significant enough on the both sides.

Since benchmarks are encouraging and running test suites didn't raise any
issue, I'd like this request for comment series to move to a patch series
soon. So please comment.

------------------
Benchmarks results

Here are the results on a 8 CPUs X86 guest using kernbench on a 4.12-r5
kernel (kernel is build 5 times):

Average Half load -j 4 Run (std deviation):
		 4.12.0-rc5		4.12.0-rc5-spf
		 Run (std deviation)
Elapsed Time     48.42 (0.334515)       48.638 (0.344848)
User Time        124.322 (0.964324)     124.478 (0.659902)
System Time      58.008 (0.300865)      58.664 (0.590999)
Percent CPU 	 376.2 (1.09545)        376.4 (1.51658)
Context Switches 7409.6 (215.18)        11022.8 (281.093)
Sleeps           15255.8 (63.0254)      15250.8 (43.4592)

Average Optimal load -j 8
 		 4.12.0-rc5		4.12.0-rc5-spf
                 Run (std deviation)
Elapsed Time     24.268 (0.151723)      24.514 (0.143805)
User Time        112.092 (12.9135)      112.04 (13.1257)
System Time      49.03 (9.46999)        49.721 (9.44455)
Percent CPU      476 (105.205)          474.3 (103.209)
Context Switches 10268.7 (3020.16)      14069.2 (3219.98)
Sleeps           15790.8 (568.885)      15829.4 (615.371)

Average Maximal load -j
 		 4.12.0-rc5		4.12.0-rc5-spf
                 Run (std deviation)
Elapsed Time     25.042 (0.237844)      25.216 (0.201941)
User Time        110.19 (10.7245)       110.312 (10.8245)
System Time      45.9113 (8.86119)      46.48 (8.93778)
Percent CPU      511.533 (99.1376)      510.133 (97.9897)
Context Switches 19521.1 (13759.8)      22354.1 (12400)
Sleeps           15514.7 (609.76)       15521.2 (670.054)

The elapsed time is in the same order, a bit larger in the case of the spf
release, but that seems to be in the error margin.

Here are the kerbench results on a 572 CPUs Power8 system :

Average Half load -j 376
 		 4.12.0-rc5		4.12.0-rc5-spf
                 Run (std deviation)
Elapsed Time     3.384 (0.0680441)      3.344 (0.0634823)
User Time        203.998 (8.41125)      193.476 (8.23406)
System Time      13.064 (0.624444)      12.028 (0.495954)
Percent CPU      6407 (285.422)         6136.2 (198.173)
Context Switches 7319.2 (517.785)       8960 (221.735)
Sleeps           24287.8 (861.132)      22902.4 (728.475)

Average Optimal load -j 752
 		 4.12.0-rc5		4.12.0-rc5-spf
                 Run (std deviation)
Elapsed Time     3.414 (0.136858)       3.432 (0.0506952)
User Time        200.985 (8.71172)      197.747 (8.9511)
System Time      12.903 (0.638262)      12.472 (0.684865)
Percent CPU      6287.9 (322.208)       6194.8 (192.116)
Context Switches 7173.5 (479.038)       9355.7 (712.3)
Sleeps           24241.6 (1003.66)      22867.5 (1242.49)

Average Maximal load -j
 		 4.12.0-rc5		4.12.0-rc5-spf
                 Run (std deviation)
Elapsed Time     3.422 (0.0791833)      3.312 (0.109864)
User Time        202.096 (7.45845)      197.541 (9.42758)
System Time      12.8733 (0.57327)      12.4567 (0.568465)
Percent CPU      6304.87 (278.195)      6234.67 (204.769)
Context Switches 7166 (412.524)         9398.73 (639.917)
Sleeps           24065.6 (1132.3)       22822.8 (1176.71)

Here the elapsed time is a bit shorter using the spf release, but again we
stay in the error margin.

Here are results using multi-fault :

--- x86 8 CPUs
                Page faults in 60s
4.12.0-rc5      23,014,776
4.12-0-rc5-spf  23,224,435

--- ppc64le 752 CPUs
                Page faults in 60s
4.12.0-rc5      28,087,752
4.12-0-rc5-spf  32,272,610

Results is a bit higher on ppc64le with the SPF patch, but I'm not convince
about this test on Power8 since the page table are managed differently on
this architecture, I'm wondering if we are not hitting the PTE lock.
I run the test multiple times, the number are varying a bit but remain in
the same order.

------------------
Changes since V4:
 - merge several patches to reduce the series as requested by Jan Kara
 - check any comment warning in the code and remove each of them
 - reword some patch description
 
Changes since V3:
 - support for the 5-level paging.
 - abort speculative path before entering userfault code
 - support for PowerPC architecture
 - reorder the patch to fix build test errors.

[1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
[3] https://lkml.org/lkml/2010/1/6/28
[4] http://ck.kolivas.org/apps/kernbench/kernbench-0.50/

Laurent Dufour (5):
  mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
  mm: fix lock dependency against mapping->i_mmap_rwsem
  mm: Protect VMA modifications using VMA sequence count
  mm: Try spin lock in speculative path
  powerpc/mm: Add speculative page fault

Peter Zijlstra (6):
  mm: Dont assume page-table invariance during faults
  mm: Prepare for FAULT_FLAG_SPECULATIVE
  mm: VMA sequence count
  mm: RCU free VMAs
  mm: Provide speculative fault infrastructure
  x86/mm: Add speculative pagefault handling

 arch/powerpc/mm/fault.c  |  25 ++++-
 arch/x86/mm/fault.c      |  14 +++
 fs/proc/task_mmu.c       |   2 +
 include/linux/mm.h       |   4 +
 include/linux/mm_types.h |   3 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  20 ++++
 mm/madvise.c             |   4 +
 mm/memory.c              | 286 +++++++++++++++++++++++++++++++++++++++--------
 mm/mempolicy.c           |  10 +-
 mm/mlock.c               |   9 +-
 mm/mmap.c                | 123 +++++++++++++++-----
 mm/mprotect.c            |   2 +
 mm/mremap.c              |   7 ++
 15 files changed, 430 insertions(+), 81 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC v5 01/11] mm: Dont assume page-table invariance during faults
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-07-07  7:07   ` Balbir Singh
  2017-08-08  9:45   ` Anshuman Khandual
  2017-06-16 17:52 ` [RFC v5 02/11] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the relyance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2e65df1831d9..fd952f05e016 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2102,30 +2102,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	debug_dma_assert_idle(src);
@@ -2682,9 +2658,6 @@ int do_swap_page(struct vm_fault *vmf)
 	int exclusive = 0;
 	int ret = 0;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 02/11] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 01/11] mm: Dont assume page-table invariance during faults Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-08-08 10:24   ` Anshuman Khandual
  2017-06-16 17:52 ` [RFC v5 03/11] mm: Introduce pte_spinlock " Laurent Dufour
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
 implemented as the vm_fault structure in the kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 55 ++++++++++++++++++++++++++++++++++++++----------------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b892e95d4929..6b7ec2a76953 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -286,6 +286,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
 
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
diff --git a/mm/memory.c b/mm/memory.c
index fd952f05e016..40834444ea0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2240,6 +2240,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+	return true;
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page.
  *
@@ -2267,6 +2273,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 	const unsigned long mmun_start = vmf->address & PAGE_MASK;
 	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
 	struct mem_cgroup *memcg;
+	int ret = VM_FAULT_OOM;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2294,7 +2301,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		mem_cgroup_cancel_charge(new_page, memcg, false);
+		ret = VM_FAULT_RETRY;
+		goto oom_free_new;
+	}
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2382,7 +2393,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 oom:
 	if (old_page)
 		put_page(old_page);
-	return VM_FAULT_OOM;
+	return ret;
 }
 
 /**
@@ -2403,8 +2414,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
 	WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-				       &vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
 	/*
 	 * We might have raced with another page fault while we released the
 	 * pte_offset_map_lock.
@@ -2522,8 +2533,11 @@ static int do_wp_page(struct vm_fault *vmf)
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			lock_page(vmf->page);
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				unlock_page(vmf->page);
+				put_page(vmf->page);
+				return VM_FAULT_RETRY;
+			}
 			if (!pte_same(*vmf->pte, vmf->orig_pte)) {
 				unlock_page(vmf->page);
 				pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2681,8 +2695,10 @@ int do_swap_page(struct vm_fault *vmf)
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+				return VM_FAULT_RETRY;
+			}
 			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2738,8 +2754,11 @@ int do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		ret = VM_FAULT_RETRY;
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto out_page;
+	}
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
 		goto out_nomap;
 
@@ -2903,8 +2922,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 			!mm_forbids_zeropage(vma->vm_mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
 						vma->vm_page_prot));
-		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				vmf->address, &vmf->ptl);
+		if (!pte_map_lock(vmf))
+			return VM_FAULT_RETRY;
 		if (!pte_none(*vmf->pte))
 			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
@@ -2936,8 +2955,11 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		mem_cgroup_cancel_charge(page, memcg, false);
+		put_page(page);
+		return VM_FAULT_RETRY;
+	}
 	if (!pte_none(*vmf->pte))
 		goto release;
 
@@ -3057,8 +3079,9 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
 	 * pte_none() under vmf->ptl protection when we return to
 	 * alloc_set_pte().
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
+
 	return 0;
 }
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 03/11] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 01/11] mm: Dont assume page-table invariance during faults Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 02/11] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-08-08 10:35   ` Anshuman Khandual
  2017-06-16 17:52 ` [RFC v5 04/11] mm: VMA sequence count Laurent Dufour
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 40834444ea0d..f1132f7931ef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2240,6 +2240,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
@@ -3552,8 +3559,8 @@ static int do_numa_page(struct vm_fault *vmf)
 	 * validation through pte_unmap_same(). It's of NUMA type but
 	 * the pfn may be screwed if the read is non atomic.
 	 */
-	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto out;
@@ -3745,8 +3752,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	entry = vmf->orig_pte;
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 04/11] mm: VMA sequence count
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (2 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 03/11] mm: Introduce pte_spinlock " Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-08-08 10:59   ` Anshuman Khandual
  2017-06-16 17:52 ` [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[port to 4.12 kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |  1 +
 mm/memory.c              |  2 ++
 mm/mmap.c                | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..8945743e4609 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,6 +342,7 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	seqcount_t vm_sequence;
 };
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index f1132f7931ef..5d259cd67a83 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1379,6 +1379,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	BUG_ON(addr >= end);
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1388,6 +1389,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f82741e199c0..9f86356d0012 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -543,6 +543,8 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	else
 		mm->highest_vm_end = vma->vm_end;
 
+	seqcount_init(&vma->vm_sequence);
+
 	/*
 	 * vma->vm_prev wasn't known when we followed the rbtree to find the
 	 * correct insertion point for that vma. As a result, we could not
@@ -677,6 +679,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -888,6 +894,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
+		write_seqcount_end(&next->vm_sequence);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -901,6 +908,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
+			if (next)
+				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -947,6 +956,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	if (next)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 
 	return 0;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (3 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 04/11] mm: VMA sequence count Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-08-08 11:17   ` Anshuman Khandual
  2017-06-16 17:52 ` [RFC v5 06/11] mm: Protect VMA modifications using VMA sequence count Laurent Dufour
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

kworker/32:1/819 is trying to acquire lock:
 (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
zap_page_range_single+0xd0/0x1a0

but task is already holding lock:
 (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
unmap_mapping_range+0x7c/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&mapping->i_mmap_rwsem){++++..}:
       down_write+0x84/0x130
       __vma_adjust+0x1f4/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #1 (&vma->vm_sequence/1){+.+...}:
       __vma_adjust+0x124/0xa80
       __split_vma.isra.2+0x174/0x290
       do_munmap+0x13c/0x4e0
       vm_munmap+0x64/0xb0
       elf_map+0x11c/0x130
       load_elf_binary+0x6f0/0x15f0
       search_binary_handler+0xe0/0x2a0
       do_execveat_common.isra.14+0x7fc/0xbe0
       call_usermodehelper_exec_async+0x14c/0x1d0
       ret_from_kernel_thread+0x5c/0x68

-> #0 (&vma->vm_sequence){+.+...}:
       lock_acquire+0xf4/0x310
       unmap_page_range+0xcc/0xfa0
       zap_page_range_single+0xd0/0x1a0
       unmap_mapping_range+0x138/0x160
       truncate_pagecache+0x50/0xa0
       put_aio_ring_file+0x48/0xb0
       aio_free_ring+0x40/0x1b0
       free_ioctx+0x38/0xc0
       process_one_work+0x2cc/0x8a0
       worker_thread+0xac/0x580
       kthread+0x164/0x1b0
       ret_from_kernel_thread+0x5c/0x68

other info that might help us debug this:

Chain exists of:
  &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mapping->i_mmap_rwsem);
                               lock(&vma->vm_sequence/1);
                               lock(&mapping->i_mmap_rwsem);
  lock(&vma->vm_sequence);

 *** DEADLOCK ***

To fix that we must grab the vm_sequence lock after any mapping one in
__vma_adjust().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/mmap.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9f86356d0012..ad85f210a92c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -679,10 +679,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
-	write_seqcount_begin(&vma->vm_sequence);
-	if (next)
-		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
-
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -790,6 +786,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		}
 	}
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence,
+					    SINGLE_DEPTH_NESTING);
+
 	anon_vma = vma->anon_vma;
 	if (!anon_vma && adjust_next)
 		anon_vma = next->anon_vma;
@@ -908,8 +909,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
-			if (next)
-				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -926,11 +925,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		if (remove_next == 2) {
 			remove_next = 1;
 			end = next->vm_end;
+			write_seqcount_end(&vma->vm_sequence);
 			goto again;
-		}
-		else if (next)
+		} else if (next) {
+			if (next != vma)
+				write_seqcount_begin_nested(&next->vm_sequence,
+							    SINGLE_DEPTH_NESTING);
 			vma_gap_update(next);
-		else {
+		} else {
 			/*
 			 * If remove_next == 2 we obviously can't
 			 * reach this path.
@@ -956,7 +958,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
-	if (next)
+	if (next && next != vma)
 		write_seqcount_end(&next->vm_sequence);
 	write_seqcount_end(&vma->vm_sequence);
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 06/11] mm: Protect VMA modifications using VMA sequence count
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (4 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 07/11] mm: RCU free VMAs Laurent Dufour
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection agains the VMA modification done in :
	- madvise()
	- mremap()
	- mpol_rebind_policy()
	- vma_replace_policy()
	- change_prot_numa()
	- mlock(), munlock()
	- mprotect()
	- mmap_region()

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c |  2 ++
 mm/madvise.c       |  4 ++++
 mm/mempolicy.c     | 10 +++++++++-
 mm/mlock.c         |  9 ++++++---
 mm/mmap.c          |  2 ++
 mm/mprotect.c      |  2 ++
 mm/mremap.c        |  7 +++++++
 7 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33d99b1..9bc40620ba39 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1069,8 +1069,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					goto out_mm;
 				}
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
+					write_seqcount_begin(&vma->vm_sequence);
 					vma->vm_flags &= ~VM_SOFTDIRTY;
 					vma_set_page_prot(vma);
+					write_seqcount_end(&vma->vm_sequence);
 				}
 				downgrade_write(&mm->mmap_sem);
 				break;
diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee4fc2c..d1fa6a7ee604 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -172,7 +172,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = new_flags;
+	write_seqcount_end(&vma->vm_sequence);
 out:
 	return error;
 }
@@ -439,9 +441,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
 		.private = tlb,
 	};
 
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	walk_page_range(addr, end, &free_walk);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 static int madvise_free_single_vma(struct vm_area_struct *vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..5e44b3e69a0d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -447,8 +447,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		write_seqcount_begin(&vma->vm_sequence);
 		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+		write_seqcount_end(&vma->vm_sequence);
+	}
 	up_write(&mm->mmap_sem);
 }
 
@@ -606,9 +609,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
+	write_seqcount_begin(&vma->vm_sequence);
 	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return nr_updated;
 }
@@ -709,6 +714,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	if (vma->vm_ops && vma->vm_ops->set_policy) {
 		err = vma->vm_ops->set_policy(vma, new);
 		if (err)
@@ -717,10 +723,12 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 
 	old = vma->vm_policy;
 	vma->vm_policy = new; /* protected by mmap_sem */
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(old);
 
 	return 0;
  err_out:
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(new);
 	return err;
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index b562b5523a65..30d9bfc61929 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -438,7 +438,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	write_seqcount_end(&vma->vm_sequence);
 
 	while (start < end) {
 		struct page *page;
@@ -563,10 +565,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
-
-	if (lock)
+	if (lock) {
+		write_seqcount_begin(&vma->vm_sequence);
 		vma->vm_flags = newflags;
-	else
+		write_seqcount_end(&vma->vm_sequence);
+	} else
 		munlock_vma_pages_range(vma, start, end);
 
 out:
diff --git a/mm/mmap.c b/mm/mmap.c
index ad85f210a92c..b48bbe6a49c6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1720,6 +1720,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
@@ -1742,6 +1743,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_flags |= VM_SOFTDIRTY;
 
 	vma_set_page_prot(vma);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return addr;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..1db5b0bf6952 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -345,6 +345,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = newflags;
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
@@ -360,6 +361,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 			(newflags & VM_WRITE)) {
 		populate_vma_page_range(vma, start, end, NULL);
 	}
+	write_seqcount_end(&vma->vm_sequence);
 
 	vm_stat_account(mm, oldflags, -nrpages);
 	vm_stat_account(mm, newflags, nrpages);
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..9c7f69c9e80f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -300,6 +300,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	if (!new_vma)
 		return -ENOMEM;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	write_seqcount_begin_nested(&new_vma->vm_sequence,
+				    SINGLE_DEPTH_NESTING);
+
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
 				     need_rmap_locks);
 	if (moved_len < old_len) {
@@ -316,6 +320,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
 				 true);
+		write_seqcount_end(&vma->vm_sequence);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -324,7 +329,9 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		mremap_userfaultfd_prep(new_vma, uf);
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
+		write_seqcount_end(&vma->vm_sequence);
 	}
+	write_seqcount_end(&new_vma->vm_sequence);
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 07/11] mm: RCU free VMAs
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (5 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 06/11] mm: Protect VMA modifications using VMA sequence count Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 08/11] mm: Provide speculative fault infrastructure Laurent Dufour
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (this could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Remove the warnings in description about the SRCU global lock which
 has been removed now]
[Rename vma_is_dead() to vma_has_changed()]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |   2 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  20 ++++++++++
 mm/mmap.c                | 102 ++++++++++++++++++++++++++++++++++-------------
 5 files changed, 99 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8945743e4609..7da5e565046c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -343,6 +343,7 @@ struct vm_area_struct {
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
@@ -360,6 +361,7 @@ struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
+	seqlock_t mm_seq;
 	u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 	unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d2bf95..e4d335bee820 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -793,6 +793,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
 	mm->vmacache_seqnum = 0;
+	seqlock_init(&mm->mm_seq);
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 975e49f00f34..2b1fa061684f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -16,6 +16,7 @@
 
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
+	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..3c6041d53310 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,26 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm,
+					    unsigned long addr);
+
+static inline bool vma_has_changed(struct vm_area_struct *vma,
+				   unsigned int sequence)
+{
+	int ret = RB_EMPTY_NODE(&vma->vm_rb);
+	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
+
+	/*
+	 * Matches both the wmb in write_seqlock_{begin,end}() and
+	 * the wmb in vma_rb_erase().
+	 */
+	smp_rmb();
+
+	return ret || seq != sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index b48bbe6a49c6..c4a1e1aecef3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -159,6 +159,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	if (vma->vm_file)
+		fput(vma->vm_file);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -169,10 +186,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -396,26 +411,37 @@ static void vma_gap_update(struct vm_area_struct *vma)
 }
 
 static inline void vma_rb_insert(struct vm_area_struct *vma,
-				 struct rb_root *root)
+				 struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
+
 	/* All rb_subtree_gap values must be consistent prior to insertion */
 	validate_mm_rb(root, NULL);
 
 	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
 }
 
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
 	/*
 	 * Note rb_erase_augmented is a fairly large inline function,
 	 * so make sure we instantiate it only once with our desired
 	 * augmented rbtree callbacks.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+	write_sequnlock(&mm->mm_seq); /* wmb */
+
+	/*
+	 * Ensure the removal is complete before clearing the node.
+	 * Matched by vma_has_changed()/handle_speculative_fault().
+	 */
+	RB_CLEAR_NODE(&vma->vm_rb);
 }
 
 static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
-						struct rb_root *root,
+						struct mm_struct *mm,
 						struct vm_area_struct *ignore)
 {
 	/*
@@ -423,21 +449,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
 	 * with the possible exception of the "next" vma being erased if
 	 * next->vm_start was reduced.
 	 */
-	validate_mm_rb(root, ignore);
+	validate_mm_rb(&mm->mm_rb, ignore);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
-					 struct rb_root *root)
+					 struct mm_struct *mm)
 {
 	/*
 	 * All rb_subtree_gap values must be consistent prior to erase,
 	 * with the possible exception of the vma being erased.
 	 */
-	validate_mm_rb(root, vma);
+	validate_mm_rb(&mm->mm_rb, vma);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 /*
@@ -554,10 +580,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * immediately update the gap to the correct value. Finally we
 	 * rebalance the rbtree after all augmented values have been set.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	vma->rb_subtree_gap = 0;
 	vma_gap_update(vma);
-	vma_rb_insert(vma, &mm->mm_rb);
+	vma_rb_insert(vma, mm);
+	write_sequnlock(&mm->mm_seq);
 }
 
 static void __vma_link_file(struct vm_area_struct *vma)
@@ -633,7 +661,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
 {
 	struct vm_area_struct *next;
 
-	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
+	vma_rb_erase_ignore(vma, mm, ignore);
 	next = vma->vm_next;
 	if (has_prev)
 		prev->vm_next = next;
@@ -886,21 +914,19 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
-		if (file) {
+		if (file)
 			uprobe_munmap(next, next->vm_start, next->vm_end);
-			fput(file);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
-		write_seqcount_end(&next->vm_sequence);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
 		 * up the code too much to do both in one go.
 		 */
+		write_seqcount_end(&next->vm_sequence);
 		if (remove_next != 3) {
 			/*
 			 * If "next" was removed and vma->vm_end was
@@ -2111,15 +2137,10 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
-	struct vm_area_struct *vma;
-
-	/* Check the cache first. */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
+	struct vm_area_struct *vma = NULL;
 
 	rb_node = mm->mm_rb.rb_node;
 
@@ -2137,13 +2158,40 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 			rb_node = rb_node->rb_right;
 	}
 
+	return vma;
+}
+
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* Check the cache first. */
+	vma = vmacache_find(mm, addr);
+	if (likely(vma))
+		return vma;
+
+	vma = __find_vma(mm, addr);
 	if (vma)
 		vmacache_update(addr, vma);
 	return vma;
 }
-
 EXPORT_SYMBOL(find_vma);
 
+struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+	unsigned int seq;
+
+	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
+
+	do {
+		seq = read_seqbegin(&mm->mm_seq);
+		vma = __find_vma(mm, addr);
+	} while (read_seqretry(&mm->mm_seq, seq));
+
+	return vma;
+}
+
 /*
  * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
  */
@@ -2498,7 +2546,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
 	do {
-		vma_rb_erase(vma, &mm->mm_rb);
+		vma_rb_erase(vma, mm);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 08/11] mm: Provide speculative fault infrastructure
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (6 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 07/11] mm: RCU free VMAs Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 09/11] mm: Try spin lock in speculative path Laurent Dufour
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed()]
[Call p4d_alloc() as it is safe since pgd is valid]
[Call pud_alloc() as it is safe since p4d is valid]
[Set fe.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |   3 +
 mm/memory.c        | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b7ec2a76953..671541e00d26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,6 +315,7 @@ struct vm_fault {
 	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	unsigned long address;		/* Faulting virtual address */
+	unsigned int sequence;
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pud_t *pud;			/* Pointer to pud entry matching
@@ -1286,6 +1287,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+				    unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 5d259cd67a83..0645cb21155f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2244,15 +2244,69 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
 static bool pte_spinlock(struct vm_fault *vmf)
 {
+	bool ret = false;
+
+	/* Check if vma is still valid */
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+		spin_lock(vmf->ptl);
+		return true;
+	}
+
+	local_irq_disable();
+	if (vma_has_changed(vmf->vma, vmf->sequence))
+		goto out;
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	return true;
+
+	if (vma_has_changed(vmf->vma, vmf->sequence)) {
+		spin_unlock(vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 static bool pte_map_lock(struct vm_fault *vmf)
 {
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
-	return true;
+	bool ret = false;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+					       vmf->address, &vmf->ptl);
+		return true;
+	}
+
+	/*
+	 * The first vma_has_changed() guarantees the page-tables are still
+	 * valid, having IRQs disabled ensures they stay around, hence the
+	 * second vma_has_changed() to make sure they are still valid once
+	 * we've got the lock. After that a concurrent zap_pte_range() will
+	 * block on the PTL and thus we're safe.
+	 */
+	local_irq_disable();
+	if (vma_has_changed(vmf->vma, vmf->sequence))
+		goto out;
+
+	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				  vmf->address, &ptl);
+	if (vma_has_changed(vmf->vma, vmf->sequence)) {
+		pte_unmap_unlock(pte, ptl);
+		goto out;
+	}
+
+	vmf->pte = pte;
+	vmf->ptl = ptl;
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 /*
@@ -2938,6 +2992,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+				return VM_FAULT_RETRY;
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
 		goto setpte;
@@ -2977,6 +3033,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		mem_cgroup_cancel_charge(page, memcg, false);
 		put_page(page);
+		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+			return VM_FAULT_RETRY;
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
@@ -3839,6 +3897,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
+	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
@@ -3866,6 +3925,122 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	return handle_pte_fault(&vmf);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+			     unsigned int flags)
+{
+	struct vm_fault vmf = {
+		.address = address,
+	};
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	int dead, seq, idx, ret = VM_FAULT_RETRY;
+	struct vm_area_struct *vma;
+
+	/* Clear flags that may lead to release the mmap_sem to retry */
+	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
+	flags |= FAULT_FLAG_SPECULATIVE;
+
+	idx = srcu_read_lock(&vma_srcu);
+	vma = find_vma_srcu(mm, address);
+	if (!vma)
+		goto unlock;
+
+	/*
+	 * Validate the VMA found by the lockless lookup.
+	 */
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
+	if ((seq & 1) || dead)
+		goto unlock;
+
+	if (address < vma->vm_start || vma->vm_end <= address)
+		goto unlock;
+
+	/*
+	 * Huge pages are not yet supported.
+	 */
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		goto unlock;
+
+	/*
+	 * The three following checks are copied from access_error from
+	 * arch/x86/mm/fault.c
+	 */
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+				       flags & FAULT_FLAG_INSTRUCTION,
+				       flags & FAULT_FLAG_REMOTE))
+		goto unlock;
+
+	/* This is one is required to check that the VMA has write access set */
+	if (flags & FAULT_FLAG_WRITE) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			goto unlock;
+	} else {
+		if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+			goto unlock;
+	}
+
+	/*
+	 * We need to re-validate the VMA after checking the bounds, otherwise
+	 * we might have a false positive on the bounds.
+	 */
+	if (read_seqcount_retry(&vma->vm_sequence, seq))
+		goto unlock;
+
+	/*
+	 * Do a speculative lookup of the PTE entry.
+	 */
+	local_irq_disable();
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_walk;
+
+	p4d = p4d_alloc(mm, pgd, address);
+	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
+		goto out_walk;
+
+	pud = pud_alloc(mm, p4d, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_walk;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_walk;
+
+	/*
+	 * The above does not allocate/instantiate page-tables because doing so
+	 * would lead to the possibility of instantiating page-tables after
+	 * free_pgtables() -- and consequently leaking them.
+	 *
+	 * The result is that we take at least one !speculative fault per PMD
+	 * in order to instantiate it.
+	 */
+
+	if (unlikely(pmd_huge(*pmd)))
+		goto out_walk;
+
+	vmf.vma = vma;
+	vmf.pmd = pmd;
+	vmf.pgoff = linear_page_index(vma, address);
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
+	vmf.sequence = seq;
+	vmf.flags = flags;
+
+	local_irq_enable();
+
+	ret = handle_pte_fault(&vmf);
+
+unlock:
+	srcu_read_unlock(&vma_srcu, idx);
+	return ret;
+
+out_walk:
+	local_irq_enable();
+	goto unlock;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 09/11] mm: Try spin lock in speculative path
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (7 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 08/11] mm: Provide speculative fault infrastructure Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-07-05 18:50   ` Peter Zijlstra
  2017-06-16 17:52 ` [RFC v5 10/11] x86/mm: Add speculative pagefault handling Laurent Dufour
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

There is a deadlock when a CPU is doing a speculative page fault and
another one is calling do_unmap().

The deadlock occurred because the speculative path try to spinlock the
pte while the interrupt are disabled. When the other CPU in the
unmap's path has locked the pte then is waiting for all the CPU to
invalidate the TLB. As the CPU doing the speculative fault have the
interrupt disable it can't invalidate the TLB, and can't get the lock.

Since we are in a speculative path, we can race with other mm action.
So let assume that the lock may not get acquired and fail the
speculative page fault.

Here are the stacks captured during the deadlock:

	CPU 0
	native_flush_tlb_others+0x7c/0x260
	flush_tlb_mm_range+0x6a/0x220
	tlb_flush_mmu_tlbonly+0x63/0xc0
	unmap_page_range+0x897/0x9d0
	? unmap_single_vma+0x7d/0xe0
	? release_pages+0x2b3/0x360
	unmap_single_vma+0x7d/0xe0
	unmap_vmas+0x51/0xa0
	unmap_region+0xbd/0x130
	do_munmap+0x279/0x460
	SyS_munmap+0x53/0x70

	CPU 1
	do_raw_spin_lock+0x14e/0x160
	_raw_spin_lock+0x5d/0x80
	? pte_map_lock+0x169/0x1b0
	pte_map_lock+0x169/0x1b0
	handle_pte_fault+0xbf2/0xd80
	? trace_hardirqs_on+0xd/0x10
	handle_speculative_fault+0x272/0x280
	handle_speculative_fault+0x5/0x280
	__do_page_fault+0x187/0x580
	trace_do_page_fault+0x52/0x260
	do_async_page_fault+0x19/0x70
	async_page_fault+0x28/0x30

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0645cb21155f..a9ea3cc2d255 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2258,7 +2258,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 		goto out;
 
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (unlikely(!spin_trylock(vmf->ptl)))
+		goto out;
 
 	if (vma_has_changed(vmf->vma, vmf->sequence)) {
 		spin_unlock(vmf->ptl);
@@ -2294,8 +2295,19 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	if (vma_has_changed(vmf->vma, vmf->sequence))
 		goto out;
 
-	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
-				  vmf->address, &ptl);
+	/* Same as pte_offset_map_lock() except that we call
+	 * spin_trylock() in place of spin_lock() to avoid race with
+	 * unmap path which may have the lock and wait for this CPU
+	 * to invalidate TLB but this CPU has irq disabled.
+	 * Since we are in a speculative patch, accept it could fail
+	 */
+	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(ptl))) {
+		pte_unmap(pte);
+		goto out;
+	}
+
 	if (vma_has_changed(vmf->vma, vmf->sequence)) {
 		pte_unmap_unlock(pte, ptl);
 		goto out;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 10/11] x86/mm: Add speculative pagefault handling
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (8 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 09/11] mm: Try spin lock in speculative path Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-06-16 17:52 ` [RFC v5 11/11] powerpc/mm: Add speculative page fault Laurent Dufour
  2017-07-03 17:32 ` [RFC v5 00/11] Speculative page faults Laurent Dufour
  11 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

From: Peter Zijlstra <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a01cbc8..c62a7ea5e27b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1365,6 +1365,19 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (error_code & PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address, flags);
+
+		/*
+		 * We also check against VM_FAULT_ERROR because we have to
+		 * raise a signal by calling later mm_fault_error() which
+		 * requires the vma pointer to be set. So in that case,
+		 * we fall through the normal path.
+		 */
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1474,6 +1487,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting. If any of the events
 	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v5 11/11] powerpc/mm: Add speculative page fault
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (9 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 10/11] x86/mm: Add speculative pagefault handling Laurent Dufour
@ 2017-06-16 17:52 ` Laurent Dufour
  2017-07-03 17:32 ` [RFC v5 00/11] Speculative page faults Laurent Dufour
  11 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-06-16 17:52 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with WM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580fdc59..4b6d0ed517ca 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -290,9 +290,31 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (!is_exec && user_mode(regs))
 		store_update_sp = store_updates_sp(regs);
 
-	if (user_mode(regs))
+	if (user_mode(regs)) {
 		flags |= FAULT_FLAG_USER;
 
+		/* let's try a speculative page fault without grabbing the
+		 * mmap_sem.
+		 */
+
+		/*
+		 * flags is set later based on the VMA's flags, for the common
+		 * speculative service, we need some flags to be set.
+		 */
+		if (is_write)
+			flags |= FAULT_FLAG_WRITE;
+
+		fault = handle_speculative_fault(mm, address, flags);
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
+
+		/*
+		 * Resetting flags since the following code assumes
+		 * FAULT_FLAG_WRITE is not set.
+		 */
+		flags &= ~FAULT_FLAG_WRITE;
+	}
+
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -478,6 +500,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 			rc = 0;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting.
 	 */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC v5 00/11] Speculative page faults
  2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
                   ` (10 preceding siblings ...)
  2017-06-16 17:52 ` [RFC v5 11/11] powerpc/mm: Add speculative page fault Laurent Dufour
@ 2017-07-03 17:32 ` Laurent Dufour
  2017-07-07  1:54   ` Balbir Singh
  11 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-07-03 17:32 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

On 16/06/2017 19:52, Laurent Dufour wrote:
> This is a port on kernel 4.12 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore [1].
> 
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type page fault is named speculative
> page fault. If the speculative page fault fails because of a concurrency is
> detected or because underlying PMD or PTE tables are not yet allocating, it
> is failing its processing and a classic page fault is then tried.
> 
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, so the VMA list is now managed using
> SRCU allowing lockless walking. The only impact would be the deferred file
> derefencing in the case of a file mapping, since the file pointer is
> released once the SRCU cleaning is done.  This patch relies on the change
> done recently by Paul McKenney in SRCU which now runs a callback per CPU
> instead of per SRCU structure [1].
> 
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
> 
> Once the VMA is found, the speculative page fault handler would check for
> the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried.  VMA sequence lockings are added when VMA attributes which are
> checked during the page fault are modified.
> 
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.
> .
> Compared to the Peter's initial work, this series introduces a spin_trylock
> when dealing with speculative page fault. This is required to avoid dead
> lock when handling a page fault while a TLB invalidate is requested by an
> other CPU holding the PTE. Another change due to a lock dependency issue
> with mapping->i_mmap_rwsem.
> 
> This series builds on top of v4.12-rc5 and is functional on x86 and
> PowerPC.
> 
> Tests have been made using a large commercial in-memory database on a
> PowerPC system with 752 CPUs. The results are very encouraging since the
> loading of the 2TB database was faster by 14% with the speculative page
> fault.
> 
> However tests done using multi-fault [3] or kernbench [4], on smaller
> systems didn't show performance improvements, I saw a little degradation but
> running the tests again shows that this is in the noise. So nothing
> significant enough on the both sides.
> 
> Since benchmarks are encouraging and running test suites didn't raise any
> issue, I'd like this request for comment series to move to a patch series
> soon. So please comment.

Hi all,

I didn't get any feedback for the moment on this series...

In addition to the previous benchmark results, here are some additional
metrics I captured using the ebizzy benchmark [A]. It is stressing the
mmap/munmap calls in an interesting way.

The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
result I repeat the test 100 times and measure the average result, mean
deviation and max. I run the test on top of 4.12 on 2 nodes, one with 80
CPUs, and the other one with 1024 CPUs:

* 80 CPUs Power 8 node:
Records/s	4.12		4.12-SPF
Average		38941,62	64235,82
Mean deviation	620,93		1718,95
Max		41988		69623

* 1024 CPUs Power 8 node:
Records/s	4.12		4.12-SPF
Average		39516,64	80689,27
Mean deviation	1387,66		1319,98
Max		43281		90441

So with that patch series, we got about 2x increase and the gap is growing
when the number of CPUs is higher.

Please comment as I'm about to drop the final patch rebased on 4.12.

Thanks,
Laurent.
[A] http://ebizzy.sourceforge.net/

> ------------------
> Benchmarks results
> 
> Here are the results on a 8 CPUs X86 guest using kernbench on a 4.12-r5
> kernel (kernel is build 5 times):
> 
> Average Half load -j 4 Run (std deviation):
> 		 4.12.0-rc5		4.12.0-rc5-spf
> 		 Run (std deviation)
> Elapsed Time     48.42 (0.334515)       48.638 (0.344848)
> User Time        124.322 (0.964324)     124.478 (0.659902)
> System Time      58.008 (0.300865)      58.664 (0.590999)
> Percent CPU 	 376.2 (1.09545)        376.4 (1.51658)
> Context Switches 7409.6 (215.18)        11022.8 (281.093)
> Sleeps           15255.8 (63.0254)      15250.8 (43.4592)
> 
> Average Optimal load -j 8
>  		 4.12.0-rc5		4.12.0-rc5-spf
>                  Run (std deviation)
> Elapsed Time     24.268 (0.151723)      24.514 (0.143805)
> User Time        112.092 (12.9135)      112.04 (13.1257)
> System Time      49.03 (9.46999)        49.721 (9.44455)
> Percent CPU      476 (105.205)          474.3 (103.209)
> Context Switches 10268.7 (3020.16)      14069.2 (3219.98)
> Sleeps           15790.8 (568.885)      15829.4 (615.371)
> 
> Average Maximal load -j
>  		 4.12.0-rc5		4.12.0-rc5-spf
>                  Run (std deviation)
> Elapsed Time     25.042 (0.237844)      25.216 (0.201941)
> User Time        110.19 (10.7245)       110.312 (10.8245)
> System Time      45.9113 (8.86119)      46.48 (8.93778)
> Percent CPU      511.533 (99.1376)      510.133 (97.9897)
> Context Switches 19521.1 (13759.8)      22354.1 (12400)
> Sleeps           15514.7 (609.76)       15521.2 (670.054)
> 
> The elapsed time is in the same order, a bit larger in the case of the spf
> release, but that seems to be in the error margin.
> 
> Here are the kerbench results on a 572 CPUs Power8 system :
> 
> Average Half load -j 376
>  		 4.12.0-rc5		4.12.0-rc5-spf
>                  Run (std deviation)
> Elapsed Time     3.384 (0.0680441)      3.344 (0.0634823)
> User Time        203.998 (8.41125)      193.476 (8.23406)
> System Time      13.064 (0.624444)      12.028 (0.495954)
> Percent CPU      6407 (285.422)         6136.2 (198.173)
> Context Switches 7319.2 (517.785)       8960 (221.735)
> Sleeps           24287.8 (861.132)      22902.4 (728.475)
> 
> Average Optimal load -j 752
>  		 4.12.0-rc5		4.12.0-rc5-spf
>                  Run (std deviation)
> Elapsed Time     3.414 (0.136858)       3.432 (0.0506952)
> User Time        200.985 (8.71172)      197.747 (8.9511)
> System Time      12.903 (0.638262)      12.472 (0.684865)
> Percent CPU      6287.9 (322.208)       6194.8 (192.116)
> Context Switches 7173.5 (479.038)       9355.7 (712.3)
> Sleeps           24241.6 (1003.66)      22867.5 (1242.49)
> 
> Average Maximal load -j
>  		 4.12.0-rc5		4.12.0-rc5-spf
>                  Run (std deviation)
> Elapsed Time     3.422 (0.0791833)      3.312 (0.109864)
> User Time        202.096 (7.45845)      197.541 (9.42758)
> System Time      12.8733 (0.57327)      12.4567 (0.568465)
> Percent CPU      6304.87 (278.195)      6234.67 (204.769)
> Context Switches 7166 (412.524)         9398.73 (639.917)
> Sleeps           24065.6 (1132.3)       22822.8 (1176.71)
> 
> Here the elapsed time is a bit shorter using the spf release, but again we
> stay in the error margin.
> 
> Here are results using multi-fault :
> 
> --- x86 8 CPUs
>                 Page faults in 60s
> 4.12.0-rc5      23,014,776
> 4.12-0-rc5-spf  23,224,435
> 
> --- ppc64le 752 CPUs
>                 Page faults in 60s
> 4.12.0-rc5      28,087,752
> 4.12-0-rc5-spf  32,272,610
> 
> Results is a bit higher on ppc64le with the SPF patch, but I'm not convince
> about this test on Power8 since the page table are managed differently on
> this architecture, I'm wondering if we are not hitting the PTE lock.
> I run the test multiple times, the number are varying a bit but remain in
> the same order.
> 
> ------------------
> Changes since V4:
>  - merge several patches to reduce the series as requested by Jan Kara
>  - check any comment warning in the code and remove each of them
>  - reword some patch description
> 
> Changes since V3:
>  - support for the 5-level paging.
>  - abort speculative path before entering userfault code
>  - support for PowerPC architecture
>  - reorder the patch to fix build test errors.
> 
> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
> [3] https://lkml.org/lkml/2010/1/6/28
> [4] http://ck.kolivas.org/apps/kernbench/kernbench-0.50/
> 
> Laurent Dufour (5):
>   mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>   mm: fix lock dependency against mapping->i_mmap_rwsem
>   mm: Protect VMA modifications using VMA sequence count
>   mm: Try spin lock in speculative path
>   powerpc/mm: Add speculative page fault
> 
> Peter Zijlstra (6):
>   mm: Dont assume page-table invariance during faults
>   mm: Prepare for FAULT_FLAG_SPECULATIVE
>   mm: VMA sequence count
>   mm: RCU free VMAs
>   mm: Provide speculative fault infrastructure
>   x86/mm: Add speculative pagefault handling
> 
>  arch/powerpc/mm/fault.c  |  25 ++++-
>  arch/x86/mm/fault.c      |  14 +++
>  fs/proc/task_mmu.c       |   2 +
>  include/linux/mm.h       |   4 +
>  include/linux/mm_types.h |   3 +
>  kernel/fork.c            |   1 +
>  mm/init-mm.c             |   1 +
>  mm/internal.h            |  20 ++++
>  mm/madvise.c             |   4 +
>  mm/memory.c              | 286 +++++++++++++++++++++++++++++++++++++++--------
>  mm/mempolicy.c           |  10 +-
>  mm/mlock.c               |   9 +-
>  mm/mmap.c                | 123 +++++++++++++++-----
>  mm/mprotect.c            |   2 +
>  mm/mremap.c              |   7 ++
>  15 files changed, 430 insertions(+), 81 deletions(-)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 09/11] mm: Try spin lock in speculative path
  2017-06-16 17:52 ` [RFC v5 09/11] mm: Try spin lock in speculative path Laurent Dufour
@ 2017-07-05 18:50   ` Peter Zijlstra
  2017-07-06 13:46     ` Laurent Dufour
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-07-05 18:50 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen

On Fri, Jun 16, 2017 at 07:52:33PM +0200, Laurent Dufour wrote:
> @@ -2294,8 +2295,19 @@ static bool pte_map_lock(struct vm_fault *vmf)
>  	if (vma_has_changed(vmf->vma, vmf->sequence))
>  		goto out;
>  
> -	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> -				  vmf->address, &ptl);
> +	/* Same as pte_offset_map_lock() except that we call

comment style..

> +	 * spin_trylock() in place of spin_lock() to avoid race with
> +	 * unmap path which may have the lock and wait for this CPU
> +	 * to invalidate TLB but this CPU has irq disabled.
> +	 * Since we are in a speculative patch, accept it could fail
> +	 */
> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +	pte = pte_offset_map(vmf->pmd, vmf->address);
> +	if (unlikely(!spin_trylock(ptl))) {
> +		pte_unmap(pte);
> +		goto out;
> +	}
> +
>  	if (vma_has_changed(vmf->vma, vmf->sequence)) {
>  		pte_unmap_unlock(pte, ptl);
>  		goto out;

Right, so if you look at my earlier patches you'll see I did something
quite disgusting here.

Not sure that wants repeating, but I cannot remember why I thought this
deadlock didn't exist anymore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 09/11] mm: Try spin lock in speculative path
  2017-07-05 18:50   ` Peter Zijlstra
@ 2017-07-06 13:46     ` Laurent Dufour
  2017-07-06 14:48       ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-07-06 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen

On 05/07/2017 20:50, Peter Zijlstra wrote:
> On Fri, Jun 16, 2017 at 07:52:33PM +0200, Laurent Dufour wrote:
>> @@ -2294,8 +2295,19 @@ static bool pte_map_lock(struct vm_fault *vmf)
>>  	if (vma_has_changed(vmf->vma, vmf->sequence))
>>  		goto out;
>>  
>> -	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>> -				  vmf->address, &ptl);
>> +	/* Same as pte_offset_map_lock() except that we call
> 
> comment style..

Hi Peter and thanks for your work and review.

I'll fix this comment style.

> 
>> +	 * spin_trylock() in place of spin_lock() to avoid race with
>> +	 * unmap path which may have the lock and wait for this CPU
>> +	 * to invalidate TLB but this CPU has irq disabled.
>> +	 * Since we are in a speculative patch, accept it could fail
>> +	 */
>> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> +	pte = pte_offset_map(vmf->pmd, vmf->address);
>> +	if (unlikely(!spin_trylock(ptl))) {
>> +		pte_unmap(pte);
>> +		goto out;
>> +	}
>> +
>>  	if (vma_has_changed(vmf->vma, vmf->sequence)) {
>>  		pte_unmap_unlock(pte, ptl);
>>  		goto out;
> 
> Right, so if you look at my earlier patches you'll see I did something
> quite disgusting here.
> 
> Not sure that wants repeating, but I cannot remember why I thought this
> deadlock didn't exist anymore.

Regarding the deadlock I did face it on my Power victim node, so I guess it
is still there, and the stack traces are quiet explicit.
Am I missing something here ?

Thanks,
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 09/11] mm: Try spin lock in speculative path
  2017-07-06 13:46     ` Laurent Dufour
@ 2017-07-06 14:48       ` Peter Zijlstra
  2017-07-06 15:29         ` Laurent Dufour
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-07-06 14:48 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen

On Thu, Jul 06, 2017 at 03:46:59PM +0200, Laurent Dufour wrote:
> On 05/07/2017 20:50, Peter Zijlstra wrote:
> > On Fri, Jun 16, 2017 at 07:52:33PM +0200, Laurent Dufour wrote:
> >> @@ -2294,8 +2295,19 @@ static bool pte_map_lock(struct vm_fault *vmf)
> >>  	if (vma_has_changed(vmf->vma, vmf->sequence))
> >>  		goto out;
> >>  
> >> -	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> >> -				  vmf->address, &ptl);

> >> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> >> +	pte = pte_offset_map(vmf->pmd, vmf->address);
> >> +	if (unlikely(!spin_trylock(ptl))) {
> >> +		pte_unmap(pte);
> >> +		goto out;
> >> +	}
> >> +
> >>  	if (vma_has_changed(vmf->vma, vmf->sequence)) {
> >>  		pte_unmap_unlock(pte, ptl);
> >>  		goto out;
> > 
> > Right, so if you look at my earlier patches you'll see I did something
> > quite disgusting here.
> > 
> > Not sure that wants repeating, but I cannot remember why I thought this
> > deadlock didn't exist anymore.
> 
> Regarding the deadlock I did face it on my Power victim node, so I guess it
> is still there, and the stack traces are quiet explicit.
> Am I missing something here ?

No, you are right in that the deadlock is quite real. What I cannot
remember is what made me think to remove the really 'wonderful' code I
had to deal with it.

That said, you might want to look at how often you terminate the
speculation because of your trylock failing. If that shows up at all we
might need to do something about it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 09/11] mm: Try spin lock in speculative path
  2017-07-06 14:48       ` Peter Zijlstra
@ 2017-07-06 15:29         ` Laurent Dufour
  2017-07-06 16:13           ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-07-06 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen

On 06/07/2017 16:48, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 03:46:59PM +0200, Laurent Dufour wrote:
>> On 05/07/2017 20:50, Peter Zijlstra wrote:
>>> On Fri, Jun 16, 2017 at 07:52:33PM +0200, Laurent Dufour wrote:
>>>> @@ -2294,8 +2295,19 @@ static bool pte_map_lock(struct vm_fault *vmf)
>>>>  	if (vma_has_changed(vmf->vma, vmf->sequence))
>>>>  		goto out;
>>>>  
>>>> -	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>>>> -				  vmf->address, &ptl);
> 
>>>> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>> +	pte = pte_offset_map(vmf->pmd, vmf->address);
>>>> +	if (unlikely(!spin_trylock(ptl))) {
>>>> +		pte_unmap(pte);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>>  	if (vma_has_changed(vmf->vma, vmf->sequence)) {
>>>>  		pte_unmap_unlock(pte, ptl);
>>>>  		goto out;
>>>
>>> Right, so if you look at my earlier patches you'll see I did something
>>> quite disgusting here.
>>>
>>> Not sure that wants repeating, but I cannot remember why I thought this
>>> deadlock didn't exist anymore.
>>
>> Regarding the deadlock I did face it on my Power victim node, so I guess it
>> is still there, and the stack traces are quiet explicit.
>> Am I missing something here ?
> 
> No, you are right in that the deadlock is quite real. What I cannot
> remember is what made me think to remove the really 'wonderful' code I
> had to deal with it.
> 
> That said, you might want to look at how often you terminate the
> speculation because of your trylock failing. If that shows up at all we
> might need to do something about it.

Based on the benchmarks I run, it doesn't fail so much often, but I was
thinking about adding some counters here. The system is accounting for
major page faults and minor ones, respectively current->maj_flt and
current->min_flt. I was wondering if an additional type like async_flt will
be welcome or if there is another smarter way to get that metric.

Feel free to advise.

Thanks
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 09/11] mm: Try spin lock in speculative path
  2017-07-06 15:29         ` Laurent Dufour
@ 2017-07-06 16:13           ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-07-06 16:13 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, akpm, kirill, ak, mhocko, dave, jack, Matthew Wilcox,
	linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen

On Thu, Jul 06, 2017 at 05:29:26PM +0200, Laurent Dufour wrote:
> Based on the benchmarks I run, it doesn't fail so much often, but I was
> thinking about adding some counters here. The system is accounting for
> major page faults and minor ones, respectively current->maj_flt and
> current->min_flt. I was wondering if an additional type like async_flt will
> be welcome or if there is another smarter way to get that metric.
> 
> Feel free to advise.

You could stick a tracepoint in, or extend PERF_COUNT_SW_PAGE_FAULTS*.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 00/11] Speculative page faults
  2017-07-03 17:32 ` [RFC v5 00/11] Speculative page faults Laurent Dufour
@ 2017-07-07  1:54   ` Balbir Singh
  0 siblings, 0 replies; 37+ messages in thread
From: Balbir Singh @ 2017-07-07  1:54 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, Tim Chen

On Mon, 2017-07-03 at 19:32 +0200, Laurent Dufour wrote:
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
> result I repeat the test 100 times and measure the average result, mean
> deviation and max. I run the test on top of 4.12 on 2 nodes, one with 80
> CPUs, and the other one with 1024 CPUs:
> 
> * 80 CPUs Power 8 node:
> Records/s	4.12		4.12-SPF
> Average		38941,62	64235,82
> Mean deviation	620,93		1718,95
> Max		41988		69623
> 
> * 1024 CPUs Power 8 node:
> Records/s	4.12		4.12-SPF
> Average		39516,64	80689,27
> Mean deviation	1387,66		1319,98
> Max		43281		90441
>

This seems like a very interesting result

Balbir Singh. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 01/11] mm: Dont assume page-table invariance during faults
  2017-06-16 17:52 ` [RFC v5 01/11] mm: Dont assume page-table invariance during faults Laurent Dufour
@ 2017-07-07  7:07   ` Balbir Singh
  2017-07-10 17:48     ` Laurent Dufour
  2017-08-08  9:45   ` Anshuman Khandual
  1 sibling, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2017-07-07  7:07 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, Tim Chen

On Fri, 2017-06-16 at 19:52 +0200, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> One of the side effects of speculating on faults (without holding
> mmap_sem) is that we can race with free_pgtables() and therefore we
> cannot assume the page-tables will stick around.
> 
> Remove the relyance on the pte pointer.
             ^^ reliance

Looking at the changelog and the code the impact is not clear.
It looks like after this patch we always assume the pte is not
the same. What is the impact of this patch?

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 01/11] mm: Dont assume page-table invariance during faults
  2017-07-07  7:07   ` Balbir Singh
@ 2017-07-10 17:48     ` Laurent Dufour
  2017-07-11  4:26       ` Balbir Singh
  2017-08-08 10:04       ` Anshuman Khandual
  0 siblings, 2 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-07-10 17:48 UTC (permalink / raw)
  To: Balbir Singh, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, Tim Chen

On 07/07/2017 09:07, Balbir Singh wrote:
> On Fri, 2017-06-16 at 19:52 +0200, Laurent Dufour wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> One of the side effects of speculating on faults (without holding
>> mmap_sem) is that we can race with free_pgtables() and therefore we
>> cannot assume the page-tables will stick around.
>>
>> Remove the relyance on the pte pointer.
>              ^^ reliance
> 
> Looking at the changelog and the code the impact is not clear.
> It looks like after this patch we always assume the pte is not
> the same. What is the impact of this patch?

Hi Balbir,

In most of the case pte_unmap_same() was returning 1, which meaning that
do_swap_page() should do its processing.

So in most of the case there will be no impact.

Now regarding the case where pte_unmap_safe() was returning 0, and thus
do_swap_page return 0 too, this happens when the page has already been
swapped back. This may happen before do_swap_page() get called or while in
the call to do_swap_page(). In that later case, the check done when
swapin_readahead() returns will detect that case.

The worst case would be that a page fault is occuring on 2 threads at the
same time on the same swapped out page. In that case one thread will take
much time looping in __read_swap_cache_async(). But in the regular page
fault path, this is even worse since the thread would wait for semaphore to
be released before starting anything.

Cheers,
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 01/11] mm: Dont assume page-table invariance during faults
  2017-07-10 17:48     ` Laurent Dufour
@ 2017-07-11  4:26       ` Balbir Singh
  2017-08-08 10:04       ` Anshuman Khandual
  1 sibling, 0 replies; 37+ messages in thread
From: Balbir Singh @ 2017-07-11  4:26 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, linux-kernel, linux-mm, haren, khandual, npiggin,
	Tim Chen

On Mon, 10 Jul 2017 19:48:43 +0200
Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:

> On 07/07/2017 09:07, Balbir Singh wrote:
> > On Fri, 2017-06-16 at 19:52 +0200, Laurent Dufour wrote:  
> >> From: Peter Zijlstra <peterz@infradead.org>
> >>
> >> One of the side effects of speculating on faults (without holding
> >> mmap_sem) is that we can race with free_pgtables() and therefore we
> >> cannot assume the page-tables will stick around.
> >>
> >> Remove the relyance on the pte pointer.  
> >              ^^ reliance
> > 
> > Looking at the changelog and the code the impact is not clear.
> > It looks like after this patch we always assume the pte is not
> > the same. What is the impact of this patch?  
> 
> Hi Balbir,
> 
> In most of the case pte_unmap_same() was returning 1, which meaning that
> do_swap_page() should do its processing.
> 
> So in most of the case there will be no impact.
> 
> Now regarding the case where pte_unmap_safe() was returning 0, and thus
> do_swap_page return 0 too, this happens when the page has already been
> swapped back. This may happen before do_swap_page() get called or while in
> the call to do_swap_page(). In that later case, the check done when
> swapin_readahead() returns will detect that case.
> 
> The worst case would be that a page fault is occuring on 2 threads at the
> same time on the same swapped out page. In that case one thread will take
> much time looping in __read_swap_cache_async(). But in the regular page
> fault path, this is even worse since the thread would wait for semaphore to
> be released before starting anything.
> 
>

Sounds good!

Thanks,
Balbir Singh 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 01/11] mm: Dont assume page-table invariance during faults
  2017-06-16 17:52 ` [RFC v5 01/11] mm: Dont assume page-table invariance during faults Laurent Dufour
  2017-07-07  7:07   ` Balbir Singh
@ 2017-08-08  9:45   ` Anshuman Khandual
  2017-08-08 12:11     ` Laurent Dufour
  1 sibling, 1 reply; 37+ messages in thread
From: Anshuman Khandual @ 2017-08-08  9:45 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> One of the side effects of speculating on faults (without holding
> mmap_sem) is that we can race with free_pgtables() and therefore we
> cannot assume the page-tables will stick around.
> 
> Remove the relyance on the pte pointer.

Looking into other parts of the series, it seemed like now we have
sequence lock both at MM and VMA level but then after that we still
need to take page table lock before handling page faults (in turn
manipulating PTE which includes swap in paths as well). Is not that
true ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 01/11] mm: Dont assume page-table invariance during faults
  2017-07-10 17:48     ` Laurent Dufour
  2017-07-11  4:26       ` Balbir Singh
@ 2017-08-08 10:04       ` Anshuman Khandual
  1 sibling, 0 replies; 37+ messages in thread
From: Anshuman Khandual @ 2017-08-08 10:04 UTC (permalink / raw)
  To: Laurent Dufour, Balbir Singh, paulmck, peterz, akpm, kirill, ak,
	mhocko, dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, Tim Chen

On 07/10/2017 11:18 PM, Laurent Dufour wrote:
> On 07/07/2017 09:07, Balbir Singh wrote:
>> On Fri, 2017-06-16 at 19:52 +0200, Laurent Dufour wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>>
>>> One of the side effects of speculating on faults (without holding
>>> mmap_sem) is that we can race with free_pgtables() and therefore we
>>> cannot assume the page-tables will stick around.
>>>
>>> Remove the relyance on the pte pointer.
>>              ^^ reliance
>>
>> Looking at the changelog and the code the impact is not clear.
>> It looks like after this patch we always assume the pte is not
>> the same. What is the impact of this patch?
> 
> Hi Balbir,
> 
> In most of the case pte_unmap_same() was returning 1, which meaning that
> do_swap_page() should do its processing.
> 
> So in most of the case there will be no impact.
> 
> Now regarding the case where pte_unmap_safe() was returning 0, and thus
> do_swap_page return 0 too, this happens when the page has already been
> swapped back. This may happen before do_swap_page() get called or while in
> the call to do_swap_page(). In that later case, the check done when
> swapin_readahead() returns will detect that case.
> 
> The worst case would be that a page fault is occuring on 2 threads at the
> same time on the same swapped out page. In that case one thread will take
> much time looping in __read_swap_cache_async(). But in the regular page
> fault path, this is even worse since the thread would wait for semaphore to
> be released before starting anything.

Can we move the detection of swap in of the same struct page back into
the page table bit earlier, ideally where pte_unmap_same() present to
speed up detection for the bail out case ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 02/11] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-06-16 17:52 ` [RFC v5 02/11] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
@ 2017-08-08 10:24   ` Anshuman Khandual
  2017-08-08 10:42     ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Anshuman Khandual @ 2017-08-08 10:24 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> When speculating faults (without holding mmap_sem) we need to validate
> that the vma against which we loaded pages is still valid when we're
> ready to install the new PTE.
> 
> Therefore, replace the pte_offset_map_lock() calls that (re)take the
> PTL with pte_map_lock() which can fail in case we find the VMA changed
> since we started the fault.

Where we are checking if VMA has changed or not since the fault ?

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> [Port to 4.12 kernel]
> [Remove the comment about the fault_env structure which has been
>  implemented as the vm_fault structure in the kernel]
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c        | 55 ++++++++++++++++++++++++++++++++++++++----------------
>  2 files changed, 40 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b892e95d4929..6b7ec2a76953 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -286,6 +286,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
> +#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */

We are not using this yet, may be can wait till late in the series.

>  
>  #define FAULT_FLAG_TRACE \
>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
> diff --git a/mm/memory.c b/mm/memory.c
> index fd952f05e016..40834444ea0d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2240,6 +2240,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  }
>  
> +static bool pte_map_lock(struct vm_fault *vmf)
> +{
> +	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> +	return true;
> +}

This is always true ? Then we should not have all these if (!pte_map_lock(vmf))
check blocks down below.

> +
>  /*
>   * Handle the case of a page which we actually need to copy to a new page.
>   *
> @@ -2267,6 +2273,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>  	const unsigned long mmun_start = vmf->address & PAGE_MASK;
>  	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
>  	struct mem_cgroup *memcg;
> +	int ret = VM_FAULT_OOM;
> 

If we remove the check block over pte_map_lock(), adding VM_FAULT_OOM
becomes redundant here.

>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> @@ -2294,7 +2301,11 @@ static int wp_page_copy(struct vm_fault *vmf)
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> -	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
> +	if (!pte_map_lock(vmf)) {
> +		mem_cgroup_cancel_charge(new_page, memcg, false);
> +		ret = VM_FAULT_RETRY;
> +		goto oom_free_new;
> +	}
>  	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>  		if (old_page) {
>  			if (!PageAnon(old_page)) {
> @@ -2382,7 +2393,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>  oom:
>  	if (old_page)
>  		put_page(old_page);
> -	return VM_FAULT_OOM;
> +	return ret;
>  }
>  
>  /**
> @@ -2403,8 +2414,8 @@ static int wp_page_copy(struct vm_fault *vmf)
>  int finish_mkwrite_fault(struct vm_fault *vmf)
>  {
>  	WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
> -	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
> -				       &vmf->ptl);
> +	if (!pte_map_lock(vmf))
> +		return VM_FAULT_RETRY;

Cant fail.

>  	/*
>  	 * We might have raced with another page fault while we released the
>  	 * pte_offset_map_lock.
> @@ -2522,8 +2533,11 @@ static int do_wp_page(struct vm_fault *vmf)
>  			get_page(vmf->page);
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>  			lock_page(vmf->page);
> -			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -					vmf->address, &vmf->ptl);
> +			if (!pte_map_lock(vmf)) {
> +				unlock_page(vmf->page);
> +				put_page(vmf->page);
> +				return VM_FAULT_RETRY;
> +			}

Same here.

>  			if (!pte_same(*vmf->pte, vmf->orig_pte)) {
>  				unlock_page(vmf->page);
>  				pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -2681,8 +2695,10 @@ int do_swap_page(struct vm_fault *vmf)
>  			 * Back out if somebody else faulted in this pte
>  			 * while we released the pte lock.
>  			 */
> -			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -					vmf->address, &vmf->ptl);
> +			if (!pte_map_lock(vmf)) {
> +				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> +				return VM_FAULT_RETRY;
> +			}

>  			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
>  				ret = VM_FAULT_OOM;
>  			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> @@ -2738,8 +2754,11 @@ int do_swap_page(struct vm_fault *vmf)
>  	/*
>  	 * Back out if somebody else already faulted in this pte.
>  	 */
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	if (!pte_map_lock(vmf)) {
> +		ret = VM_FAULT_RETRY;
> +		mem_cgroup_cancel_charge(page, memcg, false);
> +		goto out_page;
> +	}
>  	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
>  		goto out_nomap;
>  
> @@ -2903,8 +2922,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  			!mm_forbids_zeropage(vma->vm_mm)) {
>  		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
>  						vma->vm_page_prot));
> -		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -				vmf->address, &vmf->ptl);
> +		if (!pte_map_lock(vmf))
> +			return VM_FAULT_RETRY;
>  		if (!pte_none(*vmf->pte))
>  			goto unlock;
>  		/* Deliver the page fault to userland, check inside PT lock */
> @@ -2936,8 +2955,11 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
>  
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	if (!pte_map_lock(vmf)) {
> +		mem_cgroup_cancel_charge(page, memcg, false);
> +		put_page(page);
> +		return VM_FAULT_RETRY;
> +	}
>  	if (!pte_none(*vmf->pte))
>  		goto release;
>  
> @@ -3057,8 +3079,9 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
>  	 * pte_none() under vmf->ptl protection when we return to
>  	 * alloc_set_pte().
>  	 */
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	if (!pte_map_lock(vmf))
> +		return VM_FAULT_RETRY;
> +
>  	return 0;

All these 'if' blocks seem redundant, unless I am missing something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 03/11] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
  2017-06-16 17:52 ` [RFC v5 03/11] mm: Introduce pte_spinlock " Laurent Dufour
@ 2017-08-08 10:35   ` Anshuman Khandual
  2017-08-08 12:16     ` Laurent Dufour
  0 siblings, 1 reply; 37+ messages in thread
From: Anshuman Khandual @ 2017-08-08 10:35 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> When handling page fault without holding the mmap_sem the fetch of the
> pte lock pointer and the locking will have to be done while ensuring
> that the VMA is not touched in our back.

It does not change things from whats happening right now, where do we
check that VMA has not changed by now ?

> 
> So move the fetch and locking operations in a dedicated function.
> 
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  mm/memory.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 40834444ea0d..f1132f7931ef 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2240,6 +2240,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  }
>  
> +static bool pte_spinlock(struct vm_fault *vmf)
> +{
> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> +	spin_lock(vmf->ptl);
> +	return true;
> +}
> +

Moving them together makes sense but again if blocks are redundant when
it returns true all the time.

>  static bool pte_map_lock(struct vm_fault *vmf)
>  {
>  	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> @@ -3552,8 +3559,8 @@ static int do_numa_page(struct vm_fault *vmf)
>  	 * validation through pte_unmap_same(). It's of NUMA type but
>  	 * the pfn may be screwed if the read is non atomic.
>  	 */
> -	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
> -	spin_lock(vmf->ptl);
> +	if (!pte_spinlock(vmf))
> +		return VM_FAULT_RETRY;
>  	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  		goto out;
> @@ -3745,8 +3752,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
>  	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>  		return do_numa_page(vmf);
>  
> -	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> -	spin_lock(vmf->ptl);
> +	if (!pte_spinlock(vmf))
> +		return VM_FAULT_RETRY;
>  	entry = vmf->orig_pte;
>  	if (unlikely(!pte_same(*vmf->pte, entry)))
>  		goto unlock;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 02/11] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-08-08 10:24   ` Anshuman Khandual
@ 2017-08-08 10:42     ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-08-08 10:42 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Laurent Dufour, paulmck, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, linux-kernel, linux-mm, haren, npiggin,
	bsingharora, Tim Chen

On Tue, Aug 08, 2017 at 03:54:01PM +0530, Anshuman Khandual wrote:
> On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > When speculating faults (without holding mmap_sem) we need to validate
> > that the vma against which we loaded pages is still valid when we're
> > ready to install the new PTE.
> > 
> > Therefore, replace the pte_offset_map_lock() calls that (re)take the
> > PTL with pte_map_lock() which can fail in case we find the VMA changed
> > since we started the fault.
> 
> Where we are checking if VMA has changed or not since the fault ?

Not there yet, this is what you call a preparatory patch. They help
review in that you can consider smaller steps.

> > diff --git a/mm/memory.c b/mm/memory.c
> > index fd952f05e016..40834444ea0d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2240,6 +2240,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
> >  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  }
> >  
> > +static bool pte_map_lock(struct vm_fault *vmf)
> > +{
> > +	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> > +	return true;
> > +}
> 
> This is always true ? Then we should not have all these if (!pte_map_lock(vmf))
> check blocks down below.

Later patches will make it possible to return false. This patch is about
the placing this call. Having this in a separate patch makes it easier
to review all those new error conditions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 04/11] mm: VMA sequence count
  2017-06-16 17:52 ` [RFC v5 04/11] mm: VMA sequence count Laurent Dufour
@ 2017-08-08 10:59   ` Anshuman Khandual
  2017-08-08 11:04     ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Anshuman Khandual @ 2017-08-08 10:59 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 

First of all, please do mention that its adding a new element into the
vm_area_struct which will act as a sequential lock element and help
in navigating page fault without mmap_sem lock.

> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> counts such that we can easily test if a VMA is changed

Yeah true.

> 
> The unmap_page_range() one allows us to make assumptions about
> page-tables; when we find the seqcount hasn't changed we can assume
> page-tables are still valid.

Because unmap_page_range() is the only function which can tear it down ?
Or is there any other reason for this assumption ?

> 
> The flip side is that we cannot distinguish between a vma_adjust() and
> the unmap_page_range() -- where with the former we could have
> re-checked the vma bounds against the address.

Distinguished for what purpose ?

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> [port to 4.12 kernel]
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  include/linux/mm_types.h |  1 +
>  mm/memory.c              |  2 ++
>  mm/mmap.c                | 13 +++++++++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 45cdb27791a3..8945743e4609 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -342,6 +342,7 @@ struct vm_area_struct {
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +	seqcount_t vm_sequence;
>  };
>  
>  struct core_thread {
> diff --git a/mm/memory.c b/mm/memory.c
> index f1132f7931ef..5d259cd67a83 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1379,6 +1379,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  	unsigned long next;
>  
>  	BUG_ON(addr >= end);
> +	write_seqcount_begin(&vma->vm_sequence);
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
> @@ -1388,6 +1389,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
>  	} while (pgd++, addr = next, addr != end);
>  	tlb_end_vma(tlb, vma);
> +	write_seqcount_end(&vma->vm_sequence);
>  }
>  
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f82741e199c0..9f86356d0012 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -543,6 +543,8 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
>  	else
>  		mm->highest_vm_end = vma->vm_end;
>  
> +	seqcount_init(&vma->vm_sequence);
> +
>  	/*
>  	 * vma->vm_prev wasn't known when we followed the rbtree to find the
>  	 * correct insertion point for that vma. As a result, we could not
> @@ -677,6 +679,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	long adjust_next = 0;
>  	int remove_next = 0;
>  
> +	write_seqcount_begin(&vma->vm_sequence);
> +	if (next)
> +		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
> +
>  	if (next && !insert) {
>  		struct vm_area_struct *exporter = NULL, *importer = NULL;
>  
> @@ -888,6 +894,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
>  		kmem_cache_free(vm_area_cachep, next);
> +		write_seqcount_end(&next->vm_sequence);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
> @@ -901,6 +908,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  			 * "vma->vm_next" gap must be updated.
>  			 */
>  			next = vma->vm_next;
> +			if (next)
> +				write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
>  		} else {
>  			/*
>  			 * For the scope of the comment "next" and
> @@ -947,6 +956,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	if (insert && file)
>  		uprobe_mmap(insert);
>  
> +	if (next)
> +		write_seqcount_end(&next->vm_sequence);
> +	write_seqcount_end(&vma->vm_sequence)
> +
>  	validate_mm(mm);
>  
>  	return 0;

Why we are changing the sequence for 'next' element here as well ?
Is this because next VMA may be modified during the __vma_adjust()
process ? Just out of curiosity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 04/11] mm: VMA sequence count
  2017-08-08 10:59   ` Anshuman Khandual
@ 2017-08-08 11:04     ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-08-08 11:04 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Laurent Dufour, paulmck, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, linux-kernel, linux-mm, haren, npiggin,
	bsingharora, Tim Chen

On Tue, Aug 08, 2017 at 04:29:32PM +0530, Anshuman Khandual wrote:
> On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> 
> First of all, please do mention that its adding a new element into the
> vm_area_struct which will act as a sequential lock element and help
> in navigating page fault without mmap_sem lock.

You're not making sense, there is no lock, and the lines below clearly
state we're adding a sequence count.

> 
> > Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> > counts such that we can easily test if a VMA is changed
> 
> Yeah true.
> 
> > 
> > The unmap_page_range() one allows us to make assumptions about
> > page-tables; when we find the seqcount hasn't changed we can assume
> > page-tables are still valid.
> 
> Because unmap_page_range() is the only function which can tear it down ?
> Or is there any other reason for this assumption ?

Yep.

> > 
> > The flip side is that we cannot distinguish between a vma_adjust() and
> > the unmap_page_range() -- where with the former we could have
> > re-checked the vma bounds against the address.
> 
> Distinguished for what purpose ?

It states. If you know its a vma_adjust we could just check if we're
inside the new boundaries and continue. But since we cannot, we have to
assume the worst and bail.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem
  2017-06-16 17:52 ` [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
@ 2017-08-08 11:17   ` Anshuman Khandual
  2017-08-08 12:20     ` Laurent Dufour
  0 siblings, 1 reply; 37+ messages in thread
From: Anshuman Khandual @ 2017-08-08 11:17 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora, Tim Chen

On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> kworker/32:1/819 is trying to acquire lock:
>  (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
> zap_page_range_single+0xd0/0x1a0
> 
> but task is already holding lock:
>  (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
> unmap_mapping_range+0x7c/0x160
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&mapping->i_mmap_rwsem){++++..}:
>        down_write+0x84/0x130
>        __vma_adjust+0x1f4/0xa80
>        __split_vma.isra.2+0x174/0x290
>        do_munmap+0x13c/0x4e0
>        vm_munmap+0x64/0xb0
>        elf_map+0x11c/0x130
>        load_elf_binary+0x6f0/0x15f0
>        search_binary_handler+0xe0/0x2a0
>        do_execveat_common.isra.14+0x7fc/0xbe0
>        call_usermodehelper_exec_async+0x14c/0x1d0
>        ret_from_kernel_thread+0x5c/0x68
> 
> -> #1 (&vma->vm_sequence/1){+.+...}:
>        __vma_adjust+0x124/0xa80
>        __split_vma.isra.2+0x174/0x290
>        do_munmap+0x13c/0x4e0
>        vm_munmap+0x64/0xb0
>        elf_map+0x11c/0x130
>        load_elf_binary+0x6f0/0x15f0
>        search_binary_handler+0xe0/0x2a0
>        do_execveat_common.isra.14+0x7fc/0xbe0
>        call_usermodehelper_exec_async+0x14c/0x1d0
>        ret_from_kernel_thread+0x5c/0x68
> 
> -> #0 (&vma->vm_sequence){+.+...}:
>        lock_acquire+0xf4/0x310
>        unmap_page_range+0xcc/0xfa0
>        zap_page_range_single+0xd0/0x1a0
>        unmap_mapping_range+0x138/0x160
>        truncate_pagecache+0x50/0xa0
>        put_aio_ring_file+0x48/0xb0
>        aio_free_ring+0x40/0x1b0
>        free_ioctx+0x38/0xc0
>        process_one_work+0x2cc/0x8a0
>        worker_thread+0xac/0x580
>        kthread+0x164/0x1b0
>        ret_from_kernel_thread+0x5c/0x68
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&mapping->i_mmap_rwsem);
>                                lock(&vma->vm_sequence/1);
>                                lock(&mapping->i_mmap_rwsem);
>   lock(&vma->vm_sequence);
> 
>  *** DEADLOCK ***
> 
> To fix that we must grab the vm_sequence lock after any mapping one in
> __vma_adjust().
> 
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

Should not this be folded back into the previous patch ? It fixes an
issue introduced by the previous one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 01/11] mm: Dont assume page-table invariance during faults
  2017-08-08  9:45   ` Anshuman Khandual
@ 2017-08-08 12:11     ` Laurent Dufour
  0 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-08-08 12:11 UTC (permalink / raw)
  To: Anshuman Khandual, paulmck, peterz, akpm, kirill, ak, mhocko,
	dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, Tim Chen

On 08/08/2017 11:45, Anshuman Khandual wrote:
> On 06/16/2017 11:22 PM, Laurent Dufour wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> One of the side effects of speculating on faults (without holding
>> mmap_sem) is that we can race with free_pgtables() and therefore we
>> cannot assume the page-tables will stick around.
>>
>> Remove the relyance on the pte pointer.
> 
> Looking into other parts of the series, it seemed like now we have
> sequence lock both at MM and VMA level but then after that we still
> need to take page table lock before handling page faults (in turn
> manipulating PTE which includes swap in paths as well). Is not that
> true ?

Page table locking is still required as several VMAs can reference the same
page table.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 03/11] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
  2017-08-08 10:35   ` Anshuman Khandual
@ 2017-08-08 12:16     ` Laurent Dufour
  0 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-08-08 12:16 UTC (permalink / raw)
  To: Anshuman Khandual, paulmck, peterz, akpm, kirill, ak, mhocko,
	dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, Tim Chen

On 08/08/2017 12:35, Anshuman Khandual wrote:
> On 06/16/2017 11:22 PM, Laurent Dufour wrote:
>> When handling page fault without holding the mmap_sem the fetch of the
>> pte lock pointer and the locking will have to be done while ensuring
>> that the VMA is not touched in our back.
> 
> It does not change things from whats happening right now, where do we
> check that VMA has not changed by now ?

This patch is preparing the use done later in this series, the goal is to
introduce the service and the check which are relevant.
Later when the VMA check will be added this service is changed.
The goal is to ease the review.

> 
>>
>> So move the fetch and locking operations in a dedicated function.
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> ---
>>  mm/memory.c | 15 +++++++++++----
>>  1 file changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 40834444ea0d..f1132f7931ef 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2240,6 +2240,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  }
>>  
>> +static bool pte_spinlock(struct vm_fault *vmf)
>> +{
>> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> +	spin_lock(vmf->ptl);
>> +	return true;
>> +}
>> +
> 
> Moving them together makes sense but again if blocks are redundant when
> it returns true all the time.
> 
>>  static bool pte_map_lock(struct vm_fault *vmf)
>>  {
>>  	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
>> @@ -3552,8 +3559,8 @@ static int do_numa_page(struct vm_fault *vmf)
>>  	 * validation through pte_unmap_same(). It's of NUMA type but
>>  	 * the pfn may be screwed if the read is non atomic.
>>  	 */
>> -	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
>> -	spin_lock(vmf->ptl);
>> +	if (!pte_spinlock(vmf))
>> +		return VM_FAULT_RETRY;
>>  	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  		goto out;
>> @@ -3745,8 +3752,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>  	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>>  		return do_numa_page(vmf);
>>  
>> -	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> -	spin_lock(vmf->ptl);
>> +	if (!pte_spinlock(vmf))
>> +		return VM_FAULT_RETRY;
>>  	entry = vmf->orig_pte;
>>  	if (unlikely(!pte_same(*vmf->pte, entry)))
>>  		goto unlock;
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem
  2017-08-08 11:17   ` Anshuman Khandual
@ 2017-08-08 12:20     ` Laurent Dufour
  2017-08-08 12:49       ` Jan Kara
  2017-08-08 13:15       ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-08-08 12:20 UTC (permalink / raw)
  To: Anshuman Khandual, paulmck, peterz, akpm, kirill, ak, mhocko,
	dave, jack, Matthew Wilcox
  Cc: linux-kernel, linux-mm, haren, npiggin, bsingharora, Tim Chen

On 08/08/2017 13:17, Anshuman Khandual wrote:
> On 06/16/2017 11:22 PM, Laurent Dufour wrote:
>> kworker/32:1/819 is trying to acquire lock:
>>  (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
>> zap_page_range_single+0xd0/0x1a0
>>
>> but task is already holding lock:
>>  (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
>> unmap_mapping_range+0x7c/0x160
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #2 (&mapping->i_mmap_rwsem){++++..}:
>>        down_write+0x84/0x130
>>        __vma_adjust+0x1f4/0xa80
>>        __split_vma.isra.2+0x174/0x290
>>        do_munmap+0x13c/0x4e0
>>        vm_munmap+0x64/0xb0
>>        elf_map+0x11c/0x130
>>        load_elf_binary+0x6f0/0x15f0
>>        search_binary_handler+0xe0/0x2a0
>>        do_execveat_common.isra.14+0x7fc/0xbe0
>>        call_usermodehelper_exec_async+0x14c/0x1d0
>>        ret_from_kernel_thread+0x5c/0x68
>>
>> -> #1 (&vma->vm_sequence/1){+.+...}:
>>        __vma_adjust+0x124/0xa80
>>        __split_vma.isra.2+0x174/0x290
>>        do_munmap+0x13c/0x4e0
>>        vm_munmap+0x64/0xb0
>>        elf_map+0x11c/0x130
>>        load_elf_binary+0x6f0/0x15f0
>>        search_binary_handler+0xe0/0x2a0
>>        do_execveat_common.isra.14+0x7fc/0xbe0
>>        call_usermodehelper_exec_async+0x14c/0x1d0
>>        ret_from_kernel_thread+0x5c/0x68
>>
>> -> #0 (&vma->vm_sequence){+.+...}:
>>        lock_acquire+0xf4/0x310
>>        unmap_page_range+0xcc/0xfa0
>>        zap_page_range_single+0xd0/0x1a0
>>        unmap_mapping_range+0x138/0x160
>>        truncate_pagecache+0x50/0xa0
>>        put_aio_ring_file+0x48/0xb0
>>        aio_free_ring+0x40/0x1b0
>>        free_ioctx+0x38/0xc0
>>        process_one_work+0x2cc/0x8a0
>>        worker_thread+0xac/0x580
>>        kthread+0x164/0x1b0
>>        ret_from_kernel_thread+0x5c/0x68
>>
>> other info that might help us debug this:
>>
>> Chain exists of:
>>   &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem
>>
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&mapping->i_mmap_rwsem);
>>                                lock(&vma->vm_sequence/1);
>>                                lock(&mapping->i_mmap_rwsem);
>>   lock(&vma->vm_sequence);
>>
>>  *** DEADLOCK ***
>>
>> To fix that we must grab the vm_sequence lock after any mapping one in
>> __vma_adjust().
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> 
> Should not this be folded back into the previous patch ? It fixes an
> issue introduced by the previous one.

This is an option, but the previous one was signed by Peter, and I'd prefer
to keep his unchanged and add this new one to fix that.
Again this is to ease the review.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem
  2017-08-08 12:20     ` Laurent Dufour
@ 2017-08-08 12:49       ` Jan Kara
  2017-08-08 13:08         ` Laurent Dufour
  2017-08-08 13:15       ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-08-08 12:49 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Anshuman Khandual, paulmck, peterz, akpm, kirill, ak, mhocko,
	dave, jack, Matthew Wilcox, linux-kernel, linux-mm, haren,
	npiggin, bsingharora, Tim Chen

On Tue 08-08-17 14:20:23, Laurent Dufour wrote:
> On 08/08/2017 13:17, Anshuman Khandual wrote:
> > On 06/16/2017 11:22 PM, Laurent Dufour wrote:
> >> kworker/32:1/819 is trying to acquire lock:
> >>  (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
> >> zap_page_range_single+0xd0/0x1a0
> >>
> >> but task is already holding lock:
> >>  (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
> >> unmap_mapping_range+0x7c/0x160
> >>
> >> which lock already depends on the new lock.
> >>
> >> the existing dependency chain (in reverse order) is:
> >>
> >> -> #2 (&mapping->i_mmap_rwsem){++++..}:
> >>        down_write+0x84/0x130
> >>        __vma_adjust+0x1f4/0xa80
> >>        __split_vma.isra.2+0x174/0x290
> >>        do_munmap+0x13c/0x4e0
> >>        vm_munmap+0x64/0xb0
> >>        elf_map+0x11c/0x130
> >>        load_elf_binary+0x6f0/0x15f0
> >>        search_binary_handler+0xe0/0x2a0
> >>        do_execveat_common.isra.14+0x7fc/0xbe0
> >>        call_usermodehelper_exec_async+0x14c/0x1d0
> >>        ret_from_kernel_thread+0x5c/0x68
> >>
> >> -> #1 (&vma->vm_sequence/1){+.+...}:
> >>        __vma_adjust+0x124/0xa80
> >>        __split_vma.isra.2+0x174/0x290
> >>        do_munmap+0x13c/0x4e0
> >>        vm_munmap+0x64/0xb0
> >>        elf_map+0x11c/0x130
> >>        load_elf_binary+0x6f0/0x15f0
> >>        search_binary_handler+0xe0/0x2a0
> >>        do_execveat_common.isra.14+0x7fc/0xbe0
> >>        call_usermodehelper_exec_async+0x14c/0x1d0
> >>        ret_from_kernel_thread+0x5c/0x68
> >>
> >> -> #0 (&vma->vm_sequence){+.+...}:
> >>        lock_acquire+0xf4/0x310
> >>        unmap_page_range+0xcc/0xfa0
> >>        zap_page_range_single+0xd0/0x1a0
> >>        unmap_mapping_range+0x138/0x160
> >>        truncate_pagecache+0x50/0xa0
> >>        put_aio_ring_file+0x48/0xb0
> >>        aio_free_ring+0x40/0x1b0
> >>        free_ioctx+0x38/0xc0
> >>        process_one_work+0x2cc/0x8a0
> >>        worker_thread+0xac/0x580
> >>        kthread+0x164/0x1b0
> >>        ret_from_kernel_thread+0x5c/0x68
> >>
> >> other info that might help us debug this:
> >>
> >> Chain exists of:
> >>   &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem
> >>
> >>  Possible unsafe locking scenario:
> >>
> >>        CPU0                    CPU1
> >>        ----                    ----
> >>   lock(&mapping->i_mmap_rwsem);
> >>                                lock(&vma->vm_sequence/1);
> >>                                lock(&mapping->i_mmap_rwsem);
> >>   lock(&vma->vm_sequence);
> >>
> >>  *** DEADLOCK ***
> >>
> >> To fix that we must grab the vm_sequence lock after any mapping one in
> >> __vma_adjust().
> >>
> >> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> > 
> > Should not this be folded back into the previous patch ? It fixes an
> > issue introduced by the previous one.
> 
> This is an option, but the previous one was signed by Peter, and I'd prefer
> to keep his unchanged and add this new one to fix that.
> Again this is to ease the review.

In this particular case I disagree. We should not have buggy patches in the
series. It breaks bisectability and the ease of review is IMO very
questionable because the previous patch is simply buggy and thus is hard to
validate on its own. If the resulting combo would be too complex, you could
think of a different way how to split it up so that intermediate steps are
not buggy...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem
  2017-08-08 12:49       ` Jan Kara
@ 2017-08-08 13:08         ` Laurent Dufour
  0 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-08-08 13:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Anshuman Khandual, paulmck, peterz, akpm, kirill, ak, mhocko,
	dave, Matthew Wilcox, linux-kernel, linux-mm, haren, npiggin,
	bsingharora, Tim Chen

On 08/08/2017 14:49, Jan Kara wrote:
> On Tue 08-08-17 14:20:23, Laurent Dufour wrote:
>> On 08/08/2017 13:17, Anshuman Khandual wrote:
>>> On 06/16/2017 11:22 PM, Laurent Dufour wrote:
>>>> kworker/32:1/819 is trying to acquire lock:
>>>>  (&vma->vm_sequence){+.+...}, at: [<c0000000002f20e0>]
>>>> zap_page_range_single+0xd0/0x1a0
>>>>
>>>> but task is already holding lock:
>>>>  (&mapping->i_mmap_rwsem){++++..}, at: [<c0000000002f229c>]
>>>> unmap_mapping_range+0x7c/0x160
>>>>
>>>> which lock already depends on the new lock.
>>>>
>>>> the existing dependency chain (in reverse order) is:
>>>>
>>>> -> #2 (&mapping->i_mmap_rwsem){++++..}:
>>>>        down_write+0x84/0x130
>>>>        __vma_adjust+0x1f4/0xa80
>>>>        __split_vma.isra.2+0x174/0x290
>>>>        do_munmap+0x13c/0x4e0
>>>>        vm_munmap+0x64/0xb0
>>>>        elf_map+0x11c/0x130
>>>>        load_elf_binary+0x6f0/0x15f0
>>>>        search_binary_handler+0xe0/0x2a0
>>>>        do_execveat_common.isra.14+0x7fc/0xbe0
>>>>        call_usermodehelper_exec_async+0x14c/0x1d0
>>>>        ret_from_kernel_thread+0x5c/0x68
>>>>
>>>> -> #1 (&vma->vm_sequence/1){+.+...}:
>>>>        __vma_adjust+0x124/0xa80
>>>>        __split_vma.isra.2+0x174/0x290
>>>>        do_munmap+0x13c/0x4e0
>>>>        vm_munmap+0x64/0xb0
>>>>        elf_map+0x11c/0x130
>>>>        load_elf_binary+0x6f0/0x15f0
>>>>        search_binary_handler+0xe0/0x2a0
>>>>        do_execveat_common.isra.14+0x7fc/0xbe0
>>>>        call_usermodehelper_exec_async+0x14c/0x1d0
>>>>        ret_from_kernel_thread+0x5c/0x68
>>>>
>>>> -> #0 (&vma->vm_sequence){+.+...}:
>>>>        lock_acquire+0xf4/0x310
>>>>        unmap_page_range+0xcc/0xfa0
>>>>        zap_page_range_single+0xd0/0x1a0
>>>>        unmap_mapping_range+0x138/0x160
>>>>        truncate_pagecache+0x50/0xa0
>>>>        put_aio_ring_file+0x48/0xb0
>>>>        aio_free_ring+0x40/0x1b0
>>>>        free_ioctx+0x38/0xc0
>>>>        process_one_work+0x2cc/0x8a0
>>>>        worker_thread+0xac/0x580
>>>>        kthread+0x164/0x1b0
>>>>        ret_from_kernel_thread+0x5c/0x68
>>>>
>>>> other info that might help us debug this:
>>>>
>>>> Chain exists of:
>>>>   &vma->vm_sequence --> &vma->vm_sequence/1 --> &mapping->i_mmap_rwsem
>>>>
>>>>  Possible unsafe locking scenario:
>>>>
>>>>        CPU0                    CPU1
>>>>        ----                    ----
>>>>   lock(&mapping->i_mmap_rwsem);
>>>>                                lock(&vma->vm_sequence/1);
>>>>                                lock(&mapping->i_mmap_rwsem);
>>>>   lock(&vma->vm_sequence);
>>>>
>>>>  *** DEADLOCK ***
>>>>
>>>> To fix that we must grab the vm_sequence lock after any mapping one in
>>>> __vma_adjust().
>>>>
>>>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>>>
>>> Should not this be folded back into the previous patch ? It fixes an
>>> issue introduced by the previous one.
>>
>> This is an option, but the previous one was signed by Peter, and I'd prefer
>> to keep his unchanged and add this new one to fix that.
>> Again this is to ease the review.
> 
> In this particular case I disagree. We should not have buggy patches in the
> series. It breaks bisectability and the ease of review is IMO very
> questionable because the previous patch is simply buggy and thus is hard to
> validate on its own. If the resulting combo would be too complex, you could
> think of a different way how to split it up so that intermediate steps are
> not buggy...

I don't think the combo will become too large, it's just moving some calls
around. So as bisectability seems to be more important than readability,
I'll merge it into the original Peter's patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem
  2017-08-08 12:20     ` Laurent Dufour
  2017-08-08 12:49       ` Jan Kara
@ 2017-08-08 13:15       ` Peter Zijlstra
  2017-08-08 13:34         ` Laurent Dufour
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-08-08 13:15 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Anshuman Khandual, paulmck, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, linux-kernel, linux-mm, haren, npiggin,
	bsingharora, Tim Chen

On Tue, Aug 08, 2017 at 02:20:23PM +0200, Laurent Dufour wrote:
> This is an option, but the previous one was signed by Peter, and I'd prefer
> to keep his unchanged and add this new one to fix that.
> Again this is to ease the review.

You can always add something like:

[ldufour: fixed lockdep complaint]

Before your SoB.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem
  2017-08-08 13:15       ` Peter Zijlstra
@ 2017-08-08 13:34         ` Laurent Dufour
  0 siblings, 0 replies; 37+ messages in thread
From: Laurent Dufour @ 2017-08-08 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anshuman Khandual, paulmck, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, linux-kernel, linux-mm, haren, npiggin,
	bsingharora, Tim Chen

On 08/08/2017 15:15, Peter Zijlstra wrote:
> On Tue, Aug 08, 2017 at 02:20:23PM +0200, Laurent Dufour wrote:
>> This is an option, but the previous one was signed by Peter, and I'd prefer
>> to keep his unchanged and add this new one to fix that.
>> Again this is to ease the review.
> 
> You can always add something like:
> 
> [ldufour: fixed lockdep complaint]
> 
> Before your SoB.

Yes, that's what I'm doing right now, and I'll push a new series based on
4.13-rc4 asap.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-08-08 13:34 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-16 17:52 [RFC v5 00/11] Speculative page faults Laurent Dufour
2017-06-16 17:52 ` [RFC v5 01/11] mm: Dont assume page-table invariance during faults Laurent Dufour
2017-07-07  7:07   ` Balbir Singh
2017-07-10 17:48     ` Laurent Dufour
2017-07-11  4:26       ` Balbir Singh
2017-08-08 10:04       ` Anshuman Khandual
2017-08-08  9:45   ` Anshuman Khandual
2017-08-08 12:11     ` Laurent Dufour
2017-06-16 17:52 ` [RFC v5 02/11] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2017-08-08 10:24   ` Anshuman Khandual
2017-08-08 10:42     ` Peter Zijlstra
2017-06-16 17:52 ` [RFC v5 03/11] mm: Introduce pte_spinlock " Laurent Dufour
2017-08-08 10:35   ` Anshuman Khandual
2017-08-08 12:16     ` Laurent Dufour
2017-06-16 17:52 ` [RFC v5 04/11] mm: VMA sequence count Laurent Dufour
2017-08-08 10:59   ` Anshuman Khandual
2017-08-08 11:04     ` Peter Zijlstra
2017-06-16 17:52 ` [RFC v5 05/11] mm: fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
2017-08-08 11:17   ` Anshuman Khandual
2017-08-08 12:20     ` Laurent Dufour
2017-08-08 12:49       ` Jan Kara
2017-08-08 13:08         ` Laurent Dufour
2017-08-08 13:15       ` Peter Zijlstra
2017-08-08 13:34         ` Laurent Dufour
2017-06-16 17:52 ` [RFC v5 06/11] mm: Protect VMA modifications using VMA sequence count Laurent Dufour
2017-06-16 17:52 ` [RFC v5 07/11] mm: RCU free VMAs Laurent Dufour
2017-06-16 17:52 ` [RFC v5 08/11] mm: Provide speculative fault infrastructure Laurent Dufour
2017-06-16 17:52 ` [RFC v5 09/11] mm: Try spin lock in speculative path Laurent Dufour
2017-07-05 18:50   ` Peter Zijlstra
2017-07-06 13:46     ` Laurent Dufour
2017-07-06 14:48       ` Peter Zijlstra
2017-07-06 15:29         ` Laurent Dufour
2017-07-06 16:13           ` Peter Zijlstra
2017-06-16 17:52 ` [RFC v5 10/11] x86/mm: Add speculative pagefault handling Laurent Dufour
2017-06-16 17:52 ` [RFC v5 11/11] powerpc/mm: Add speculative page fault Laurent Dufour
2017-07-03 17:32 ` [RFC v5 00/11] Speculative page faults Laurent Dufour
2017-07-07  1:54   ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).