linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/16] Speculative page faults
@ 2017-08-08 14:35 Laurent Dufour
  2017-08-08 14:35 ` [PATCH 01/16] mm: Dont assume page-table invariance during faults Laurent Dufour
                   ` (15 more replies)
  0 siblings, 16 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

This is a port on kernel 4.13 of the work done by Peter Zijlstra to
handle page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type page fault is named speculative
page fault. If the speculative page fault fails because of a concurrency is
detected or because underlying PMD or PTE tables are not yet allocating, it
is failing its processing and a classic page fault is then tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, so the VMA list is now managed using
SRCU allowing lockless walking. The only impact would be the deferred file
derefencing in the case of a file mapping, since the file pointer is
released once the SRCU cleaning is done.  This patch relies on the change
done recently by Paul McKenney in SRCU which now runs a callback per CPU
instead of per SRCU structure [1].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA is found, the speculative page fault handler would check for
the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence locks are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

Compared to the Peter's initial work, this series introduces a spin_trylock
when dealing with speculative page fault. This is required to avoid dead
lock when handling a page fault while a TLB invalidate is requested by an
other CPU holding the PTE. Another change due to a lock dependency issue
with mapping->i_mmap_rwsem.

This series builds on top of v4.13-rc4 and is functional on x86 and
PowerPC.

Tests have been made using a large commercial in-memory database on a
PowerPC system with 752 CPUs. The results are very encouraging since the
loading of the 2TB database was faster by 14% with the speculative page
fault.

Using ebizzy test [3], which spreads a lot of threads, the result are good
when running on both a large or a small system. When using kernbench, the
result are quite similar which expected as not so much multithreaded
processes are involved. But there is no performance degradation neither
which is good.

------------------
Benchmarks results

Note these test have been made on top of 4.13-rc3 with the following patch
from Paul McKenney applied: 
 "srcu: Provide ordering for CPU not involved in grace period" [5]

Ebizzy:
-------
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
result I repeated the test 100 times and measure the average result, mean
deviation and max.

- 16 CPUs x86 VM
Records/s	4.13-rc3	4.13-rc3-spf
Average		11455.92	45803.64
Mean deviation	509.34		848.19
Max		13997		49824

- 80 CPUs Power 8 node:
Records/s	4.13-rc3	4.13-rc3-spf
Average		33848.76	63427.62
Mean deviation	684.48		1618.84
Max		36235		70401

Kernbench:
----------
This test is building a 4.12 kernel using platform default config. The
build has been run 5 times each time.

- 16 CPUs x86 VM
Average Half load -j 7 Run (std deviation)
 		 4.13.0-rc3		4.13.0-rc3-spf
Elapsed Time     166.668 (0.462299)	167.55 (0.432724)
User Time        1083.11 (2.89018)	1083.76 (2.17015)
System Time      202.982 (0.984058)	210.364 (0.890382)
Percent CPU 	 771.2 (0.83666)	771.8 (1.09545)
Context Switches 46789 (519.558)	67602.4 (365.929)
Sleeps           83870.8 (836.392)	84269.4 (457.962)

Average Optimal load -j 16 Run (std deviation)
 		 4.13.0-rc3		4.13.0-rc3-spf
Elapsed Time     85.002 (0.298111)	85.406 (0.506784)
User Time        1033.25 (52.6037)	1034.63 (51.8167)
System Time      185.46 (18.4826)	191.75 (19.6379)
Percent CPU 	 1062.6 (307.181)	1063.9 (307.948)
Context Switches 67423.3 (21762.7)	91316.1 (25004.4)
Sleeps           89393.6 (5860.2)	89489.9 (5563.54)

The elapsed time is in the same order, a bit larger in the case of the spf
release, but that seems to be in the error margin.

- 80 CPUs Power 8 node:
Average Half load -j 40 Run (std deviation)
 		 4.13.0-rc3		4.13.0-rc3-spf
Elapsed Time     116.422 (0.604707)	116.898 (1.00981)
User Time        4410.13 (23.4272)	4393.49 (22.6739)
System Time      130.128 (0.567468)	132.16 (0.840238)
Percent CPU      3899.2 (13.9535)	3871 (17.6777)
Context Switches 72699.8 (585.077)	73281.4 (516.003)
Sleeps           160396 (1248.34)	161801 (522.71)

Average Optimal load -j 80 Run (std deviation)
 		 4.13.0-rc3		4.13.0-rc3-spf
Elapsed Time     111.216 (0.826698)	110.442 (0.846505)
User Time        5911.85 (1583.04)	5932.14 (1622.02)
System Time      164.799 (36.5712)	168.29 (38.0891)
Percent CPU      5371.9 (1552.74)	5410.2 (1623.17)
Context Switches 117770 (47512.1)	130131 (59927.8)
Sleeps           161619 (2210.47)	163442 (2349.71)

Here the elapsed time is a bit shorter using the spf release, but again we
stay in the error margin. It has to be noted that this system is not
correctly balanced on the NUMA point of view as all the available memory is
attached to one core.

------------------------
Changes since RFC V5 [6]
 - Port to 4.13 kernel
 - Merging patch fixing lock dependency into the original patch
 - Replace the 2 parameters of vma_has_changed() with the vmf pointer
 - In patch 7, don't call __do_fault() in the speculative path as it may
 want to unlock the mmap_sem.
 - In patch 11-12, don't check for vma boundaries when
 page_add_new_anon_rmap() is called during the spf path and protect against
 anon_vma pointer's update.
 - In patch 13-16, add performance events to report number of successful
 and failed speculative events. 

[1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da915ad5cf25b5f5d358dd3670c3378d8ae8c03e
[3] http://ebizzy.sourceforge.net/
[4] http://ck.kolivas.org/apps/kernbench/kernbench-0.50/
[5] https://lkml.org/lkml/2017/7/24/829
[6] https://lwn.net/Articles/725607/

Laurent Dufour (10):
  mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
  mm: Protect VMA modifications using VMA sequence count
  mm: Try spin lock in speculative path
  powerpc/mm: Add speculative page fault
  mm: Introduce __page_add_new_anon_rmap()
  mm: Protect SPF handler against anon_vma changes
  perf: Add a speculative page fault sw events
  x86/mm: Add support for SPF events
  powerpc/mm: Add support for SPF events
  perf tools: Add support for SPF events

Peter Zijlstra (6):
  mm: Dont assume page-table invariance during faults
  mm: Prepare for FAULT_FLAG_SPECULATIVE
  mm: VMA sequence count
  mm: RCU free VMAs
  mm: Provide speculative fault infrastructure
  x86/mm: Add speculative pagefault handling

 arch/powerpc/mm/fault.c               |  30 +++-
 arch/x86/mm/fault.c                   |  18 ++
 fs/proc/task_mmu.c                    |   2 +
 include/linux/mm.h                    |   4 +
 include/linux/mm_types.h              |   3 +
 include/linux/rmap.h                  |  12 +-
 include/uapi/linux/perf_event.h       |   2 +
 kernel/fork.c                         |   1 +
 mm/init-mm.c                          |   1 +
 mm/internal.h                         |  19 +++
 mm/khugepaged.c                       |   3 +
 mm/madvise.c                          |   4 +
 mm/memory.c                           | 302 ++++++++++++++++++++++++++++------
 mm/mempolicy.c                        |  10 +-
 mm/mlock.c                            |   9 +-
 mm/mmap.c                             | 123 ++++++++++----
 mm/mprotect.c                         |   2 +
 mm/mremap.c                           |   7 +
 mm/rmap.c                             |   5 +-
 tools/include/uapi/linux/perf_event.h |   2 +
 tools/perf/util/evsel.c               |   2 +
 tools/perf/util/parse-events.c        |   8 +
 tools/perf/util/parse-events.l        |   2 +
 tools/perf/util/python.c              |   2 +
 24 files changed, 484 insertions(+), 89 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/16] mm: Dont assume page-table invariance during faults
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

From: Peter Zijlstra <peterz@infradead.org>

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the reliance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f65beaad319b..d08f494f1b37 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2103,30 +2103,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	debug_dma_assert_idle(src);
@@ -2683,9 +2659,6 @@ int do_swap_page(struct vm_fault *vmf)
 	int exclusive = 0;
 	int ret = 0;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
  2017-08-08 14:35 ` [PATCH 01/16] mm: Dont assume page-table invariance during faults Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-09 10:08   ` Kirill A. Shutemov
  2017-08-08 14:35 ` [PATCH 03/16] mm: Introduce pte_spinlock " Laurent Dufour
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

From: Peter Zijlstra <peterz@infradead.org>

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
 implemented as the vm_fault structure in the kernel]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 55 ++++++++++++++++++++++++++++++++++++++----------------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..8763ec96dc78 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -286,6 +286,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
 
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
diff --git a/mm/memory.c b/mm/memory.c
index d08f494f1b37..b93916c0b086 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2241,6 +2241,12 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+	return true;
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page.
  *
@@ -2268,6 +2274,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 	const unsigned long mmun_start = vmf->address & PAGE_MASK;
 	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
 	struct mem_cgroup *memcg;
+	int ret = VM_FAULT_OOM;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2295,7 +2302,11 @@ static int wp_page_copy(struct vm_fault *vmf)
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		mem_cgroup_cancel_charge(new_page, memcg, false);
+		ret = VM_FAULT_RETRY;
+		goto oom_free_new;
+	}
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2383,7 +2394,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 oom:
 	if (old_page)
 		put_page(old_page);
-	return VM_FAULT_OOM;
+	return ret;
 }
 
 /**
@@ -2404,8 +2415,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
 	WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-				       &vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
 	/*
 	 * We might have raced with another page fault while we released the
 	 * pte_offset_map_lock.
@@ -2523,8 +2534,11 @@ static int do_wp_page(struct vm_fault *vmf)
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			lock_page(vmf->page);
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				unlock_page(vmf->page);
+				put_page(vmf->page);
+				return VM_FAULT_RETRY;
+			}
 			if (!pte_same(*vmf->pte, vmf->orig_pte)) {
 				unlock_page(vmf->page);
 				pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2682,8 +2696,10 @@ int do_swap_page(struct vm_fault *vmf)
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			if (!pte_map_lock(vmf)) {
+				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+				return VM_FAULT_RETRY;
+			}
 			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2739,8 +2755,11 @@ int do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		ret = VM_FAULT_RETRY;
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto out_page;
+	}
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
 		goto out_nomap;
 
@@ -2866,8 +2885,8 @@ static int do_anonymous_page(struct vm_fault *vmf)
 			!mm_forbids_zeropage(vma->vm_mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
 						vma->vm_page_prot));
-		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				vmf->address, &vmf->ptl);
+		if (!pte_map_lock(vmf))
+			return VM_FAULT_RETRY;
 		if (!pte_none(*vmf->pte))
 			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
@@ -2899,8 +2918,11 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf)) {
+		mem_cgroup_cancel_charge(page, memcg, false);
+		put_page(page);
+		return VM_FAULT_RETRY;
+	}
 	if (!pte_none(*vmf->pte))
 		goto release;
 
@@ -3020,8 +3042,9 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
 	 * pte_none() under vmf->ptl protection when we return to
 	 * alloc_set_pte().
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	if (!pte_map_lock(vmf))
+		return VM_FAULT_RETRY;
+
 	return 0;
 }
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/16] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
  2017-08-08 14:35 ` [PATCH 01/16] mm: Dont assume page-table invariance during faults Laurent Dufour
  2017-08-08 14:35 ` [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 04/16] mm: VMA sequence count Laurent Dufour
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b93916c0b086..11c5fe5f62bb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2241,6 +2241,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 }
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
 	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
@@ -3515,8 +3522,8 @@ static int do_numa_page(struct vm_fault *vmf)
 	 * validation through pte_unmap_same(). It's of NUMA type but
 	 * the pfn may be screwed if the read is non atomic.
 	 */
-	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto out;
@@ -3708,8 +3715,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
-	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (!pte_spinlock(vmf))
+		return VM_FAULT_RETRY;
 	entry = vmf->orig_pte;
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/16] mm: VMA sequence count
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (2 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 03/16] mm: Introduce pte_spinlock " Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 05/16] mm: Protect VMA modifications using " Laurent Dufour
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

From: Peter Zijlstra <peterz@infradead.org>

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Port to 4.12 kernel]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |  1 +
 mm/memory.c              |  2 ++
 mm/mmap.c                | 21 ++++++++++++++++++---
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f384bb62d8e..d7d6dae4c009 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,6 +342,7 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	seqcount_t vm_sequence;
 } __randomize_layout;
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 11c5fe5f62bb..7d61f64916a2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1380,6 +1380,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	BUG_ON(addr >= end);
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1389,6 +1390,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f19efcf75418..140b22136cb7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -557,6 +557,8 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	else
 		mm->highest_vm_end = vm_end_gap(vma);
 
+	seqcount_init(&vma->vm_sequence);
+
 	/*
 	 * vma->vm_prev wasn't known when we followed the rbtree to find the
 	 * correct insertion point for that vma. As a result, we could not
@@ -798,6 +800,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		}
 	}
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next && next != vma)
+		write_seqcount_begin_nested(&next->vm_sequence,
+					    SINGLE_DEPTH_NESTING);
+
 	anon_vma = vma->anon_vma;
 	if (!anon_vma && adjust_next)
 		anon_vma = next->anon_vma;
@@ -902,6 +909,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
+		write_seqcount_end(&next->vm_sequence);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -931,11 +939,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		if (remove_next == 2) {
 			remove_next = 1;
 			end = next->vm_end;
+			write_seqcount_end(&vma->vm_sequence);
 			goto again;
-		}
-		else if (next)
+		} else if (next) {
+			if (next != vma)
+				write_seqcount_begin_nested(&next->vm_sequence,
+							    SINGLE_DEPTH_NESTING);
 			vma_gap_update(next);
-		else {
+		} else {
 			/*
 			 * If remove_next == 2 we obviously can't
 			 * reach this path.
@@ -961,6 +972,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	if (next && next != vma)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 
 	return 0;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (3 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 04/16] mm: VMA sequence count Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-09 10:12   ` Kirill A. Shutemov
  2017-08-08 14:35 ` [PATCH 06/16] mm: RCU free VMAs Laurent Dufour
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection agains the VMA modification done in :
	- madvise()
	- mremap()
	- mpol_rebind_policy()
	- vma_replace_policy()
	- change_prot_numa()
	- mlock(), munlock()
	- mprotect()
	- mmap_region()
	- collapse_huge_page()

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c |  2 ++
 mm/khugepaged.c    |  3 +++
 mm/madvise.c       |  4 ++++
 mm/mempolicy.c     | 10 +++++++++-
 mm/mlock.c         |  9 ++++++---
 mm/mmap.c          |  2 ++
 mm/mprotect.c      |  2 ++
 mm/mremap.c        |  7 +++++++
 8 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..5c0c3ab10f3c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1064,8 +1064,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					goto out_mm;
 				}
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
+					write_seqcount_begin(&vma->vm_sequence);
 					vma->vm_flags &= ~VM_SOFTDIRTY;
 					vma_set_page_prot(vma);
+					write_seqcount_end(&vma->vm_sequence);
 				}
 				downgrade_write(&mm->mmap_sem);
 				break;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c01f177a1120..56dd994c05d0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1005,6 +1005,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (mm_find_pmd(mm, address) != pmd)
 		goto out;
 
+	write_seqcount_begin(&vma->vm_sequence);
 	anon_vma_lock_write(vma->anon_vma);
 
 	pte = pte_offset_map(pmd, address);
@@ -1040,6 +1041,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
+		write_seqcount_end(&vma->vm_sequence);
 		result = SCAN_FAIL;
 		goto out;
 	}
@@ -1074,6 +1076,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
+	write_seqcount_end(&vma->vm_sequence);
 
 	*hpage = NULL;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 47d8d8a25eae..4f73ecaa0961 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -172,7 +172,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = new_flags;
+	write_seqcount_end(&vma->vm_sequence);
 out:
 	return error;
 }
@@ -440,9 +442,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
 		.private = tlb,
 	};
 
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	walk_page_range(addr, end, &free_walk);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 static int madvise_free_single_vma(struct vm_area_struct *vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d911fa5cb2a7..32ed50c0d4b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -378,8 +378,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		write_seqcount_begin(&vma->vm_sequence);
 		mpol_rebind_policy(vma->vm_policy, new);
+		write_seqcount_end(&vma->vm_sequence);
+	}
 	up_write(&mm->mmap_sem);
 }
 
@@ -537,9 +540,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
+	write_seqcount_begin(&vma->vm_sequence);
 	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return nr_updated;
 }
@@ -640,6 +645,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	if (vma->vm_ops && vma->vm_ops->set_policy) {
 		err = vma->vm_ops->set_policy(vma, new);
 		if (err)
@@ -648,10 +654,12 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 
 	old = vma->vm_policy;
 	vma->vm_policy = new; /* protected by mmap_sem */
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(old);
 
 	return 0;
  err_out:
+	write_seqcount_end(&vma->vm_sequence);
 	mpol_put(new);
 	return err;
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index b562b5523a65..30d9bfc61929 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -438,7 +438,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	write_seqcount_end(&vma->vm_sequence);
 
 	while (start < end) {
 		struct page *page;
@@ -563,10 +565,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
-
-	if (lock)
+	if (lock) {
+		write_seqcount_begin(&vma->vm_sequence);
 		vma->vm_flags = newflags;
-	else
+		write_seqcount_end(&vma->vm_sequence);
+	} else
 		munlock_vma_pages_range(vma, start, end);
 
 out:
diff --git a/mm/mmap.c b/mm/mmap.c
index 140b22136cb7..221b1f3e966a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1734,6 +1734,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
+	write_seqcount_begin(&vma->vm_sequence);
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
@@ -1756,6 +1757,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_flags |= VM_SOFTDIRTY;
 
 	vma_set_page_prot(vma);
+	write_seqcount_end(&vma->vm_sequence);
 
 	return addr;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4180ad8cc9c5..297f0f1e7560 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -344,6 +344,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
+	write_seqcount_begin(&vma->vm_sequence);
 	vma->vm_flags = newflags;
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
@@ -359,6 +360,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 			(newflags & VM_WRITE)) {
 		populate_vma_page_range(vma, start, end, NULL);
 	}
+	write_seqcount_end(&vma->vm_sequence);
 
 	vm_stat_account(mm, oldflags, -nrpages);
 	vm_stat_account(mm, newflags, nrpages);
diff --git a/mm/mremap.c b/mm/mremap.c
index 3f23715d3c69..1abadea8ab84 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -301,6 +301,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	if (!new_vma)
 		return -ENOMEM;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	write_seqcount_begin_nested(&new_vma->vm_sequence,
+				    SINGLE_DEPTH_NESTING);
+
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
 				     need_rmap_locks);
 	if (moved_len < old_len) {
@@ -317,6 +321,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
 				 true);
+		write_seqcount_end(&vma->vm_sequence);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -325,7 +330,9 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		mremap_userfaultfd_prep(new_vma, uf);
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
+		write_seqcount_end(&vma->vm_sequence);
 	}
+	write_seqcount_end(&new_vma->vm_sequence);
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/16] mm: RCU free VMAs
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (4 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 05/16] mm: Protect VMA modifications using " Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 07/16] mm: Provide speculative fault infrastructure Laurent Dufour
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

From: Peter Zijlstra <peterz@infradead.org>

Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (this could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Remove the warnings in description about the SRCU global lock which
 has been removed now]
[Rename vma_is_dead() to vma_has_changed()]
[Pass vm_fault structure pointer instead of 2 arguments  to
 vmf_has_changed() ]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |   2 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  19 +++++++++
 mm/mmap.c                | 100 +++++++++++++++++++++++++++++++++++------------
 5 files changed, 97 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d7d6dae4c009..30c127b8a4d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -343,6 +343,7 @@ struct vm_area_struct {
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 } __randomize_layout;
 
 struct core_thread {
@@ -360,6 +361,7 @@ struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
+	seqlock_t mm_seq;
 	u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 	unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..8d1223270fea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -791,6 +791,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
 	mm->vmacache_seqnum = 0;
+	seqlock_init(&mm->mm_seq);
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 975e49f00f34..2b1fa061684f 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -16,6 +16,7 @@
 
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
+	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 4ef49fc55e58..9d6347e35747 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,25 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm,
+					    unsigned long addr);
+
+static inline bool vma_has_changed(struct vm_fault *vmf)
+{
+	int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
+	unsigned seq = ACCESS_ONCE(vmf->vma->vm_sequence.sequence);
+
+	/*
+	 * Matches both the wmb in write_seqlock_{begin,end}() and
+	 * the wmb in vma_rb_erase().
+	 */
+	smp_rmb();
+
+	return ret || seq != vmf->sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 221b1f3e966a..73f5ffc03155 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -159,6 +159,23 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	if (vma->vm_file)
+		fput(vma->vm_file);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -169,10 +186,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -410,26 +425,37 @@ static void vma_gap_update(struct vm_area_struct *vma)
 }
 
 static inline void vma_rb_insert(struct vm_area_struct *vma,
-				 struct rb_root *root)
+				 struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
+
 	/* All rb_subtree_gap values must be consistent prior to insertion */
 	validate_mm_rb(root, NULL);
 
 	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
 }
 
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
 	/*
 	 * Note rb_erase_augmented is a fairly large inline function,
 	 * so make sure we instantiate it only once with our desired
 	 * augmented rbtree callbacks.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+	write_sequnlock(&mm->mm_seq); /* wmb */
+
+	/*
+	 * Ensure the removal is complete before clearing the node.
+	 * Matched by vma_has_changed()/handle_speculative_fault().
+	 */
+	RB_CLEAR_NODE(&vma->vm_rb);
 }
 
 static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
-						struct rb_root *root,
+						struct mm_struct *mm,
 						struct vm_area_struct *ignore)
 {
 	/*
@@ -437,21 +463,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
 	 * with the possible exception of the "next" vma being erased if
 	 * next->vm_start was reduced.
 	 */
-	validate_mm_rb(root, ignore);
+	validate_mm_rb(&mm->mm_rb, ignore);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
-					 struct rb_root *root)
+					 struct mm_struct *mm)
 {
 	/*
 	 * All rb_subtree_gap values must be consistent prior to erase,
 	 * with the possible exception of the vma being erased.
 	 */
-	validate_mm_rb(root, vma);
+	validate_mm_rb(&mm->mm_rb, vma);
 
-	__vma_rb_erase(vma, root);
+	__vma_rb_erase(vma, mm);
 }
 
 /*
@@ -568,10 +594,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * immediately update the gap to the correct value. Finally we
 	 * rebalance the rbtree after all augmented values have been set.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	vma->rb_subtree_gap = 0;
 	vma_gap_update(vma);
-	vma_rb_insert(vma, &mm->mm_rb);
+	vma_rb_insert(vma, mm);
+	write_sequnlock(&mm->mm_seq);
 }
 
 static void __vma_link_file(struct vm_area_struct *vma)
@@ -647,7 +675,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
 {
 	struct vm_area_struct *next;
 
-	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
+	vma_rb_erase_ignore(vma, mm, ignore);
 	next = vma->vm_next;
 	if (has_prev)
 		prev->vm_next = next;
@@ -900,15 +928,13 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
-		if (file) {
+		if (file)
 			uprobe_munmap(next, next->vm_start, next->vm_end);
-			fput(file);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		free_vma(next);
 		write_seqcount_end(&next->vm_sequence);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
@@ -2129,15 +2155,10 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
-	struct vm_area_struct *vma;
-
-	/* Check the cache first. */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
+	struct vm_area_struct *vma = NULL;
 
 	rb_node = mm->mm_rb.rb_node;
 
@@ -2155,13 +2176,40 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 			rb_node = rb_node->rb_right;
 	}
 
+	return vma;
+}
+
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* Check the cache first. */
+	vma = vmacache_find(mm, addr);
+	if (likely(vma))
+		return vma;
+
+	vma = __find_vma(mm, addr);
 	if (vma)
 		vmacache_update(addr, vma);
 	return vma;
 }
-
 EXPORT_SYMBOL(find_vma);
 
+struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+	unsigned int seq;
+
+	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
+
+	do {
+		seq = read_seqbegin(&mm->mm_seq);
+		vma = __find_vma(mm, addr);
+	} while (read_seqretry(&mm->mm_seq, seq));
+
+	return vma;
+}
+
 /*
  * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
  */
@@ -2529,7 +2577,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
 	do {
-		vma_rb_erase(vma, &mm->mm_rb);
+		vma_rb_erase(vma, mm);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/16] mm: Provide speculative fault infrastructure
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (5 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 06/16] mm: RCU free VMAs Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 08/16] mm: Try spin lock in speculative path Laurent Dufour
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

From: Peter Zijlstra <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed()]
[Call p4d_alloc() as it is safe since pgd is valid]
[Call pud_alloc() as it is safe since p4d is valid]
[Set fe.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |   3 +
 mm/memory.c        | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 183 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8763ec96dc78..863a13af680a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,6 +315,7 @@ struct vm_fault {
 	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	unsigned long address;		/* Faulting virtual address */
+	unsigned int sequence;
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pud_t *pud;			/* Pointer to pud entry matching
@@ -1286,6 +1287,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+				    unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 7d61f64916a2..14236d98a5c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2245,15 +2245,69 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
 static bool pte_spinlock(struct vm_fault *vmf)
 {
+	bool ret = false;
+
+	/* Check if vma is still valid */
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+		spin_lock(vmf->ptl);
+		return true;
+	}
+
+	local_irq_disable();
+	if (vma_has_changed(vmf))
+		goto out;
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	return true;
+
+	if (vma_has_changed(vmf)) {
+		spin_unlock(vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 static bool pte_map_lock(struct vm_fault *vmf)
 {
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
-	return true;
+	bool ret = false;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+					       vmf->address, &vmf->ptl);
+		return true;
+	}
+
+	/*
+	 * The first vma_has_changed() guarantees the page-tables are still
+	 * valid, having IRQs disabled ensures they stay around, hence the
+	 * second vma_has_changed() to make sure they are still valid once
+	 * we've got the lock. After that a concurrent zap_pte_range() will
+	 * block on the PTL and thus we're safe.
+	 */
+	local_irq_disable();
+	if (vma_has_changed(vmf))
+		goto out;
+
+	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				  vmf->address, &ptl);
+	if (vma_has_changed(vmf)) {
+		pte_unmap_unlock(pte, ptl);
+		goto out;
+	}
+
+	vmf->pte = pte;
+	vmf->ptl = ptl;
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 /*
@@ -2872,6 +2926,10 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_SHARED)
 		return VM_FAULT_SIGBUS;
 
+	/* Can't call userland page fault handler in the speculative path */
+	if (vmf->flags & FAULT_FLAG_SPECULATIVE && userfaultfd_missing(vma))
+		return VM_FAULT_RETRY;
+
 	/*
 	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
 	 * pte_offset_map() on pmds where a huge pmd might be created
@@ -3707,6 +3765,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (!vmf->pte) {
 		if (vma_is_anonymous(vmf->vma))
 			return do_anonymous_page(vmf);
+		else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+			return VM_FAULT_RETRY;
 		else
 			return do_fault(vmf);
 	}
@@ -3802,6 +3862,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
+	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
@@ -3829,6 +3890,122 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	return handle_pte_fault(&vmf);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+			     unsigned int flags)
+{
+	struct vm_fault vmf = {
+		.address = address,
+	};
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	int dead, seq, idx, ret = VM_FAULT_RETRY;
+	struct vm_area_struct *vma;
+
+	/* Clear flags that may lead to release the mmap_sem to retry */
+	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
+	flags |= FAULT_FLAG_SPECULATIVE;
+
+	idx = srcu_read_lock(&vma_srcu);
+	vma = find_vma_srcu(mm, address);
+	if (!vma)
+		goto unlock;
+
+	/*
+	 * Validate the VMA found by the lockless lookup.
+	 */
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
+	if ((seq & 1) || dead)
+		goto unlock;
+
+	/*
+	 * We need to re-validate the VMA after checking the bounds, otherwise
+	 * we might have a false positive on the bounds.
+	 */
+	if (address < vma->vm_start || vma->vm_end <= address)
+		goto unlock;
+
+	/*
+	 * Huge pages are not yet supported.
+	 */
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		goto unlock;
+
+	/*
+	 * The three following checks are copied from access_error from
+	 * arch/x86/mm/fault.c
+	 */
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+				       flags & FAULT_FLAG_INSTRUCTION,
+				       flags & FAULT_FLAG_REMOTE))
+		goto unlock;
+
+	/* This is one is required to check that the VMA has write access set */
+	if (flags & FAULT_FLAG_WRITE) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			goto unlock;
+	} else {
+		if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+			goto unlock;
+	}
+
+	if (read_seqcount_retry(&vma->vm_sequence, seq))
+		goto unlock;
+
+	/*
+	 * Do a speculative lookup of the PTE entry.
+	 */
+	local_irq_disable();
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_walk;
+
+	p4d = p4d_alloc(mm, pgd, address);
+	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
+		goto out_walk;
+
+	pud = pud_alloc(mm, p4d, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_walk;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_walk;
+
+	/*
+	 * The above does not allocate/instantiate page-tables because doing so
+	 * would lead to the possibility of instantiating page-tables after
+	 * free_pgtables() -- and consequently leaking them.
+	 *
+	 * The result is that we take at least one !speculative fault per PMD
+	 * in order to instantiate it.
+	 */
+
+	if (unlikely(pmd_huge(*pmd)))
+		goto out_walk;
+
+	vmf.vma = vma;
+	vmf.pmd = pmd;
+	vmf.pgoff = linear_page_index(vma, address);
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
+	vmf.sequence = seq;
+	vmf.flags = flags;
+
+	local_irq_enable();
+
+	ret = handle_pte_fault(&vmf);
+
+unlock:
+	srcu_read_unlock(&vma_srcu, idx);
+	return ret;
+
+out_walk:
+	local_irq_enable();
+	goto unlock;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/16] mm: Try spin lock in speculative path
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (6 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 07/16] mm: Provide speculative fault infrastructure Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 09/16] x86/mm: Add speculative pagefault handling Laurent Dufour
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

There is a deadlock when a CPU is doing a speculative page fault and
another one is calling do_unmap().

The deadlock occurred because the speculative path try to spinlock the
pte while the interrupt are disabled. When the other CPU in the
unmap's path has locked the pte then is waiting for all the CPU to
invalidate the TLB. As the CPU doing the speculative fault have the
interrupt disable it can't invalidate the TLB, and can't get the lock.

Since we are in a speculative path, we can race with other mm action.
So let assume that the lock may not get acquired and fail the
speculative page fault.

Here are the stacks captured during the deadlock:

	CPU 0
	native_flush_tlb_others+0x7c/0x260
	flush_tlb_mm_range+0x6a/0x220
	tlb_flush_mmu_tlbonly+0x63/0xc0
	unmap_page_range+0x897/0x9d0
	? unmap_single_vma+0x7d/0xe0
	? release_pages+0x2b3/0x360
	unmap_single_vma+0x7d/0xe0
	unmap_vmas+0x51/0xa0
	unmap_region+0xbd/0x130
	do_munmap+0x279/0x460
	SyS_munmap+0x53/0x70

	CPU 1
	do_raw_spin_lock+0x14e/0x160
	_raw_spin_lock+0x5d/0x80
	? pte_map_lock+0x169/0x1b0
	pte_map_lock+0x169/0x1b0
	handle_pte_fault+0xbf2/0xd80
	? trace_hardirqs_on+0xd/0x10
	handle_speculative_fault+0x272/0x280
	handle_speculative_fault+0x5/0x280
	__do_page_fault+0x187/0x580
	trace_do_page_fault+0x52/0x260
	do_async_page_fault+0x19/0x70
	async_page_fault+0x28/0x30

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 14236d98a5c5..519c28507a93 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2259,7 +2259,8 @@ static bool pte_spinlock(struct vm_fault *vmf)
 		goto out;
 
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-	spin_lock(vmf->ptl);
+	if (unlikely(!spin_trylock(vmf->ptl)))
+		goto out;
 
 	if (vma_has_changed(vmf)) {
 		spin_unlock(vmf->ptl);
@@ -2295,8 +2296,20 @@ static bool pte_map_lock(struct vm_fault *vmf)
 	if (vma_has_changed(vmf))
 		goto out;
 
-	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
-				  vmf->address, &ptl);
+	/*
+	 * Same as pte_offset_map_lock() except that we call
+	 * spin_trylock() in place of spin_lock() to avoid race with
+	 * unmap path which may have the lock and wait for this CPU
+	 * to invalidate TLB but this CPU has irq disabled.
+	 * Since we are in a speculative patch, accept it could fail
+	 */
+	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	pte = pte_offset_map(vmf->pmd, vmf->address);
+	if (unlikely(!spin_trylock(ptl))) {
+		pte_unmap(pte);
+		goto out;
+	}
+
 	if (vma_has_changed(vmf)) {
 		pte_unmap_unlock(pte, ptl);
 		goto out;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/16] x86/mm: Add speculative pagefault handling
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (7 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 08/16] mm: Try spin lock in speculative path Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 10/16] powerpc/mm: Add speculative page fault Laurent Dufour
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

From: Peter Zijlstra <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2a1fa10c6a98..46fb9c2a832d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1365,6 +1365,19 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (error_code & PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address, flags);
+
+		/*
+		 * We also check against VM_FAULT_ERROR because we have to
+		 * raise a signal by calling later mm_fault_error() which
+		 * requires the vma pointer to be set. So in that case,
+		 * we fall through the normal path.
+		 */
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1474,6 +1487,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting. If any of the events
 	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/16] powerpc/mm: Add speculative page fault
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (8 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 09/16] x86/mm: Add speculative pagefault handling Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 11/16] mm: Introduce __page_add_new_anon_rmap() Laurent Dufour
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with WM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 4c422632047b..c6cd40901dd0 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -291,9 +291,31 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (is_write && is_user)
 		store_update_sp = store_updates_sp(regs);
 
-	if (is_user)
+	if (is_user) {
 		flags |= FAULT_FLAG_USER;
 
+		/* let's try a speculative page fault without grabbing the
+		 * mmap_sem.
+		 */
+
+		/*
+		 * flags is set later based on the VMA's flags, for the common
+		 * speculative service, we need some flags to be set.
+		 */
+		if (is_write)
+			flags |= FAULT_FLAG_WRITE;
+
+		fault = handle_speculative_fault(mm, address, flags);
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+			goto done;
+
+		/*
+		 * Resetting flags since the following code assumes
+		 * FAULT_FLAG_WRITE is not set.
+		 */
+		flags &= ~FAULT_FLAG_WRITE;
+	}
+
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -479,6 +501,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 			rc = 0;
 	}
 
+done:
 	/*
 	 * Major/minor page fault accounting.
 	 */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/16] mm: Introduce __page_add_new_anon_rmap()
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (9 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 10/16] powerpc/mm: Add speculative page fault Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 12/16] mm: Protect SPF handler against anon_vma changes Laurent Dufour
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check. Currently __page_add_new_anon_rmap() is only
called during the speculative page fault path.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/rmap.h | 12 ++++++++++--
 mm/rmap.c            |  5 ++---
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 43ef2c30cb0f..f5cd4dbc78b0 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -170,8 +170,16 @@ void page_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 			   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+			      unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+					  struct vm_area_struct *vma,
+					  unsigned long address, bool compound)
+{
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	__page_add_new_anon_rmap(page, vma, address, compound);
+}
+
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index c8993c63eb25..e99f9cd7b399 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1117,7 +1117,7 @@ void do_page_add_anon_rmap(struct page *page,
 }
 
 /**
- * page_add_new_anon_rmap - add pte mapping to a new anonymous page
+ * __page_add_new_anon_rmap - add pte mapping to a new anonymous page
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
@@ -1127,12 +1127,11 @@ void do_page_add_anon_rmap(struct page *page,
  * This means the inc-and-test can be bypassed.
  * Page does not have to be locked.
  */
-void page_add_new_anon_rmap(struct page *page,
+void __page_add_new_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
 	int nr = compound ? hpage_nr_pages(page) : 1;
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	__SetPageSwapBacked(page);
 	if (compound) {
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/16] mm: Protect SPF handler against anon_vma changes
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (10 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 11/16] mm: Introduce __page_add_new_anon_rmap() Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 13/16] perf: Add a speculative page fault sw events Laurent Dufour
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 519c28507a93..cb6906435ff5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -587,7 +587,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 * Hide vma from rmap and truncate_pagecache before freeing
 		 * pgtables
 		 */
+		write_seqcount_begin(&vma->vm_sequence);
 		unlink_anon_vmas(vma);
+		write_seqcount_end(&vma->vm_sequence);
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
@@ -601,7 +603,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = vma->vm_next;
+				write_seqcount_begin(&vma->vm_sequence);
 				unlink_anon_vmas(vma);
+				write_seqcount_end(&vma->vm_sequence);
 				unlink_file_vma(vma);
 			}
 			free_pgd_range(tlb, addr, vma->vm_end,
@@ -2403,7 +2407,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+		__page_add_new_anon_rmap(new_page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2873,7 +2877,7 @@ int do_swap_page(struct vm_fault *vmf)
 		mem_cgroup_commit_charge(page, memcg, true, false);
 		activate_page(page);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		__page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
@@ -3015,7 +3019,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, vmf->address, false);
+	__page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
@@ -3940,6 +3944,9 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
 	if (address < vma->vm_start || vma->vm_end <= address)
 		goto unlock;
 
+	if (unlikely(!vma->anon_vma))
+		goto unlock;
+
 	/*
 	 * Huge pages are not yet supported.
 	 */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 13/16] perf: Add a speculative page fault sw events
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (11 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 12/16] mm: Protect SPF handler against anon_vma changes Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-09 13:18   ` Michael Ellerman
  2017-08-08 14:35 ` [PATCH 14/16] x86/mm: Add support for SPF events Laurent Dufour
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

Add new software events to count succeeded and failed speculative page
faults.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/uapi/linux/perf_event.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b1c0b187acfe..fbfb03dff334 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -111,6 +111,8 @@ enum perf_sw_ids {
 	PERF_COUNT_SW_EMULATION_FAULTS		= 8,
 	PERF_COUNT_SW_DUMMY			= 9,
 	PERF_COUNT_SW_BPF_OUTPUT		= 10,
+	PERF_COUNT_SW_SPF_DONE			= 11,
+	PERF_COUNT_SW_SPF_FAILED		= 12,
 
 	PERF_COUNT_SW_MAX,			/* non-ABI */
 };
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 14/16] x86/mm: Add support for SPF events
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (12 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 13/16] perf: Add a speculative page fault sw events Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 15/16] powerpc/mm: " Laurent Dufour
  2017-08-08 14:35 ` [PATCH 16/16] perf tools: " Laurent Dufour
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

Add support for the new speculative page faults software events.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 46fb9c2a832d..17985f11b9da 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1374,8 +1374,12 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		 * requires the vma pointer to be set. So in that case,
 		 * we fall through the normal path.
 		 */
-		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR)) {
+			perf_sw_event(PERF_COUNT_SW_SPF_DONE, 1,
+				      regs, address);
 			goto done;
+		}
+		perf_sw_event(PERF_COUNT_SW_SPF_FAILED, 1, regs, address);
 	}
 
 	/*
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 15/16] powerpc/mm: Add support for SPF events
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (13 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 14/16] x86/mm: Add support for SPF events Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-08 14:35 ` [PATCH 16/16] perf tools: " Laurent Dufour
  15 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

Add support for the new speculative page faults software events.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c6cd40901dd0..112c4bc9da70 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -306,8 +306,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 			flags |= FAULT_FLAG_WRITE;
 
 		fault = handle_speculative_fault(mm, address, flags);
-		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR))
+		if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR)) {
+			perf_sw_event(PERF_COUNT_SW_SPF_DONE, 1,
+				      regs, address);
 			goto done;
+		}
+
+		perf_sw_event(PERF_COUNT_SW_SPF_FAILED, 1, regs, address);
 
 		/*
 		 * Resetting flags since the following code assumes
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 16/16] perf tools: Add support for SPF events
  2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
                   ` (14 preceding siblings ...)
  2017-08-08 14:35 ` [PATCH 15/16] powerpc/mm: " Laurent Dufour
@ 2017-08-08 14:35 ` Laurent Dufour
  2017-08-09  1:43   ` Anshuman Khandual
  15 siblings, 1 reply; 28+ messages in thread
From: Laurent Dufour @ 2017-08-08 14:35 UTC (permalink / raw)
  To: paulmck, peterz, akpm, kirill, ak, mhocko, dave, jack,
	Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

Add support for the new speculative faults events.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 tools/include/uapi/linux/perf_event.h | 2 ++
 tools/perf/util/evsel.c               | 2 ++
 tools/perf/util/parse-events.c        | 8 ++++++++
 tools/perf/util/parse-events.l        | 2 ++
 tools/perf/util/python.c              | 2 ++
 5 files changed, 16 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index b1c0b187acfe..fbfb03dff334 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -111,6 +111,8 @@ enum perf_sw_ids {
 	PERF_COUNT_SW_EMULATION_FAULTS		= 8,
 	PERF_COUNT_SW_DUMMY			= 9,
 	PERF_COUNT_SW_BPF_OUTPUT		= 10,
+	PERF_COUNT_SW_SPF_DONE			= 11,
+	PERF_COUNT_SW_SPF_FAILED		= 12,
 
 	PERF_COUNT_SW_MAX,			/* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 413f74df08de..37d55ffd98b1 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -426,6 +426,8 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
 	"alignment-faults",
 	"emulation-faults",
 	"dummy",
+	"speculative-faults",
+	"speculative-faults-failed",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 01e779b91c8e..da1f87859366 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -135,6 +135,14 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
 		.symbol = "bpf-output",
 		.alias  = "",
 	},
+	[PERF_COUNT_SW_SPF_DONE] = {
+		.symbol = "speculative-faults",
+		.alias	= "spf",
+	},
+	[PERF_COUNT_SW_SPF_FAILED] = {
+		.symbol = "speculative-faults-failed",
+		.alias	= "spf-failed",
+	},
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 660fca05bc93..ca0adbc97683 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -274,6 +274,8 @@ alignment-faults				{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_AL
 emulation-faults				{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EMULATION_FAULTS); }
 dummy						{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output					{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf				{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF_DONE); }
+speculative-faults-failed|spf-failed		{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF_FAILED); }
 
 	/*
 	 * We have to handle the kernel PMU event cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index c129e99114ae..b85e70e0da06 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1141,6 +1141,8 @@ static struct {
 	PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
 	PERF_CONST(COUNT_SW_EMULATION_FAULTS),
 	PERF_CONST(COUNT_SW_DUMMY),
+	PERF_CONST(COUNT_SW_SPF_DONE),
+	PERF_CONST(COUNT_SW_SPF_FAILED),
 
 	PERF_CONST(SAMPLE_IP),
 	PERF_CONST(SAMPLE_TID),
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 16/16] perf tools: Add support for SPF events
  2017-08-08 14:35 ` [PATCH 16/16] perf tools: " Laurent Dufour
@ 2017-08-09  1:43   ` Anshuman Khandual
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2017-08-09  1:43 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

On 08/08/2017 08:05 PM, Laurent Dufour wrote:
> Add support for the new speculative faults events.
> 
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  tools/include/uapi/linux/perf_event.h | 2 ++
>  tools/perf/util/evsel.c               | 2 ++
>  tools/perf/util/parse-events.c        | 8 ++++++++
>  tools/perf/util/parse-events.l        | 2 ++
>  tools/perf/util/python.c              | 2 ++
>  5 files changed, 16 insertions(+)
> 
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index b1c0b187acfe..fbfb03dff334 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -111,6 +111,8 @@ enum perf_sw_ids {
>  	PERF_COUNT_SW_EMULATION_FAULTS		= 8,
>  	PERF_COUNT_SW_DUMMY			= 9,
>  	PERF_COUNT_SW_BPF_OUTPUT		= 10,
> +	PERF_COUNT_SW_SPF_DONE			= 11,
> +	PERF_COUNT_SW_SPF_FAILED		= 12,
>  

PERF_COUNT_SW_SPF_FAULTS makes sense but not the FAILED one. IIRC,
there are no error path counting in perf SW events at the moment.
SPF_FAULTS and SPF_FAILS are VM internal events like THP collapse
etc. IMHO it should be added as a VM statistics counter or as a
trace point event instead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-08-08 14:35 ` [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
@ 2017-08-09 10:08   ` Kirill A. Shutemov
  2017-08-09 10:54     ` Laurent Dufour
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2017-08-09 10:08 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On Tue, Aug 08, 2017 at 04:35:35PM +0200, Laurent Dufour wrote:
> @@ -2295,7 +2302,11 @@ static int wp_page_copy(struct vm_fault *vmf)
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> -	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
> +	if (!pte_map_lock(vmf)) {
> +		mem_cgroup_cancel_charge(new_page, memcg, false);
> +		ret = VM_FAULT_RETRY;
> +		goto oom_free_new;

With the change, label is misleading.

> +	}
>  	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>  		if (old_page) {
>  			if (!PageAnon(old_page)) {

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
  2017-08-08 14:35 ` [PATCH 05/16] mm: Protect VMA modifications using " Laurent Dufour
@ 2017-08-09 10:12   ` Kirill A. Shutemov
  2017-08-09 10:43     ` Laurent Dufour
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2017-08-09 10:12 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
> 
> This patch provides protection agains the VMA modification done in :
> 	- madvise()
> 	- mremap()
> 	- mpol_rebind_policy()
> 	- vma_replace_policy()
> 	- change_prot_numa()
> 	- mlock(), munlock()
> 	- mprotect()
> 	- mmap_region()
> 	- collapse_huge_page()

I don't thinks it's anywhere near complete list of places where we touch
vm_flags. What is your plan for the rest?
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
  2017-08-09 10:12   ` Kirill A. Shutemov
@ 2017-08-09 10:43     ` Laurent Dufour
  2017-08-10  0:58       ` Kirill A. Shutemov
  0 siblings, 1 reply; 28+ messages in thread
From: Laurent Dufour @ 2017-08-09 10:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
>> The VMA sequence count has been introduced to allow fast detection of
>> VMA modification when running a page fault handler without holding
>> the mmap_sem.
>>
>> This patch provides protection agains the VMA modification done in :
>> 	- madvise()
>> 	- mremap()
>> 	- mpol_rebind_policy()
>> 	- vma_replace_policy()
>> 	- change_prot_numa()
>> 	- mlock(), munlock()
>> 	- mprotect()
>> 	- mmap_region()
>> 	- collapse_huge_page()
> 
> I don't thinks it's anywhere near complete list of places where we touch
> vm_flags. What is your plan for the rest?

The goal is only to protect places where change to the VMA is impacting the
page fault handling. If you think I missed one, please advise.

Thanks,
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2017-08-09 10:08   ` Kirill A. Shutemov
@ 2017-08-09 10:54     ` Laurent Dufour
  0 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-09 10:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On 09/08/2017 12:08, Kirill A. Shutemov wrote:
> On Tue, Aug 08, 2017 at 04:35:35PM +0200, Laurent Dufour wrote:
>> @@ -2295,7 +2302,11 @@ static int wp_page_copy(struct vm_fault *vmf)
>>  	/*
>>  	 * Re-check the pte - we dropped the lock
>>  	 */
>> -	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
>> +	if (!pte_map_lock(vmf)) {
>> +		mem_cgroup_cancel_charge(new_page, memcg, false);
>> +		ret = VM_FAULT_RETRY;
>> +		goto oom_free_new;
> 
> With the change, label is misleading.

That's right.
But I'm wondering renaming it out to 'out_free_new' and replacing all the
matching 'goto' where the label was making sense will help readability ?
Have you better idea ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 13/16] perf: Add a speculative page fault sw events
  2017-08-08 14:35 ` [PATCH 13/16] perf: Add a speculative page fault sw events Laurent Dufour
@ 2017-08-09 13:18   ` Michael Ellerman
  2017-08-09 13:25     ` Laurent Dufour
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Ellerman @ 2017-08-09 13:18 UTC (permalink / raw)
  To: Laurent Dufour, paulmck, peterz, akpm, kirill, ak, mhocko, dave,
	jack, Matthew Wilcox, benh, paulus, Thomas Gleixner, Ingo Molnar,
	hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

Laurent Dufour <ldufour@linux.vnet.ibm.com> writes:

> Add new software events to count succeeded and failed speculative page
> faults.
>
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  include/uapi/linux/perf_event.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index b1c0b187acfe..fbfb03dff334 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -111,6 +111,8 @@ enum perf_sw_ids {
>  	PERF_COUNT_SW_EMULATION_FAULTS		= 8,
>  	PERF_COUNT_SW_DUMMY			= 9,
>  	PERF_COUNT_SW_BPF_OUTPUT		= 10,
> +	PERF_COUNT_SW_SPF_DONE			= 11,
> +	PERF_COUNT_SW_SPF_FAILED		= 12,

Can't you calculate:

  PERF_COUNT_SW_SPF_FAILED = PERF_COUNT_SW_PAGE_FAULTS - PERF_COUNT_SW_SPF_DONE

ie. do you need a separate event for it?

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 13/16] perf: Add a speculative page fault sw events
  2017-08-09 13:18   ` Michael Ellerman
@ 2017-08-09 13:25     ` Laurent Dufour
  0 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-09 13:25 UTC (permalink / raw)
  To: Michael Ellerman, paulmck, peterz, akpm, kirill, ak, mhocko,
	dave, jack, Matthew Wilcox, benh, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon
  Cc: linux-kernel, linux-mm, haren, khandual, npiggin, bsingharora,
	Tim Chen, linuxppc-dev, x86

On 09/08/2017 15:18, Michael Ellerman wrote:
> Laurent Dufour <ldufour@linux.vnet.ibm.com> writes:
> 
>> Add new software events to count succeeded and failed speculative page
>> faults.
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> ---
>>  include/uapi/linux/perf_event.h | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index b1c0b187acfe..fbfb03dff334 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -111,6 +111,8 @@ enum perf_sw_ids {
>>  	PERF_COUNT_SW_EMULATION_FAULTS		= 8,
>>  	PERF_COUNT_SW_DUMMY			= 9,
>>  	PERF_COUNT_SW_BPF_OUTPUT		= 10,
>> +	PERF_COUNT_SW_SPF_DONE			= 11,
>> +	PERF_COUNT_SW_SPF_FAILED		= 12,
> 
> Can't you calculate:
> 
>   PERF_COUNT_SW_SPF_FAILED = PERF_COUNT_SW_PAGE_FAULTS - PERF_COUNT_SW_SPF_DONE
> 
> ie. do you need a separate event for it?

Unfortunately not, because PERF_COUNT_SW_PAGE_FAULTS counts also page
faults from the kernel space, while SPF is only concerning user space page
faults.

Cheers,
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
  2017-08-09 10:43     ` Laurent Dufour
@ 2017-08-10  0:58       ` Kirill A. Shutemov
  2017-08-10  8:27         ` Laurent Dufour
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2017-08-10  0:58 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> > On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> >> The VMA sequence count has been introduced to allow fast detection of
> >> VMA modification when running a page fault handler without holding
> >> the mmap_sem.
> >>
> >> This patch provides protection agains the VMA modification done in :
> >> 	- madvise()
> >> 	- mremap()
> >> 	- mpol_rebind_policy()
> >> 	- vma_replace_policy()
> >> 	- change_prot_numa()
> >> 	- mlock(), munlock()
> >> 	- mprotect()
> >> 	- mmap_region()
> >> 	- collapse_huge_page()
> > 
> > I don't thinks it's anywhere near complete list of places where we touch
> > vm_flags. What is your plan for the rest?
> 
> The goal is only to protect places where change to the VMA is impacting the
> page fault handling. If you think I missed one, please advise.

That's very fragile approach. We rely here too much on specific compiler behaviour.

Any write access to vm_flags can, in theory, be translated to several
write accesses. For instance with setting vm_flags to 0 in the middle,
which would result in sigfault on page fault to the vma.

Nothing (apart from common sense) prevents compiler from generating this
kind of pattern.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
  2017-08-10  0:58       ` Kirill A. Shutemov
@ 2017-08-10  8:27         ` Laurent Dufour
  2017-08-10 13:43           ` Kirill A. Shutemov
  0 siblings, 1 reply; 28+ messages in thread
From: Laurent Dufour @ 2017-08-10  8:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On 10/08/2017 02:58, Kirill A. Shutemov wrote:
> On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
>> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
>>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
>>>> The VMA sequence count has been introduced to allow fast detection of
>>>> VMA modification when running a page fault handler without holding
>>>> the mmap_sem.
>>>>
>>>> This patch provides protection agains the VMA modification done in :
>>>> 	- madvise()
>>>> 	- mremap()
>>>> 	- mpol_rebind_policy()
>>>> 	- vma_replace_policy()
>>>> 	- change_prot_numa()
>>>> 	- mlock(), munlock()
>>>> 	- mprotect()
>>>> 	- mmap_region()
>>>> 	- collapse_huge_page()
>>>
>>> I don't thinks it's anywhere near complete list of places where we touch
>>> vm_flags. What is your plan for the rest?
>>
>> The goal is only to protect places where change to the VMA is impacting the
>> page fault handling. If you think I missed one, please advise.
> 
> That's very fragile approach. We rely here too much on specific compiler behaviour.
> 
> Any write access to vm_flags can, in theory, be translated to several
> write accesses. For instance with setting vm_flags to 0 in the middle,
> which would result in sigfault on page fault to the vma.

Indeed, just setting vm_flags to 0 will not result in sigfault, the real
job is done when the pte are updated and the bits allowing access are
cleared. Access to the pte is controlled by the pte lock.
Page fault handler is triggered based on the pte bits, not the content of
vm_flags and the speculative page fault is checking for the vma again once
the pte lock is held. So there is no concurrency when dealing with the pte
bits.

Regarding the compiler behaviour, there are memory barriers and locking
which should prevent that.

Thanks,
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
  2017-08-10  8:27         ` Laurent Dufour
@ 2017-08-10 13:43           ` Kirill A. Shutemov
  2017-08-10 18:16             ` Laurent Dufour
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2017-08-10 13:43 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote:
> On 10/08/2017 02:58, Kirill A. Shutemov wrote:
> > On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> >> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> >>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> >>>> The VMA sequence count has been introduced to allow fast detection of
> >>>> VMA modification when running a page fault handler without holding
> >>>> the mmap_sem.
> >>>>
> >>>> This patch provides protection agains the VMA modification done in :
> >>>> 	- madvise()
> >>>> 	- mremap()
> >>>> 	- mpol_rebind_policy()
> >>>> 	- vma_replace_policy()
> >>>> 	- change_prot_numa()
> >>>> 	- mlock(), munlock()
> >>>> 	- mprotect()
> >>>> 	- mmap_region()
> >>>> 	- collapse_huge_page()
> >>>
> >>> I don't thinks it's anywhere near complete list of places where we touch
> >>> vm_flags. What is your plan for the rest?
> >>
> >> The goal is only to protect places where change to the VMA is impacting the
> >> page fault handling. If you think I missed one, please advise.
> > 
> > That's very fragile approach. We rely here too much on specific compiler behaviour.
> > 
> > Any write access to vm_flags can, in theory, be translated to several
> > write accesses. For instance with setting vm_flags to 0 in the middle,
> > which would result in sigfault on page fault to the vma.
> 
> Indeed, just setting vm_flags to 0 will not result in sigfault, the real
> job is done when the pte are updated and the bits allowing access are
> cleared. Access to the pte is controlled by the pte lock.
> Page fault handler is triggered based on the pte bits, not the content of
> vm_flags and the speculative page fault is checking for the vma again once
> the pte lock is held. So there is no concurrency when dealing with the pte
> bits.

Suppose we are getting page fault to readable VMA, pte is clear at the
time of page fault. In this case we need to consult vm_flags to check if
the vma is read-accessible.

If by the time of check vm_flags happend to be '0' we would get SIGSEGV as
the vma appears to be non-readable.

Where is my logic faulty?

> Regarding the compiler behaviour, there are memory barriers and locking
> which should prevent that.

Which locks barriers are you talking about?

We need at least READ_ONCE/WRITE_ONCE to access vm_flags everywhere.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
  2017-08-10 13:43           ` Kirill A. Shutemov
@ 2017-08-10 18:16             ` Laurent Dufour
  0 siblings, 0 replies; 28+ messages in thread
From: Laurent Dufour @ 2017-08-10 18:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: paulmck, peterz, akpm, ak, mhocko, dave, jack, Matthew Wilcox,
	benh, mpe, paulus, Thomas Gleixner, Ingo Molnar, hpa,
	Will Deacon, linux-kernel, linux-mm, haren, khandual, npiggin,
	bsingharora, Tim Chen, linuxppc-dev, x86

On 10/08/2017 15:43, Kirill A. Shutemov wrote:
> On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote:
>> On 10/08/2017 02:58, Kirill A. Shutemov wrote:
>>> On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
>>>> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
>>>>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
>>>>>> The VMA sequence count has been introduced to allow fast detection of
>>>>>> VMA modification when running a page fault handler without holding
>>>>>> the mmap_sem.
>>>>>>
>>>>>> This patch provides protection agains the VMA modification done in :
>>>>>> 	- madvise()
>>>>>> 	- mremap()
>>>>>> 	- mpol_rebind_policy()
>>>>>> 	- vma_replace_policy()
>>>>>> 	- change_prot_numa()
>>>>>> 	- mlock(), munlock()
>>>>>> 	- mprotect()
>>>>>> 	- mmap_region()
>>>>>> 	- collapse_huge_page()
>>>>>
>>>>> I don't thinks it's anywhere near complete list of places where we touch
>>>>> vm_flags. What is your plan for the rest?
>>>>
>>>> The goal is only to protect places where change to the VMA is impacting the
>>>> page fault handling. If you think I missed one, please advise.
>>>
>>> That's very fragile approach. We rely here too much on specific compiler behaviour.
>>>
>>> Any write access to vm_flags can, in theory, be translated to several
>>> write accesses. For instance with setting vm_flags to 0 in the middle,
>>> which would result in sigfault on page fault to the vma.
>>
>> Indeed, just setting vm_flags to 0 will not result in sigfault, the real
>> job is done when the pte are updated and the bits allowing access are
>> cleared. Access to the pte is controlled by the pte lock.
>> Page fault handler is triggered based on the pte bits, not the content of
>> vm_flags and the speculative page fault is checking for the vma again once
>> the pte lock is held. So there is no concurrency when dealing with the pte
>> bits.
> 
> Suppose we are getting page fault to readable VMA, pte is clear at the
> time of page fault. In this case we need to consult vm_flags to check if
> the vma is read-accessible.
> 
> If by the time of check vm_flags happend to be '0' we would get SIGSEGV as
> the vma appears to be non-readable.
> 
> Where is my logic faulty?

The speculative page fault handler will not deliver the signal, if the page
fault can't be done in the speculative path for instance because the
vm_flags are not matching the required one, the speculative page fault is
aborted and the *classic* page fault handler is run which will do the job
again grabbing the mmap_sem.

>> Regarding the compiler behaviour, there are memory barriers and locking
>> which should prevent that.
> 
> Which locks barriers are you talking about?

When the VMA is modified and that the changes will impact the speculative
page fault handler the sequence count is touch using write_seqcount_begin()
and write_seqcount_end(). These 2 services contains calls to smp_wmb().
On the speculative path side, the calls to *_read_seqcount() contains also
memory barriers calls.

> We need at least READ_ONCE/WRITE_ONCE to access vm_flags everywhere.

I don't think READ_ONCE/WRITE_ONCE would help here, as they would not
prevent reading transcient state as the vm_flags example you mentioned.

That said, there are not so much VMA's fields used in the SPF's path and
caching them into the vmf structure under the control of the VMA's sequence
count would solve this.
I'll try to move in that direction unless anyone has a better idea.


Cheers,
Laurent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-08-10 18:16 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-08 14:35 [PATCH 00/16] Speculative page faults Laurent Dufour
2017-08-08 14:35 ` [PATCH 01/16] mm: Dont assume page-table invariance during faults Laurent Dufour
2017-08-08 14:35 ` [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2017-08-09 10:08   ` Kirill A. Shutemov
2017-08-09 10:54     ` Laurent Dufour
2017-08-08 14:35 ` [PATCH 03/16] mm: Introduce pte_spinlock " Laurent Dufour
2017-08-08 14:35 ` [PATCH 04/16] mm: VMA sequence count Laurent Dufour
2017-08-08 14:35 ` [PATCH 05/16] mm: Protect VMA modifications using " Laurent Dufour
2017-08-09 10:12   ` Kirill A. Shutemov
2017-08-09 10:43     ` Laurent Dufour
2017-08-10  0:58       ` Kirill A. Shutemov
2017-08-10  8:27         ` Laurent Dufour
2017-08-10 13:43           ` Kirill A. Shutemov
2017-08-10 18:16             ` Laurent Dufour
2017-08-08 14:35 ` [PATCH 06/16] mm: RCU free VMAs Laurent Dufour
2017-08-08 14:35 ` [PATCH 07/16] mm: Provide speculative fault infrastructure Laurent Dufour
2017-08-08 14:35 ` [PATCH 08/16] mm: Try spin lock in speculative path Laurent Dufour
2017-08-08 14:35 ` [PATCH 09/16] x86/mm: Add speculative pagefault handling Laurent Dufour
2017-08-08 14:35 ` [PATCH 10/16] powerpc/mm: Add speculative page fault Laurent Dufour
2017-08-08 14:35 ` [PATCH 11/16] mm: Introduce __page_add_new_anon_rmap() Laurent Dufour
2017-08-08 14:35 ` [PATCH 12/16] mm: Protect SPF handler against anon_vma changes Laurent Dufour
2017-08-08 14:35 ` [PATCH 13/16] perf: Add a speculative page fault sw events Laurent Dufour
2017-08-09 13:18   ` Michael Ellerman
2017-08-09 13:25     ` Laurent Dufour
2017-08-08 14:35 ` [PATCH 14/16] x86/mm: Add support for SPF events Laurent Dufour
2017-08-08 14:35 ` [PATCH 15/16] powerpc/mm: " Laurent Dufour
2017-08-08 14:35 ` [PATCH 16/16] perf tools: " Laurent Dufour
2017-08-09  1:43   ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).