linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Various TLB and PTE improvements
@ 2018-05-20  0:43 Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

I've posted most of these separately at one time or another but I
will send them as a series, there have been some bug fixes and
changelog and comment improvements, and got some more numbers.

Most of the patches are logically independent (except 2 and 3 AFAIKS).

Thanks,
Nick

Nicholas Piggin (7):
  powerpc/64s/radix: do not flush TLB on spurious fault
  powerpc/64s/radix: reset mm_cpumask for single thread process when
    possible
  powerpc/64s/radix: make single threaded mms always flush all translations
    from non-local CPUs
  powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the
    full case
  powerpc/64s/radix: optimise pte_update
  powerpc/64s/radix: prefetch user address in update_mmu_cache
  powerpc/64s/radix: avoid ptesync after set_pte and
    ptep_set_access_flags

 arch/powerpc/include/asm/book3s/64/radix.h    | 37 ++++----
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 12 ++-
 arch/powerpc/include/asm/cacheflush.h         | 13 +++
 arch/powerpc/include/asm/mmu_context.h        | 19 ++++
 arch/powerpc/include/asm/tlb.h                |  8 ++
 arch/powerpc/mm/mem.c                         |  4 +-
 arch/powerpc/mm/mmu_context.c                 |  6 +-
 arch/powerpc/mm/pgtable-book3s64.c            |  3 +-
 arch/powerpc/mm/tlb-radix.c                   | 89 ++++++++++++++-----
 9 files changed, 143 insertions(+), 48 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault
  2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
@ 2018-05-20  0:43 ` Nicholas Piggin
  2018-05-21  6:06   ` Aneesh Kumar K.V
  2018-05-20  0:43 ` [PATCH v2 2/7] powerpc/64s/radix: reset mm_cpumask for single thread process when possible Nicholas Piggin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

In the case of a spurious fault (which can happen due to a race with
another thread that changes the page table), the default Linux mm code
calls flush_tlb_page for that address. This is not required because
the pte will be re-fetched. Hash does not wire this up to a hardware
TLB flush for this reason. This patch avoids the flush for radix.

>From Power ISA v3.0B, p.1090:

    Setting a Reference or Change Bit or Upgrading Access Authority
    (PTE Subject to Atomic Hardware Updates)

    If the only change being made to a valid PTE that is subject to
    atomic hardware updates is to set the Refer- ence or Change bit to
    1 or to add access authorities, a simpler sequence suffices
    because the translation hardware will refetch the PTE if an access
    is attempted for which the only problems were reference and/or
    change bits needing to be set or insufficient access authority.

The nest MMU on POWER9 does not re-fetch the PTE after such an access
attempt before faulting, so address spaces with a coprocessor
attached will continue to flush in these cases.

This reduces tlbies for a kernel compile workload from 0.95M to 0.90M.

fork --fork --exec benchmark improved 0.5% (12300->12400).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
Since v1:
- Added NMMU handling

 arch/powerpc/include/asm/book3s/64/tlbflush.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0cac17253513..ebf572ea621e 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -4,7 +4,7 @@
 
 #define MMU_NO_CONTEXT	~0UL
 
-
+#include <linux/mm_types.h>
 #include <asm/book3s/64/tlbflush-hash.h>
 #include <asm/book3s/64/tlbflush-radix.h>
 
@@ -137,6 +137,16 @@ static inline void flush_all_mm(struct mm_struct *mm)
 #define flush_tlb_page(vma, addr)	local_flush_tlb_page(vma, addr)
 #define flush_all_mm(mm)		local_flush_all_mm(mm)
 #endif /* CONFIG_SMP */
+
+#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
+static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+						unsigned long address)
+{
+	/* See ptep_set_access_flags comment */
+	if (atomic_read(&vma->vm_mm->context.copros) > 0)
+		flush_tlb_page(vma, address);
+}
+
 /*
  * flush the page walk cache for the address
  */
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/7] powerpc/64s/radix: reset mm_cpumask for single thread process when possible
  2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
@ 2018-05-20  0:43 ` Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 3/7] powerpc/64s/radix: make single threaded mms always flush all translations from non-local CPUs Nicholas Piggin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

When a single-threaded process has a non-local mm_cpumask and requires
a full PID tlbie invalidation, use that as an opportunity to reset the
cpumask back to the current CPU we're running on.

No other thread can concurrently switch to this mm, because it must
have been given a reference to mm_users by the current thread before it
can use_mm. mm_users can be asynchronously incremented (by
mmget_not_zero), but those users must not do a use_mm, because existing
convention already disallows it (sparc64 has reset mm_cpumask using this
condition since the start of history, see arch/sparc/kernel/smp_64.c).

This reduces tlbies for a kernel compile workload from 0.90M to 0.40M,
tlbiels are increased from 20M to 22.5M because local pid flushes take
128 tlbiels vs 1 for global pid flush.

This slows down context switch benchmark ping-ponging on different
CPUs by 7.5%, because switching to the idle thread and back now
requires a PID switch. There are ways this could be avoided, but for
now it's simpler not to allow lazy PID switching.

Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
Since v1:
- Fix switch_mm to be irq-safe (thanks Balbir)

 arch/powerpc/include/asm/mmu_context.h | 19 +++++++++
 arch/powerpc/include/asm/tlb.h         |  8 ++++
 arch/powerpc/mm/tlb-radix.c            | 57 +++++++++++++++++++-------
 3 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 1835ca1505d6..f26704371fdc 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -6,6 +6,7 @@
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/spinlock.h>
 #include <asm/mmu.h>	
 #include <asm/cputable.h>
@@ -201,6 +202,24 @@ static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 static inline void enter_lazy_tlb(struct mm_struct *mm,
 				  struct task_struct *tsk)
 {
+#ifdef CONFIG_PPC_BOOK3S_64
+	/*
+	 * Under radix, we do not want to keep lazy PIDs around because
+	 * even if the CPU does not access userspace, it can still bring
+	 * in translations through speculation and prefetching.
+	 *
+	 * Switching away here allows us to trim back the mm_cpumask in
+	 * cases where we know the process is not running on some CPUs
+	 * (see mm/tlb-radix.c).
+	 */
+	if (radix_enabled() && mm != &init_mm) {
+		mmgrab(&init_mm);
+		tsk->active_mm = &init_mm;
+		switch_mm(mm, tsk->active_mm, tsk);
+		mmdrop(mm);
+	}
+#endif
+
 	/* 64-bit Book3E keeps track of current PGD in the PACA */
 #ifdef CONFIG_PPC_BOOK3E_64
 	get_paca()->pgd = NULL;
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index a7eabff27a0f..006fce98c403 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -76,6 +76,14 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 		return false;
 	return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
+static inline void mm_reset_thread_local(struct mm_struct *mm)
+{
+	WARN_ON(atomic_read(&mm->context.copros) > 0);
+	WARN_ON(!(atomic_read(&mm->mm_users) == 1 && current->mm == mm));
+	atomic_set(&mm->context.active_cpus, 1);
+	cpumask_clear(mm_cpumask(mm));
+	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
+}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 5ac3206c51cc..d5593a78702a 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -504,6 +504,15 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd
 }
 EXPORT_SYMBOL(radix__local_flush_tlb_page);
 
+static bool mm_is_singlethreaded(struct mm_struct *mm)
+{
+	if (atomic_read(&mm->context.copros) > 0)
+		return false;
+	if (atomic_read(&mm->mm_users) == 1 && current->mm == mm)
+		return true;
+	return false;
+}
+
 static bool mm_needs_flush_escalation(struct mm_struct *mm)
 {
 	/*
@@ -511,7 +520,9 @@ static bool mm_needs_flush_escalation(struct mm_struct *mm)
 	 * caching PTEs and not flushing them properly when
 	 * RIC = 0 for a PID/LPID invalidate
 	 */
-	return atomic_read(&mm->context.copros) != 0;
+	if (atomic_read(&mm->context.copros) > 0)
+		return true;
+	return false;
 }
 
 #ifdef CONFIG_SMP
@@ -525,12 +536,17 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 
 	preempt_disable();
 	if (!mm_is_thread_local(mm)) {
-		if (mm_needs_flush_escalation(mm))
+		if (mm_is_singlethreaded(mm)) {
 			_tlbie_pid(pid, RIC_FLUSH_ALL);
-		else
+			mm_reset_thread_local(mm);
+		} else if (mm_needs_flush_escalation(mm)) {
+			_tlbie_pid(pid, RIC_FLUSH_ALL);
+		} else {
 			_tlbie_pid(pid, RIC_FLUSH_TLB);
-	} else
+		}
+	} else {
 		_tlbiel_pid(pid, RIC_FLUSH_TLB);
+	}
 	preempt_enable();
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
@@ -544,10 +560,13 @@ void radix__flush_all_mm(struct mm_struct *mm)
 		return;
 
 	preempt_disable();
-	if (!mm_is_thread_local(mm))
+	if (!mm_is_thread_local(mm)) {
 		_tlbie_pid(pid, RIC_FLUSH_ALL);
-	else
+		if (mm_is_singlethreaded(mm))
+			mm_reset_thread_local(mm);
+	} else {
 		_tlbiel_pid(pid, RIC_FLUSH_ALL);
+	}
 	preempt_enable();
 }
 EXPORT_SYMBOL(radix__flush_all_mm);
@@ -644,10 +663,14 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		if (local) {
 			_tlbiel_pid(pid, RIC_FLUSH_TLB);
 		} else {
-			if (mm_needs_flush_escalation(mm))
+			if (mm_is_singlethreaded(mm)) {
+				_tlbie_pid(pid, RIC_FLUSH_ALL);
+				mm_reset_thread_local(mm);
+			} else if (mm_needs_flush_escalation(mm)) {
 				_tlbie_pid(pid, RIC_FLUSH_ALL);
-			else
+			} else {
 				_tlbie_pid(pid, RIC_FLUSH_TLB);
+			}
 		}
 	} else {
 		bool hflush = false;
@@ -802,13 +825,19 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	}
 
 	if (full) {
-		if (!local && mm_needs_flush_escalation(mm))
-			also_pwc = true;
-
-		if (local)
+		if (local) {
 			_tlbiel_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
-		else
-			_tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL: RIC_FLUSH_TLB);
+		} else {
+			if (mm_is_singlethreaded(mm)) {
+				_tlbie_pid(pid, RIC_FLUSH_ALL);
+				mm_reset_thread_local(mm);
+			} else {
+				if (mm_needs_flush_escalation(mm))
+					also_pwc = true;
+
+				_tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
+			}
+		}
 	} else {
 		if (local)
 			_tlbiel_va_range(start, end, pid, page_size, psize, also_pwc);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/7] powerpc/64s/radix: make single threaded mms always flush all translations from non-local CPUs
  2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 2/7] powerpc/64s/radix: reset mm_cpumask for single thread process when possible Nicholas Piggin
@ 2018-05-20  0:43 ` Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 4/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Go one step further, if we're going to put a tlbie on the bus at all,
make it count. Make any global invalidation from a single threaded mm
do a full PID flush so the mm_cpumask can be reset.

The tradeoff is that it will over-flush one time the local CPU's TLB
if there was a small number of pages to flush that could be done with
specific address tlbies.

If the workload is invalidate-heavy enough for this to be a concern,
this should be outweighed by the benefit that it can subsequently
avoid the global flush.

This reduces tlbies for a kernel compile workload from 0.40M to 0.18M,
tlbiels are increased from 22.5M to 23.8M because local pid flushes
take 128 tlbiels vs 1 for global pid flush.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/tlb-radix.c | 45 ++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index d5593a78702a..55f93d66c8d2 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -587,10 +587,16 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
 		return;
 
 	preempt_disable();
-	if (!mm_is_thread_local(mm))
-		_tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
-	else
+	if (mm_is_thread_local(mm)) {
 		_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
+	} else {
+		if (mm_is_singlethreaded(mm)) {
+			_tlbie_pid(pid, RIC_FLUSH_ALL);
+			mm_reset_thread_local(mm);
+		} else {
+			_tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
+		}
+	}
 	preempt_enable();
 }
 
@@ -659,14 +665,14 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 				nr_pages > tlb_single_page_flush_ceiling);
 	}
 
-	if (full) {
+	if (!local && mm_is_singlethreaded(mm)) {
+		_tlbie_pid(pid, RIC_FLUSH_ALL);
+		mm_reset_thread_local(mm);
+	} else if (full) {
 		if (local) {
 			_tlbiel_pid(pid, RIC_FLUSH_TLB);
 		} else {
-			if (mm_is_singlethreaded(mm)) {
-				_tlbie_pid(pid, RIC_FLUSH_ALL);
-				mm_reset_thread_local(mm);
-			} else if (mm_needs_flush_escalation(mm)) {
+			if (mm_needs_flush_escalation(mm)) {
 				_tlbie_pid(pid, RIC_FLUSH_ALL);
 			} else {
 				_tlbie_pid(pid, RIC_FLUSH_TLB);
@@ -824,19 +830,17 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 				nr_pages > tlb_single_page_flush_ceiling);
 	}
 
-	if (full) {
+	if (!local && mm_is_singlethreaded(mm)) {
+		_tlbie_pid(pid, RIC_FLUSH_ALL);
+		mm_reset_thread_local(mm);
+	} else if (full) {
 		if (local) {
 			_tlbiel_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
 		} else {
-			if (mm_is_singlethreaded(mm)) {
-				_tlbie_pid(pid, RIC_FLUSH_ALL);
-				mm_reset_thread_local(mm);
-			} else {
-				if (mm_needs_flush_escalation(mm))
-					also_pwc = true;
+			if (mm_needs_flush_escalation(mm))
+				also_pwc = true;
 
-				_tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
-			}
+			_tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
 		}
 	} else {
 		if (local)
@@ -882,7 +886,12 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
 	if (mm_is_thread_local(mm)) {
 		_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
 	} else {
-		_tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
+		if (mm_is_singlethreaded(mm)) {
+			_tlbie_pid(pid, RIC_FLUSH_ALL);
+			mm_reset_thread_local(mm);
+		} else {
+			_tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
+		}
 	}
 
 	preempt_enable();
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case
  2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
                   ` (2 preceding siblings ...)
  2018-05-20  0:43 ` [PATCH v2 3/7] powerpc/64s/radix: make single threaded mms always flush all translations from non-local CPUs Nicholas Piggin
@ 2018-05-20  0:43 ` Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 5/7] powerpc/64s/radix: optimise pte_update Nicholas Piggin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This matches other architectures, when we know there will be no
further accesses to the address (e.g., for teardown), page table
entries can be cleared non-atomically.

The comments about NMMU are bogus: all MMU notifiers (including NMMU)
are released at this point, with their TLBs flushed. An NMMU access at
this point would be a bug.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 705193e7192f..fcd92f9b6ec0 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -176,14 +176,8 @@ static inline pte_t radix__ptep_get_and_clear_full(struct mm_struct *mm,
 	unsigned long old_pte;
 
 	if (full) {
-		/*
-		 * If we are trying to clear the pte, we can skip
-		 * the DD1 pte update sequence and batch the tlb flush. The
-		 * tlb flush batching is done by mmu gather code. We
-		 * still keep the cmp_xchg update to make sure we get
-		 * correct R/C bit which might be updated via Nest MMU.
-		 */
-		old_pte = __radix_pte_update(ptep, ~0ul, 0);
+		old_pte = pte_val(*ptep);
+		*ptep = __pte(0);
 	} else
 		old_pte = radix__pte_update(mm, addr, ptep, ~0ul, 0, 0);
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 5/7] powerpc/64s/radix: optimise pte_update
  2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
                   ` (3 preceding siblings ...)
  2018-05-20  0:43 ` [PATCH v2 4/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
@ 2018-05-20  0:43 ` Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 6/7] powerpc/64s/radix: prefetch user address in update_mmu_cache Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 7/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
  6 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Implementing pte_update with pte_xchg (which uses cmpxchg) is
inefficient. A single larx/stcx. works fine, no need for the less
efficient cmpxchg sequence.

Then remove the memory barriers from the operation. There is a
requirement for TLB flushing to load mm_cpumask after the store
that reduces pte permissions, which is moved into the TLB flush
code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 25 +++++++++++-----------
 arch/powerpc/mm/mmu_context.c              |  6 ++++--
 arch/powerpc/mm/tlb-radix.c                | 11 +++++++++-
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index fcd92f9b6ec0..ab4ed61a74fa 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -127,20 +127,21 @@ extern void radix__mark_initmem_nx(void);
 static inline unsigned long __radix_pte_update(pte_t *ptep, unsigned long clr,
 					       unsigned long set)
 {
-	pte_t pte;
-	unsigned long old_pte, new_pte;
-
-	do {
-		pte = READ_ONCE(*ptep);
-		old_pte = pte_val(pte);
-		new_pte = (old_pte | set) & ~clr;
-
-	} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
-
-	return old_pte;
+	__be64 old_be, tmp_be;
+
+	__asm__ __volatile__(
+	"1:	ldarx	%0,0,%3		# pte_update\n"
+	"	andc	%1,%0,%5	\n"
+	"	or	%1,%1,%4	\n"
+	"	stdcx.	%1,0,%3		\n"
+	"	bne-	1b"
+	: "=&r" (old_be), "=&r" (tmp_be), "=m" (*ptep)
+	: "r" (ptep), "r" (cpu_to_be64(set)), "r" (cpu_to_be64(clr))
+	: "cc" );
+
+	return be64_to_cpu(old_be);
 }
 
-
 static inline unsigned long radix__pte_update(struct mm_struct *mm,
 					unsigned long addr,
 					pte_t *ptep, unsigned long clr,
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0ab297c4cfad..f84e14f23e50 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -57,8 +57,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * in switch_slb(), and/or the store of paca->mm_ctx_id in
 		 * copy_mm_to_paca().
 		 *
-		 * On the read side the barrier is in pte_xchg(), which orders
-		 * the store to the PTE vs the load of mm_cpumask.
+		 * On the other side, the barrier is in mm/tlb-radix.c for
+		 * radix which orders earlier stores to clear the PTEs vs
+		 * the load of mm_cpumask. And pte_xchg which does the same
+		 * thing for hash.
 		 *
 		 * This full barrier is needed by membarrier when switching
 		 * between processes after store to rq->curr, before user-space
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 55f93d66c8d2..b419702b1ba6 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -535,6 +535,11 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		return;
 
 	preempt_disable();
+	/*
+	 * Order loads of mm_cpumask vs previous stores to clear ptes before
+	 * the invalidate. See barrier in switch_mm_irqs_off
+	 */
+	smp_mb();
 	if (!mm_is_thread_local(mm)) {
 		if (mm_is_singlethreaded(mm)) {
 			_tlbie_pid(pid, RIC_FLUSH_ALL);
@@ -560,6 +565,7 @@ void radix__flush_all_mm(struct mm_struct *mm)
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (!mm_is_thread_local(mm)) {
 		_tlbie_pid(pid, RIC_FLUSH_ALL);
 		if (mm_is_singlethreaded(mm))
@@ -587,6 +593,7 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
 	} else {
@@ -655,6 +662,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		local = true;
 		full = (end == TLB_FLUSH_ALL ||
@@ -820,6 +828,7 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		local = true;
 		full = (end == TLB_FLUSH_ALL ||
@@ -882,7 +891,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
 
 	/* Otherwise first do the PWC, then iterate the pages. */
 	preempt_disable();
-
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
 	} else {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6/7] powerpc/64s/radix: prefetch user address in update_mmu_cache
  2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
                   ` (4 preceding siblings ...)
  2018-05-20  0:43 ` [PATCH v2 5/7] powerpc/64s/radix: optimise pte_update Nicholas Piggin
@ 2018-05-20  0:43 ` Nicholas Piggin
  2018-05-20  0:43 ` [PATCH v2 7/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
  6 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Prefetch the faulting address in update_mmu_cache to give the page
table walker perhaps 100 cycles head start as locks are dropped and
the interrupt completed.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/mem.c              | 4 +++-
 arch/powerpc/mm/pgtable-book3s64.c | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c3c39b02b2ba..8cecda4bd66a 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -509,8 +509,10 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
 	 */
 	unsigned long access, trap;
 
-	if (radix_enabled())
+	if (radix_enabled()) {
+		prefetch((void *)address);
 		return;
+	}
 
 	/* We only want HPTEs for linux PTEs that have _PAGE_ACCESSED set */
 	if (!pte_young(*ptep) || address >= TASK_SIZE)
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 994492453f0e..7ce889a7e5ce 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -145,7 +145,8 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd)
 {
-	return;
+	if (radix_enabled())
+		prefetch((void *)addr);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 7/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags
  2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
                   ` (5 preceding siblings ...)
  2018-05-20  0:43 ` [PATCH v2 6/7] powerpc/64s/radix: prefetch user address in update_mmu_cache Nicholas Piggin
@ 2018-05-20  0:43 ` Nicholas Piggin
  6 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-20  0:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The ISA suggests ptesync after setting a pte, to prevent a table walk
initiated by a subsequent access from missing that store and causing a
spurious fault. This is an architectual allowance that allows an
implementation's page table walker to be incoherent with the store
queue.

However there is no correctness problem in taking a spurious fault in
userspace -- the kernel copes with these at any time, so the updated
pte will be found eventually. Spurious kernel faults on vmap memory
must be avoided, so a ptesync is put into flush_cache_vmap.

On POWER9 so far I have not found a measurable window where this can
result in more minor faults, so as an optimisation, remove the costly
ptesync from pte updates. If an implementation benefits from ptesync,
it would be better to add it back in update_mmu_cache, so it's not
done for things like fork(2).

fork --fork --exec benchmark improved 5.2% (12400->13100).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h |  2 --
 arch/powerpc/include/asm/cacheflush.h      | 13 +++++++++++++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index ab4ed61a74fa..cc9437a542cc 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -210,7 +210,6 @@ static inline void radix__ptep_set_access_flags(struct mm_struct *mm,
 		__radix_pte_update(ptep, 0, new_pte);
 	} else
 		__radix_pte_update(ptep, 0, set);
-	asm volatile("ptesync" : : : "memory");
 }
 
 static inline int radix__pte_same(pte_t pte_a, pte_t pte_b)
@@ -227,7 +226,6 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
 				 pte_t *ptep, pte_t pte, int percpu)
 {
 	*ptep = pte;
-	asm volatile("ptesync" : : : "memory");
 }
 
 static inline int radix__pmd_bad(pmd_t pmd)
diff --git a/arch/powerpc/include/asm/cacheflush.h b/arch/powerpc/include/asm/cacheflush.h
index 11843e37d9cf..e9662648e72d 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -26,6 +26,19 @@
 #define flush_cache_vmap(start, end)		do { } while (0)
 #define flush_cache_vunmap(start, end)		do { } while (0)
 
+#ifdef CONFIG_BOOK3S_64
+/*
+ * Book3s has no ptesync after setting a pte, so without this ptesync it's
+ * possible for a kernel virtual mapping access to return a spurious fault
+ * if it's accessed right after the pte is set. The page fault handler does
+ * not expect this type of fault. flush_cache_vmap is not exactly the right
+ * place to put this, but it seems to work well enough.
+ */
+#define flush_cache_vmap(start, end)		do { asm volatile("ptesync"); } while (0)
+#else
+#define flush_cache_vmap(start, end)		do { } while (0)
+#endif
+
 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
 extern void flush_dcache_page(struct page *page);
 #define flush_dcache_mmap_lock(mapping)		do { } while (0)
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault
  2018-05-20  0:43 ` [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
@ 2018-05-21  6:06   ` Aneesh Kumar K.V
  2018-05-24 10:37     ` Nicholas Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Aneesh Kumar K.V @ 2018-05-21  6:06 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:

> In the case of a spurious fault (which can happen due to a race with
> another thread that changes the page table), the default Linux mm code
> calls flush_tlb_page for that address. This is not required because
> the pte will be re-fetched. Hash does not wire this up to a hardware
> TLB flush for this reason. This patch avoids the flush for radix.
>
> From Power ISA v3.0B, p.1090:
>
>     Setting a Reference or Change Bit or Upgrading Access Authority
>     (PTE Subject to Atomic Hardware Updates)
>
>     If the only change being made to a valid PTE that is subject to
>     atomic hardware updates is to set the Refer- ence or Change bit to
>     1 or to add access authorities, a simpler sequence suffices
>     because the translation hardware will refetch the PTE if an access
>     is attempted for which the only problems were reference and/or
>     change bits needing to be set or insufficient access authority.
>
> The nest MMU on POWER9 does not re-fetch the PTE after such an access
> attempt before faulting, so address spaces with a coprocessor
> attached will continue to flush in these cases.
>
> This reduces tlbies for a kernel compile workload from 0.95M to 0.90M.
>
> fork --fork --exec benchmark improved 0.5% (12300->12400).
>


Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Do you want to use flush_tlb_fix_spurious_fault in
ptep_set_access_flags() also?. That would bring it closer to generic version?

> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> Since v1:
> - Added NMMU handling
>
>  arch/powerpc/include/asm/book3s/64/tlbflush.h | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> index 0cac17253513..ebf572ea621e 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> @@ -4,7 +4,7 @@
>  
>  #define MMU_NO_CONTEXT	~0UL
>  
> -
> +#include <linux/mm_types.h>
>  #include <asm/book3s/64/tlbflush-hash.h>
>  #include <asm/book3s/64/tlbflush-radix.h>
>  
> @@ -137,6 +137,16 @@ static inline void flush_all_mm(struct mm_struct *mm)
>  #define flush_tlb_page(vma, addr)	local_flush_tlb_page(vma, addr)
>  #define flush_all_mm(mm)		local_flush_all_mm(mm)
>  #endif /* CONFIG_SMP */
> +
> +#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
> +static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> +						unsigned long address)
> +{
> +	/* See ptep_set_access_flags comment */
> +	if (atomic_read(&vma->vm_mm->context.copros) > 0)
> +		flush_tlb_page(vma, address);
> +}
> +
>  /*
>   * flush the page walk cache for the address
>   */
> -- 
> 2.17.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault
  2018-05-21  6:06   ` Aneesh Kumar K.V
@ 2018-05-24 10:37     ` Nicholas Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2018-05-24 10:37 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev

On Mon, 21 May 2018 11:36:12 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > In the case of a spurious fault (which can happen due to a race with
> > another thread that changes the page table), the default Linux mm code
> > calls flush_tlb_page for that address. This is not required because
> > the pte will be re-fetched. Hash does not wire this up to a hardware
> > TLB flush for this reason. This patch avoids the flush for radix.
> >
> > From Power ISA v3.0B, p.1090:
> >
> >     Setting a Reference or Change Bit or Upgrading Access Authority
> >     (PTE Subject to Atomic Hardware Updates)
> >
> >     If the only change being made to a valid PTE that is subject to
> >     atomic hardware updates is to set the Refer- ence or Change bit to
> >     1 or to add access authorities, a simpler sequence suffices
> >     because the translation hardware will refetch the PTE if an access
> >     is attempted for which the only problems were reference and/or
> >     change bits needing to be set or insufficient access authority.
> >
> > The nest MMU on POWER9 does not re-fetch the PTE after such an access
> > attempt before faulting, so address spaces with a coprocessor
> > attached will continue to flush in these cases.
> >
> > This reduces tlbies for a kernel compile workload from 0.95M to 0.90M.
> >
> > fork --fork --exec benchmark improved 0.5% (12300->12400).
> >  
> 
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> Do you want to use flush_tlb_fix_spurious_fault in
> ptep_set_access_flags() also?. That would bring it closer to generic version?

I'm not sure it adds much, it does bring it closer to generic version
and they do happen to do the same thing, but it's not really fixing a
spurious fault in ptep_set_access_flags(). I think adding that just
means you would have to follow another indirection to work out what it
does.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-05-24 10:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-20  0:43 [PATCH v2 0/7] Various TLB and PTE improvements Nicholas Piggin
2018-05-20  0:43 ` [PATCH v2 1/7] powerpc/64s/radix: do not flush TLB on spurious fault Nicholas Piggin
2018-05-21  6:06   ` Aneesh Kumar K.V
2018-05-24 10:37     ` Nicholas Piggin
2018-05-20  0:43 ` [PATCH v2 2/7] powerpc/64s/radix: reset mm_cpumask for single thread process when possible Nicholas Piggin
2018-05-20  0:43 ` [PATCH v2 3/7] powerpc/64s/radix: make single threaded mms always flush all translations from non-local CPUs Nicholas Piggin
2018-05-20  0:43 ` [PATCH v2 4/7] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
2018-05-20  0:43 ` [PATCH v2 5/7] powerpc/64s/radix: optimise pte_update Nicholas Piggin
2018-05-20  0:43 ` [PATCH v2 6/7] powerpc/64s/radix: prefetch user address in update_mmu_cache Nicholas Piggin
2018-05-20  0:43 ` [PATCH v2 7/7] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).