All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations
@ 2018-05-13  4:21 Nicholas Piggin
  2018-05-13  4:21 ` [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-05-13  4:21 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Here's a few patches which I'm sure will cause a lot of concern, but
I think now is the time to have it out and really start optimising
these things as far as we can. Radix MMU has been stable for quite
some time, and distros have made releases with the more conservative
flushes and barriers and updates etc.

If we decide not to do any of these things, we can document why not
so it becomes easier to revisit.

With these patches, plus the TLB flush reduction patches earlier,
plus a few generic mm patches that I haven't posted yet, fork/exec
benchmark from selftests increases performance by 11%.

A test which mprotects 16GB of memory to readonly, then reads a byte
from each page, then protects read/write and updates a byte from each
page, then repeats, is more tha 2x faster. Mostly due to reduced TLB
flushing, barriers, and atomics from these two patch sets.

Nicholas Piggin (3):
  powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the
    full case
  powerpc/64s/radix: avoid ptesync after set_pte and
    ptep_set_access_flags
  powerpc/64s/radix: optimise pte_update

 arch/powerpc/include/asm/book3s/64/radix.h | 37 +++++++++-------------
 arch/powerpc/mm/mmu_context.c              |  6 ++--
 arch/powerpc/mm/tlb-radix.c                | 11 ++++++-
 3 files changed, 29 insertions(+), 25 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case
  2018-05-13  4:21 [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations Nicholas Piggin
@ 2018-05-13  4:21 ` Nicholas Piggin
  2018-05-13  4:21 ` [PATCH 2/3] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
  2018-05-13  4:21 ` [PATCH 3/3] powerpc/64s/radix: optimise pte_update Nicholas Piggin
  2 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-05-13  4:21 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This matches other architectures, when we know there will be no
further accesses to the address (e.g., for teardown), page table
entries can be cleared non-atomically.

The comments about NMMU are bogus, all MMU notifiers (including NMMU)
are released at this point, with their TLBs flushed. An NMMU access
at this point would be a bug.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 705193e7192f..fcd92f9b6ec0 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -176,14 +176,8 @@ static inline pte_t radix__ptep_get_and_clear_full(struct mm_struct *mm,
 	unsigned long old_pte;
 
 	if (full) {
-		/*
-		 * If we are trying to clear the pte, we can skip
-		 * the DD1 pte update sequence and batch the tlb flush. The
-		 * tlb flush batching is done by mmu gather code. We
-		 * still keep the cmp_xchg update to make sure we get
-		 * correct R/C bit which might be updated via Nest MMU.
-		 */
-		old_pte = __radix_pte_update(ptep, ~0ul, 0);
+		old_pte = pte_val(*ptep);
+		*ptep = __pte(0);
 	} else
 		old_pte = radix__pte_update(mm, addr, ptep, ~0ul, 0, 0);
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/3] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags
  2018-05-13  4:21 [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations Nicholas Piggin
  2018-05-13  4:21 ` [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
@ 2018-05-13  4:21 ` Nicholas Piggin
  2018-05-13  4:21 ` [PATCH 3/3] powerpc/64s/radix: optimise pte_update Nicholas Piggin
  2 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-05-13  4:21 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The ISA suggests ptesync after setting a pte, to prevent a table walk
initiated by a subsequent access from causing a spurious fault, which
may be an allowance implementation to have page table walk loads
incoherent with store queues.

However there is no correctness problem in spurious faults -- the
kernel copes with these at any time, and the architecture requires
the pte to be re-loaded, which would eventually find the updated pte.

On POWER9 there does not appear to be a large window where this is a
problem, so as an optimisation, remove the costly ptesync from pte
updates. If implementations benefit from ptesync, it would likely be
better to go in update_mmu_cache, rather than set_pte etc which is
called for things like fork and mprotect.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index fcd92f9b6ec0..45bf1e1b1d33 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -209,7 +209,6 @@ static inline void radix__ptep_set_access_flags(struct mm_struct *mm,
 		__radix_pte_update(ptep, 0, new_pte);
 	} else
 		__radix_pte_update(ptep, 0, set);
-	asm volatile("ptesync" : : : "memory");
 }
 
 static inline int radix__pte_same(pte_t pte_a, pte_t pte_b)
@@ -226,7 +225,6 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
 				 pte_t *ptep, pte_t pte, int percpu)
 {
 	*ptep = pte;
-	asm volatile("ptesync" : : : "memory");
 }
 
 static inline int radix__pmd_bad(pmd_t pmd)
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 3/3] powerpc/64s/radix: optimise pte_update
  2018-05-13  4:21 [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations Nicholas Piggin
  2018-05-13  4:21 ` [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
  2018-05-13  4:21 ` [PATCH 2/3] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
@ 2018-05-13  4:21 ` Nicholas Piggin
  2 siblings, 0 replies; 4+ messages in thread
From: Nicholas Piggin @ 2018-05-13  4:21 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Implementing pte_update with pte_xchg (which uses cmpxchg) is
inefficient. A single larx/stcx. works fine, no need for the less
efficient cmpxchg sequence.

Then remove the memory barriers from the operation. There is a
requirement for TLB flushing to load mm_cpumask after the store
that reduces pte permissions, which is moved into the TLB flush
code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h | 25 +++++++++++-----------
 arch/powerpc/mm/mmu_context.c              |  6 ++++--
 arch/powerpc/mm/tlb-radix.c                | 11 +++++++++-
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 45bf1e1b1d33..cc9437a542cc 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -127,20 +127,21 @@ extern void radix__mark_initmem_nx(void);
 static inline unsigned long __radix_pte_update(pte_t *ptep, unsigned long clr,
 					       unsigned long set)
 {
-	pte_t pte;
-	unsigned long old_pte, new_pte;
-
-	do {
-		pte = READ_ONCE(*ptep);
-		old_pte = pte_val(pte);
-		new_pte = (old_pte | set) & ~clr;
-
-	} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
-
-	return old_pte;
+	__be64 old_be, tmp_be;
+
+	__asm__ __volatile__(
+	"1:	ldarx	%0,0,%3		# pte_update\n"
+	"	andc	%1,%0,%5	\n"
+	"	or	%1,%1,%4	\n"
+	"	stdcx.	%1,0,%3		\n"
+	"	bne-	1b"
+	: "=&r" (old_be), "=&r" (tmp_be), "=m" (*ptep)
+	: "r" (ptep), "r" (cpu_to_be64(set)), "r" (cpu_to_be64(clr))
+	: "cc" );
+
+	return be64_to_cpu(old_be);
 }
 
-
 static inline unsigned long radix__pte_update(struct mm_struct *mm,
 					unsigned long addr,
 					pte_t *ptep, unsigned long clr,
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0ab297c4cfad..f84e14f23e50 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -57,8 +57,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * in switch_slb(), and/or the store of paca->mm_ctx_id in
 		 * copy_mm_to_paca().
 		 *
-		 * On the read side the barrier is in pte_xchg(), which orders
-		 * the store to the PTE vs the load of mm_cpumask.
+		 * On the other side, the barrier is in mm/tlb-radix.c for
+		 * radix which orders earlier stores to clear the PTEs vs
+		 * the load of mm_cpumask. And pte_xchg which does the same
+		 * thing for hash.
 		 *
 		 * This full barrier is needed by membarrier when switching
 		 * between processes after store to rq->curr, before user-space
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 55f93d66c8d2..b419702b1ba6 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -535,6 +535,11 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
 		return;
 
 	preempt_disable();
+	/*
+	 * Order loads of mm_cpumask vs previous stores to clear ptes before
+	 * the invalidate. See barrier in switch_mm_irqs_off
+	 */
+	smp_mb();
 	if (!mm_is_thread_local(mm)) {
 		if (mm_is_singlethreaded(mm)) {
 			_tlbie_pid(pid, RIC_FLUSH_ALL);
@@ -560,6 +565,7 @@ void radix__flush_all_mm(struct mm_struct *mm)
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (!mm_is_thread_local(mm)) {
 		_tlbie_pid(pid, RIC_FLUSH_ALL);
 		if (mm_is_singlethreaded(mm))
@@ -587,6 +593,7 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
 	} else {
@@ -655,6 +662,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		local = true;
 		full = (end == TLB_FLUSH_ALL ||
@@ -820,6 +828,7 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 		return;
 
 	preempt_disable();
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		local = true;
 		full = (end == TLB_FLUSH_ALL ||
@@ -882,7 +891,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
 
 	/* Otherwise first do the PWC, then iterate the pages. */
 	preempt_disable();
-
+	smp_mb(); /* see radix__flush_tlb_mm */
 	if (mm_is_thread_local(mm)) {
 		_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
 	} else {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-05-13  4:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-13  4:21 [PATCH 0/3] powerpc/64s/radix pte manipulation optimisations Nicholas Piggin
2018-05-13  4:21 ` [PATCH 1/3] powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case Nicholas Piggin
2018-05-13  4:21 ` [PATCH 2/3] powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags Nicholas Piggin
2018-05-13  4:21 ` [PATCH 3/3] powerpc/64s/radix: optimise pte_update Nicholas Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.