All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] Make use of v7 barrier variants in Linux
@ 2013-06-06 14:28 Will Deacon
  2013-06-06 14:28 ` [PATCH 01/10] ARM: mm: remove redundant dsb() prior to range TLB invalidation Will Deacon
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

This patch series updates our barrier macros to make use of the
different variants introduced by the v7 architecture. This includes
both access type (store vs load/store) and shareability domain. There
is a dependency on my TLB patches, which I have included in the series
and which were most recently posted here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2013-May/169124.html

With these patches applied, I see around 5% improvement on hackbench
scores running on my TC2 with both clusters enabled.

Since these changes have subtle memory-ordering implications, I've
avoiding touching any cache-flushing operations or barrier code that is
used during things like CPU suspend/resume, where the CPU coming up/down
might have bits like actlr.smp clear. Maybe this is overkill, but it
reaches the point of diminishing returns if we start having
implementation-specific barrier options, so I've tried to keep it
general.

All feedback welcome,

Will


Will Deacon (10):
  ARM: mm: remove redundant dsb() prior to range TLB invalidation
  ARM: tlb: don't perform inner-shareable invalidation for local TLB ops
  ARM: tlb: don't bother with barriers for branch predictor maintenance
  ARM: tlb: don't perform inner-shareable invalidation for local BP ops
  ARM: barrier: allow options to be passed to memory barrier
    instructions
  ARM: spinlock: use inner-shareable dsb variant prior to sev
    instruction
  ARM: mm: use inner-shareable barriers for TLB and user cache
    operations
  ARM: tlb: reduce scope of barrier domains for TLB invalidation
  ARM: kvm: use inner-shareable barriers after TLB flushing
  ARM: mcpm: use -st dsb option prior to sev instructions

 arch/arm/common/mcpm_head.S      |   2 +-
 arch/arm/common/vlock.S          |   4 +-
 arch/arm/include/asm/assembler.h |   4 +-
 arch/arm/include/asm/barrier.h   |  32 ++++++------
 arch/arm/include/asm/spinlock.h  |   2 +-
 arch/arm/include/asm/switch_to.h |  10 ++++
 arch/arm/include/asm/tlbflush.h  | 105 ++++++++++++++++++++++++++++++++-------
 arch/arm/kernel/smp_tlb.c        |  10 ++--
 arch/arm/kvm/init.S              |   2 +-
 arch/arm/kvm/interrupts.S        |   4 +-
 arch/arm/mm/cache-v7.S           |   4 +-
 arch/arm/mm/context.c            |   6 +--
 arch/arm/mm/dma-mapping.c        |   1 -
 arch/arm/mm/proc-v7.S            |   2 +-
 arch/arm/mm/tlb-v7.S             |   8 +--
 15 files changed, 134 insertions(+), 62 deletions(-)

-- 
1.8.2.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 01/10] ARM: mm: remove redundant dsb() prior to range TLB invalidation
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 02/10] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

The kernel TLB range invalidation functions already contain dsb
instructions before and after the maintenance, so there is no need to
introduce additional barriers.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/mm/dma-mapping.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index ef3e0f3..5baabf7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -455,7 +455,6 @@ static void __dma_remap(struct page *page, size_t size, pgprot_t prot)
 	unsigned end = start + size;
 
 	apply_to_page_range(&init_mm, start, size, __dma_update_pte, &prot);
-	dsb();
 	flush_tlb_kernel_range(start, end);
 }
 
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 02/10] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
  2013-06-06 14:28 ` [PATCH 01/10] ARM: mm: remove redundant dsb() prior to range TLB invalidation Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-13 17:50   ` Jonathan Austin
  2013-06-06 14:28 ` [PATCH 03/10] ARM: tlb: don't bother with barriers for branch predictor maintenance Will Deacon
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

Inner-shareable TLB invalidation is typically more expensive than local
(non-shareable) invalidation, so performing the broadcasting for
local_flush_tlb_* operations is a waste of cycles and needlessly
clobbers entries in the TLBs of other CPUs.

This patch introduces __flush_tlb_* versions for many of the TLB
invalidation functions, which only respect inner-shareable variants of
the invalidation instructions. This allows us to modify the v7 SMP TLB
flags to include *both* inner-shareable and non-shareable operations and
then check the relevant flags depending on whether the operation is
local or not.

This gains us around 0.5% in hackbench scores for a dual-core A15, but I
would expect this to improve as more cores (and clusters) are added to
the equation.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/tlbflush.h | 67 ++++++++++++++++++++++++++++++++++++++---
 arch/arm/kernel/smp_tlb.c       |  8 ++---
 arch/arm/mm/context.c           |  6 +---
 3 files changed, 68 insertions(+), 13 deletions(-)

diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
index a3625d1..55b5e18 100644
--- a/arch/arm/include/asm/tlbflush.h
+++ b/arch/arm/include/asm/tlbflush.h
@@ -167,6 +167,8 @@
 #endif
 
 #define v7wbi_tlb_flags_smp	(TLB_WB | TLB_BARRIER | \
+				 TLB_V6_U_FULL | TLB_V6_U_PAGE | \
+				 TLB_V6_U_ASID | \
 				 TLB_V7_UIS_FULL | TLB_V7_UIS_PAGE | \
 				 TLB_V7_UIS_ASID | TLB_V7_UIS_BP)
 #define v7wbi_tlb_flags_up	(TLB_WB | TLB_DCLEAN | TLB_BARRIER | \
@@ -330,6 +332,21 @@ static inline void local_flush_tlb_all(void)
 	tlb_op(TLB_V4_U_FULL | TLB_V6_U_FULL, "c8, c7, 0", zero);
 	tlb_op(TLB_V4_D_FULL | TLB_V6_D_FULL, "c8, c6, 0", zero);
 	tlb_op(TLB_V4_I_FULL | TLB_V6_I_FULL, "c8, c5, 0", zero);
+
+	if (tlb_flag(TLB_BARRIER)) {
+		dsb();
+		isb();
+	}
+}
+
+static inline void __flush_tlb_all(void)
+{
+	const int zero = 0;
+	const unsigned int __tlb_flag = __cpu_tlb_flags;
+
+	if (tlb_flag(TLB_WB))
+		dsb();
+
 	tlb_op(TLB_V7_UIS_FULL, "c8, c3, 0", zero);
 
 	if (tlb_flag(TLB_BARRIER)) {
@@ -348,21 +365,32 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm)
 		dsb();
 
 	if (possible_tlb_flags & (TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) {
-		if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) {
+		if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) {
 			tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero);
 			tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero);
 			tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero);
 		}
-		put_cpu();
 	}
 
 	tlb_op(TLB_V6_U_ASID, "c8, c7, 2", asid);
 	tlb_op(TLB_V6_D_ASID, "c8, c6, 2", asid);
 	tlb_op(TLB_V6_I_ASID, "c8, c5, 2", asid);
+
+	if (tlb_flag(TLB_BARRIER))
+		dsb();
+}
+
+static inline void __flush_tlb_mm(struct mm_struct *mm)
+{
+	const unsigned int __tlb_flag = __cpu_tlb_flags;
+
+	if (tlb_flag(TLB_WB))
+		dsb();
+
 #ifdef CONFIG_ARM_ERRATA_720789
-	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 0", zero);
+	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 0", 0);
 #else
-	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 2", asid);
+	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 2", ASID(mm));
 #endif
 
 	if (tlb_flag(TLB_BARRIER))
@@ -392,6 +420,21 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 	tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", uaddr);
 	tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", uaddr);
 	tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", uaddr);
+
+	if (tlb_flag(TLB_BARRIER))
+		dsb();
+}
+
+static inline void
+__flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
+{
+	const unsigned int __tlb_flag = __cpu_tlb_flags;
+
+	uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm);
+
+	if (tlb_flag(TLB_WB))
+		dsb();
+
 #ifdef CONFIG_ARM_ERRATA_720789
 	tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK);
 #else
@@ -421,6 +464,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr)
 	tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr);
 	tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr);
 	tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr);
+
+	if (tlb_flag(TLB_BARRIER)) {
+		dsb();
+		isb();
+	}
+}
+
+static inline void __flush_tlb_kernel_page(unsigned long kaddr)
+{
+	const unsigned int __tlb_flag = __cpu_tlb_flags;
+
+	kaddr &= PAGE_MASK;
+
+	if (tlb_flag(TLB_WB))
+		dsb();
+
 	tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 1", kaddr);
 
 	if (tlb_flag(TLB_BARRIER)) {
diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c
index 9a52a07..cc299b5 100644
--- a/arch/arm/kernel/smp_tlb.c
+++ b/arch/arm/kernel/smp_tlb.c
@@ -135,7 +135,7 @@ void flush_tlb_all(void)
 	if (tlb_ops_need_broadcast())
 		on_each_cpu(ipi_flush_tlb_all, NULL, 1);
 	else
-		local_flush_tlb_all();
+		__flush_tlb_all();
 	broadcast_tlb_a15_erratum();
 }
 
@@ -144,7 +144,7 @@ void flush_tlb_mm(struct mm_struct *mm)
 	if (tlb_ops_need_broadcast())
 		on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm, mm, 1);
 	else
-		local_flush_tlb_mm(mm);
+		__flush_tlb_mm(mm);
 	broadcast_tlb_mm_a15_erratum(mm);
 }
 
@@ -157,7 +157,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 		on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_page,
 					&ta, 1);
 	} else
-		local_flush_tlb_page(vma, uaddr);
+		__flush_tlb_page(vma, uaddr);
 	broadcast_tlb_mm_a15_erratum(vma->vm_mm);
 }
 
@@ -168,7 +168,7 @@ void flush_tlb_kernel_page(unsigned long kaddr)
 		ta.ta_start = kaddr;
 		on_each_cpu(ipi_flush_tlb_kernel_page, &ta, 1);
 	} else
-		local_flush_tlb_kernel_page(kaddr);
+		__flush_tlb_kernel_page(kaddr);
 	broadcast_tlb_a15_erratum();
 }
 
diff --git a/arch/arm/mm/context.c b/arch/arm/mm/context.c
index 2ac3737..62c1ec5 100644
--- a/arch/arm/mm/context.c
+++ b/arch/arm/mm/context.c
@@ -134,10 +134,7 @@ static void flush_context(unsigned int cpu)
 	}
 
 	/* Queue a TLB invalidate and flush the I-cache if necessary. */
-	if (!tlb_ops_need_broadcast())
-		cpumask_set_cpu(cpu, &tlb_flush_pending);
-	else
-		cpumask_setall(&tlb_flush_pending);
+	cpumask_setall(&tlb_flush_pending);
 
 	if (icache_is_vivt_asid_tagged())
 		__flush_icache_all();
@@ -215,7 +212,6 @@ void check_and_switch_context(struct mm_struct *mm, struct task_struct *tsk)
 	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending)) {
 		local_flush_bp_all();
 		local_flush_tlb_all();
-		dummy_flush_tlb_a15_erratum();
 	}
 
 	atomic64_set(&per_cpu(active_asids, cpu), asid);
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 03/10] ARM: tlb: don't bother with barriers for branch predictor maintenance
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
  2013-06-06 14:28 ` [PATCH 01/10] ARM: mm: remove redundant dsb() prior to range TLB invalidation Will Deacon
  2013-06-06 14:28 ` [PATCH 02/10] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 04/10] ARM: tlb: don't perform inner-shareable invalidation for local BP ops Will Deacon
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

Branch predictor maintenance is only required when we are either
changing the kernel's view of memory (switching tables completely) or
dealing with ASID rollover.

Both of these use-cases require subsequent TLB invalidation, which has
the relevant barrier instructions to ensure completion and visibility
of the maintenance, so this patch removes the instruction barrier from
[local_]flush_bp_all.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/tlbflush.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
index 55b5e18..e111027 100644
--- a/arch/arm/include/asm/tlbflush.h
+++ b/arch/arm/include/asm/tlbflush.h
@@ -488,6 +488,10 @@ static inline void __flush_tlb_kernel_page(unsigned long kaddr)
 	}
 }
 
+/*
+ * Branch predictor maintenance is paired with full TLB invalidation, so
+ * there is no need for any barriers here.
+ */
 static inline void local_flush_bp_all(void)
 {
 	const int zero = 0;
@@ -497,9 +501,6 @@ static inline void local_flush_bp_all(void)
 		asm("mcr p15, 0, %0, c7, c1, 6" : : "r" (zero));
 	else if (tlb_flag(TLB_V6_BP))
 		asm("mcr p15, 0, %0, c7, c5, 6" : : "r" (zero));
-
-	if (tlb_flag(TLB_BARRIER))
-		isb();
 }
 
 #ifdef CONFIG_ARM_ERRATA_798181
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 04/10] ARM: tlb: don't perform inner-shareable invalidation for local BP ops
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
                   ` (2 preceding siblings ...)
  2013-06-06 14:28 ` [PATCH 03/10] ARM: tlb: don't bother with barriers for branch predictor maintenance Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 05/10] ARM: barrier: allow options to be passed to memory barrier instructions Will Deacon
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

Now that the ASID allocator doesn't require inner-shareable maintenance,
we can convert the local_bp_flush_all function to perform only
non-shareable flushing, in a similar manner to the TLB invalidation
routines.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/tlbflush.h | 13 ++++++++++---
 arch/arm/kernel/smp_tlb.c       |  2 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
index e111027..0bdd5d2d 100644
--- a/arch/arm/include/asm/tlbflush.h
+++ b/arch/arm/include/asm/tlbflush.h
@@ -168,7 +168,7 @@
 
 #define v7wbi_tlb_flags_smp	(TLB_WB | TLB_BARRIER | \
 				 TLB_V6_U_FULL | TLB_V6_U_PAGE | \
-				 TLB_V6_U_ASID | \
+				 TLB_V6_U_ASID | TLB_V6_BP | \
 				 TLB_V7_UIS_FULL | TLB_V7_UIS_PAGE | \
 				 TLB_V7_UIS_ASID | TLB_V7_UIS_BP)
 #define v7wbi_tlb_flags_up	(TLB_WB | TLB_DCLEAN | TLB_BARRIER | \
@@ -492,14 +492,21 @@ static inline void __flush_tlb_kernel_page(unsigned long kaddr)
  * Branch predictor maintenance is paired with full TLB invalidation, so
  * there is no need for any barriers here.
  */
-static inline void local_flush_bp_all(void)
+static inline void __flush_bp_all(void)
 {
 	const int zero = 0;
 	const unsigned int __tlb_flag = __cpu_tlb_flags;
 
 	if (tlb_flag(TLB_V7_UIS_BP))
 		asm("mcr p15, 0, %0, c7, c1, 6" : : "r" (zero));
-	else if (tlb_flag(TLB_V6_BP))
+}
+
+static inline void local_flush_bp_all(void)
+{
+	const int zero = 0;
+	const unsigned int __tlb_flag = __cpu_tlb_flags;
+
+	if (tlb_flag(TLB_V6_BP))
 		asm("mcr p15, 0, %0, c7, c5, 6" : : "r" (zero));
 }
 
diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c
index cc299b5..5cb5500 100644
--- a/arch/arm/kernel/smp_tlb.c
+++ b/arch/arm/kernel/smp_tlb.c
@@ -204,5 +204,5 @@ void flush_bp_all(void)
 	if (tlb_ops_need_broadcast())
 		on_each_cpu(ipi_flush_bp_all, NULL, 1);
 	else
-		local_flush_bp_all();
+		__flush_bp_all();
 }
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 05/10] ARM: barrier: allow options to be passed to memory barrier instructions
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
                   ` (3 preceding siblings ...)
  2013-06-06 14:28 ` [PATCH 04/10] ARM: tlb: don't perform inner-shareable invalidation for local BP ops Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 06/10] ARM: spinlock: use inner-shareable dsb variant prior to sev instruction Will Deacon
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

On ARMv7, the memory barrier instructions take an optional `option'
field which can be used to constrain the effects of a memory barrier
based on shareability and access type.

This patch allows the caller to pass these options if required, and
updates the smp_*() barriers to request inner-shareable barriers,
affecting only stores for the _wmb variant. wmb() is also changed to
use the -st version of dsb.

Reported-by: Albin Tonnerre <albin.tonnerre@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/assembler.h |  4 ++--
 arch/arm/include/asm/barrier.h   | 32 ++++++++++++++++----------------
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
index 05ee9ee..863b280 100644
--- a/arch/arm/include/asm/assembler.h
+++ b/arch/arm/include/asm/assembler.h
@@ -212,9 +212,9 @@
 #ifdef CONFIG_SMP
 #if __LINUX_ARM_ARCH__ >= 7
 	.ifeqs "\mode","arm"
-	ALT_SMP(dmb)
+	ALT_SMP(dmb	ish)
 	.else
-	ALT_SMP(W(dmb))
+	ALT_SMP(W(dmb)	ish)
 	.endif
 #elif __LINUX_ARM_ARCH__ == 6
 	ALT_SMP(mcr	p15, 0, r0, c7, c10, 5)	@ dmb
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 8dcd9c7..60f15e2 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -14,27 +14,27 @@
 #endif
 
 #if __LINUX_ARM_ARCH__ >= 7
-#define isb() __asm__ __volatile__ ("isb" : : : "memory")
-#define dsb() __asm__ __volatile__ ("dsb" : : : "memory")
-#define dmb() __asm__ __volatile__ ("dmb" : : : "memory")
+#define isb(option) __asm__ __volatile__ ("isb " #option : : : "memory")
+#define dsb(option) __asm__ __volatile__ ("dsb " #option : : : "memory")
+#define dmb(option) __asm__ __volatile__ ("dmb " #option : : : "memory")
 #elif defined(CONFIG_CPU_XSC3) || __LINUX_ARM_ARCH__ == 6
-#define isb() __asm__ __volatile__ ("mcr p15, 0, %0, c7, c5, 4" \
+#define isb(x) __asm__ __volatile__ ("mcr p15, 0, %0, c7, c5, 4" \
 				    : : "r" (0) : "memory")
-#define dsb() __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 4" \
+#define dsb(x) __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 4" \
 				    : : "r" (0) : "memory")
-#define dmb() __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 5" \
+#define dmb(x) __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 5" \
 				    : : "r" (0) : "memory")
 #elif defined(CONFIG_CPU_FA526)
-#define isb() __asm__ __volatile__ ("mcr p15, 0, %0, c7, c5, 4" \
+#define isb(x) __asm__ __volatile__ ("mcr p15, 0, %0, c7, c5, 4" \
 				    : : "r" (0) : "memory")
-#define dsb() __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 4" \
+#define dsb(x) __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 4" \
 				    : : "r" (0) : "memory")
-#define dmb() __asm__ __volatile__ ("" : : : "memory")
+#define dmb(x) __asm__ __volatile__ ("" : : : "memory")
 #else
-#define isb() __asm__ __volatile__ ("" : : : "memory")
-#define dsb() __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 4" \
+#define isb(x) __asm__ __volatile__ ("" : : : "memory")
+#define dsb(x) __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 4" \
 				    : : "r" (0) : "memory")
-#define dmb() __asm__ __volatile__ ("" : : : "memory")
+#define dmb(x) __asm__ __volatile__ ("" : : : "memory")
 #endif
 
 #ifdef CONFIG_ARCH_HAS_BARRIERS
@@ -42,7 +42,7 @@
 #elif defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
 #define mb()		do { dsb(); outer_sync(); } while (0)
 #define rmb()		dsb()
-#define wmb()		mb()
+#define wmb()		do { dsb(st); outer_sync(); } while (0)
 #else
 #define mb()		barrier()
 #define rmb()		barrier()
@@ -54,9 +54,9 @@
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #else
-#define smp_mb()	dmb()
-#define smp_rmb()	dmb()
-#define smp_wmb()	dmb()
+#define smp_mb()	dmb(ish)
+#define smp_rmb()	smp_mb()
+#define smp_wmb()	dmb(ishst)
 #endif
 
 #define read_barrier_depends()		do { } while(0)
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 06/10] ARM: spinlock: use inner-shareable dsb variant prior to sev instruction
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
                   ` (4 preceding siblings ...)
  2013-06-06 14:28 ` [PATCH 05/10] ARM: barrier: allow options to be passed to memory barrier instructions Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 07/10] ARM: mm: use inner-shareable barriers for TLB and user cache operations Will Deacon
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

When unlocking a spinlock, we use the sev instruction to signal other
CPUs waiting on the lock. Since sev is not a memory access instruction,
we require a dsb in order to ensure that the sev is not issued ahead
of the store placing the lock in an unlocked state.

However, as sev is only concerned with other processors in a
multiprocessor system, we can restrict the scope of the preceding dsb
to the inner-shareable domain. Furthermore, we can restrict the scope to
consider only stores, since there are no independent loads on the unlock
path.

A side-effect of this change is that a spin_unlock operation no longer
forces completion of pending TLB invalidation, something which we rely
on when unlocking runqueues to ensure that CPU migration during TLB
maintenance routines doesn't cause us to continue before the operation
has completed.

This patch adds the -ishst suffix to the ARMv7 definition of dsb_sev()
and adds an inner-shareable dsb to the context-switch path when running
a preemptible, SMP, v7 kernel.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/spinlock.h  |  2 +-
 arch/arm/include/asm/switch_to.h | 10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 6220e9f..5a0261e 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -46,7 +46,7 @@ static inline void dsb_sev(void)
 {
 #if __LINUX_ARM_ARCH__ >= 7
 	__asm__ __volatile__ (
-		"dsb\n"
+		"dsb ishst\n"
 		SEV
 	);
 #else
diff --git a/arch/arm/include/asm/switch_to.h b/arch/arm/include/asm/switch_to.h
index fa09e6b..c99e259 100644
--- a/arch/arm/include/asm/switch_to.h
+++ b/arch/arm/include/asm/switch_to.h
@@ -4,6 +4,16 @@
 #include <linux/thread_info.h>
 
 /*
+ * For v7 SMP cores running a preemptible kernel we may be pre-empted
+ * during a TLB maintenance operation, so execute an inner-shareable dsb
+ * to ensure that the maintenance completes in case we migrate to another
+ * CPU.
+ */
+#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) && defined(CONFIG_CPU_V7)
+#define finish_arch_switch(prev)	dsb(ish)
+#endif
+
+/*
  * switch_to(prev, next) should switch from task `prev' to `next'
  * `prev' will never be the same as `next'.  schedule() itself
  * contains the memory barrier to tell GCC not to cache `current'.
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 07/10] ARM: mm: use inner-shareable barriers for TLB and user cache operations
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
                   ` (5 preceding siblings ...)
  2013-06-06 14:28 ` [PATCH 06/10] ARM: spinlock: use inner-shareable dsb variant prior to sev instruction Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 08/10] ARM: tlb: reduce scope of barrier domains for TLB invalidation Will Deacon
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

System-wide barriers aren't required for situations where we only need
to make visibility and ordering guarantees in the inner-shareable domain
(i.e. we are not dealing with devices or potentially incoherent CPUs).

This patch changes the v7 TLB operations, coherent_user_range and
dcache_clean_area functions to user inner-shareable barriers. For cache
maintenance, only the store access type is required to ensure completion.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/mm/cache-v7.S | 4 ++--
 arch/arm/mm/proc-v7.S  | 2 +-
 arch/arm/mm/tlb-v7.S   | 8 ++++----
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 15451ee..44f5a6a 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -274,7 +274,7 @@ ENTRY(v7_coherent_user_range)
 	add	r12, r12, r2
 	cmp	r12, r1
 	blo	1b
-	dsb
+	dsb	ishst
 	icache_line_size r2, r3
 	sub	r3, r2, #1
 	bic	r12, r0, r3
@@ -286,7 +286,7 @@ ENTRY(v7_coherent_user_range)
 	mov	r0, #0
 	ALT_SMP(mcr	p15, 0, r0, c7, c1, 6)	@ invalidate BTB Inner Shareable
 	ALT_UP(mcr	p15, 0, r0, c7, c5, 6)	@ invalidate BTB
-	dsb
+	dsb	ishst
 	isb
 	mov	pc, lr
 
diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
index 2c73a73..d19ddc0 100644
--- a/arch/arm/mm/proc-v7.S
+++ b/arch/arm/mm/proc-v7.S
@@ -82,7 +82,7 @@ ENTRY(cpu_v7_dcache_clean_area)
 	add	r0, r0, r2
 	subs	r1, r1, r2
 	bhi	1b
-	dsb
+	dsb	ishst
 	mov	pc, lr
 ENDPROC(cpu_v7_dcache_clean_area)
 
diff --git a/arch/arm/mm/tlb-v7.S b/arch/arm/mm/tlb-v7.S
index ea94765..3553087 100644
--- a/arch/arm/mm/tlb-v7.S
+++ b/arch/arm/mm/tlb-v7.S
@@ -35,7 +35,7 @@
 ENTRY(v7wbi_flush_user_tlb_range)
 	vma_vm_mm r3, r2			@ get vma->vm_mm
 	mmid	r3, r3				@ get vm_mm->context.id
-	dsb
+	dsb	ish
 	mov	r0, r0, lsr #PAGE_SHIFT		@ align address
 	mov	r1, r1, lsr #PAGE_SHIFT
 	asid	r3, r3				@ mask ASID
@@ -56,7 +56,7 @@ ENTRY(v7wbi_flush_user_tlb_range)
 	add	r0, r0, #PAGE_SZ
 	cmp	r0, r1
 	blo	1b
-	dsb
+	dsb	ish
 	mov	pc, lr
 ENDPROC(v7wbi_flush_user_tlb_range)
 
@@ -69,7 +69,7 @@ ENDPROC(v7wbi_flush_user_tlb_range)
  *	- end   - end address (exclusive, may not be aligned)
  */
 ENTRY(v7wbi_flush_kern_tlb_range)
-	dsb
+	dsb	ish
 	mov	r0, r0, lsr #PAGE_SHIFT		@ align address
 	mov	r1, r1, lsr #PAGE_SHIFT
 	mov	r0, r0, lsl #PAGE_SHIFT
@@ -84,7 +84,7 @@ ENTRY(v7wbi_flush_kern_tlb_range)
 	add	r0, r0, #PAGE_SZ
 	cmp	r0, r1
 	blo	1b
-	dsb
+	dsb	ish
 	isb
 	mov	pc, lr
 ENDPROC(v7wbi_flush_kern_tlb_range)
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 08/10] ARM: tlb: reduce scope of barrier domains for TLB invalidation
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
                   ` (6 preceding siblings ...)
  2013-06-06 14:28 ` [PATCH 07/10] ARM: mm: use inner-shareable barriers for TLB and user cache operations Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 09/10] ARM: kvm: use inner-shareable barriers after TLB flushing Will Deacon
  2013-06-06 14:28 ` [PATCH 10/10] ARM: mcpm: use -st dsb option prior to sev instructions Will Deacon
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

Our TLB invalidation routines may require a barrier before the
maintenance (in order to ensure pending page table writes are visible to
the hardware walker) and barriers afterwards (in order to ensure
completion of the maintenance and visibility in the instruction stream).

Whilst this is expensive, the cost can be reduced somewhat by reducing
the scope of the barrier instructions:

  - The barrier before only needs to apply to stores (pte writes)
  - Local ops are required only to affect the non-shareable domain
  - Global ops are required only to affect the inner-shareable domain

This patch makes these changes for the TLB flushing code.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/tlbflush.h | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
index 0bdd5d2d..77350a2 100644
--- a/arch/arm/include/asm/tlbflush.h
+++ b/arch/arm/include/asm/tlbflush.h
@@ -327,14 +327,14 @@ static inline void local_flush_tlb_all(void)
 	const unsigned int __tlb_flag = __cpu_tlb_flags;
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(nshst);
 
 	tlb_op(TLB_V4_U_FULL | TLB_V6_U_FULL, "c8, c7, 0", zero);
 	tlb_op(TLB_V4_D_FULL | TLB_V6_D_FULL, "c8, c6, 0", zero);
 	tlb_op(TLB_V4_I_FULL | TLB_V6_I_FULL, "c8, c5, 0", zero);
 
 	if (tlb_flag(TLB_BARRIER)) {
-		dsb();
+		dsb(nsh);
 		isb();
 	}
 }
@@ -345,12 +345,12 @@ static inline void __flush_tlb_all(void)
 	const unsigned int __tlb_flag = __cpu_tlb_flags;
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(ishst);
 
 	tlb_op(TLB_V7_UIS_FULL, "c8, c3, 0", zero);
 
 	if (tlb_flag(TLB_BARRIER)) {
-		dsb();
+		dsb(ish);
 		isb();
 	}
 }
@@ -362,7 +362,7 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm)
 	const unsigned int __tlb_flag = __cpu_tlb_flags;
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(nshst);
 
 	if (possible_tlb_flags & (TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) {
 		if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) {
@@ -377,7 +377,7 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm)
 	tlb_op(TLB_V6_I_ASID, "c8, c5, 2", asid);
 
 	if (tlb_flag(TLB_BARRIER))
-		dsb();
+		dsb(nsh);
 }
 
 static inline void __flush_tlb_mm(struct mm_struct *mm)
@@ -385,7 +385,7 @@ static inline void __flush_tlb_mm(struct mm_struct *mm)
 	const unsigned int __tlb_flag = __cpu_tlb_flags;
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(ishst);
 
 #ifdef CONFIG_ARM_ERRATA_720789
 	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 0", 0);
@@ -394,7 +394,7 @@ static inline void __flush_tlb_mm(struct mm_struct *mm)
 #endif
 
 	if (tlb_flag(TLB_BARRIER))
-		dsb();
+		dsb(ish);
 }
 
 static inline void
@@ -406,7 +406,7 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 	uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm);
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(nshst);
 
 	if (possible_tlb_flags & (TLB_V4_U_PAGE|TLB_V4_D_PAGE|TLB_V4_I_PAGE|TLB_V4_I_FULL) &&
 	    cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm))) {
@@ -422,7 +422,7 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 	tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", uaddr);
 
 	if (tlb_flag(TLB_BARRIER))
-		dsb();
+		dsb(nsh);
 }
 
 static inline void
@@ -433,7 +433,7 @@ __flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 	uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm);
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(ishst);
 
 #ifdef CONFIG_ARM_ERRATA_720789
 	tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK);
@@ -442,7 +442,7 @@ __flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 #endif
 
 	if (tlb_flag(TLB_BARRIER))
-		dsb();
+		dsb(ish);
 }
 
 static inline void local_flush_tlb_kernel_page(unsigned long kaddr)
@@ -453,7 +453,7 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr)
 	kaddr &= PAGE_MASK;
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(nshst);
 
 	tlb_op(TLB_V4_U_PAGE, "c8, c7, 1", kaddr);
 	tlb_op(TLB_V4_D_PAGE, "c8, c6, 1", kaddr);
@@ -466,7 +466,7 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr)
 	tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr);
 
 	if (tlb_flag(TLB_BARRIER)) {
-		dsb();
+		dsb(nsh);
 		isb();
 	}
 }
@@ -478,12 +478,12 @@ static inline void __flush_tlb_kernel_page(unsigned long kaddr)
 	kaddr &= PAGE_MASK;
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(ishst);
 
 	tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 1", kaddr);
 
 	if (tlb_flag(TLB_BARRIER)) {
-		dsb();
+		dsb(ish);
 		isb();
 	}
 }
@@ -517,7 +517,7 @@ static inline void dummy_flush_tlb_a15_erratum(void)
 	 * Dummy TLBIMVAIS. Using the unmapped address 0 and ASID 0.
 	 */
 	asm("mcr p15, 0, %0, c8, c3, 1" : : "r" (0));
-	dsb();
+	dsb(ish);
 }
 #else
 static inline void dummy_flush_tlb_a15_erratum(void)
@@ -546,7 +546,7 @@ static inline void flush_pmd_entry(void *pmd)
 	tlb_l2_op(TLB_L2CLEAN_FR, "c15, c9, 1  @ L2 flush_pmd", pmd);
 
 	if (tlb_flag(TLB_WB))
-		dsb();
+		dsb(ishst);
 }
 
 static inline void clean_pmd_entry(void *pmd)
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 09/10] ARM: kvm: use inner-shareable barriers after TLB flushing
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
                   ` (7 preceding siblings ...)
  2013-06-06 14:28 ` [PATCH 08/10] ARM: tlb: reduce scope of barrier domains for TLB invalidation Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-06 14:28 ` [PATCH 10/10] ARM: mcpm: use -st dsb option prior to sev instructions Will Deacon
  9 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

When flushing the TLB at PL2 in response to remapping at stage-2 or VMID
rollover, we have a dsb instruction to ensure completion of the command
before continuing.

Since we only care about other processors for TLB invalidation, use the
inner-shareable variant of the dsb instruction instead.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/kvm/init.S       | 2 +-
 arch/arm/kvm/interrupts.S | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
index f048338..1b9844d 100644
--- a/arch/arm/kvm/init.S
+++ b/arch/arm/kvm/init.S
@@ -142,7 +142,7 @@ target:	@ We're now in the trampoline code, switch page tables
 
 	@ Invalidate the old TLBs
 	mcr	p15, 4, r0, c8, c7, 0	@ TLBIALLH
-	dsb
+	dsb	ish
 
 	eret
 
diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
index f7793df..dfb5dcc 100644
--- a/arch/arm/kvm/interrupts.S
+++ b/arch/arm/kvm/interrupts.S
@@ -54,7 +54,7 @@ ENTRY(__kvm_tlb_flush_vmid_ipa)
 	mcrr	p15, 6, r2, r3, c2	@ Write VTTBR
 	isb
 	mcr     p15, 0, r0, c8, c3, 0	@ TLBIALLIS (rt ignored)
-	dsb
+	dsb	ish
 	isb
 	mov	r2, #0
 	mov	r3, #0
@@ -78,7 +78,7 @@ ENTRY(__kvm_flush_vm_context)
 	mcr     p15, 4, r0, c8, c3, 4
 	/* Invalidate instruction caches Inner Shareable (ICIALLUIS) */
 	mcr     p15, 0, r0, c7, c1, 0
-	dsb
+	dsb	ish
 	isb				@ Not necessary if followed by eret
 
 	bx	lr
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 10/10] ARM: mcpm: use -st dsb option prior to sev instructions
  2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
                   ` (8 preceding siblings ...)
  2013-06-06 14:28 ` [PATCH 09/10] ARM: kvm: use inner-shareable barriers after TLB flushing Will Deacon
@ 2013-06-06 14:28 ` Will Deacon
  2013-06-07  4:15   ` Nicolas Pitre
  9 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-06-06 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

In a similar manner to our spinlock implementation, mcpm uses sev to
wake up cores waiting on a lock when the lock is unlocked. In order to
ensure that the final write unlocking the lock is visible, a dsb
instruction is executed immediately prior to the sev.

This patch changes these dsbs to use the -st option, since we only
require that the store unlocking the lock is made visible.

Reviewed-by: Dave Martin <dave.martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/common/mcpm_head.S | 2 +-
 arch/arm/common/vlock.S     | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/common/mcpm_head.S b/arch/arm/common/mcpm_head.S
index 8178705..5cdf619 100644
--- a/arch/arm/common/mcpm_head.S
+++ b/arch/arm/common/mcpm_head.S
@@ -151,7 +151,7 @@ mcpm_setup_leave:
 
 	mov	r0, #INBOUND_NOT_COMING_UP
 	strb	r0, [r8, #MCPM_SYNC_CLUSTER_INBOUND]
-	dsb
+	dsb	st
 	sev
 
 	mov	r0, r11
diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
index ff19858..8b7df28 100644
--- a/arch/arm/common/vlock.S
+++ b/arch/arm/common/vlock.S
@@ -42,7 +42,7 @@
 	dmb
 	mov	\rscratch, #0
 	strb	\rscratch, [\rbase, \rcpu]
-	dsb
+	dsb	st
 	sev
 .endm
 
@@ -102,7 +102,7 @@ ENTRY(vlock_unlock)
 	dmb
 	mov	r1, #VLOCK_OWNER_NONE
 	strb	r1, [r0, #VLOCK_OWNER_OFFSET]
-	dsb
+	dsb	st
 	sev
 	bx	lr
 ENDPROC(vlock_unlock)
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 10/10] ARM: mcpm: use -st dsb option prior to sev instructions
  2013-06-06 14:28 ` [PATCH 10/10] ARM: mcpm: use -st dsb option prior to sev instructions Will Deacon
@ 2013-06-07  4:15   ` Nicolas Pitre
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2013-06-07  4:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 6 Jun 2013, Will Deacon wrote:

> In a similar manner to our spinlock implementation, mcpm uses sev to
> wake up cores waiting on a lock when the lock is unlocked. In order to
> ensure that the final write unlocking the lock is visible, a dsb
> instruction is executed immediately prior to the sev.
> 
> This patch changes these dsbs to use the -st option, since we only
> require that the store unlocking the lock is made visible.
> 
> Reviewed-by: Dave Martin <dave.martin@arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Acked-by: Nicolas Pitre <nico@linaro.org>

> ---
>  arch/arm/common/mcpm_head.S | 2 +-
>  arch/arm/common/vlock.S     | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/common/mcpm_head.S b/arch/arm/common/mcpm_head.S
> index 8178705..5cdf619 100644
> --- a/arch/arm/common/mcpm_head.S
> +++ b/arch/arm/common/mcpm_head.S
> @@ -151,7 +151,7 @@ mcpm_setup_leave:
>  
>  	mov	r0, #INBOUND_NOT_COMING_UP
>  	strb	r0, [r8, #MCPM_SYNC_CLUSTER_INBOUND]
> -	dsb
> +	dsb	st
>  	sev
>  
>  	mov	r0, r11
> diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
> index ff19858..8b7df28 100644
> --- a/arch/arm/common/vlock.S
> +++ b/arch/arm/common/vlock.S
> @@ -42,7 +42,7 @@
>  	dmb
>  	mov	\rscratch, #0
>  	strb	\rscratch, [\rbase, \rcpu]
> -	dsb
> +	dsb	st
>  	sev
>  .endm
>  
> @@ -102,7 +102,7 @@ ENTRY(vlock_unlock)
>  	dmb
>  	mov	r1, #VLOCK_OWNER_NONE
>  	strb	r1, [r0, #VLOCK_OWNER_OFFSET]
> -	dsb
> +	dsb	st
>  	sev
>  	bx	lr
>  ENDPROC(vlock_unlock)
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 02/10] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops
  2013-06-06 14:28 ` [PATCH 02/10] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon
@ 2013-06-13 17:50   ` Jonathan Austin
  2013-06-18 11:32     ` Will Deacon
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Austin @ 2013-06-13 17:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 06/06/13 15:28, Will Deacon wrote:
> Inner-shareable TLB invalidation is typically more expensive than local
> (non-shareable) invalidation, so performing the broadcasting for
> local_flush_tlb_* operations is a waste of cycles and needlessly
> clobbers entries in the TLBs of other CPUs.
>
> This patch introduces __flush_tlb_* versions for many of the TLB
> invalidation functions, which only respect inner-shareable variants of
> the invalidation instructions. This allows us to modify the v7 SMP TLB
> flags to include *both* inner-shareable and non-shareable operations and
> then check the relevant flags depending on whether the operation is
> local or not.

I think this approach leaves us in trouble for some SMP_ON_UP cores as 
the IS versions of the instructions don't exist for them.

Is there something that should be ensuring your new __flush_tlb* 
functions don't get called for SMP_ON_UP? If not it looks like we might 
need to do some runtime patching with the ALT_SMP/ALT_UP macros...

I've commented on one of the example inline below...
>
> This gains us around 0.5% in hackbench scores for a dual-core A15, but I
> would expect this to improve as more cores (and clusters) are added to
> the equation.
>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>   arch/arm/include/asm/tlbflush.h | 67 ++++++++++++++++++++++++++++++++++++++---
>   arch/arm/kernel/smp_tlb.c       |  8 ++---
>   arch/arm/mm/context.c           |  6 +---
>   3 files changed, 68 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
> index a3625d1..55b5e18 100644
> --- a/arch/arm/include/asm/tlbflush.h
> +++ b/arch/arm/include/asm/tlbflush.h
> @@ -167,6 +167,8 @@
>   #endif
>
>   #define v7wbi_tlb_flags_smp	(TLB_WB | TLB_BARRIER | \
> +				 TLB_V6_U_FULL | TLB_V6_U_PAGE | \
> +				 TLB_V6_U_ASID | \
>   				 TLB_V7_UIS_FULL | TLB_V7_UIS_PAGE | \
>   				 TLB_V7_UIS_ASID | TLB_V7_UIS_BP)
>   #define v7wbi_tlb_flags_up	(TLB_WB | TLB_DCLEAN | TLB_BARRIER | \
> @@ -330,6 +332,21 @@ static inline void local_flush_tlb_all(void)
>   	tlb_op(TLB_V4_U_FULL | TLB_V6_U_FULL, "c8, c7, 0", zero);
>   	tlb_op(TLB_V4_D_FULL | TLB_V6_D_FULL, "c8, c6, 0", zero);
>   	tlb_op(TLB_V4_I_FULL | TLB_V6_I_FULL, "c8, c5, 0", zero);
> +
> +	if (tlb_flag(TLB_BARRIER)) {
> +		dsb();
> +		isb();
> +	}
> +}
> +
> +static inline void __flush_tlb_all(void)
> +{
> +	const int zero = 0;
> +	const unsigned int __tlb_flag = __cpu_tlb_flags;
> +
> +	if (tlb_flag(TLB_WB))
> +		dsb();
> +
>   	tlb_op(TLB_V7_UIS_FULL, "c8, c3, 0", zero);

I think we can get away with something similar to what we do in the 
cache maintenance case here, using ALT_SMP and ALT_UP to do runtime code 
patching and use TLB_V6_U_* for the UP case...

A follow on question is whether we still need to keep the *non* unified 
TLB maintenance operations (eg DTLBIALL, ITLBIALL). As far as I can see 
looking in to old TRMs, the last ARM CPU that didn't automatically treat 
those I/D ops to unified ones was ARM10, so not relevant here...

But - do some of the non-ARM cores exploit the (now deprecated) option 
to maintain these separately? Also did I miss some more obscure ARM variant?

Jonny

>
>   	if (tlb_flag(TLB_BARRIER)) {
> @@ -348,21 +365,32 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm)
>   		dsb();
>
>   	if (possible_tlb_flags & (TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) {
> -		if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) {
> +		if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) {
>   			tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero);
>   			tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero);
>   			tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero);
>   		}
> -		put_cpu();
>   	}
>
>   	tlb_op(TLB_V6_U_ASID, "c8, c7, 2", asid);
>   	tlb_op(TLB_V6_D_ASID, "c8, c6, 2", asid);
>   	tlb_op(TLB_V6_I_ASID, "c8, c5, 2", asid);
> +
> +	if (tlb_flag(TLB_BARRIER))
> +		dsb();
> +}
> +
> +static inline void __flush_tlb_mm(struct mm_struct *mm)
> +{
> +	const unsigned int __tlb_flag = __cpu_tlb_flags;
> +
> +	if (tlb_flag(TLB_WB))
> +		dsb();
> +
>   #ifdef CONFIG_ARM_ERRATA_720789
> -	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 0", zero);
> +	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 0", 0);
>   #else
> -	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 2", asid);
> +	tlb_op(TLB_V7_UIS_ASID, "c8, c3, 2", ASID(mm));
>   #endif
>
>   	if (tlb_flag(TLB_BARRIER))
> @@ -392,6 +420,21 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
>   	tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", uaddr);
>   	tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", uaddr);
>   	tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", uaddr);
> +
> +	if (tlb_flag(TLB_BARRIER))
> +		dsb();
> +}
> +
> +static inline void
> +__flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
> +{
> +	const unsigned int __tlb_flag = __cpu_tlb_flags;
> +
> +	uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm);
> +
> +	if (tlb_flag(TLB_WB))
> +		dsb();
> +
>   #ifdef CONFIG_ARM_ERRATA_720789
>   	tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK);
>   #else
> @@ -421,6 +464,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr)
>   	tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr);
>   	tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr);
>   	tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr);
> +
> +	if (tlb_flag(TLB_BARRIER)) {
> +		dsb();
> +		isb();
> +	}
> +}
> +
> +static inline void __flush_tlb_kernel_page(unsigned long kaddr)
> +{
> +	const unsigned int __tlb_flag = __cpu_tlb_flags;
> +
> +	kaddr &= PAGE_MASK;
> +
> +	if (tlb_flag(TLB_WB))
> +		dsb();
> +
>   	tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 1", kaddr);
>
>   	if (tlb_flag(TLB_BARRIER)) {
> diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c
> index 9a52a07..cc299b5 100644
> --- a/arch/arm/kernel/smp_tlb.c
> +++ b/arch/arm/kernel/smp_tlb.c
> @@ -135,7 +135,7 @@ void flush_tlb_all(void)
>   	if (tlb_ops_need_broadcast())
>   		on_each_cpu(ipi_flush_tlb_all, NULL, 1);
>   	else
> -		local_flush_tlb_all();
> +		__flush_tlb_all();
>   	broadcast_tlb_a15_erratum();
>   }
>
> @@ -144,7 +144,7 @@ void flush_tlb_mm(struct mm_struct *mm)
>   	if (tlb_ops_need_broadcast())
>   		on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm, mm, 1);
>   	else
> -		local_flush_tlb_mm(mm);
> +		__flush_tlb_mm(mm);
>   	broadcast_tlb_mm_a15_erratum(mm);
>   }
>
> @@ -157,7 +157,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
>   		on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_page,
>   					&ta, 1);
>   	} else
> -		local_flush_tlb_page(vma, uaddr);
> +		__flush_tlb_page(vma, uaddr);
>   	broadcast_tlb_mm_a15_erratum(vma->vm_mm);
>   }
>
> @@ -168,7 +168,7 @@ void flush_tlb_kernel_page(unsigned long kaddr)
>   		ta.ta_start = kaddr;
>   		on_each_cpu(ipi_flush_tlb_kernel_page, &ta, 1);
>   	} else
> -		local_flush_tlb_kernel_page(kaddr);
> +		__flush_tlb_kernel_page(kaddr);
>   	broadcast_tlb_a15_erratum();
>   }
>
> diff --git a/arch/arm/mm/context.c b/arch/arm/mm/context.c
> index 2ac3737..62c1ec5 100644
> --- a/arch/arm/mm/context.c
> +++ b/arch/arm/mm/context.c
> @@ -134,10 +134,7 @@ static void flush_context(unsigned int cpu)
>   	}
>
>   	/* Queue a TLB invalidate and flush the I-cache if necessary. */
> -	if (!tlb_ops_need_broadcast())
> -		cpumask_set_cpu(cpu, &tlb_flush_pending);
> -	else
> -		cpumask_setall(&tlb_flush_pending);
> +	cpumask_setall(&tlb_flush_pending);
>
>   	if (icache_is_vivt_asid_tagged())
>   		__flush_icache_all();
> @@ -215,7 +212,6 @@ void check_and_switch_context(struct mm_struct *mm, struct task_struct *tsk)
>   	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending)) {
>   		local_flush_bp_all();
>   		local_flush_tlb_all();
> -		dummy_flush_tlb_a15_erratum();
>   	}
>
>   	atomic64_set(&per_cpu(active_asids, cpu), asid);
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 02/10] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops
  2013-06-13 17:50   ` Jonathan Austin
@ 2013-06-18 11:32     ` Will Deacon
  0 siblings, 0 replies; 14+ messages in thread
From: Will Deacon @ 2013-06-18 11:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 13, 2013 at 06:50:03PM +0100, Jonathan Austin wrote:
> Hi Will,

Hi Jonny,

> On 06/06/13 15:28, Will Deacon wrote:
> > Inner-shareable TLB invalidation is typically more expensive than local
> > (non-shareable) invalidation, so performing the broadcasting for
> > local_flush_tlb_* operations is a waste of cycles and needlessly
> > clobbers entries in the TLBs of other CPUs.
> >
> > This patch introduces __flush_tlb_* versions for many of the TLB
> > invalidation functions, which only respect inner-shareable variants of
> > the invalidation instructions. This allows us to modify the v7 SMP TLB
> > flags to include *both* inner-shareable and non-shareable operations and
> > then check the relevant flags depending on whether the operation is
> > local or not.
> 
> I think this approach leaves us in trouble for some SMP_ON_UP cores as 
> the IS versions of the instructions don't exist for them.
> 
> Is there something that should be ensuring your new __flush_tlb* 
> functions don't get called for SMP_ON_UP? If not it looks like we might 
> need to do some runtime patching with the ALT_SMP/ALT_UP macros...

Well spotted. Actually, the best fix here is no honour the tlb flags we end
up with, since they get patched by the SMP_ON_UP code (which forces
indirection via MULTI_TLB). We should extract the `meat' of the local_ ops
into __local_ops, which can be inlined directly into the non-local variants
without introducing additional barriers around the invalidation operations.

> A follow on question is whether we still need to keep the *non* unified 
> TLB maintenance operations (eg DTLBIALL, ITLBIALL). As far as I can see 
> looking in to old TRMs, the last ARM CPU that didn't automatically treat 
> those I/D ops to unified ones was ARM10, so not relevant here...

My reading of the 1136 TRM is that there are separate micro-tlbs and a
unified main tlb. I don't see any implication that an operation on the
unified tlb automatically applies to both micro-tlbs, but I've not checked
the rtl (the implication holds the other way around).

Will

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-06-18 11:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-06 14:28 [PATCH 00/10] Make use of v7 barrier variants in Linux Will Deacon
2013-06-06 14:28 ` [PATCH 01/10] ARM: mm: remove redundant dsb() prior to range TLB invalidation Will Deacon
2013-06-06 14:28 ` [PATCH 02/10] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon
2013-06-13 17:50   ` Jonathan Austin
2013-06-18 11:32     ` Will Deacon
2013-06-06 14:28 ` [PATCH 03/10] ARM: tlb: don't bother with barriers for branch predictor maintenance Will Deacon
2013-06-06 14:28 ` [PATCH 04/10] ARM: tlb: don't perform inner-shareable invalidation for local BP ops Will Deacon
2013-06-06 14:28 ` [PATCH 05/10] ARM: barrier: allow options to be passed to memory barrier instructions Will Deacon
2013-06-06 14:28 ` [PATCH 06/10] ARM: spinlock: use inner-shareable dsb variant prior to sev instruction Will Deacon
2013-06-06 14:28 ` [PATCH 07/10] ARM: mm: use inner-shareable barriers for TLB and user cache operations Will Deacon
2013-06-06 14:28 ` [PATCH 08/10] ARM: tlb: reduce scope of barrier domains for TLB invalidation Will Deacon
2013-06-06 14:28 ` [PATCH 09/10] ARM: kvm: use inner-shareable barriers after TLB flushing Will Deacon
2013-06-06 14:28 ` [PATCH 10/10] ARM: mcpm: use -st dsb option prior to sev instructions Will Deacon
2013-06-07  4:15   ` Nicolas Pitre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.