[PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
@ 2019-06-17 14:32 Takao Indoh
  2019-06-17 14:32 ` [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage")) Takao Indoh
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Takao Indoh @ 2019-06-17 14:32 UTC (permalink / raw)
  To: Jonathan Corbet, Catalin Marinas, Will Deacon
  Cc: linux-doc, linux-kernel, linux-arm-kernel, QI Fuli, Takao Indoh

From: Takao Indoh <indou.takao@fujitsu.com>

I found a performance issue related on the implementation of Linux's TLB
flush for arm64.

When I run a single-threaded test program on moderate environment, it
usually takes 39ms to finish its work. However, when I put a small
apprication, which just calls mprotest() continuously, on one of sibling
cores and run it simultaneously, the test program slows down significantly.
It becomes 49ms(125%) on ThunderX2. I also detected the same problem on
ThunderX1 and Fujitsu A64FX.

I suppose the root cause of this issue is the implementation of Linux's TLB
flush for arm64, especially use of TLBI-is instruction which is a broadcast
to all processor core on the system. In case of the above situation,
TLBI-is is called by mprotect().

This is not a problem for small environment, but this causes a significant
performance noise for large-scale HPC environment, which has more than
thousand nodes with low latency interconnect.

To fix this problem, this patch adds new boot parameter
'disable_tlbflush_is'.  In the case of flush_tlb_mm() *without* this
parameter, TLB entry is invalidated by __tlbi(aside1is, asid). By this
instruction, all CPUs within the same inner shareable domain check if there
are TLB entries which have this ASID, this causes performance noise. OTOH,
when this new parameter is specified, TLB entry is invalidated by
__tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
Therefore TLB flush is done on minimal CPUs and performance problem does
not occur. Actually I confirm the performance problem is fixed by this
patch.

Takao Indoh (2):
  arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm:
    kill mm_cpumask usage"))
  arm64: tlb: Add boot parameter to disable TLB flush within the same
    inner shareable domain

 .../admin-guide/kernel-parameters.txt         |   4 +
 arch/arm64/include/asm/mmu_context.h          |   7 +-
 arch/arm64/include/asm/tlbflush.h             |  61 ++-----
 arch/arm64/kernel/Makefile                    |   2 +-
 arch/arm64/kernel/smp.c                       |   6 +
 arch/arm64/kernel/tlbflush.c                  | 155 ++++++++++++++++++
 arch/arm64/mm/context.c                       |   2 +
 7 files changed, 186 insertions(+), 51 deletions(-)
 create mode 100644 arch/arm64/kernel/tlbflush.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage"))
  2019-06-17 14:32 [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain Takao Indoh
@ 2019-06-17 14:32 ` Takao Indoh
  2019-07-23 11:55   ` Catalin Marinas
  2019-06-17 14:32 ` [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain Takao Indoh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Takao Indoh @ 2019-06-17 14:32 UTC (permalink / raw)
  To: Jonathan Corbet, Catalin Marinas, Will Deacon
  Cc: linux-doc, linux-kernel, linux-arm-kernel, QI Fuli, Takao Indoh

From: Takao Indoh <indou.takao@fujitsu.com>

mm_cpumask was deleted by the commit 38d96287504a ("arm64: mm: kill
mm_cpumask usage") because it was not used at that time. Now this is needed
to find appropriate CPUs for TLB flush, so this patch reverts this commit.

Signed-off-by: QI Fuli <qi.fuli@fujitsu.com>
Signed-off-by: Takao Indoh <indou.takao@fujitsu.com>
---
 arch/arm64/include/asm/mmu_context.h | 7 ++++++-
 arch/arm64/kernel/smp.c              | 6 ++++++
 arch/arm64/mm/context.c              | 2 ++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 2da3e478fd8f..21ef11590bcb 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -241,8 +241,13 @@ static inline void
 switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	  struct task_struct *tsk)
 {
-	if (prev != next)
+	unsigned int cpu = smp_processor_id();
+
+	if (prev != next) {
 		__switch_mm(next);
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+		local_flush_tlb_mm(prev);
+	}
 
 	/*
 	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index bb4b3f07761a..12a922d1cdd7 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -218,6 +218,7 @@ asmlinkage notrace void secondary_start_kernel(void)
 	 */
 	mmgrab(mm);
 	current->active_mm = mm;
+	cpumask_set_cpu(cpu, mm_cpumask(mm));
 
 	/*
 	 * TTBR0 is only used for the identity mapping at this stage. Make it
@@ -320,6 +321,11 @@ int __cpu_disable(void)
 	 */
 	irq_migrate_all_off_this_cpu();
 
+	/*
+	 * Remove this CPU from the vm mask set of all processes.
+	 */
+	clear_tasks_mm_cpumask(cpu);
+
 	return 0;
 }
 
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 1f0ea2facf24..ff3ab2924074 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -188,6 +188,7 @@ static u64 new_context(struct mm_struct *mm)
 set_asid:
 	__set_bit(asid, asid_map);
 	cur_idx = asid;
+	cpumask_clear(mm_cpumask(mm));
 	return idx2asid(asid) | generation;
 }
 
@@ -239,6 +240,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 switch_mm_fastpath:
 
 	arm64_apply_bp_hardening();
+	cpumask_set_cpu(cpu, mm_cpumask(mm));
 
 	/*
 	 * Defer TTBR0_EL1 setting for user threads to uaccess_enable() when
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain
  2019-06-17 14:32 [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain Takao Indoh
  2019-06-17 14:32 ` [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage")) Takao Indoh
@ 2019-06-17 14:32 ` Takao Indoh
  2019-07-23 12:11   ` Catalin Marinas
  2019-06-17 17:03 ` [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction " Will Deacon
  2019-11-01  9:56 ` qi.fuli
  3 siblings, 1 reply; 18+ messages in thread
From: Takao Indoh @ 2019-06-17 14:32 UTC (permalink / raw)
  To: Jonathan Corbet, Catalin Marinas, Will Deacon
  Cc: linux-doc, linux-kernel, linux-arm-kernel, QI Fuli, Takao Indoh

From: Takao Indoh <indou.takao@fujitsu.com>

This patch adds new boot parameter 'disable_tlbflush_is' to disable TLB
flush within the same inner shareable domain for performance tuning.

In the case of flush_tlb_mm() *without* this parameter, TLB entry is
invalidated by __tlbi(aside1is, asid). By this instruction, all CPUs within
the same inner shareable domain check if there are TLB entries which have
this ASID, this causes performance noise, especially at large-scale HPC
environment, which has more than thousand nodes with low latency
interconnect.

When this new parameter is specified, TLB entry is invalidated by
__tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
Therefore TLB flush is done on minimal CPUs and performance problem does
not occur.

Signed-off-by: QI Fuli <qi.fuli@fujitsu.com>
Signed-off-by: Takao Indoh <indou.takao@fujitsu.com>
---
 .../admin-guide/kernel-parameters.txt         |   4 +
 arch/arm64/include/asm/tlbflush.h             |  61 ++-----
 arch/arm64/kernel/Makefile                    |   2 +-
 arch/arm64/kernel/tlbflush.c                  | 155 ++++++++++++++++++
 4 files changed, 172 insertions(+), 50 deletions(-)
 create mode 100644 arch/arm64/kernel/tlbflush.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 138f6664b2e2..a693eea34e48 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -848,6 +848,10 @@
 	disable=	[IPV6]
 			See Documentation/networking/ipv6.txt.
 
+	disable_tlbflush_is
+			[ARM64] Disable using TLB instruction to flush
+			all PE within the same inner shareable domain.
+
 	hardened_usercopy=
                         [KNL] Under CONFIG_HARDENED_USERCOPY, whether
                         hardening is enabled for this boot. Hardened
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index dff8f9ea5754..ba2b3fd0b63c 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -139,6 +139,13 @@
  *	on top of these routines, since that is our interface to the mmu_gather
  *	API as used by munmap() and friends.
  */
+
+void flush_tlb_mm(struct mm_struct *mm);
+void flush_tlb_page_nosync(struct vm_area_struct *vma,
+				unsigned long uaddr);
+void __flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, unsigned long stride, bool last_level);
+
 static inline void local_flush_tlb_all(void)
 {
 	dsb(nshst);
@@ -155,24 +162,14 @@ static inline void flush_tlb_all(void)
 	isb();
 }
 
-static inline void flush_tlb_mm(struct mm_struct *mm)
+static inline void local_flush_tlb_mm(struct mm_struct *mm)
 {
 	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
 
-	dsb(ishst);
-	__tlbi(aside1is, asid);
-	__tlbi_user(aside1is, asid);
-	dsb(ish);
-}
-
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
-					 unsigned long uaddr)
-{
-	unsigned long addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
-
-	dsb(ishst);
-	__tlbi(vale1is, addr);
-	__tlbi_user(vale1is, addr);
+	dsb(nshst);
+	__tlbi(aside1, asid);
+	__tlbi_user(aside1, asid);
+	dsb(nsh);
 }
 
 static inline void flush_tlb_page(struct vm_area_struct *vma,
@@ -188,40 +185,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
  */
 #define MAX_TLBI_OPS	PTRS_PER_PTE
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
-				     unsigned long start, unsigned long end,
-				     unsigned long stride, bool last_level)
-{
-	unsigned long asid = ASID(vma->vm_mm);
-	unsigned long addr;
-
-	start = round_down(start, stride);
-	end = round_up(end, stride);
-
-	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
-		flush_tlb_mm(vma->vm_mm);
-		return;
-	}
-
-	/* Convert the stride into units of 4k */
-	stride >>= 12;
-
-	start = __TLBI_VADDR(start, asid);
-	end = __TLBI_VADDR(end, asid);
-
-	dsb(ishst);
-	for (addr = start; addr < end; addr += stride) {
-		if (last_level) {
-			__tlbi(vale1is, addr);
-			__tlbi_user(vale1is, addr);
-		} else {
-			__tlbi(vae1is, addr);
-			__tlbi_user(vae1is, addr);
-		}
-	}
-	dsb(ish);
-}
-
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 9e7dcb2c31c7..266c9a57b081 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -19,7 +19,7 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
 			   return_address.o cpuinfo.o cpu_errata.o		\
 			   cpufeature.o alternative.o cacheinfo.o		\
 			   smp.o smp_spin_table.o topology.o smccc-call.o	\
-			   syscall.o
+			   syscall.o tlbflush.o
 
 extra-$(CONFIG_EFI)			:= efi-entry.o
 
diff --git a/arch/arm64/kernel/tlbflush.c b/arch/arm64/kernel/tlbflush.c
new file mode 100644
index 000000000000..52c9a237759a
--- /dev/null
+++ b/arch/arm64/kernel/tlbflush.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2019 FUJITSU LIMITED
+
+#include <linux/smp.h>
+#include <asm/tlbflush.h>
+
+struct tlb_args {
+	struct vm_area_struct *ta_vma;
+	unsigned long ta_start;
+	unsigned long ta_end;
+	unsigned long ta_stride;
+	bool ta_last_level;
+};
+
+int disable_tlbflush_is;
+
+static int __init disable_tlbflush_is_setup(char *str)
+{
+	disable_tlbflush_is = 1;
+
+	return 0;
+}
+__setup("disable_tlbflush_is", disable_tlbflush_is_setup);
+
+static inline void __flush_tlb_mm(struct mm_struct *mm)
+{
+	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
+
+	dsb(ishst);
+	__tlbi(aside1is, asid);
+	__tlbi_user(aside1is, asid);
+	dsb(ish);
+}
+
+static inline void ipi_flush_tlb_mm(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	local_flush_tlb_mm(mm);
+}
+
+void flush_tlb_mm(struct mm_struct *mm)
+{
+	if (disable_tlbflush_is)
+		on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm,
+				 (void *)mm, true);
+	else
+		__flush_tlb_mm(mm);
+}
+
+static inline void __flush_tlb_page_nosync(unsigned long addr)
+{
+	dsb(ishst);
+	__tlbi(vale1is, addr);
+	__tlbi_user(vale1is, addr);
+}
+
+static inline void __local_flush_tlb_page_nosync(unsigned long addr)
+{
+	dsb(nshst);
+	__tlbi(vale1, addr);
+	__tlbi_user(vale1, addr);
+	dsb(nsh);
+}
+
+static inline void ipi_flush_tlb_page_nosync(void *arg)
+{
+	unsigned long addr = *(unsigned long *)arg;
+
+	__local_flush_tlb_page_nosync(addr);
+}
+
+void flush_tlb_page_nosync(struct vm_area_struct *vma, unsigned long uaddr)
+{
+	unsigned long addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+
+	if (disable_tlbflush_is)
+		on_each_cpu_mask(mm_cpumask(vma->vm_mm),
+				ipi_flush_tlb_page_nosync, &addr, true);
+	else
+		__flush_tlb_page_nosync(addr);
+}
+
+static inline void ___flush_tlb_range(unsigned long start, unsigned long end,
+				     unsigned long stride, bool last_level)
+{
+	unsigned long addr;
+
+	dsb(ishst);
+	for (addr = start; addr < end; addr += stride) {
+		if (last_level) {
+			__tlbi(vale1is, addr);
+			__tlbi_user(vale1is, addr);
+		} else {
+			__tlbi(vae1is, addr);
+			__tlbi_user(vae1is, addr);
+		}
+	}
+	dsb(ish);
+}
+
+static inline void __local_flush_tlb_range(unsigned long addr, bool last_level)
+{
+	dsb(nshst);
+	if (last_level) {
+		__tlbi(vale1, addr);
+		__tlbi_user(vale1, addr);
+	} else {
+		__tlbi(vae1, addr);
+		__tlbi_user(vae1, addr);
+	}
+	dsb(nsh);
+}
+
+static inline void ipi_flush_tlb_range(void *arg)
+{
+	struct tlb_args *ta = (struct tlb_args *)arg;
+	unsigned long addr;
+
+	for (addr = ta->ta_start; addr < ta->ta_end; addr += ta->ta_stride)
+		__local_flush_tlb_range(addr, ta->ta_last_level);
+}
+
+void __flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, unsigned long stride, bool last_level)
+{
+	unsigned long asid = ASID(vma->vm_mm);
+
+	start = round_down(start, stride);
+	end = round_up(end, stride);
+
+	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
+		flush_tlb_mm(vma->vm_mm);
+		return;
+	}
+
+	/* Convert the stride into units of 4k */
+	stride >>= 12;
+
+	start = __TLBI_VADDR(start, asid);
+	end = __TLBI_VADDR(end, asid);
+
+	if (disable_tlbflush_is) {
+		struct tlb_args ta = {
+			.ta_start	= start,
+			.ta_end		= end,
+			.ta_stride	= stride,
+			.ta_last_level	= last_level,
+		};
+
+		on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_range,
+					    &ta, true);
+	} else
+		___flush_tlb_range(start, end, stride, last_level);
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-06-17 14:32 [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain Takao Indoh
  2019-06-17 14:32 ` [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage")) Takao Indoh
  2019-06-17 14:32 ` [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain Takao Indoh
@ 2019-06-17 17:03 ` Will Deacon
  2019-06-24 10:34   ` qi.fuli
  2019-11-01  9:56 ` qi.fuli
  3 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2019-06-17 17:03 UTC (permalink / raw)
  To: Takao Indoh
  Cc: Jonathan Corbet, Catalin Marinas, linux-doc, linux-kernel,
	linux-arm-kernel, QI Fuli, Takao Indoh, peterz

Hi Takao,

[+Peter Z]

On Mon, Jun 17, 2019 at 11:32:53PM +0900, Takao Indoh wrote:
> From: Takao Indoh <indou.takao@fujitsu.com>
> 
> I found a performance issue related on the implementation of Linux's TLB
> flush for arm64.
> 
> When I run a single-threaded test program on moderate environment, it
> usually takes 39ms to finish its work. However, when I put a small
> apprication, which just calls mprotest() continuously, on one of sibling
> cores and run it simultaneously, the test program slows down significantly.
> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on
> ThunderX1 and Fujitsu A64FX.

This is a problem for any applications that share hardware resources with
each other, so I don't think it's something we should be too concerned about
addressing unless there is a practical DoS scenario, which there doesn't
appear to be in this case. It may be that the real answer is "don't call
mprotect() in a loop".

> I suppose the root cause of this issue is the implementation of Linux's TLB
> flush for arm64, especially use of TLBI-is instruction which is a broadcast
> to all processor core on the system. In case of the above situation,
> TLBI-is is called by mprotect().

On the flip side, Linux is providing the hardware with enough information
not to broadcast to cores for which the remote TLBs don't have entries
allocated for the ASID being invalidated. I would say that the root cause
of the issue is that this filtering is not taking place.

> This is not a problem for small environment, but this causes a significant
> performance noise for large-scale HPC environment, which has more than
> thousand nodes with low latency interconnect.

If you have a system with over a thousand nodes, without snoop filtering
for DVM messages and you expect performance to scale in the face of tight
mprotect() loops then I think you have a problem irrespective of this patch.
What happens if somebody runs I-cache invalidation in a loop?

> To fix this problem, this patch adds new boot parameter
> 'disable_tlbflush_is'.  In the case of flush_tlb_mm() *without* this
> parameter, TLB entry is invalidated by __tlbi(aside1is, asid). By this
> instruction, all CPUs within the same inner shareable domain check if there
> are TLB entries which have this ASID, this causes performance noise. OTOH,
> when this new parameter is specified, TLB entry is invalidated by
> __tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
> Therefore TLB flush is done on minimal CPUs and performance problem does
> not occur. Actually I confirm the performance problem is fixed by this
> patch.

Other than my comments above, my overall concern with this patch is that
it introduces divergent behaviour for our TLB invalidation flow, which is
undesirable from both maintainability and usability perspectives. If you
wish to change the code, please don't put it behind a command-line option,
but instead improve the code that is already there. However, I suspect that
blowing away the local TLB on every context-switch may have hidden costs
which are only apparent with workloads different from the contrived case
that you're seeking to improve. You also haven't taken into account the
effects of virtualisation, where it's likely that the hypervisor will
upgrade non-shareable operations to inner-shareable ones anyway.

Thanks,

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-06-17 17:03 ` [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction " Will Deacon
@ 2019-06-24 10:34   ` qi.fuli
  2019-06-27 10:27     ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: qi.fuli @ 2019-06-24 10:34 UTC (permalink / raw)
  To: Will Deacon, indou.takao
  Cc: Jonathan Corbet, Catalin Marinas, linux-doc, linux-kernel,
	linux-arm-kernel, qi.fuli, indou.takao, peterz

Hi Will,

I am Takao's colleague, thank you very much for your reply.

On 6/18/19 2:03 AM, Will Deacon wrote:
> Hi Takao,
>
> [+Peter Z]
>
> On Mon, Jun 17, 2019 at 11:32:53PM +0900, Takao Indoh wrote:
>> From: Takao Indoh <indou.takao@fujitsu.com>
>>
>> I found a performance issue related on the implementation of Linux's TLB
>> flush for arm64.
>>
>> When I run a single-threaded test program on moderate environment, it
>> usually takes 39ms to finish its work. However, when I put a small
>> apprication, which just calls mprotest() continuously, on one of sibling
>> cores and run it simultaneously, the test program slows down significantly.
>> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on
>> ThunderX1 and Fujitsu A64FX.
> This is a problem for any applications that share hardware resources with
> each other, so I don't think it's something we should be too concerned about
> addressing unless there is a practical DoS scenario, which there doesn't
> appear to be in this case. It may be that the real answer is "don't call
> mprotect() in a loop".
I think there has been a misunderstanding, please let me explain.
This application is just an example using for reproducing the 
performance issue we found.
Our original purpose is reducing OS jitter by this series.
The OS jitter on massively parallel processing systems have been known 
and studied for many years.
The 2.5% OS jitter can result in over a factor of 20 slowdown for the 
same application [1].
Though it may be an extreme example, reducing the OS jitter has been an 
issue in HPC environment.

[1] Ferreira, Kurt B., Patrick Bridges, and Ron Brightwell. 
"Characterizing application sensitivity to OS interference using 
kernel-level noise injection." Proceedings of the 2008 ACM/IEEE 
conference on Supercomputing. IEEE Press, 2008.

>> I suppose the root cause of this issue is the implementation of Linux's TLB
>> flush for arm64, especially use of TLBI-is instruction which is a broadcast
>> to all processor core on the system. In case of the above situation,
>> TLBI-is is called by mprotect().
> On the flip side, Linux is providing the hardware with enough information
> not to broadcast to cores for which the remote TLBs don't have entries
> allocated for the ASID being invalidated. I would say that the root cause
> of the issue is that this filtering is not taking place.

Do you mean that the filter should be implemented in hardware?

Thanks,
Qi Fuli

>> This is not a problem for small environment, but this causes a significant
>> performance noise for large-scale HPC environment, which has more than
>> thousand nodes with low latency interconnect.
> If you have a system with over a thousand nodes, without snoop filtering
> for DVM messages and you expect performance to scale in the face of tight
> mprotect() loops then I think you have a problem irrespective of this patch.
> What happens if somebody runs I-cache invalidation in a loop?
>
>> To fix this problem, this patch adds new boot parameter
>> 'disable_tlbflush_is'.  In the case of flush_tlb_mm() *without* this
>> parameter, TLB entry is invalidated by __tlbi(aside1is, asid). By this
>> instruction, all CPUs within the same inner shareable domain check if there
>> are TLB entries which have this ASID, this causes performance noise. OTOH,
>> when this new parameter is specified, TLB entry is invalidated by
>> __tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
>> Therefore TLB flush is done on minimal CPUs and performance problem does
>> not occur. Actually I confirm the performance problem is fixed by this
>> patch.
> Other than my comments above, my overall concern with this patch is that
> it introduces divergent behaviour for our TLB invalidation flow, which is
> undesirable from both maintainability and usability perspectives. If you
> wish to change the code, please don't put it behind a command-line option,
> but instead improve the code that is already there. However, I suspect that
> blowing away the local TLB on every context-switch may have hidden costs
> which are only apparent with workloads different from the contrived case
> that you're seeking to improve. You also haven't taken into account the
> effects of virtualisation, where it's likely that the hypervisor will
> upgrade non-shareable operations to inner-shareable ones anyway.
>
> Thanks,
>
> Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-06-24 10:34   ` qi.fuli
@ 2019-06-27 10:27     ` Will Deacon
  2019-07-03  2:45       ` qi.fuli
  0 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2019-06-27 10:27 UTC (permalink / raw)
  To: qi.fuli
  Cc: Will Deacon, indou.takao, linux-doc, peterz, Catalin Marinas,
	Jonathan Corbet, linux-kernel, linux-arm-kernel

On Mon, Jun 24, 2019 at 10:34:02AM +0000, qi.fuli@fujitsu.com wrote:
> On 6/18/19 2:03 AM, Will Deacon wrote:
> > On Mon, Jun 17, 2019 at 11:32:53PM +0900, Takao Indoh wrote:
> >> From: Takao Indoh <indou.takao@fujitsu.com>
> >>
> >> I found a performance issue related on the implementation of Linux's TLB
> >> flush for arm64.
> >>
> >> When I run a single-threaded test program on moderate environment, it
> >> usually takes 39ms to finish its work. However, when I put a small
> >> apprication, which just calls mprotest() continuously, on one of sibling
> >> cores and run it simultaneously, the test program slows down significantly.
> >> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on
> >> ThunderX1 and Fujitsu A64FX.
> > This is a problem for any applications that share hardware resources with
> > each other, so I don't think it's something we should be too concerned about
> > addressing unless there is a practical DoS scenario, which there doesn't
> > appear to be in this case. It may be that the real answer is "don't call
> > mprotect() in a loop".
> I think there has been a misunderstanding, please let me explain.
> This application is just an example using for reproducing the 
> performance issue we found.
> Our original purpose is reducing OS jitter by this series.
> The OS jitter on massively parallel processing systems have been known 
> and studied for many years.
> The 2.5% OS jitter can result in over a factor of 20 slowdown for the 
> same application [1].

I think it's worth pointing out that the system in question was neither
ARM-based nor running Linux, so I'd be cautious in applying the conclusions
of that paper directly to our TLB invalidation code. Furthermore, the noise
being generated in their experiments uses a timer interrupt, which has a
/vastly/ different profile to a DVM message in terms of both system impact
and frequency.

> Though it may be an extreme example, reducing the OS jitter has been an 
> issue in HPC environment.
> 
> [1] Ferreira, Kurt B., Patrick Bridges, and Ron Brightwell. 
> "Characterizing application sensitivity to OS interference using 
> kernel-level noise injection." Proceedings of the 2008 ACM/IEEE 
> conference on Supercomputing. IEEE Press, 2008.
> 
> >> I suppose the root cause of this issue is the implementation of Linux's TLB
> >> flush for arm64, especially use of TLBI-is instruction which is a broadcast
> >> to all processor core on the system. In case of the above situation,
> >> TLBI-is is called by mprotect().
> > On the flip side, Linux is providing the hardware with enough information
> > not to broadcast to cores for which the remote TLBs don't have entries
> > allocated for the ASID being invalidated. I would say that the root cause
> > of the issue is that this filtering is not taking place.
> 
> Do you mean that the filter should be implemented in hardware?

Yes. If you're building a large system and you care about "jitter", then
you either need to partition it in such a way that sources of noise are
contained, or you need to introduce filters to limit their scope. Rewriting
the low-level memory-management parts of the operating system is a red
herring and imposes a needless burden on everybody else without solving
the real problem, which is that contended use of shared resources doesn't
scale.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-06-27 10:27     ` Will Deacon
@ 2019-07-03  2:45       ` qi.fuli
  2019-07-09  0:25         ` Jon Masters
  2019-07-09  8:07         ` Will Deacon
  0 siblings, 2 replies; 18+ messages in thread
From: qi.fuli @ 2019-07-03  2:45 UTC (permalink / raw)
  To: Will Deacon, qi.fuli
  Cc: Will Deacon, indou.takao, linux-doc, peterz, Catalin Marinas,
	Jonathan Corbet, linux-kernel, linux-arm-kernel

Hi Will,

Thanks for your comments.

On 6/27/19 7:27 PM, Will Deacon wrote:
> On Mon, Jun 24, 2019 at 10:34:02AM +0000, qi.fuli@fujitsu.com wrote:
>> On 6/18/19 2:03 AM, Will Deacon wrote:
>>> On Mon, Jun 17, 2019 at 11:32:53PM +0900, Takao Indoh wrote:
>>>> From: Takao Indoh <indou.takao@fujitsu.com>
>>>>
>>>> I found a performance issue related on the implementation of Linux's TLB
>>>> flush for arm64.
>>>>
>>>> When I run a single-threaded test program on moderate environment, it
>>>> usually takes 39ms to finish its work. However, when I put a small
>>>> apprication, which just calls mprotest() continuously, on one of sibling
>>>> cores and run it simultaneously, the test program slows down significantly.
>>>> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on
>>>> ThunderX1 and Fujitsu A64FX.
>>> This is a problem for any applications that share hardware resources with
>>> each other, so I don't think it's something we should be too concerned about
>>> addressing unless there is a practical DoS scenario, which there doesn't
>>> appear to be in this case. It may be that the real answer is "don't call
>>> mprotect() in a loop".
>> I think there has been a misunderstanding, please let me explain.
>> This application is just an example using for reproducing the
>> performance issue we found.
>> Our original purpose is reducing OS jitter by this series.
>> The OS jitter on massively parallel processing systems have been known
>> and studied for many years.
>> The 2.5% OS jitter can result in over a factor of 20 slowdown for the
>> same application [1].
> I think it's worth pointing out that the system in question was neither
> ARM-based nor running Linux, so I'd be cautious in applying the conclusions
> of that paper directly to our TLB invalidation code. Furthermore, the noise
> being generated in their experiments uses a timer interrupt, which has a
> /vastly/ different profile to a DVM message in terms of both system impact
> and frequency.
My original purpose was to explain that the OS jitter is a vital issue for
large-scale HPC environment by referencing this paper.
Please allow me to introduce the issue that had occurred to our HPC 
environment.
We used FWQ [1] to do an experiment on 1 node of our HPC environment,
we expected it would be tens of microseconds of maximum OS jitter, but 
it was
hundreds of microseconds, which didn't meet our requirement. We tried to 
find
out the cause by using ftrace, but we cannot find any processes which would
cause noise and only knew the extension of processing time. Then we 
confirmed
the CPU instruction count through CPU PMU, we also didn't find any changes.
However, we found that with the increase of that the TLB flash was called,
the noise was also increasing. Here we understood that the cause of this 
issue
is the implementation of Linux's TLB flush for arm64, especially use of 
TLBI-is
instruction which is a broadcast to all processor core on the system. 
Therefore,
we made this patch set to fix this issue. After testing for several 
times, the
noise was reduced and our original goal was achieved, so we do think 
this patch
makes sense.

As I mentioned, the OS jitter is a vital issue for large-scale HPC 
environment.
We tried a lot of things to reduce the OS jitter. One of them is task 
separation
between the CPUs which are used for computing and the CPUs which are 
used for
maintenance. All of the daemon processes and I/O interrupts are bounden 
to the
maintenance CPUs. Further more, we used nohz_full to avoid the noise 
caused by
computing CPU interruption, but all of the CPUs were affected by TLBI-is
instruction, the task separation of CPUs didn't work. Therefore, we 
would like
to implement that TLB flush is done on minimal CPUs to reducing the OS 
jitter
by using this patch set.

[1] https://asc.llnl.gov/sequoia/benchmarks/FTQ_summary_v1.1.pdf

Thanks,
QI Fuli

>> Though it may be an extreme example, reducing the OS jitter has been an
>> issue in HPC environment.
>>
>> [1] Ferreira, Kurt B., Patrick Bridges, and Ron Brightwell.
>> "Characterizing application sensitivity to OS interference using
>> kernel-level noise injection." Proceedings of the 2008 ACM/IEEE
>> conference on Supercomputing. IEEE Press, 2008.
>>
>>>> I suppose the root cause of this issue is the implementation of Linux's TLB
>>>> flush for arm64, especially use of TLBI-is instruction which is a broadcast
>>>> to all processor core on the system. In case of the above situation,
>>>> TLBI-is is called by mprotect().
>>> On the flip side, Linux is providing the hardware with enough information
>>> not to broadcast to cores for which the remote TLBs don't have entries
>>> allocated for the ASID being invalidated. I would say that the root cause
>>> of the issue is that this filtering is not taking place.
>> Do you mean that the filter should be implemented in hardware?
> Yes. If you're building a large system and you care about "jitter", then
> you either need to partition it in such a way that sources of noise are
> contained, or you need to introduce filters to limit their scope. Rewriting
> the low-level memory-management parts of the operating system is a red
> herring and imposes a needless burden on everybody else without solving
> the real problem, which is that contended use of shared resources doesn't
> scale.
>
> Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-07-03  2:45       ` qi.fuli
@ 2019-07-09  0:25         ` Jon Masters
  2019-07-09  0:29           ` Jon Masters
  2019-07-09  8:07         ` Will Deacon
  1 sibling, 1 reply; 18+ messages in thread
From: Jon Masters @ 2019-07-09  0:25 UTC (permalink / raw)
  To: qi.fuli, Will Deacon
  Cc: Will Deacon, indou.takao, linux-doc, peterz, Catalin Marinas,
	Jonathan Corbet, linux-kernel, linux-arm-kernel

On 7/2/19 10:45 PM, qi.fuli@fujitsu.com wrote:

> However, we found that with the increase of that the TLB flash was called,
> the noise was also increasing. Here we understood that the cause of this 
> issue is the implementation of Linux's TLB flush for arm64, especially use of 
> TLBI-is instruction which is a broadcast to all processor core on the system. 

Are you saying that for a microbenchmark in which very large numbers of
threads are created and destroyed rapidly there are a large number of
associated tlb range flushes which always use broadcast TLBIs?

If that's the case, and the hardware doesn't do any ASID filtering and
each TLBI results in a DVM to every PE, would it make sense to look at
whether there are ways to improve batching/switch to an IPI approach
rather than relying on broadcasts, as a more generic solution?

Jon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-07-09  0:25         ` Jon Masters
@ 2019-07-09  0:29           ` Jon Masters
  2019-07-09  8:03             ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: Jon Masters @ 2019-07-09  0:29 UTC (permalink / raw)
  To: qi.fuli, Will Deacon
  Cc: Will Deacon, indou.takao, linux-doc, peterz, Catalin Marinas,
	Jonathan Corbet, linux-kernel, linux-arm-kernel

On 7/8/19 8:25 PM, Jon Masters wrote:
> On 7/2/19 10:45 PM, qi.fuli@fujitsu.com wrote:
> 
>> However, we found that with the increase of that the TLB flash was called,
>> the noise was also increasing. Here we understood that the cause of this 
>> issue is the implementation of Linux's TLB flush for arm64, especially use of 
>> TLBI-is instruction which is a broadcast to all processor core on the system. 
> 
> Are you saying that for a microbenchmark in which very large numbers of
> threads are created and destroyed rapidly there are a large number of
> associated tlb range flushes which always use broadcast TLBIs?
> 
> If that's the case, and the hardware doesn't do any ASID filtering and
> each TLBI results in a DVM to every PE, would it make sense to look at
> whether there are ways to improve batching/switch to an IPI approach
> rather than relying on broadcasts, as a more generic solution?

What I meant was a heuristic to do this automatically, rather than via a
command line.

Jon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-07-09  0:29           ` Jon Masters
@ 2019-07-09  8:03             ` Will Deacon
  0 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2019-07-09  8:03 UTC (permalink / raw)
  To: Jon Masters
  Cc: qi.fuli, Will Deacon, indou.takao, linux-doc, peterz,
	Catalin Marinas, Jonathan Corbet, linux-kernel, linux-arm-kernel

On Mon, Jul 08, 2019 at 08:29:26PM -0400, Jon Masters wrote:
> On 7/8/19 8:25 PM, Jon Masters wrote:
> > On 7/2/19 10:45 PM, qi.fuli@fujitsu.com wrote:
> > 
> >> However, we found that with the increase of that the TLB flash was called,
> >> the noise was also increasing. Here we understood that the cause of this 
> >> issue is the implementation of Linux's TLB flush for arm64, especially use of 
> >> TLBI-is instruction which is a broadcast to all processor core on the system. 
> > 
> > Are you saying that for a microbenchmark in which very large numbers of
> > threads are created and destroyed rapidly there are a large number of
> > associated tlb range flushes which always use broadcast TLBIs?
> > 
> > If that's the case, and the hardware doesn't do any ASID filtering and
> > each TLBI results in a DVM to every PE, would it make sense to look at
> > whether there are ways to improve batching/switch to an IPI approach
> > rather than relying on broadcasts, as a more generic solution?
> 
> What I meant was a heuristic to do this automatically, rather than via a
> command line.

One of my main initial objections to this patch [1] still applies to that
approach, though, which is that I don't want the maintenance headache of
maintaining two very different TLB invalidation schemes in the kernel.
Dynamically switching between them is arguably worse. If "jitter" is such a
big deal, then I don't think changing our TLBI mechanism even helps on a
system that has broadcast cache maintenance (including for the I-side) as
well as shared levels of cache further from the CPUs -- it just happens to
solve the case of a spinning mprotect(), well yeah, maybe don't do that if
your hardware can't handle it gracefully.

What I would be interested in seeing is an evaluation of a real workload
that suffers due to our mmu_gather/tlb_flush implementation on arm64 so that
we can understand where the problem lies and whether or not we can do
something to address it. But "jitter is bad, use IPIs" isn't helpful at all.

Will

[1] https://lkml.kernel.org/r/20190617170328.GJ30800@fuggles.cambridge.arm.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-07-03  2:45       ` qi.fuli
  2019-07-09  0:25         ` Jon Masters
@ 2019-07-09  8:07         ` Will Deacon
  1 sibling, 0 replies; 18+ messages in thread
From: Will Deacon @ 2019-07-09  8:07 UTC (permalink / raw)
  To: qi.fuli
  Cc: Will Deacon, indou.takao, linux-doc, peterz, Catalin Marinas,
	Jonathan Corbet, linux-kernel, linux-arm-kernel

On Wed, Jul 03, 2019 at 02:45:43AM +0000, qi.fuli@fujitsu.com wrote:
> We used FWQ [1] to do an experiment on 1 node of our HPC environment,
> we expected it would be tens of microseconds of maximum OS jitter, but 
> it was
> hundreds of microseconds, which didn't meet our requirement. We tried to 
> find
> out the cause by using ftrace, but we cannot find any processes which would
> cause noise and only knew the extension of processing time. Then we 
> confirmed
> the CPU instruction count through CPU PMU, we also didn't find any changes.
> However, we found that with the increase of that the TLB flash was called,
> the noise was also increasing. Here we understood that the cause of this 
> issue
> is the implementation of Linux's TLB flush for arm64, especially use of 
> TLBI-is
> instruction which is a broadcast to all processor core on the system. 
> Therefore,
> we made this patch set to fix this issue. After testing for several 
> times, the
> noise was reduced and our original goal was achieved, so we do think 
> this patch
> makes sense.
> 
> As I mentioned, the OS jitter is a vital issue for large-scale HPC 
> environment.
> We tried a lot of things to reduce the OS jitter. One of them is task 
> separation
> between the CPUs which are used for computing and the CPUs which are 
> used for
> maintenance. All of the daemon processes and I/O interrupts are bounden 
> to the
> maintenance CPUs. Further more, we used nohz_full to avoid the noise 
> caused by
> computing CPU interruption, but all of the CPUs were affected by TLBI-is
> instruction, the task separation of CPUs didn't work. Therefore, we 
> would like
> to implement that TLB flush is done on minimal CPUs to reducing the OS 
> jitter
> by using this patch set.

So have you confirmed that this is due to TLBI traffic and not, for example,
stores sitting in remote store buffers that get flushed by the IPI or
something else like that? It feels like you're inferring things about the
underlying behaviour, whereas you should be in a position to simulate this
if nothing else.

If it *is* because of TLBI, then where are they coming from? Is FWQ calling
munmap/mprotect all the time? Why?

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage"))
  2019-06-17 14:32 ` [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage")) Takao Indoh
@ 2019-07-23 11:55   ` Catalin Marinas
  0 siblings, 0 replies; 18+ messages in thread
From: Catalin Marinas @ 2019-07-23 11:55 UTC (permalink / raw)
  To: Takao Indoh
  Cc: Jonathan Corbet, Will Deacon, linux-doc, linux-kernel,
	linux-arm-kernel, QI Fuli, Takao Indoh

Hi,

I know Will is on the case but just expressing some thoughts of my own.

On Mon, Jun 17, 2019 at 11:32:54PM +0900, Takao Indoh wrote:
> From: Takao Indoh <indou.takao@fujitsu.com>
> 
> mm_cpumask was deleted by the commit 38d96287504a ("arm64: mm: kill
> mm_cpumask usage") because it was not used at that time. Now this is needed
> to find appropriate CPUs for TLB flush, so this patch reverts this commit.
> 
> Signed-off-by: QI Fuli <qi.fuli@fujitsu.com>
> Signed-off-by: Takao Indoh <indou.takao@fujitsu.com>
> ---
>  arch/arm64/include/asm/mmu_context.h | 7 ++++++-
>  arch/arm64/kernel/smp.c              | 6 ++++++
>  arch/arm64/mm/context.c              | 2 ++
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 2da3e478fd8f..21ef11590bcb 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -241,8 +241,13 @@ static inline void
>  switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	  struct task_struct *tsk)
>  {
> -	if (prev != next)
> +	unsigned int cpu = smp_processor_id();
> +
> +	if (prev != next) {
>  		__switch_mm(next);
> +		cpumask_clear_cpu(cpu, mm_cpumask(prev));
> +		local_flush_tlb_mm(prev);
> +	}

That's not actually a revert as we've never flushed the TLBs on the
switch_mm() path. Also, this flush is not sufficient on a CnP capable
CPU since another thread of the same CPU could have the prev TTBR0_EL1
value set and loading the TLB back.

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain
  2019-06-17 14:32 ` [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain Takao Indoh
@ 2019-07-23 12:11   ` Catalin Marinas
  0 siblings, 0 replies; 18+ messages in thread
From: Catalin Marinas @ 2019-07-23 12:11 UTC (permalink / raw)
  To: Takao Indoh
  Cc: Jonathan Corbet, Will Deacon, linux-doc, linux-kernel,
	linux-arm-kernel, QI Fuli, Takao Indoh

On Mon, Jun 17, 2019 at 11:32:55PM +0900, Takao Indoh wrote:
> From: Takao Indoh <indou.takao@fujitsu.com>
> 
> This patch adds new boot parameter 'disable_tlbflush_is' to disable TLB
> flush within the same inner shareable domain for performance tuning.
> 
> In the case of flush_tlb_mm() *without* this parameter, TLB entry is
> invalidated by __tlbi(aside1is, asid). By this instruction, all CPUs within
> the same inner shareable domain check if there are TLB entries which have
> this ASID, this causes performance noise, especially at large-scale HPC
> environment, which has more than thousand nodes with low latency
> interconnect.
> 
> When this new parameter is specified, TLB entry is invalidated by
> __tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
> Therefore TLB flush is done on minimal CPUs and performance problem does
> not occur.
> 
> Signed-off-by: QI Fuli <qi.fuli@fujitsu.com>
> Signed-off-by: Takao Indoh <indou.takao@fujitsu.com>
[...]
> +void flush_tlb_mm(struct mm_struct *mm)
> +{
> +	if (disable_tlbflush_is)
> +		on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm,
> +				 (void *)mm, true);
> +	else
> +		__flush_tlb_mm(mm);
> +}

Could we try instead to call a _nosync variant here when
cpumask_weight() is 1 or the *IS if greater than 1 and avoid the IPI?

Will tried this in the past but because of the task placement after
fork()+execve(), I think we always ended up with a weight of 2 in the
child process. Your first patch "solves" this by flushing the TLBs on
context switch (bar the CnP case). Can you give it a try to see if it
improves things? At least it's a starting point for further
investigation.

I fully agree with Will that we don't want two different TLB handling
implementations in the arm64 kernel and even less desirable to have a
command line option.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-06-17 14:32 [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain Takao Indoh
                   ` (2 preceding siblings ...)
  2019-06-17 17:03 ` [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction " Will Deacon
@ 2019-11-01  9:56 ` qi.fuli
  2019-11-01 17:28   ` Will Deacon
  3 siblings, 1 reply; 18+ messages in thread
From: qi.fuli @ 2019-11-01  9:56 UTC (permalink / raw)
  To: Jonathan Corbet, Catalin Marinas, Will Deacon, Itaru Kitayama,
	peterz, Jon Masters
  Cc: linux-doc, linux-kernel, linux-arm-kernel, qi.fuli, indou.takao,
	maeda.naoaki, misono.tomohiro, tokamoto

Hi,

First of all thanks for the comments for the patch.

I'm still struggling with this problem to find out the solution.
As a result of an investigation on this problem, after all, I think it 
is necessary to improve TLB flush mechanism of the kernel to fix this 
problem completely.

So, I'd like to restart a discussion. At first, I summarize this problem 
to recall what was the problem and then I want to discuss how to fix it.

Summary of the problem:
A few months ago I proposed patches to solve a performance problem due 
to TLB flush.[1]

A problem is that TLB flush on a core affects all other cores even if 
all other cores do not need actual flush, and it causes performance 
degradation.

In this thread, I explained that:
* I found a performance problem which is caused by TLBI-is instruction.
* The problem occurs like this:
  1) On a core, OS tries to flush TLB using TLBI-is instruction
  2) TLBI-is instruction causes a broadcast to all other cores, and
  each core received hard-wired signal
  3) Each core check if there are TLB entries which have the specified 
ASID/VA
  4) This check causes performance degradation
* We ran FWQ[2] and detected OS jitter due to this problem, this noise
  is serious for HPC usage.

The noise means here a difference between maximum time and minimum time 
which the same work takes.

How to fix:
I think the cause is TLB flush by TLBI-is because the instruction 
affects cores that are not related to its flush.

So the previous patch I posted is
* Use mm_cpumask in mm_struct to find appropriate CPUs for TLB flush
* Exec TLBI instead of TLBI-is only to CPUs specified by mm_cpumask
  (This is the same behavior as arm32 and x86)

And after the discussion about this patch, I got the following comments.
1) This patch switches the behavior (original flush by TLBI-is and new 
flush by TLBI) by boot parameter, this implementation is not acceptable 
due to bad maintainability.
2) Even if this patch fixes this problem, it may cause another 
performance problem.

I'd like to start over the implementation by considering these points.
For the second comment above, I will run a benchmark test to analyze the 
impact on performance.
Please let me know if there are other points I should take into 
consideration.

[1] https://lkml.org/lkml/2019/6/17/703
[2] https://asc.llnl.gov/sequoia/benchmarks/FTQ_summary_v1.1.pdf

Thanks,
QI Fuli

On 6/17/19 11:32 PM, Takao Indoh wrote:
> From: Takao Indoh <indou.takao@fujitsu.com>
> 
> I found a performance issue related on the implementation of Linux's TLB
> flush for arm64.
> 
> When I run a single-threaded test program on moderate environment, it
> usually takes 39ms to finish its work. However, when I put a small
> apprication, which just calls mprotest() continuously, on one of sibling
> cores and run it simultaneously, the test program slows down significantly.
> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on
> ThunderX1 and Fujitsu A64FX.
> 
> I suppose the root cause of this issue is the implementation of Linux's TLB
> flush for arm64, especially use of TLBI-is instruction which is a broadcast
> to all processor core on the system. In case of the above situation,
> TLBI-is is called by mprotect().
> 
> This is not a problem for small environment, but this causes a significant
> performance noise for large-scale HPC environment, which has more than
> thousand nodes with low latency interconnect.
> 
> To fix this problem, this patch adds new boot parameter
> 'disable_tlbflush_is'.  In the case of flush_tlb_mm() *without* this
> parameter, TLB entry is invalidated by __tlbi(aside1is, asid). By this
> instruction, all CPUs within the same inner shareable domain check if there
> are TLB entries which have this ASID, this causes performance noise. OTOH,
> when this new parameter is specified, TLB entry is invalidated by
> __tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
> Therefore TLB flush is done on minimal CPUs and performance problem does
> not occur. Actually I confirm the performance problem is fixed by this
> patch.
> 
> Takao Indoh (2):
>    arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm:
>      kill mm_cpumask usage"))
>    arm64: tlb: Add boot parameter to disable TLB flush within the same
>      inner shareable domain
> 
>   .../admin-guide/kernel-parameters.txt         |   4 +
>   arch/arm64/include/asm/mmu_context.h          |   7 +-
>   arch/arm64/include/asm/tlbflush.h             |  61 ++-----
>   arch/arm64/kernel/Makefile                    |   2 +-
>   arch/arm64/kernel/smp.c                       |   6 +
>   arch/arm64/kernel/tlbflush.c                  | 155 ++++++++++++++++++
>   arch/arm64/mm/context.c                       |   2 +
>   7 files changed, 186 insertions(+), 51 deletions(-)
>   create mode 100644 arch/arm64/kernel/tlbflush.c
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-11-01  9:56 ` qi.fuli
@ 2019-11-01 17:28   ` Will Deacon
  2019-11-26 14:26     ` Matthias Brugger
  2019-12-01 16:02     ` Jon Masters
  0 siblings, 2 replies; 18+ messages in thread
From: Will Deacon @ 2019-11-01 17:28 UTC (permalink / raw)
  To: qi.fuli
  Cc: Jonathan Corbet, Catalin Marinas, Will Deacon, Itaru Kitayama,
	peterz, Jon Masters, linux-doc, linux-kernel, linux-arm-kernel,
	indou.takao, maeda.naoaki, misono.tomohiro, tokamoto

Hi,

[please note that my email address has changed and the old one doesn't work
 any more]

On Fri, Nov 01, 2019 at 09:56:05AM +0000, qi.fuli@fujitsu.com wrote:
> First of all thanks for the comments for the patch.
> 
> I'm still struggling with this problem to find out the solution.
> As a result of an investigation on this problem, after all, I think it 
> is necessary to improve TLB flush mechanism of the kernel to fix this 
> problem completely.
> 
> So, I'd like to restart a discussion. At first, I summarize this problem 
> to recall what was the problem and then I want to discuss how to fix it.
> 
> Summary of the problem:
> A few months ago I proposed patches to solve a performance problem due 
> to TLB flush.[1]
> 
> A problem is that TLB flush on a core affects all other cores even if 
> all other cores do not need actual flush, and it causes performance 
> degradation.
> 
> In this thread, I explained that:
> * I found a performance problem which is caused by TLBI-is instruction.
> * The problem occurs like this:
>   1) On a core, OS tries to flush TLB using TLBI-is instruction
>   2) TLBI-is instruction causes a broadcast to all other cores, and
>   each core received hard-wired signal
>   3) Each core check if there are TLB entries which have the specified 
> ASID/VA

For those following along at home, my understanding is that this "check"
effectively stalls the pipeline as though it is being performed in software.

Some questions:

Does this mean a malicious virtual machine can effectively DoS the system?
What about a malicious application calling mprotect()?

Do all broadcast TLBI instructions cause this expensive check, or are
some significantly slower than others?

>   4) This check causes performance degradation
> * We ran FWQ[2] and detected OS jitter due to this problem, this noise
>   is serious for HPC usage.
> 
> The noise means here a difference between maximum time and minimum time 
> which the same work takes.
> 
> How to fix:
> I think the cause is TLB flush by TLBI-is because the instruction 
> affects cores that are not related to its flush.

Does broadcast I-cache maintenance cause the same problem?

> So the previous patch I posted is
> * Use mm_cpumask in mm_struct to find appropriate CPUs for TLB flush
> * Exec TLBI instead of TLBI-is only to CPUs specified by mm_cpumask
>   (This is the same behavior as arm32 and x86)
> 
> And after the discussion about this patch, I got the following comments.
> 1) This patch switches the behavior (original flush by TLBI-is and new 
> flush by TLBI) by boot parameter, this implementation is not acceptable 
> due to bad maintainability.
> 2) Even if this patch fixes this problem, it may cause another 
> performance problem.
> 
> I'd like to start over the implementation by considering these points.
> For the second comment above, I will run a benchmark test to analyze the 
> impact on performance.
> Please let me know if there are other points I should take into 
> consideration.

I think it's worth bearing in mind that I have little sympathy for the
problem that you are seeing. As far as I can tell, you've done the
following:

  1. You designed a CPU micro-architecture that stalls whenever it receives
     a TLB invalidation request.

  2. You integrated said CPU design into a system where broadcast TLB
     invalidation is not filtered and therefore stalls every CPU every
     time that /any/ TLB invalidation is broadcast.

  3. You deployed a mixture of Linux and jitter-sensitive software on
     this system, and now you're failing to meet your performance
     requirements.

Have I got that right?

If so, given that your CPU design isn't widely available, nobody else
appears to have made this mistake and jitter hasn't been reported as an
issue for any other systems, it's very unlikely that we're going to make
invasive upstream kernel changes to support you. I'm sorry, but all I can
suggest is that you check that your micro-architecture and performance
requirements are aligned with the design of Linux *before* building another
machine like this in future.

I hate to be blunt, but I also don't want to waste your time.

Thanks,

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-11-01 17:28   ` Will Deacon
@ 2019-11-26 14:26     ` Matthias Brugger
  2019-11-26 14:36       ` Will Deacon
  2019-12-01 16:02     ` Jon Masters
  1 sibling, 1 reply; 18+ messages in thread
From: Matthias Brugger @ 2019-11-26 14:26 UTC (permalink / raw)
  To: Will Deacon, qi.fuli
  Cc: tokamoto, Jon Masters, Jonathan Corbet, peterz, Catalin Marinas,
	linux-doc, Will Deacon, linux-kernel, maeda.naoaki,
	misono.tomohiro, Itaru Kitayama, linux-arm-kernel, indou.takao,
	Robert Richter



On 01/11/2019 18:28, Will Deacon wrote:
> Hi,
> 
> [please note that my email address has changed and the old one doesn't work
>  any more]
> 
> On Fri, Nov 01, 2019 at 09:56:05AM +0000, qi.fuli@fujitsu.com wrote:
>> First of all thanks for the comments for the patch.
>>
>> I'm still struggling with this problem to find out the solution.
>> As a result of an investigation on this problem, after all, I think it 
>> is necessary to improve TLB flush mechanism of the kernel to fix this 
>> problem completely.
>>
>> So, I'd like to restart a discussion. At first, I summarize this problem 
>> to recall what was the problem and then I want to discuss how to fix it.
>>
>> Summary of the problem:
>> A few months ago I proposed patches to solve a performance problem due 
>> to TLB flush.[1]
>>
>> A problem is that TLB flush on a core affects all other cores even if 
>> all other cores do not need actual flush, and it causes performance 
>> degradation.
>>
>> In this thread, I explained that:
>> * I found a performance problem which is caused by TLBI-is instruction.
>> * The problem occurs like this:
>>   1) On a core, OS tries to flush TLB using TLBI-is instruction
>>   2) TLBI-is instruction causes a broadcast to all other cores, and
>>   each core received hard-wired signal
>>   3) Each core check if there are TLB entries which have the specified 
>> ASID/VA
> 
> For those following along at home, my understanding is that this "check"
> effectively stalls the pipeline as though it is being performed in software.
> 
> Some questions:
> 
> Does this mean a malicious virtual machine can effectively DoS the system?
> What about a malicious application calling mprotect()?
> 
> Do all broadcast TLBI instructions cause this expensive check, or are
> some significantly slower than others?
> 
>>   4) This check causes performance degradation
>> * We ran FWQ[2] and detected OS jitter due to this problem, this noise
>>   is serious for HPC usage.
>>
>> The noise means here a difference between maximum time and minimum time 
>> which the same work takes.
>>
>> How to fix:
>> I think the cause is TLB flush by TLBI-is because the instruction 
>> affects cores that are not related to its flush.
> 
> Does broadcast I-cache maintenance cause the same problem?
> 
>> So the previous patch I posted is
>> * Use mm_cpumask in mm_struct to find appropriate CPUs for TLB flush
>> * Exec TLBI instead of TLBI-is only to CPUs specified by mm_cpumask
>>   (This is the same behavior as arm32 and x86)
>>
>> And after the discussion about this patch, I got the following comments.
>> 1) This patch switches the behavior (original flush by TLBI-is and new 
>> flush by TLBI) by boot parameter, this implementation is not acceptable 
>> due to bad maintainability.
>> 2) Even if this patch fixes this problem, it may cause another 
>> performance problem.
>>
>> I'd like to start over the implementation by considering these points.
>> For the second comment above, I will run a benchmark test to analyze the 
>> impact on performance.
>> Please let me know if there are other points I should take into 
>> consideration.
> 
> I think it's worth bearing in mind that I have little sympathy for the
> problem that you are seeing. As far as I can tell, you've done the
> following:
> 
>   1. You designed a CPU micro-architecture that stalls whenever it receives
>      a TLB invalidation request.
> 
>   2. You integrated said CPU design into a system where broadcast TLB
>      invalidation is not filtered and therefore stalls every CPU every
>      time that /any/ TLB invalidation is broadcast.
> 
>   3. You deployed a mixture of Linux and jitter-sensitive software on
>      this system, and now you're failing to meet your performance
>      requirements.
> 
> Have I got that right?
> 
> If so, given that your CPU design isn't widely available, nobody else
> appears to have made this mistake and jitter hasn't been reported as an
> issue for any other systems, it's very unlikely that we're going to make
> invasive upstream kernel changes to support you. I'm sorry, but all I can
> suggest is that you check that your micro-architecture and performance
> requirements are aligned with the design of Linux *before* building another
> machine like this in future.
> 

I just wanted to note that the cover letter states that they have also seen this
on Thunderx1 and Thunderx2.

Not sure about other machines, like the Huawei TaiShan 200 series.

What I want to say, it seems not to be something that only affects Fujitsu but
also other vendors. So maybe we should consider adding an erratum like the one
for the repeated TLBI on Qualcomm SoCs.

Regards,
Matthias

> I hate to be blunt, but I also don't want to waste your time.
> 
> Thanks,
> 
> Will
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-11-26 14:26     ` Matthias Brugger
@ 2019-11-26 14:36       ` Will Deacon
  0 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2019-11-26 14:36 UTC (permalink / raw)
  To: Matthias Brugger
  Cc: qi.fuli, tokamoto, Jon Masters, Jonathan Corbet, peterz,
	Catalin Marinas, linux-doc, Will Deacon, linux-kernel,
	maeda.naoaki, misono.tomohiro, Itaru Kitayama, linux-arm-kernel,
	indou.takao, Robert Richter

On Tue, Nov 26, 2019 at 03:26:48PM +0100, Matthias Brugger wrote:
> On 01/11/2019 18:28, Will Deacon wrote:
> > On Fri, Nov 01, 2019 at 09:56:05AM +0000, qi.fuli@fujitsu.com wrote:
> >> First of all thanks for the comments for the patch.
> >>
> >> I'm still struggling with this problem to find out the solution.
> >> As a result of an investigation on this problem, after all, I think it 
> >> is necessary to improve TLB flush mechanism of the kernel to fix this 
> >> problem completely.
> >>
> >> So, I'd like to restart a discussion. At first, I summarize this problem 
> >> to recall what was the problem and then I want to discuss how to fix it.
> >>
> >> Summary of the problem:
> >> A few months ago I proposed patches to solve a performance problem due 
> >> to TLB flush.[1]
> >>
> >> A problem is that TLB flush on a core affects all other cores even if 
> >> all other cores do not need actual flush, and it causes performance 
> >> degradation.
> >>
> >> In this thread, I explained that:
> >> * I found a performance problem which is caused by TLBI-is instruction.
> >> * The problem occurs like this:
> >>   1) On a core, OS tries to flush TLB using TLBI-is instruction
> >>   2) TLBI-is instruction causes a broadcast to all other cores, and
> >>   each core received hard-wired signal
> >>   3) Each core check if there are TLB entries which have the specified 
> >> ASID/VA
> > 
> > For those following along at home, my understanding is that this "check"
> > effectively stalls the pipeline as though it is being performed in software.
> > 
> > Some questions:
> > 
> > Does this mean a malicious virtual machine can effectively DoS the system?
> > What about a malicious application calling mprotect()?
> > 
> > Do all broadcast TLBI instructions cause this expensive check, or are
> > some significantly slower than others?
> > 
> >>   4) This check causes performance degradation
> >> * We ran FWQ[2] and detected OS jitter due to this problem, this noise
> >>   is serious for HPC usage.
> >>
> >> The noise means here a difference between maximum time and minimum time 
> >> which the same work takes.
> >>
> >> How to fix:
> >> I think the cause is TLB flush by TLBI-is because the instruction 
> >> affects cores that are not related to its flush.
> > 
> > Does broadcast I-cache maintenance cause the same problem?
> > 
> >> So the previous patch I posted is
> >> * Use mm_cpumask in mm_struct to find appropriate CPUs for TLB flush
> >> * Exec TLBI instead of TLBI-is only to CPUs specified by mm_cpumask
> >>   (This is the same behavior as arm32 and x86)
> >>
> >> And after the discussion about this patch, I got the following comments.
> >> 1) This patch switches the behavior (original flush by TLBI-is and new 
> >> flush by TLBI) by boot parameter, this implementation is not acceptable 
> >> due to bad maintainability.
> >> 2) Even if this patch fixes this problem, it may cause another 
> >> performance problem.
> >>
> >> I'd like to start over the implementation by considering these points.
> >> For the second comment above, I will run a benchmark test to analyze the 
> >> impact on performance.
> >> Please let me know if there are other points I should take into 
> >> consideration.
> > 
> > I think it's worth bearing in mind that I have little sympathy for the
> > problem that you are seeing. As far as I can tell, you've done the
> > following:
> > 
> >   1. You designed a CPU micro-architecture that stalls whenever it receives
> >      a TLB invalidation request.
> > 
> >   2. You integrated said CPU design into a system where broadcast TLB
> >      invalidation is not filtered and therefore stalls every CPU every
> >      time that /any/ TLB invalidation is broadcast.
> > 
> >   3. You deployed a mixture of Linux and jitter-sensitive software on
> >      this system, and now you're failing to meet your performance
> >      requirements.
> > 
> > Have I got that right?
> > 
> > If so, given that your CPU design isn't widely available, nobody else
> > appears to have made this mistake and jitter hasn't been reported as an
> > issue for any other systems, it's very unlikely that we're going to make
> > invasive upstream kernel changes to support you. I'm sorry, but all I can
> > suggest is that you check that your micro-architecture and performance
> > requirements are aligned with the design of Linux *before* building another
> > machine like this in future.
> > 
> 
> I just wanted to note that the cover letter states that they have also seen this
> on Thunderx1 and Thunderx2.
> 
> Not sure about other machines, like the Huawei TaiShan 200 series.
> 
> What I want to say, it seems not to be something that only affects Fujitsu but
> also other vendors. So maybe we should consider adding an erratum like the one
> for the repeated TLBI on Qualcomm SoCs.

Careful here -- we're talking about a reported performance issue, not a
correctness one. The "repeated TLBI" sequence is very much a workaround for
the latter.

In the case of TX1/TX2, I can imagine the "let's sit in a loop of mprotect()
calls" scaling poorly, which is what the cover letter is referring to, but
that's not really a workload that we need to optimise for. However, the case
that Fujitsu are reporting seems to go beyond that because of the design of
their CPU micro-architecture, where even just a single TLB invalidation
message stalls all of the other CPUs in the system. I don't have any reason
to believe that particular problem affects other CPU designs.

Thanks,

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain
  2019-11-01 17:28   ` Will Deacon
  2019-11-26 14:26     ` Matthias Brugger
@ 2019-12-01 16:02     ` Jon Masters
  1 sibling, 0 replies; 18+ messages in thread
From: Jon Masters @ 2019-12-01 16:02 UTC (permalink / raw)
  To: Will Deacon, qi.fuli
  Cc: Jonathan Corbet, Catalin Marinas, Will Deacon, Itaru Kitayama,
	peterz, linux-doc, linux-kernel, linux-arm-kernel, indou.takao,
	maeda.naoaki, misono.tomohiro, tokamoto

On 11/1/19 1:28 PM, Will Deacon wrote:

> On Fri, Nov 01, 2019 at 09:56:05AM +0000, qi.fuli@fujitsu.com wrote:

>> In this thread, I explained that:
>> * I found a performance problem which is caused by TLBI-is instruction.
>> * The problem occurs like this:
>>    1) On a core, OS tries to flush TLB using TLBI-is instruction
>>    2) TLBI-is instruction causes a broadcast to all other cores, and
>>    each core received hard-wired signal
>>    3) Each core check if there are TLB entries which have the specified
>> ASID/VA

(the above confuses implementation with architecture)

<snip>

> I think it's worth bearing in mind that I have little sympathy for the
> problem that you are seeing. As far as I can tell, you've done the
> following:
> 
>    1. You designed a CPU micro-architecture that stalls whenever it receives
>       a TLB invalidation request.

s/SPARC/Arm/ && wire in DVM

>    2. You integrated said CPU design into a system where broadcast TLB
>       invalidation is not filtered and therefore stalls every CPU every
>       time that /any/ TLB invalidation is broadcast.
> 
>    3. You deployed a mixture of Linux and jitter-sensitive software on
>       this system, and now you're failing to meet your performance
>       requirements.
> 
> Have I got that right?
> 
> If so, given that your CPU design isn't widely available, nobody else
> appears to have made this mistake and jitter hasn't been reported as an
> issue for any other systems, it's very unlikely that we're going to make
> invasive upstream kernel changes to support you. I'm sorry, but all I can
> suggest is that you check that your micro-architecture and performance
> requirements are aligned with the design of Linux *before* building another
> machine like this in future.
> 
> I hate to be blunt, but I also don't want to waste your time.

I always tried to ask nicely for the above to be heeded. There's a 
difference between "hi, our implementation doesn't scale, and here's 
why" vs "there's a problem with all TLBIs...". There isn't. The problem 
is the implementation and that should have been called out first thing.

Jon.

-- 
Computer Architect

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-12-01 19:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-17 14:32 [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain Takao Indoh
2019-06-17 14:32 ` [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage")) Takao Indoh
2019-07-23 11:55   ` Catalin Marinas
2019-06-17 14:32 ` [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain Takao Indoh
2019-07-23 12:11   ` Catalin Marinas
2019-06-17 17:03 ` [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction " Will Deacon
2019-06-24 10:34   ` qi.fuli
2019-06-27 10:27     ` Will Deacon
2019-07-03  2:45       ` qi.fuli
2019-07-09  0:25         ` Jon Masters
2019-07-09  0:29           ` Jon Masters
2019-07-09  8:03             ` Will Deacon
2019-07-09  8:07         ` Will Deacon
2019-11-01  9:56 ` qi.fuli
2019-11-01 17:28   ` Will Deacon
2019-11-26 14:26     ` Matthias Brugger
2019-11-26 14:36       ` Will Deacon
2019-12-01 16:02     ` Jon Masters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).