linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2
@ 2020-02-23 19:25 Andrea Arcangeli
  2020-02-23 19:25 ` [PATCH 1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes Andrea Arcangeli
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2020-02-23 19:25 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Rafael Aquini, Mark Salter
  Cc: Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

Hello,

This is introducing a nr_active_mm that allows to optimize away the
tlbi broadcast also for multi threaded processes, it doesn't rely
anymore on mm_users <= 1.

This also optimizes away all TLB flushes (including local ones) when
the process is not running in any cpu (including during exit_mmap with
lazy tlb state).

This optimization is generally only observable when there are parallel
TLB flushes from different processes in multiple CPUs. One possible
use case is an userland malloc libs freeing small objects with
MADV_DONTNEED and causing a frequent tiny tlb flushes as demonstrated
by the tcmalloc testsuite.

All memory intensive apps dealing a multitude of frequently freed
small objects tend to opt-out of glibc and they opt-in jemalloc or
tcmalloc, so this should facilitate the SMP/NUMA scalability of long
lived apps with small objects running in different containers if
they're issuing frequent MADV_DONTNEED tlb flushes while the other
threads of the process are not running.

I was suggested to implement the mm_cpumask the standard way in
order to optimize multithreaded apps too and to avoid restricting the
optimization to mm_users <= 1. So initially I had two bitmasks allocated
as shown at the bottom of this cover letter, by setting
ARCH_NR_MM_CPUMASK to 2 with the below patch applied... however I
figured a single atomic per-mm achieves the exact same runtime behavior
of the extra bitmap, so I just dropped the extra bitmap and I replaced
it with nr_active_mm as an optimization.

If the switch_mm atomic ops in the switch_mm fast path would be a
concern (they're still faster than the cpumask_set_cpu/clear_cpu, with
less than 256-512 CPUs), it's worth mentioning it'd be possible to
remove all atomic ops from the switch_mm fast path by restricting this
optimization to single threaded processes by checking mm_users <= 1
and < 1 instead of nr_active_mm <= 1 and < 1 similarly to what the
earlier version of this patchset was doing.

Thanks,
Andrea

Andrea Arcangeli (3):
  mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  arm64: select CPUMASK_OFFSTACK if NUMA
  arm64: tlb: skip tlbi broadcast

 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/efi.h         |  2 +-
 arch/arm64/include/asm/mmu.h         |  4 +-
 arch/arm64/include/asm/mmu_context.h | 33 ++++++++--
 arch/arm64/include/asm/tlbflush.h    | 95 +++++++++++++++++++++++++++-
 arch/arm64/mm/context.c              | 54 ++++++++++++++++
 mm/mmu_context.c                     |  2 +
 7 files changed, 180 insertions(+), 11 deletions(-)

Early attempt with the standard mm_cpumask follows:

From: Andrea Arcangeli <aarcange@redhat.com>
Subject: mm: allow per-arch mm_cpumasks based on ARCH_NR_MM_CPUMASK

Allow archs to allocate multiple mm_cpumasks in the mm_struct per-arch
by definining a ARCH_NR_MM_CPUMASK > 1 (to be included before
"linux/mm_types.h").

Those extra per-mm cpumasks can be referenced with
__mm_cpumask(N, mm), where N == 0 points to the mm_cpumask()
known by the common code and N > 0 points to the per-arch private
ones.
---
 drivers/firmware/efi/efi.c |  3 ++-
 include/linux/mm_types.h   | 17 +++++++++++++++--
 kernel/fork.c              |  3 ++-
 mm/init-mm.c               |  2 +-
 4 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 5da0232ae33f..608c9bf181e5 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -86,7 +86,8 @@ struct mm_struct efi_mm = {
 	.mmap_sem		= __RWSEM_INITIALIZER(efi_mm.mmap_sem),
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
-	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS)] = 0},
+	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS) *
+				     ARCH_NR_MM_CPUMASK] = 0},
 };
 
 struct workqueue_struct *efi_rts_wq;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f29bba20bba1..b53d5622b3b2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -531,6 +531,9 @@ struct mm_struct {
 	RH_KABI_RESERVE(7)
 	RH_KABI_RESERVE(8)
 
+#ifndef ARCH_NR_MM_CPUMASK
+#define ARCH_NR_MM_CPUMASK 1
+#endif
 	/*
 	 * The mm_cpumask needs to be at the end of mm_struct, because it
 	 * is dynamically sized based on nr_cpu_ids.
@@ -544,15 +547,25 @@ extern struct mm_struct init_mm;
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 	unsigned long cpu_bitmap = (unsigned long)mm;
+	int i;
 
 	cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
-	cpumask_clear((struct cpumask *)cpu_bitmap);
+	for (i = 0; i < ARCH_NR_MM_CPUMASK; i++) {
+		cpumask_clear((struct cpumask *)cpu_bitmap);
+		cpu_bitmap += cpumask_size();
+	}
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
+static inline cpumask_t *__mm_cpumask(int index, struct mm_struct *mm)
+{
+	return (struct cpumask *)((unsigned long)&mm->cpu_bitmap +
+				  cpumask_size() * index);
+}
+
 static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return (struct cpumask *)&mm->cpu_bitmap;
+	return __mm_cpumask(0, mm);
 }
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 1dad2f91fac3..a6cbbc1b6008 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2418,7 +2418,8 @@ void __init proc_caches_init(void)
 	 * dynamically sized based on the maximum CPU number this system
 	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
-	mm_size = sizeof(struct mm_struct) + cpumask_size();
+	mm_size = sizeof(struct mm_struct) + cpumask_size() * \
+		ARCH_NR_MM_CPUMASK;
 
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
 			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
diff --git a/mm/init-mm.c b/mm/init-mm.c
index a787a319211e..d975f8ce270e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -35,6 +35,6 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
-	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
+	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS) * ARCH_NR_MM_CPUMASK] = 0},
 	INIT_MM_CONTEXT(init_mm)
 };


[bitmap version depending on the above follows]

@@ -248,6 +260,42 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 		cpu_switch_mm(mm->pgd, mm);
 }
 
+enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
+{
+	if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
+		bool is_local = cpumask_test_cpu(cpu, mm_cpumask(mm));
+		cpumask_t *stale_cpumask = __mm_cpumask(1, mm);
+		int next_zero = cpumask_next_zero(-1, stale_cpumask);
+		bool local_is_clear = false;
+		if (next_zero < nr_cpu_ids &&
+		    (is_local && next_zero == cpu)) {
+			next_zero = cpumask_next_zero(next_zero, stale_cpumask);
+			local_is_clear = true;
+		}
+		if (next_zero < nr_cpu_ids) {
+			cpumask_setall(stale_cpumask);
+			local_is_clear = false;
+		}
+
+		/*
+		 * Enforce CPU ordering between the
+		 * cpumask_setall() and cpumask_any_but().
+		 */
+		smp_mb();
+
+		if (likely(cpumask_any_but(mm_cpumask(mm),
+					   cpu) >= nr_cpu_ids)) {
+			if (is_local) {
+				if (!local_is_clear)
+					cpumask_clear_cpu(cpu, stale_cpumask);
+				return TLB_FLUSH_LOCAL;
+			}
+			return TLB_FLUSH_NO;
+		}
+	}
+	return TLB_FLUSH_BROADCAST;
+}
+
 /* Errata workaround post TTBRx_EL1 update. */
 asmlinkage void post_ttbr_update_workaround(void)
 {



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  2020-02-23 19:25 [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Andrea Arcangeli
@ 2020-02-23 19:25 ` Andrea Arcangeli
  2020-03-03  4:28   ` Rafael Aquini
  2020-02-23 19:25 ` [PATCH 2/3] arm64: select CPUMASK_OFFSTACK if NUMA Andrea Arcangeli
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2020-02-23 19:25 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Rafael Aquini, Mark Salter
  Cc: Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

alpha, ia64, mips, powerpc, sh, sparc are relying on a check on
mm->mm_users to know if they can skip some remote TLB flushes for
single threaded processes.

Most callers of use_mm() tend to invoke mmget_not_zero() or
get_task_mm() before use_mm() to ensure the mm will remain alive in
between use_mm() and unuse_mm().

Some callers however don't increase mm_users and they instead rely on
serialization in __mmput() to ensure the mm will remain alive in
between use_mm() and unuse_mm(). Not increasing mm_users during
use_mm() is however unsafe for aforementioned arch TLB flushes
optimizations. So either mmget()/mmput() should be added to the
problematic callers of use_mm()/unuse_mm() or we can embed them in
use_mm()/unuse_mm() which is more robust.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mmu_context.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3e612ae748e9..ced0e1218c0f 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -30,6 +30,7 @@ void use_mm(struct mm_struct *mm)
 		mmgrab(mm);
 		tsk->active_mm = mm;
 	}
+	mmget(mm);
 	tsk->mm = mm;
 	switch_mm(active_mm, mm, tsk);
 	task_unlock(tsk);
@@ -57,6 +58,7 @@ void unuse_mm(struct mm_struct *mm)
 	task_lock(tsk);
 	sync_mm_rss(mm);
 	tsk->mm = NULL;
+	mmput(mm);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] arm64: select CPUMASK_OFFSTACK if NUMA
  2020-02-23 19:25 [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Andrea Arcangeli
  2020-02-23 19:25 ` [PATCH 1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes Andrea Arcangeli
@ 2020-02-23 19:25 ` Andrea Arcangeli
  2020-03-03  4:31   ` Rafael Aquini
  2020-02-23 19:25 ` [PATCH 3/3] arm64: tlb: skip tlbi broadcast Andrea Arcangeli
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2020-02-23 19:25 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Rafael Aquini, Mark Salter
  Cc: Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

It's unclear why normally CPUMASK_OFFSTACK can only be manually
configured "if DEBUG_PER_CPU_MAPS" which is not an option meant to be
enabled on enterprise arm64 kernels.

The default enterprise kernels NR_CPUS is 4096 which is fairly large.
So it'll save some RAM and it'll increase reliability to select
CPUMASK_OFFSET at least when NUMA is selected and a large NR_CPUS is
to be expected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0b30e884e088..882887e65394 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -955,6 +955,7 @@ config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
 	select ACPI_NUMA if ACPI
 	select OF_NUMA
+	select CPUMASK_OFFSTACK
 	help
 	  Enable NUMA (Non Uniform Memory Access) support.
 



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-02-23 19:25 [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Andrea Arcangeli
  2020-02-23 19:25 ` [PATCH 1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes Andrea Arcangeli
  2020-02-23 19:25 ` [PATCH 2/3] arm64: select CPUMASK_OFFSTACK if NUMA Andrea Arcangeli
@ 2020-02-23 19:25 ` Andrea Arcangeli
  2020-03-02 15:24   ` Rafael Aquini
  2020-03-09 11:22   ` Catalin Marinas
  2020-03-03 13:01 ` [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Larry Woodman
  2020-03-18  8:53 ` qi.fuli
  4 siblings, 2 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2020-02-23 19:25 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Rafael Aquini, Mark Salter
  Cc: Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

With multiple NUMA nodes and multiple sockets, the tlbi broadcast
shall be delivered through the interconnects in turn increasing the
CPU interconnect traffic and the latency of the tlbi broadcast
instruction. To avoid the synchronous delivery of the tlbi broadcast
before the tlbi instruction can be retired, the hardware would need to
implement a replicated mm_cpumask bitflag for each ASID and every CPU
would need to tell every other CPU which ASID is being loaded. Exactly
what x86 does with mm_cpumask in software.

Even within a single NUMA node the latency of the tlbi broadcast
instruction increases almost linearly with the number of CPUs trying
to send tlbi broadcasts at the same time.

If a single thread of the process is running and it's also running in
the CPU issuing the TLB flush, or if no thread of the process are
running, we can achieve full SMP scalability in the arm64 TLB flushng
by skipping the tlbi broadcasting.

After the local TLB flush this means the ASID context goes out of sync
in all CPUs except the local one. This can be tracked on the per-mm
cpumask: if the bit is set it means the ASID context is stale for that
CPU. This results in an extra local ASID TLB flush only when threads
are running in new CPUs after a TLB flush.

Skipping the tlbi instruction broadcasting is already implemented in
local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
flush_tlb_range() and flush_tlb_page() too.

The below benchmarks are measured on a non-NUMA 32 CPUs system (ARMv8
Ampere), so it should be far from a worst case scenario: the
enterprise kernel config allows multiple NUMA nodes with NR_CPUS set
by default to 4096.

=== stock ===

 # cat for-each-cpu.sh
 #!/bin/bash

 for i in $(seq `nproc`); do
         "$@" &>/dev/null &
 done
 wait
 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
[..]
            2.1696 +- 0.0122 seconds time elapsed  ( +-  0.56% )

 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
[..]
            0.99018 +- 0.00360 seconds time elapsed  ( +-  0.36% )

 # cat sort-compute
 #!/bin/bash

 for x in `seq 256`; do
         for i in `seq 32`; do /usr/bin/sort </usr/bin/sort >/dev/null; done &
 done
 wait
 # perf stat -r 10 -e dummy ./sort-compute
[..]
            1.8094 +- 0.0139 seconds time elapsed  ( +-  0.77% )

=== patch applied ===

 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
[..]
            0.13941 +- 0.00449 seconds time elapsed  ( +-  3.22% )

 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
[..]
            0.90510 +- 0.00262 seconds time elapsed  ( +-  0.29% )

 # perf stat -r 10 -e dummy ./sort-compute
[..]
            1.64025 +- 0.00618 seconds time elapsed  ( +-  0.38% )

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/arm64/include/asm/efi.h         |  2 +-
 arch/arm64/include/asm/mmu.h         |  4 +-
 arch/arm64/include/asm/mmu_context.h | 33 ++++++++--
 arch/arm64/include/asm/tlbflush.h    | 95 +++++++++++++++++++++++++++-
 arch/arm64/mm/context.c              | 54 ++++++++++++++++
 5 files changed, 177 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h
index 44531a69d32b..5d9a1433d918 100644
--- a/arch/arm64/include/asm/efi.h
+++ b/arch/arm64/include/asm/efi.h
@@ -131,7 +131,7 @@ static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
 
 static inline void efi_set_pgd(struct mm_struct *mm)
 {
-	__switch_mm(mm);
+	__switch_mm(mm, smp_processor_id());
 
 	if (system_uses_ttbr0_pan()) {
 		if (mm != current->active_mm) {
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index e4d862420bb4..9072fd7bc5f8 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -20,6 +20,7 @@ typedef struct {
 	atomic64_t	id;
 	void		*vdso;
 	unsigned long	flags;
+	atomic_t	nr_active_mm;
 } mm_context_t;
 
 /*
@@ -27,7 +28,8 @@ typedef struct {
  * ASID change and therefore doesn't need to reload the counter using
  * atomic64_read.
  */
-#define ASID(mm)	((mm)->context.id.counter & 0xffff)
+#define __ASID(asid)	((asid) & 0xffff)
+#define ASID(mm)	__ASID((mm)->context.id.counter)
 
 extern bool arm64_use_ng_mappings;
 
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 3827ff4040a3..9c66fe317e2f 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -175,7 +175,10 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
 #define destroy_context(mm)		do { } while(0)
 void check_and_switch_context(struct mm_struct *mm, unsigned int cpu);
 
-#define init_new_context(tsk,mm)	({ atomic64_set(&(mm)->context.id, 0); 0; })
+#define init_new_context(tsk,mm)			\
+	({ atomic64_set(&(mm)->context.id, 0);		\
+	   atomic_set(&(mm)->context.nr_active_mm, 0);	\
+	   0; })
 
 #ifdef CONFIG_ARM64_SW_TTBR0_PAN
 static inline void update_saved_ttbr0(struct task_struct *tsk,
@@ -203,6 +206,15 @@ static inline void update_saved_ttbr0(struct task_struct *tsk,
 static inline void
 enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
+	unsigned int cpu = smp_processor_id();
+	if (per_cpu(cpu_not_lazy_tlb, cpu) &&
+	    is_idle_task(tsk)) {
+		per_cpu(cpu_not_lazy_tlb, cpu) = false;
+		if (!system_uses_ttbr0_pan())
+			cpu_set_reserved_ttbr0();
+		atomic_dec(&mm->context.nr_active_mm);
+	}
+	VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0);
 	/*
 	 * We don't actually care about the ttbr0 mapping, so point it at the
 	 * zero page.
@@ -210,10 +222,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	update_saved_ttbr0(tsk, &init_mm);
 }
 
-static inline void __switch_mm(struct mm_struct *next)
+static inline void __switch_mm(struct mm_struct *next, unsigned int cpu)
 {
-	unsigned int cpu = smp_processor_id();
-
 	/*
 	 * init_mm.pgd does not contain any user mappings and it is always
 	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
@@ -230,8 +240,19 @@ static inline void
 switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	  struct task_struct *tsk)
 {
-	if (prev != next)
-		__switch_mm(next);
+	unsigned int cpu = smp_processor_id();
+
+	if (!per_cpu(cpu_not_lazy_tlb, cpu)) {
+		per_cpu(cpu_not_lazy_tlb, cpu) = true;
+		atomic_inc(&next->context.nr_active_mm);
+		__switch_mm(next, cpu);
+	} else if (prev != next) {
+		atomic_inc(&next->context.nr_active_mm);
+		__switch_mm(next, cpu);
+		atomic_dec(&prev->context.nr_active_mm);
+	}
+	VM_WARN_ON(!atomic_read(&next->context.nr_active_mm));
+	VM_WARN_ON(atomic_read(&prev->context.nr_active_mm) < 0);
 
 	/*
 	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bc3949064725..0bd987ff9cbd 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -136,6 +136,15 @@ static inline void local_flush_tlb_all(void)
 	isb();
 }
 
+static inline void local_flush_tlb_asid(unsigned long asid)
+{
+	asid = __TLBI_VADDR(0, __ASID(asid));
+	dsb(nshst);
+	__tlbi(aside1, asid);
+	__tlbi_user(aside1, asid);
+	dsb(nsh);
+}
+
 static inline void flush_tlb_all(void)
 {
 	dsb(ishst);
@@ -144,9 +153,38 @@ static inline void flush_tlb_all(void)
 	isb();
 }
 
+DECLARE_PER_CPU(bool, cpu_not_lazy_tlb);
+
+enum tlb_flush_types {
+	TLB_FLUSH_NO,
+	TLB_FLUSH_LOCAL,
+	TLB_FLUSH_BROADCAST,
+};
+extern enum tlb_flush_types tlb_flush_check(struct mm_struct *mm,
+					    unsigned int cpu);
+
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
 	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
+	enum tlb_flush_types flush;
+
+	flush = tlb_flush_check(mm, get_cpu());
+	switch (flush) {
+	case TLB_FLUSH_LOCAL:
+
+		dsb(nshst);
+		__tlbi(aside1, asid);
+		__tlbi_user(aside1, asid);
+		dsb(nsh);
+
+		/* fall through */
+	case TLB_FLUSH_NO:
+		put_cpu();
+		return;
+	case TLB_FLUSH_BROADCAST:
+		break;
+	}
+	put_cpu();
 
 	dsb(ishst);
 	__tlbi(aside1is, asid);
@@ -167,7 +205,31 @@ static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
-	flush_tlb_page_nosync(vma, uaddr);
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
+	enum tlb_flush_types flush;
+
+	flush = tlb_flush_check(mm, get_cpu());
+	switch (flush) {
+	case TLB_FLUSH_LOCAL:
+
+		dsb(nshst);
+		__tlbi(vale1, addr);
+		__tlbi_user(vale1, addr);
+		dsb(nsh);
+
+		/* fall through */
+	case TLB_FLUSH_NO:
+		put_cpu();
+		return;
+	case TLB_FLUSH_BROADCAST:
+		break;
+	}
+	put_cpu();
+
+	dsb(ishst);
+	__tlbi(vale1is, addr);
+	__tlbi_user(vale1is, addr);
 	dsb(ish);
 }
 
@@ -181,14 +243,16 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level)
 {
-	unsigned long asid = ASID(vma->vm_mm);
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long asid = ASID(mm);
 	unsigned long addr;
+	enum tlb_flush_types flush;
 
 	start = round_down(start, stride);
 	end = round_up(end, stride);
 
 	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
-		flush_tlb_mm(vma->vm_mm);
+		flush_tlb_mm(mm);
 		return;
 	}
 
@@ -198,6 +262,31 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	start = __TLBI_VADDR(start, asid);
 	end = __TLBI_VADDR(end, asid);
 
+	flush = tlb_flush_check(mm, get_cpu());
+	switch (flush) {
+	case TLB_FLUSH_LOCAL:
+
+		dsb(nshst);
+		for (addr = start; addr < end; addr += stride) {
+			if (last_level) {
+				__tlbi(vale1, addr);
+				__tlbi_user(vale1, addr);
+			} else {
+				__tlbi(vae1, addr);
+				__tlbi_user(vae1, addr);
+			}
+		}
+		dsb(nsh);
+
+		/* fall through */
+	case TLB_FLUSH_NO:
+		put_cpu();
+		return;
+	case TLB_FLUSH_BROADCAST:
+		break;
+	}
+	put_cpu();
+
 	dsb(ishst);
 	for (addr = start; addr < end; addr += stride) {
 		if (last_level) {
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 8ef73e89d514..3152b7f7da12 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -25,6 +25,7 @@ static unsigned long *asid_map;
 static DEFINE_PER_CPU(atomic64_t, active_asids);
 static DEFINE_PER_CPU(u64, reserved_asids);
 static cpumask_t tlb_flush_pending;
+DEFINE_PER_CPU(bool, cpu_not_lazy_tlb);
 
 #define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
 #define ASID_FIRST_VERSION	(1UL << asid_bits)
@@ -191,6 +192,12 @@ static u64 new_context(struct mm_struct *mm)
 set_asid:
 	__set_bit(asid, asid_map);
 	cur_idx = asid;
+	/*
+	  * check_and_switch_context() will change the ASID of this mm
+	  * so no need of extra ASID local TLB flushes: the new ASID
+	  * isn't stale anymore after the tlb_flush_pending was set.
+	  */
+	cpumask_clear(mm_cpumask(mm));
 	return idx2asid(asid) | generation;
 }
 
@@ -240,6 +247,15 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
 
 switch_mm_fastpath:
+	/*
+	 * Enforce CPU ordering between the atomic_inc(nr_active_mm)
+	 * in switch_mm() and the below cpumask_test_cpu(mm_cpumask).
+	 */
+	smp_mb();
+	if (cpumask_test_cpu(cpu, mm_cpumask(mm))) {
+		cpumask_clear_cpu(cpu, mm_cpumask(mm));
+		local_flush_tlb_asid(asid);
+	}
 
 	arm64_apply_bp_hardening();
 
@@ -251,6 +267,44 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 		cpu_switch_mm(mm->pgd, mm);
 }
 
+enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
+{
+	if (atomic_read(&mm->context.nr_active_mm) <= 1) {
+		bool is_local = current->active_mm == mm &&
+			per_cpu(cpu_not_lazy_tlb, cpu);
+		cpumask_t *stale_cpumask = mm_cpumask(mm);
+		unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask);
+		bool local_is_clear = false;
+		if (next_zero < nr_cpu_ids &&
+		    (is_local && next_zero == cpu)) {
+			next_zero = cpumask_next_zero(next_zero, stale_cpumask);
+			local_is_clear = true;
+		}
+		if (next_zero < nr_cpu_ids) {
+			cpumask_setall(stale_cpumask);
+			local_is_clear = false;
+		}
+
+		/*
+		 * Enforce CPU ordering between the above
+		 * cpumask_setall(mm_cpumask) and the below
+		 * atomic_read(nr_active_mm).
+		 */
+		smp_mb();
+
+		if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) {
+			if (is_local) {
+				if (!local_is_clear)
+					cpumask_clear_cpu(cpu, stale_cpumask);
+				return TLB_FLUSH_LOCAL;
+			}
+			if (atomic_read(&mm->context.nr_active_mm) == 0)
+				return TLB_FLUSH_NO;
+		}
+	}
+	return TLB_FLUSH_BROADCAST;
+}
+
 /* Errata workaround post TTBRx_EL1 update. */
 asmlinkage void post_ttbr_update_workaround(void)
 {



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-02-23 19:25 ` [PATCH 3/3] arm64: tlb: skip tlbi broadcast Andrea Arcangeli
@ 2020-03-02 15:24   ` Rafael Aquini
  2020-03-04  4:19     ` Rafael Aquini
  2020-03-09 11:22   ` Catalin Marinas
  1 sibling, 1 reply; 15+ messages in thread
From: Rafael Aquini @ 2020-03-02 15:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Catalin Marinas, Mark Salter, Jon Masters,
	linux-kernel, linux-mm, linux-arm-kernel, Michal Hocko, QI Fuli

On Sun, Feb 23, 2020 at 02:25:20PM -0500, Andrea Arcangeli wrote:
> With multiple NUMA nodes and multiple sockets, the tlbi broadcast
> shall be delivered through the interconnects in turn increasing the
> CPU interconnect traffic and the latency of the tlbi broadcast
> instruction. To avoid the synchronous delivery of the tlbi broadcast
> before the tlbi instruction can be retired, the hardware would need to
> implement a replicated mm_cpumask bitflag for each ASID and every CPU
> would need to tell every other CPU which ASID is being loaded. Exactly
> what x86 does with mm_cpumask in software.
> 
> Even within a single NUMA node the latency of the tlbi broadcast
> instruction increases almost linearly with the number of CPUs trying
> to send tlbi broadcasts at the same time.
> 
> If a single thread of the process is running and it's also running in
> the CPU issuing the TLB flush, or if no thread of the process are
> running, we can achieve full SMP scalability in the arm64 TLB flushng
> by skipping the tlbi broadcasting.
> 
> After the local TLB flush this means the ASID context goes out of sync
> in all CPUs except the local one. This can be tracked on the per-mm
> cpumask: if the bit is set it means the ASID context is stale for that
> CPU. This results in an extra local ASID TLB flush only when threads
> are running in new CPUs after a TLB flush.
> 
> Skipping the tlbi instruction broadcasting is already implemented in
> local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
> flush_tlb_range() and flush_tlb_page() too.
> 
> The below benchmarks are measured on a non-NUMA 32 CPUs system (ARMv8
> Ampere), so it should be far from a worst case scenario: the
> enterprise kernel config allows multiple NUMA nodes with NR_CPUS set
> by default to 4096.
> 
> === stock ===
> 
>  # cat for-each-cpu.sh
>  #!/bin/bash
> 
>  for i in $(seq `nproc`); do
>          "$@" &>/dev/null &
>  done
>  wait
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
> [..]
>             2.1696 +- 0.0122 seconds time elapsed  ( +-  0.56% )
> 
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
> [..]
>             0.99018 +- 0.00360 seconds time elapsed  ( +-  0.36% )
> 
>  # cat sort-compute
>  #!/bin/bash
> 
>  for x in `seq 256`; do
>          for i in `seq 32`; do /usr/bin/sort </usr/bin/sort >/dev/null; done &
>  done
>  wait
>  # perf stat -r 10 -e dummy ./sort-compute
> [..]
>             1.8094 +- 0.0139 seconds time elapsed  ( +-  0.77% )
> 
> === patch applied ===
> 
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
> [..]
>             0.13941 +- 0.00449 seconds time elapsed  ( +-  3.22% )
> 
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
> [..]
>             0.90510 +- 0.00262 seconds time elapsed  ( +-  0.29% )
> 
>  # perf stat -r 10 -e dummy ./sort-compute
> [..]
>             1.64025 +- 0.00618 seconds time elapsed  ( +-  0.38% )
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/arm64/include/asm/efi.h         |  2 +-
>  arch/arm64/include/asm/mmu.h         |  4 +-
>  arch/arm64/include/asm/mmu_context.h | 33 ++++++++--
>  arch/arm64/include/asm/tlbflush.h    | 95 +++++++++++++++++++++++++++-
>  arch/arm64/mm/context.c              | 54 ++++++++++++++++
>  5 files changed, 177 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h
> index 44531a69d32b..5d9a1433d918 100644
> --- a/arch/arm64/include/asm/efi.h
> +++ b/arch/arm64/include/asm/efi.h
> @@ -131,7 +131,7 @@ static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
>  
>  static inline void efi_set_pgd(struct mm_struct *mm)
>  {
> -	__switch_mm(mm);
> +	__switch_mm(mm, smp_processor_id());
>  
>  	if (system_uses_ttbr0_pan()) {
>  		if (mm != current->active_mm) {
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index e4d862420bb4..9072fd7bc5f8 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -20,6 +20,7 @@ typedef struct {
>  	atomic64_t	id;
>  	void		*vdso;
>  	unsigned long	flags;
> +	atomic_t	nr_active_mm;
>  } mm_context_t;
>  
>  /*
> @@ -27,7 +28,8 @@ typedef struct {
>   * ASID change and therefore doesn't need to reload the counter using
>   * atomic64_read.
>   */
> -#define ASID(mm)	((mm)->context.id.counter & 0xffff)
> +#define __ASID(asid)	((asid) & 0xffff)
> +#define ASID(mm)	__ASID((mm)->context.id.counter)
>  
>  extern bool arm64_use_ng_mappings;
>  
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 3827ff4040a3..9c66fe317e2f 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -175,7 +175,10 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
>  #define destroy_context(mm)		do { } while(0)
>  void check_and_switch_context(struct mm_struct *mm, unsigned int cpu);
>  
> -#define init_new_context(tsk,mm)	({ atomic64_set(&(mm)->context.id, 0); 0; })
> +#define init_new_context(tsk,mm)			\
> +	({ atomic64_set(&(mm)->context.id, 0);		\
> +	   atomic_set(&(mm)->context.nr_active_mm, 0);	\
> +	   0; })
>  
>  #ifdef CONFIG_ARM64_SW_TTBR0_PAN
>  static inline void update_saved_ttbr0(struct task_struct *tsk,
> @@ -203,6 +206,15 @@ static inline void update_saved_ttbr0(struct task_struct *tsk,
>  static inline void
>  enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>  {
> +	unsigned int cpu = smp_processor_id();
> +	if (per_cpu(cpu_not_lazy_tlb, cpu) &&
> +	    is_idle_task(tsk)) {
> +		per_cpu(cpu_not_lazy_tlb, cpu) = false;
> +		if (!system_uses_ttbr0_pan())
> +			cpu_set_reserved_ttbr0();
> +		atomic_dec(&mm->context.nr_active_mm);
> +	}
> +	VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0);
>  	/*
>  	 * We don't actually care about the ttbr0 mapping, so point it at the
>  	 * zero page.
> @@ -210,10 +222,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>  	update_saved_ttbr0(tsk, &init_mm);
>  }
>  
> -static inline void __switch_mm(struct mm_struct *next)
> +static inline void __switch_mm(struct mm_struct *next, unsigned int cpu)
>  {
> -	unsigned int cpu = smp_processor_id();
> -
>  	/*
>  	 * init_mm.pgd does not contain any user mappings and it is always
>  	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
> @@ -230,8 +240,19 @@ static inline void
>  switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	  struct task_struct *tsk)
>  {
> -	if (prev != next)
> -		__switch_mm(next);
> +	unsigned int cpu = smp_processor_id();
> +
> +	if (!per_cpu(cpu_not_lazy_tlb, cpu)) {
> +		per_cpu(cpu_not_lazy_tlb, cpu) = true;
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +	} else if (prev != next) {
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +		atomic_dec(&prev->context.nr_active_mm);
> +	}
> +	VM_WARN_ON(!atomic_read(&next->context.nr_active_mm));
> +	VM_WARN_ON(atomic_read(&prev->context.nr_active_mm) < 0);
>  
>  	/*
>  	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index bc3949064725..0bd987ff9cbd 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -136,6 +136,15 @@ static inline void local_flush_tlb_all(void)
>  	isb();
>  }
>  
> +static inline void local_flush_tlb_asid(unsigned long asid)
> +{
> +	asid = __TLBI_VADDR(0, __ASID(asid));
> +	dsb(nshst);
> +	__tlbi(aside1, asid);
> +	__tlbi_user(aside1, asid);
> +	dsb(nsh);
> +}
> +
>  static inline void flush_tlb_all(void)
>  {
>  	dsb(ishst);
> @@ -144,9 +153,38 @@ static inline void flush_tlb_all(void)
>  	isb();
>  }
>  
> +DECLARE_PER_CPU(bool, cpu_not_lazy_tlb);
> +
> +enum tlb_flush_types {
> +	TLB_FLUSH_NO,
> +	TLB_FLUSH_LOCAL,
> +	TLB_FLUSH_BROADCAST,
> +};
> +extern enum tlb_flush_types tlb_flush_check(struct mm_struct *mm,
> +					    unsigned int cpu);
> +
>  static inline void flush_tlb_mm(struct mm_struct *mm)
>  {
>  	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
> +	enum tlb_flush_types flush;
> +
> +	flush = tlb_flush_check(mm, get_cpu());
> +	switch (flush) {
> +	case TLB_FLUSH_LOCAL:
> +
> +		dsb(nshst);
> +		__tlbi(aside1, asid);
> +		__tlbi_user(aside1, asid);
> +		dsb(nsh);
> +
> +		/* fall through */
> +	case TLB_FLUSH_NO:
> +		put_cpu();
> +		return;
> +	case TLB_FLUSH_BROADCAST:
> +		break;
> +	}
> +	put_cpu();
>  
>  	dsb(ishst);
>  	__tlbi(aside1is, asid);
> @@ -167,7 +205,31 @@ static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>  static inline void flush_tlb_page(struct vm_area_struct *vma,
>  				  unsigned long uaddr)
>  {
> -	flush_tlb_page_nosync(vma, uaddr);
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
> +	enum tlb_flush_types flush;
> +
> +	flush = tlb_flush_check(mm, get_cpu());
> +	switch (flush) {
> +	case TLB_FLUSH_LOCAL:
> +
> +		dsb(nshst);
> +		__tlbi(vale1, addr);
> +		__tlbi_user(vale1, addr);
> +		dsb(nsh);
> +
> +		/* fall through */
> +	case TLB_FLUSH_NO:
> +		put_cpu();
> +		return;
> +	case TLB_FLUSH_BROADCAST:
> +		break;
> +	}
> +	put_cpu();
> +
> +	dsb(ishst);
> +	__tlbi(vale1is, addr);
> +	__tlbi_user(vale1is, addr);
>  	dsb(ish);
>  }
>  
> @@ -181,14 +243,16 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>  				     unsigned long start, unsigned long end,
>  				     unsigned long stride, bool last_level)
>  {
> -	unsigned long asid = ASID(vma->vm_mm);
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long asid = ASID(mm);
>  	unsigned long addr;
> +	enum tlb_flush_types flush;
>  
>  	start = round_down(start, stride);
>  	end = round_up(end, stride);
>  
>  	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
> -		flush_tlb_mm(vma->vm_mm);
> +		flush_tlb_mm(mm);
>  		return;
>  	}
>  
> @@ -198,6 +262,31 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>  	start = __TLBI_VADDR(start, asid);
>  	end = __TLBI_VADDR(end, asid);
>  
> +	flush = tlb_flush_check(mm, get_cpu());
> +	switch (flush) {
> +	case TLB_FLUSH_LOCAL:
> +
> +		dsb(nshst);
> +		for (addr = start; addr < end; addr += stride) {
> +			if (last_level) {
> +				__tlbi(vale1, addr);
> +				__tlbi_user(vale1, addr);
> +			} else {
> +				__tlbi(vae1, addr);
> +				__tlbi_user(vae1, addr);
> +			}
> +		}
> +		dsb(nsh);
> +
> +		/* fall through */
> +	case TLB_FLUSH_NO:
> +		put_cpu();
> +		return;
> +	case TLB_FLUSH_BROADCAST:
> +		break;
> +	}
> +	put_cpu();
> +
>  	dsb(ishst);
>  	for (addr = start; addr < end; addr += stride) {
>  		if (last_level) {
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 8ef73e89d514..3152b7f7da12 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -25,6 +25,7 @@ static unsigned long *asid_map;
>  static DEFINE_PER_CPU(atomic64_t, active_asids);
>  static DEFINE_PER_CPU(u64, reserved_asids);
>  static cpumask_t tlb_flush_pending;
> +DEFINE_PER_CPU(bool, cpu_not_lazy_tlb);
>  
>  #define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
>  #define ASID_FIRST_VERSION	(1UL << asid_bits)
> @@ -191,6 +192,12 @@ static u64 new_context(struct mm_struct *mm)
>  set_asid:
>  	__set_bit(asid, asid_map);
>  	cur_idx = asid;
> +	/*
> +	  * check_and_switch_context() will change the ASID of this mm
> +	  * so no need of extra ASID local TLB flushes: the new ASID
> +	  * isn't stale anymore after the tlb_flush_pending was set.
> +	  */
> +	cpumask_clear(mm_cpumask(mm));
>  	return idx2asid(asid) | generation;
>  }
>  
> @@ -240,6 +247,15 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
>  
>  switch_mm_fastpath:
> +	/*
> +	 * Enforce CPU ordering between the atomic_inc(nr_active_mm)
> +	 * in switch_mm() and the below cpumask_test_cpu(mm_cpumask).
> +	 */
> +	smp_mb();
> +	if (cpumask_test_cpu(cpu, mm_cpumask(mm))) {
> +		cpumask_clear_cpu(cpu, mm_cpumask(mm));
> +		local_flush_tlb_asid(asid);
> +	}
>  
>  	arm64_apply_bp_hardening();
>  
> @@ -251,6 +267,44 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  		cpu_switch_mm(mm->pgd, mm);
>  }
>  
> +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
> +{
> +	if (atomic_read(&mm->context.nr_active_mm) <= 1) {
> +		bool is_local = current->active_mm == mm &&
> +			per_cpu(cpu_not_lazy_tlb, cpu);
> +		cpumask_t *stale_cpumask = mm_cpumask(mm);
> +		unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask);
> +		bool local_is_clear = false;
> +		if (next_zero < nr_cpu_ids &&
> +		    (is_local && next_zero == cpu)) {
> +			next_zero = cpumask_next_zero(next_zero, stale_cpumask);
> +			local_is_clear = true;
> +		}
> +		if (next_zero < nr_cpu_ids) {
> +			cpumask_setall(stale_cpumask);
> +			local_is_clear = false;
> +		}
> +
> +		/*
> +		 * Enforce CPU ordering between the above
> +		 * cpumask_setall(mm_cpumask) and the below
> +		 * atomic_read(nr_active_mm).
> +		 */
> +		smp_mb();
> +
> +		if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) {
> +			if (is_local) {
> +				if (!local_is_clear)
> +					cpumask_clear_cpu(cpu, stale_cpumask);
> +				return TLB_FLUSH_LOCAL;
> +			}
> +			if (atomic_read(&mm->context.nr_active_mm) == 0)
> +				return TLB_FLUSH_NO;
> +		}
> +	}
> +	return TLB_FLUSH_BROADCAST;
> +}
> +
>  /* Errata workaround post TTBRx_EL1 update. */
>  asmlinkage void post_ttbr_update_workaround(void)
>  {


May I suggest the following (cosmetic) changes to this
patch?

I'm testing these changes against RHEL integration + regression
tests, and I'll also run them against a specially crafted test
to measure the impact on task-switching, if any. (I'll report back,
soon)

Cheers!
 Rafael
--


diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bb73f02bef28..14eceba302bc 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -182,25 +182,21 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	flush = tlb_flush_check(mm, get_cpu());
 	switch (flush) {
 	case TLB_FLUSH_LOCAL:
-
 		dsb(nshst);
 		__tlbi(aside1, asid);
 		__tlbi_user(aside1, asid);
 		dsb(nsh);
-
 		/* fall through */
 	case TLB_FLUSH_NO:
-		put_cpu();
-		return;
+		break;
 	case TLB_FLUSH_BROADCAST:
+		dsb(ishst);
+		__tlbi(aside1is, asid);
+		__tlbi_user(aside1is, asid);
+		dsb(ish);
 		break;
 	}
 	put_cpu();
-
-	dsb(ishst);
-	__tlbi(aside1is, asid);
-	__tlbi_user(aside1is, asid);
-	dsb(ish);
 }
 
 static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
@@ -223,25 +219,21 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	flush = tlb_flush_check(mm, get_cpu());
 	switch (flush) {
 	case TLB_FLUSH_LOCAL:
-
 		dsb(nshst);
 		__tlbi(vale1, addr);
 		__tlbi_user(vale1, addr);
 		dsb(nsh);
-
 		/* fall through */
 	case TLB_FLUSH_NO:
-		put_cpu();
-		return;
+		break;
 	case TLB_FLUSH_BROADCAST:
+		dsb(ishst);
+		__tlbi(vale1is, addr);
+		__tlbi_user(vale1is, addr);
+		dsb(ish);
 		break;
 	}
 	put_cpu();
-
-	dsb(ishst);
-	__tlbi(vale1is, addr);
-	__tlbi_user(vale1is, addr);
-	dsb(ish);
 }
 
 /*
@@ -276,7 +268,6 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	flush = tlb_flush_check(mm, get_cpu());
 	switch (flush) {
 	case TLB_FLUSH_LOCAL:
-
 		dsb(nshst);
 		for (addr = start; addr < end; addr += stride) {
 			if (last_level) {
@@ -288,27 +279,24 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 			}
 		}
 		dsb(nsh);
-
 		/* fall through */
 	case TLB_FLUSH_NO:
-		put_cpu();
-		return;
+		break;
 	case TLB_FLUSH_BROADCAST:
+		dsb(ishst);
+		for (addr = start; addr < end; addr += stride) {
+			if (last_level) {
+				__tlbi(vale1is, addr);
+				__tlbi_user(vale1is, addr);
+			} else {
+				__tlbi(vae1is, addr);
+				__tlbi_user(vae1is, addr);
+			}
+		}
+		dsb(ish);
 		break;
 	}
 	put_cpu();
-
-	dsb(ishst);
-	for (addr = start; addr < end; addr += stride) {
-		if (last_level) {
-			__tlbi(vale1is, addr);
-			__tlbi_user(vale1is, addr);
-		} else {
-			__tlbi(vae1is, addr);
-			__tlbi_user(vae1is, addr);
-		}
-	}
-	dsb(ish);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  2020-02-23 19:25 ` [PATCH 1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes Andrea Arcangeli
@ 2020-03-03  4:28   ` Rafael Aquini
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael Aquini @ 2020-03-03  4:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Catalin Marinas, Mark Salter, Jon Masters,
	linux-kernel, linux-mm, linux-arm-kernel, Michal Hocko, QI Fuli

On Sun, Feb 23, 2020 at 02:25:18PM -0500, Andrea Arcangeli wrote:
> alpha, ia64, mips, powerpc, sh, sparc are relying on a check on
> mm->mm_users to know if they can skip some remote TLB flushes for
> single threaded processes.
> 
> Most callers of use_mm() tend to invoke mmget_not_zero() or
> get_task_mm() before use_mm() to ensure the mm will remain alive in
> between use_mm() and unuse_mm().
> 
> Some callers however don't increase mm_users and they instead rely on
> serialization in __mmput() to ensure the mm will remain alive in
> between use_mm() and unuse_mm(). Not increasing mm_users during
> use_mm() is however unsafe for aforementioned arch TLB flushes
> optimizations. So either mmget()/mmput() should be added to the
> problematic callers of use_mm()/unuse_mm() or we can embed them in
> use_mm()/unuse_mm() which is more robust.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/mmu_context.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 3e612ae748e9..ced0e1218c0f 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -30,6 +30,7 @@ void use_mm(struct mm_struct *mm)
>  		mmgrab(mm);
>  		tsk->active_mm = mm;
>  	}
> +	mmget(mm);
>  	tsk->mm = mm;
>  	switch_mm(active_mm, mm, tsk);
>  	task_unlock(tsk);
> @@ -57,6 +58,7 @@ void unuse_mm(struct mm_struct *mm)
>  	task_lock(tsk);
>  	sync_mm_rss(mm);
>  	tsk->mm = NULL;
> +	mmput(mm);
>  	/* active_mm is still 'mm' */
>  	enter_lazy_tlb(mm, tsk);
>  	task_unlock(tsk);

Acked-by: Rafael Aquini <aquini@redhat.com>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] arm64: select CPUMASK_OFFSTACK if NUMA
  2020-02-23 19:25 ` [PATCH 2/3] arm64: select CPUMASK_OFFSTACK if NUMA Andrea Arcangeli
@ 2020-03-03  4:31   ` Rafael Aquini
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael Aquini @ 2020-03-03  4:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Catalin Marinas, Mark Salter, Jon Masters,
	linux-kernel, linux-mm, linux-arm-kernel, Michal Hocko, QI Fuli

On Sun, Feb 23, 2020 at 02:25:19PM -0500, Andrea Arcangeli wrote:
> It's unclear why normally CPUMASK_OFFSTACK can only be manually
> configured "if DEBUG_PER_CPU_MAPS" which is not an option meant to be
> enabled on enterprise arm64 kernels.
> 
> The default enterprise kernels NR_CPUS is 4096 which is fairly large.
> So it'll save some RAM and it'll increase reliability to select
> CPUMASK_OFFSET at least when NUMA is selected and a large NR_CPUS is
> to be expected.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0b30e884e088..882887e65394 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -955,6 +955,7 @@ config NUMA
>  	bool "Numa Memory Allocation and Scheduler Support"
>  	select ACPI_NUMA if ACPI
>  	select OF_NUMA
> +	select CPUMASK_OFFSTACK
>  	help
>  	  Enable NUMA (Non Uniform Memory Access) support.
> 

Acked-by: Rafael Aquini <aquini@redhat.com>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2
  2020-02-23 19:25 [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2020-02-23 19:25 ` [PATCH 3/3] arm64: tlb: skip tlbi broadcast Andrea Arcangeli
@ 2020-03-03 13:01 ` Larry Woodman
  2020-03-18  8:53 ` qi.fuli
  4 siblings, 0 replies; 15+ messages in thread
From: Larry Woodman @ 2020-03-03 13:01 UTC (permalink / raw)
  To: Andrea Arcangeli, Will Deacon, Catalin Marinas, Rafael Aquini,
	Mark Salter
  Cc: Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

[-- Attachment #1: Type: text/plain, Size: 7887 bytes --]

On 02/23/2020 02:25 PM, Andrea Arcangeli wrote:
> Hello,
>
> This is introducing a nr_active_mm that allows to optimize away the
> tlbi broadcast also for multi threaded processes, it doesn't rely
> anymore on mm_users <= 1.
>
> This also optimizes away all TLB flushes (including local ones) when
> the process is not running in any cpu (including during exit_mmap with
> lazy tlb state).
>
> This optimization is generally only observable when there are parallel
> TLB flushes from different processes in multiple CPUs. One possible
> use case is an userland malloc libs freeing small objects with
> MADV_DONTNEED and causing a frequent tiny tlb flushes as demonstrated
> by the tcmalloc testsuite.
>
> All memory intensive apps dealing a multitude of frequently freed
> small objects tend to opt-out of glibc and they opt-in jemalloc or
> tcmalloc, so this should facilitate the SMP/NUMA scalability of long
> lived apps with small objects running in different containers if
> they're issuing frequent MADV_DONTNEED tlb flushes while the other
> threads of the process are not running.
>
> I was suggested to implement the mm_cpumask the standard way in
> order to optimize multithreaded apps too and to avoid restricting the
> optimization to mm_users <= 1. So initially I had two bitmasks allocated
> as shown at the bottom of this cover letter, by setting
> ARCH_NR_MM_CPUMASK to 2 with the below patch applied... however I
> figured a single atomic per-mm achieves the exact same runtime behavior
> of the extra bitmap, so I just dropped the extra bitmap and I replaced
> it with nr_active_mm as an optimization.
>
> If the switch_mm atomic ops in the switch_mm fast path would be a
> concern (they're still faster than the cpumask_set_cpu/clear_cpu, with
> less than 256-512 CPUs), it's worth mentioning it'd be possible to
> remove all atomic ops from the switch_mm fast path by restricting this
> optimization to single threaded processes by checking mm_users <= 1
> and < 1 instead of nr_active_mm <= 1 and < 1 similarly to what the
> earlier version of this patchset was doing.
>
> Thanks,
> Andrea
>
> Andrea Arcangeli (3):
>   mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
>   arm64: select CPUMASK_OFFSTACK if NUMA
>   arm64: tlb: skip tlbi broadcast
>
>  arch/arm64/Kconfig                   |  1 +
>  arch/arm64/include/asm/efi.h         |  2 +-
>  arch/arm64/include/asm/mmu.h         |  4 +-
>  arch/arm64/include/asm/mmu_context.h | 33 ++++++++--
>  arch/arm64/include/asm/tlbflush.h    | 95 +++++++++++++++++++++++++++-
>  arch/arm64/mm/context.c              | 54 ++++++++++++++++
>  mm/mmu_context.c                     |  2 +
>  7 files changed, 180 insertions(+), 11 deletions(-)
>
> Early attempt with the standard mm_cpumask follows:
>
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: mm: allow per-arch mm_cpumasks based on ARCH_NR_MM_CPUMASK
>
> Allow archs to allocate multiple mm_cpumasks in the mm_struct per-arch
> by definining a ARCH_NR_MM_CPUMASK > 1 (to be included before
> "linux/mm_types.h").
>
> Those extra per-mm cpumasks can be referenced with
> __mm_cpumask(N, mm), where N == 0 points to the mm_cpumask()
> known by the common code and N > 0 points to the per-arch private
> ones.
> ---
>  drivers/firmware/efi/efi.c |  3 ++-
>  include/linux/mm_types.h   | 17 +++++++++++++++--
>  kernel/fork.c              |  3 ++-
>  mm/init-mm.c               |  2 +-
>  4 files changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 5da0232ae33f..608c9bf181e5 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -86,7 +86,8 @@ struct mm_struct efi_mm = {
>  	.mmap_sem		= __RWSEM_INITIALIZER(efi_mm.mmap_sem),
>  	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
>  	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
> -	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS)] = 0},
> +	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS) *
> +				     ARCH_NR_MM_CPUMASK] = 0},
>  };
>  
>  struct workqueue_struct *efi_rts_wq;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index f29bba20bba1..b53d5622b3b2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -531,6 +531,9 @@ struct mm_struct {
>  	RH_KABI_RESERVE(7)
>  	RH_KABI_RESERVE(8)
>  
> +#ifndef ARCH_NR_MM_CPUMASK
> +#define ARCH_NR_MM_CPUMASK 1
> +#endif
>  	/*
>  	 * The mm_cpumask needs to be at the end of mm_struct, because it
>  	 * is dynamically sized based on nr_cpu_ids.
> @@ -544,15 +547,25 @@ extern struct mm_struct init_mm;
>  static inline void mm_init_cpumask(struct mm_struct *mm)
>  {
>  	unsigned long cpu_bitmap = (unsigned long)mm;
> +	int i;
>  
>  	cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
> -	cpumask_clear((struct cpumask *)cpu_bitmap);
> +	for (i = 0; i < ARCH_NR_MM_CPUMASK; i++) {
> +		cpumask_clear((struct cpumask *)cpu_bitmap);
> +		cpu_bitmap += cpumask_size();
> +	}
>  }
>  
>  /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> +static inline cpumask_t *__mm_cpumask(int index, struct mm_struct *mm)
> +{
> +	return (struct cpumask *)((unsigned long)&mm->cpu_bitmap +
> +				  cpumask_size() * index);
> +}
> +
>  static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
>  {
> -	return (struct cpumask *)&mm->cpu_bitmap;
> +	return __mm_cpumask(0, mm);
>  }
>  
>  struct mmu_gather;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 1dad2f91fac3..a6cbbc1b6008 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2418,7 +2418,8 @@ void __init proc_caches_init(void)
>  	 * dynamically sized based on the maximum CPU number this system
>  	 * can have, taking hotplug into account (nr_cpu_ids).
>  	 */
> -	mm_size = sizeof(struct mm_struct) + cpumask_size();
> +	mm_size = sizeof(struct mm_struct) + cpumask_size() * \
> +		ARCH_NR_MM_CPUMASK;
>  
>  	mm_cachep = kmem_cache_create_usercopy("mm_struct",
>  			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index a787a319211e..d975f8ce270e 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -35,6 +35,6 @@ struct mm_struct init_mm = {
>  	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  	.user_ns	= &init_user_ns,
> -	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
> +	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS) * ARCH_NR_MM_CPUMASK] = 0},
>  	INIT_MM_CONTEXT(init_mm)
>  };
>
>
> [bitmap version depending on the above follows]
>
> @@ -248,6 +260,42 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  		cpu_switch_mm(mm->pgd, mm);
>  }
>  
> +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
> +{
> +	if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
> +		bool is_local = cpumask_test_cpu(cpu, mm_cpumask(mm));
> +		cpumask_t *stale_cpumask = __mm_cpumask(1, mm);
> +		int next_zero = cpumask_next_zero(-1, stale_cpumask);
> +		bool local_is_clear = false;
> +		if (next_zero < nr_cpu_ids &&
> +		    (is_local && next_zero == cpu)) {
> +			next_zero = cpumask_next_zero(next_zero, stale_cpumask);
> +			local_is_clear = true;
> +		}
> +		if (next_zero < nr_cpu_ids) {
> +			cpumask_setall(stale_cpumask);
> +			local_is_clear = false;
> +		}
> +
> +		/*
> +		 * Enforce CPU ordering between the
> +		 * cpumask_setall() and cpumask_any_but().
> +		 */
> +		smp_mb();
> +
> +		if (likely(cpumask_any_but(mm_cpumask(mm),
> +					   cpu) >= nr_cpu_ids)) {
> +			if (is_local) {
> +				if (!local_is_clear)
> +					cpumask_clear_cpu(cpu, stale_cpumask);
> +				return TLB_FLUSH_LOCAL;
> +			}
> +			return TLB_FLUSH_NO;
> +		}
> +	}
> +	return TLB_FLUSH_BROADCAST;
> +}
> +
>  /* Errata workaround post TTBRx_EL1 update. */
>  asmlinkage void post_ttbr_update_workaround(void)
>  {
>
>
For the 3-part series:


Acked-by: Larry Woodman <lwoodman@redhat.com>


[-- Attachment #2: Type: text/html, Size: 8128 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-03-02 15:24   ` Rafael Aquini
@ 2020-03-04  4:19     ` Rafael Aquini
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael Aquini @ 2020-03-04  4:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Catalin Marinas, Mark Salter, Jon Masters,
	linux-kernel, linux-mm, linux-arm-kernel, Michal Hocko, QI Fuli

[-- Attachment #1: Type: text/plain, Size: 6279 bytes --]

On Mon, Mar 02, 2020 at 10:24:51AM -0500, Rafael Aquini wrote:
[...]
> I'm testing these changes against RHEL integration + regression
> tests, and I'll also run them against a specially crafted test
> to measure the impact on task-switching, if any. (I'll report back,
> soon)
>

As promised, I ran the the patches against a full round of kernel
integration/regression tests (the same we use to run for RHEL kernels)
unfortunately, there is no easy way to share these internal results, 
but apart from a couple warnings -- which were not related to the test
build -- everything went on very smoothly with the patches on top of
a RHEL-8 test-kernel.


I grabbed some perf numbers, to serve as kernel build benchmark. 
Test system is a 1 socket 32 core 3.3Ghz ARMv8 Ampere eMAG 256GB RAM.
rpmbuild spawns the build with make -j32 and besides the stock kernel RPM,
it also builds the -debug flavor RPM and all the sub-RPMs for tools and
extra modules.

* stock RHEL-8 build:

 Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm':

     27,817,487.14 msec task-clock                #   15.015 CPUs utilized          
         1,318,586      context-switches          #    0.047 K/sec                  
           515,342      cpu-migrations            #    0.019 K/sec                  
        68,513,346      page-faults               #    0.002 M/sec                  
91,713,983,302,759      cycles                    #    3.297 GHz                    
49,871,167,452,081      instructions              #    0.54  insn per cycle         
23,801,939,026,338      cache-references          #  855.647 M/sec                  
   568,847,557,178      cache-misses              #    2.390 % of all cache refs    
   145,305,286,469      dTLB-loads                #    5.224 M/sec                  
       752,451,698      dTLB-load-misses          #    0.52% of all dTLB cache hits 

    1852.656905157 seconds time elapsed

   26866.849105000 seconds user
     965.756120000 seconds sys


* RHEL8 kernel + Andrea's patches:

 Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm':

     27,713,883.25 msec task-clock                #   15.137 CPUs utilized          
         1,310,196      context-switches          #    0.047 K/sec                  
           511,909      cpu-migrations            #    0.018 K/sec                  
        68,535,178      page-faults               #    0.002 M/sec                  
91,412,320,904,990      cycles                    #    3.298 GHz                    
49,844,016,063,738      instructions              #    0.55  insn per cycle         
23,795,774,331,203      cache-references          #  858.623 M/sec                  
   568,445,037,308      cache-misses              #    2.389 % of all cache refs    
   135,868,301,669      dTLB-loads                #    4.903 M/sec                  
       746,267,581      dTLB-load-misses          #    0.55% of all dTLB cache hits 

    1830.813507976 seconds time elapsed

   26785.529337000 seconds user
     943.251641000 seconds sys




I also wanted to measure the impact of the increased amount of code in
the task switching path (in order to decide which TLB invalidation
scheme to pick), to figure out what would be the worst case scenario
for single threads of execution constrained into one core and yelding
the CPU to each other. I wrote the small test (attached) and grabbed
some numbers in the same box, for the sake of comparison:

* stock RHEL-8 build:

 Performance counter stats for './tlb-test' (1000 runs):

            122.67 msec task-clock                #    0.997 CPUs utilized            ( +-  0.04% )
            32,297      context-switches          #    0.263 M/sec                    ( +-  0.00% )
                 0      cpu-migrations            #    0.000 K/sec                  
               325      page-faults               #    0.003 M/sec                    ( +-  0.01% )
       404,648,928      cycles                    #    3.299 GHz                      ( +-  0.04% )
       239,856,265      instructions              #    0.59  insn per cycle           ( +-  0.00% )
       121,189,080      cache-references          #  987.964 M/sec                    ( +-  0.00% )
         3,414,396      cache-misses              #    2.817 % of all cache refs      ( +-  0.05% )
         2,820,754      dTLB-loads                #   22.996 M/sec                    ( +-  0.04% )
            34,545      dTLB-load-misses          #    1.22% of all dTLB cache hits   ( +-  6.16% )

         0.1230361 +- 0.0000435 seconds time elapsed  ( +-  0.04% )


* RHEL8 kernel + Andrea's patches:

 Performance counter stats for './tlb-test' (1000 runs):

            125.57 msec task-clock                #    0.997 CPUs utilized            ( +-  0.05% )
            32,244      context-switches          #    0.257 M/sec                    ( +-  0.01% )
                 0      cpu-migrations            #    0.000 K/sec                  
               325      page-faults               #    0.003 M/sec                    ( +-  0.01% )
       413,692,492      cycles                    #    3.295 GHz                      ( +-  0.04% )
       241,017,764      instructions              #    0.58  insn per cycle           ( +-  0.00% )
       121,155,050      cache-references          #  964.856 M/sec                    ( +-  0.00% )
         3,512,035      cache-misses              #    2.899 % of all cache refs      ( +-  0.04% )
         2,912,475      dTLB-loads                #   23.194 M/sec                    ( +-  0.02% )
            45,340      dTLB-load-misses          #    1.56% of all dTLB cache hits   ( +-  5.07% )

         0.1259462 +- 0.0000634 seconds time elapsed  ( +-  0.05% )



AFAICS, the aforementioned benchmark numbers are suggesting that there is, 
virtually, no considerable impact, or very minimal and no detrimental impact 
to ordinary workloads imposed by the changes, and Andrea's benchmarks are 
showing/suggesting that a broad range of particular workloads will considerably
benefit from the changes. 

With the numbers above, added to what I've seen in our (internal)
integration tests, I'm confident on the stability of the changes.

-- Rafael

[-- Attachment #2: tlb-test.c --]
[-- Type: text/plain, Size: 5592 bytes --]

// SPDX-License-Identifier: BSD-2-Clause
/*
 * Copyright (c) 2020, Rafael Aquini <aquini@redhat.com>
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 *  1. Redistributions of source code must retain the above copyright notice,
 *     this list of conditions and the following disclaimer.
 *
 *  2. Redistributions in binary form must reproduce the above copyright notice,
 *     this list of conditions and the following disclaimer in the documentation
 *     and/or other materials provided with the distribution.
 *
 *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 *  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 *  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 *  ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
 *  BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 *  CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 *  SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
 *  BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
 *  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 *  OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
 *  ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * compile with:  gcc -o tlb-test -D_GNU_SOURCE -lpthread tlb-test.c
 * dependencies:
 *  - _GNU_SOURCE required for asprintf(3), sched_getcpu(3) && sched_setaffinity(2)
 *  - libpthreads required for POSIX semaphores
 */
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <sched.h>
#include <time.h>
#include <semaphore.h>
#include <sys/types.h>
#include <sys/times.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/mman.h>

#ifndef NCHILDS
#define NCHILDS		4
#endif

#ifndef NPAGES
#define NPAGES		32
#endif

#ifndef NRUNS
#define NRUNS		8192
#endif

#ifdef DEBUG
#define DPRINTF(...)	fprintf(stderr, __VA_ARGS__)
#else
#define DPRINTF(...)
#endif

#define ERROR_EXIT(msg)							\
	do {								\
		char *estr = NULL;					\
		asprintf(&estr, "[%s:%d] %s", __FILE__, __LINE__, msg);	\
		perror(estr);						\
		exit(EXIT_FAILURE);					\
	} while (0)

static const char *prg_name = "tlb-test";
static long system_hz;
static long page_size;
static sem_t *sem;

/*
 * Fisher-Yates shuffler algorithm [Statistical Tables (London, 1938), Ex.12],
 * adapted to computer language by R. Durstenfeld [CACM 7 (1964), 420], and
 * presented by Donald E. Knuth at:
 *  "The Art of Computer Programming, Volume 2: Seminumerical Algorithms"
 *  [Algorithm P (shuffling) under Section 3.4 OTHER TYPES OF RANDOM QUANTITIES]
 */
void fy_shuffler(unsigned long *buf, unsigned long len)
{
	unsigned long j, u, tmp;

	for (j = len - 1; j > 0; j--) {
		u = rand() % j;
		tmp = *(buf + u);
		*(buf + u) = *(buf + j);
		*(buf + j) = tmp;
	}
}

unsigned long usec_diff(struct timeval *a, struct timeval *b)
{
	unsigned long usec;

	usec = (b->tv_sec - a->tv_sec) * 1000000;
	usec += b->tv_usec - a->tv_usec;
	return usec;
}

unsigned long workload(void *addr, size_t len, unsigned long *fault_order, int child)
{
	struct timeval start, end;
	unsigned long i;

	gettimeofday(&start, NULL);
	for (i = 0; i < len; i++) {
		unsigned long p = *(fault_order + i);
		*((unsigned char *)(addr + (p * page_size))) = ((i * p) % 0xff);
	}
	gettimeofday(&end, NULL);

	DPRINTF("[%s: child-%d (CPU=%d PID=%ld)] RUNNING! \n",
		prg_name, child, sched_getcpu(), (long) getpid());

	return usec_diff(&start, &end);
}

int child(int n, FILE *stream)
{
	unsigned long pages[NPAGES];
	size_t map_sz;
	int i, runs;
	void *addr;
	double elapsed = 0;

	for (i = 0; i < NPAGES; i++)
		pages[i] = i;

	map_sz = page_size * NPAGES;
	addr = mmap(NULL, map_sz, PROT_READ | PROT_WRITE,
				MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
	if (addr == MAP_FAILED)
		ERROR_EXIT("mmap");

	if (madvise(addr, map_sz, MADV_NOHUGEPAGE) == -1)
		ERROR_EXIT("madvise");

	srand(time(NULL));

	for (runs = 0; runs < NRUNS; runs++) {
		sem_wait(sem);
		elapsed += workload(addr, NPAGES, pages, n);
		fy_shuffler(pages, NPAGES);
		sem_post(sem);
		/*
		 * relinquish the CPU to provide a small backoff, so other tasks
		 * get a fair chance on aquiring the semaphore.
		 */
		sched_yield();
	}

	fprintf(stream, "[%s: child-%d (CPU=%d PID=%ld)]  %lf msecs\n",
		prg_name, n, sched_getcpu(), (long) getpid(), (double )(elapsed / 1000));

	return 0;
}

int main(int argc, char *argv[])
{
	pid_t pid[NCHILDS];
	int i, ret, status;
	cpu_set_t set;

	CPU_ZERO(&set);		/* clear the set */
	CPU_SET(1, &set);
	if (sched_setaffinity(0, sizeof(cpu_set_t), &set) == -1)
		ERROR_EXIT("sched_setaffinity");

	if ((system_hz = sysconf(_SC_CLK_TCK)) == -1)
		ERROR_EXIT("sysconf");

	if ((page_size = sysconf(_SC_PAGESIZE)) == -1)
		ERROR_EXIT("sysconf");

	sem = sem_open(prg_name, O_CREAT, S_IRUSR | S_IWUSR, 0);
	if (sem == SEM_FAILED)
		ERROR_EXIT("sem_open");

	for (i = 0; i < NCHILDS; i++) {
		pid[i] = fork();
		switch (pid[i]) {
		case -1:	/* fork() has failed */
			ERROR_EXIT("fork");
			break;
		case 0:		/* child of a sucessful fork() */
			ret = child(i+1, stdout);
			exit(ret);
			break;
		}
	}

	sem_post(sem);

	for (;;) {
		if (wait(&status) == -1) {
			if (errno == ECHILD) {
				goto out;
			} else {
				ERROR_EXIT("wait");
			}
		}
	}
out:
	sem_close(sem);
	sem_unlink(prg_name);
	exit(EXIT_SUCCESS);
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-02-23 19:25 ` [PATCH 3/3] arm64: tlb: skip tlbi broadcast Andrea Arcangeli
  2020-03-02 15:24   ` Rafael Aquini
@ 2020-03-09 11:22   ` Catalin Marinas
  2020-03-14  3:16     ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Catalin Marinas @ 2020-03-09 11:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Rafael Aquini, Mark Salter, Jon Masters,
	linux-kernel, linux-mm, linux-arm-kernel, Michal Hocko, QI Fuli

Hi Andrea,

On Sun, Feb 23, 2020 at 02:25:20PM -0500, Andrea Arcangeli wrote:
>  switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	  struct task_struct *tsk)
>  {
> -	if (prev != next)
> -		__switch_mm(next);
> +	unsigned int cpu = smp_processor_id();
> +
> +	if (!per_cpu(cpu_not_lazy_tlb, cpu)) {
> +		per_cpu(cpu_not_lazy_tlb, cpu) = true;
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +	} else if (prev != next) {
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +		atomic_dec(&prev->context.nr_active_mm);
> +	}

IIUC, nr_active_mm keeps track of how many instances of the current pgd
(TTBR0_EL1) are active.

> +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
> +{
> +	if (atomic_read(&mm->context.nr_active_mm) <= 1) {
> +		bool is_local = current->active_mm == mm &&
> +			per_cpu(cpu_not_lazy_tlb, cpu);
> +		cpumask_t *stale_cpumask = mm_cpumask(mm);
> +		unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask);
> +		bool local_is_clear = false;
> +		if (next_zero < nr_cpu_ids &&
> +		    (is_local && next_zero == cpu)) {
> +			next_zero = cpumask_next_zero(next_zero, stale_cpumask);
> +			local_is_clear = true;
> +		}
> +		if (next_zero < nr_cpu_ids) {
> +			cpumask_setall(stale_cpumask);
> +			local_is_clear = false;
> +		}
> +
> +		/*
> +		 * Enforce CPU ordering between the above
> +		 * cpumask_setall(mm_cpumask) and the below
> +		 * atomic_read(nr_active_mm).
> +		 */
> +		smp_mb();
> +
> +		if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) {
> +			if (is_local) {
> +				if (!local_is_clear)
> +					cpumask_clear_cpu(cpu, stale_cpumask);
> +				return TLB_FLUSH_LOCAL;
> +			}
> +			if (atomic_read(&mm->context.nr_active_mm) == 0)
> +				return TLB_FLUSH_NO;
> +		}
> +	}
> +	return TLB_FLUSH_BROADCAST;

And this code here can assume that if nr_active_mm <= 1, no broadcast is
necessary.

One concern I have is the ordering between TTBR0_EL1 update in
cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We
only have an ISB for context synchronisation on that CPU but I don't
think the architecture guarantees any relation between sysreg access and
the memory update. We have a DSB but that's further down in switch_to().

However, what worries me more is that you can now potentially do a TLB
shootdown without clearing the intermediate (e.g. VA to pte) walk caches
from the TLB. Even if the corresponding pgd and ASID are no longer
active on other CPUs, I'm not sure it's entirely safe to free (and
re-allocate) pages belonging to a pgtable without first flushing the
TLB. All the architecture spec states is that the software must first
clear the entry followed by TLBI (the break-before-make rules).

That said, the benchmark numbers are not very encouraging. Around 1%
improvement in a single run, it can as well be noise. Also something
like hackbench may also show a slight impact on the context switch path.
Maybe with a true NUMA machine with hundreds of CPUs we may see a
difference, but it depends on how well the TLBI is implemented.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-03-09 11:22   ` Catalin Marinas
@ 2020-03-14  3:16     ` Andrea Arcangeli
  2020-03-16 14:09       ` Mark Rutland
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2020-03-14  3:16 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Rafael Aquini, Mark Salter, Jon Masters,
	linux-kernel, linux-mm, linux-arm-kernel, Michal Hocko, QI Fuli

Hi Catalin,

On Mon, Mar 09, 2020 at 11:22:42AM +0000, Catalin Marinas wrote:
> IIUC, nr_active_mm keeps track of how many instances of the current pgd
> (TTBR0_EL1) are active.

Correct.

> And this code here can assume that if nr_active_mm <= 1, no broadcast is
> necessary.

Yes.

> One concern I have is the ordering between TTBR0_EL1 update in
> cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We
> only have an ISB for context synchronisation on that CPU but I don't
> think the architecture guarantees any relation between sysreg access and
> the memory update. We have a DSB but that's further down in switch_to().

There are several cpu_do_switch_mm updating TTBR0_EL1 and nr_active_mm
updates that can happen on different CPUs simultaneously. It's hard to
see exactly which one you refer to.

Overall the idea here is that even if a speculative tlb lookup happens
in between those updates while the "mm" is going away and
atomic_dec(&mm->nr_active_mm) is being called on the mm, it doesn't
matter because no userland software can use such stale tlb anymore
until local_flush_tlb_asid() gets rid of it.

The v1 patch (before I posted the incremental mm_count check) had
issues with speculatively loaded stale tlb entries only because they
weren't guaranteed to be flushed when the kernel thread switched back
to an userland process. So it relied on the CPU not to speculatively
load random pagetables while the kernel thread was running in lazy tlb
mode, but here the flush is guaranteed and in turn the CPU can always
load any TLB it wants at any given time.

> However, what worries me more is that you can now potentially do a TLB
> shootdown without clearing the intermediate (e.g. VA to pte) walk caches
> from the TLB. Even if the corresponding pgd and ASID are no longer
> active on other CPUs, I'm not sure it's entirely safe to free (and
> re-allocate) pages belonging to a pgtable without first flushing the

With regard to not doing a tlbi broadcast, nothing fundamentally
changed between v1 (single threaded using mm_users) and the latest
version (multhtreaded introducing nr_active_mm).

The v1 only skipped the tlbi broadcast in remote CPUs that run the
asid of a single threaded process before a CPU migration, but the
pages could already be reallocated from the point of view of the
remote CPUs.

In the current version the page can be reallocated even from the point
of view of the current CPU. However the fact the window has been
enlarged significantly should be a good thing, so if there would have
been something wrong with it, it would have been far easier to
reproduce it.

This concern is still a corollary of the previous paragraph: it's
still about stale tlb entries being left in an asid that can't ever be
used through the current asid.

> TLB. All the architecture spec states is that the software must first
> clear the entry followed by TLBI (the break-before-make rules).

The "break" in "break-before-make" is still guaranteed or it wouldn't
boot: however it's not implemented with the tlbi broadcast anymore.

The break is implemented by enforcing no stale tlb entry of such asid
exists in the TLB while any userland code runs.

X86 specs supposed an OS would allocate a TSS per-process and you
would do a context switch by using a task gate. I recall the first
Linux version I used had a TSS per process as envisioned by the
specs. Later the TSS become per-CPU and the esp0 pointer was updated
instead (native_load_sp0) and the stack was switched by hand.

I guess reading the specs may result confusing after such a software
change, that doesn't mean the software shouldn't optimize things
behind the specs if it's safe to do and it's not explicitly forbidden.

The page being reused by another virtual address in another CPU isn't
necessarily an invalid scenario from the point of view of the CPU. It
looks invalid if you assume the page is freed. You can think of it
like a MAP_SHARED page that gets one more mapping associated to it
(the reuse of the page) from another CPU it may look more legit. The
fact there's an old mapping left on the MAP_SHARED pagecache
indefinitely doesn't mean the CPU with such old mapping left in the
TLB is allowed to change the content of the page if the software never
writes to such virtual address through the old mapping. The important
thing is that the content of the page must not change unless the
software running in the CPU explicitly writes through the virtual
address that corresponds to the stale TLB entry (and it's guaranteed
the software won't write to it). The stale TLB of such asid eventually
is flushed either through a bumb of the asid generation or through a
local asid flush.

> That said, the benchmark numbers are not very encouraging. Around 1%
> improvement in a single run, it can as well be noise. Also something
> like hackbench may also show a slight impact on the context switch path.

I recall I tested hackbench and it appeared faster with processes,
otherwise it was within measurement error.

hackbench with processes is fork heavy so it gets some benefit because
all those copy-on-write post fork will trigger tlbi broadcasts on all
CPUs to flush the wrprotected tlb entry. Specifically the one
converted to local tlb flush is ptep_clear_flush_notify in
wp_page_copy() and there's one for each page being modified by parent
or child.

> Maybe with a true NUMA machine with hundreds of CPUs we may see a
> difference, but it depends on how well the TLBI is implemented.

The numbers in the commit header were not in a single run. perf stat
-r 10 -e dummy runs it 10 times and then shows the stdev along the
number too so you can what the noise was. It wasn't only a 1%
improvement either. Overall there's no noise in the measurement.

tcmalloc_large_heap_fragmentation_unittest simulating dealing with
small objects by many different containers at the same time, was 9.4%
faster (%stdev 0.36% 0.29%), with 32 CPUs and no NUMA.

256 times parallel run of 32 `sort` in a row, was 10.3% faster (%stdev
0.77% 0.38%), with 32 CPUs and no NUMA.

The multithreaded microbenchmark runs x16 times faster, but that's not
meaningful by itself, it's still a good hint some real life workload
(especially those with frequent MADV_DONTNEED) will run faster and
they did (and a verification that now also multithreaded apps can be
optimized).

Rafael already posted a benchmark specifically stressing the context
switch.

It's reasonable to expect any multi-socket NUMA to show more benefit
from the optimization than the 32 CPU non NUMA used for the current
benchmarks.

Thanks,
Andrea



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-03-14  3:16     ` Andrea Arcangeli
@ 2020-03-16 14:09       ` Mark Rutland
  2020-03-31  9:45         ` Mark Rutland
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Rutland @ 2020-03-16 14:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Catalin Marinas, Will Deacon, Rafael Aquini, Mark Salter,
	Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

Hi Andrea,

On Fri, Mar 13, 2020 at 11:16:09PM -0400, Andrea Arcangeli wrote:
> On Mon, Mar 09, 2020 at 11:22:42AM +0000, Catalin Marinas wrote:
> > One concern I have is the ordering between TTBR0_EL1 update in
> > cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We
> > only have an ISB for context synchronisation on that CPU but I don't
> > think the architecture guarantees any relation between sysreg access and
> > the memory update. We have a DSB but that's further down in switch_to().
> 
> There are several cpu_do_switch_mm updating TTBR0_EL1 and nr_active_mm
> updates that can happen on different CPUs simultaneously. It's hard to
> see exactly which one you refer to.
> 
> Overall the idea here is that even if a speculative tlb lookup happens
> in between those updates while the "mm" is going away and
> atomic_dec(&mm->nr_active_mm) is being called on the mm, it doesn't
> matter because no userland software can use such stale tlb anymore
> until local_flush_tlb_asid() gets rid of it.

I think you're focussing on the use of the end-to-end translations by
architecturally executed instructions, while Catalin is describing
problems resulting from the use of intermediate translations by
speculative walks.

The concern here is that speculation can result in intermediate walk
entries in the TLB being used to continue page table walks, and these
walks can have side-effects regardless of whether the resulting entires
are consumed by architecturally executed instructions.

For example, say we free a PMD page for which there is a stale
intermediate entry in a TLB, and that page gets reallocated for
arbitrary date. A speculative walk can consume that data as-if it were a
PMD, and depending on the value it sees a number of things can happen.
If the value happens to point at MMIO, the MMU might read from that MMIO
with side-effects. If the value happens to contain an architecturally
invalid configuration the result can be CONSTRAINED UNPREDICTABLE and
not limited to the corresponding ASID.

Even if the resulting end-to-end translation is never architecturally
consumed there are potential problems here, and I think that this series
is assuming stronger-than-architected behaviour from the MMU and page
table walks.

> This concern is still a corollary of the previous paragraph: it's
> still about stale tlb entries being left in an asid that can't ever be
> used through the current asid.

I'm not certain if the architecture generally allows or forbids walks to
be continued for an ASID that is not live in a TTBR, but there are cases
where that is possible, so I don't think the above accurately captures
the situation. Stale intermediate entries can certainly be consumed in
some cases.

> > TLB. All the architecture spec states is that the software must first
> > clear the entry followed by TLBI (the break-before-make rules).
> 
> The "break" in "break-before-make" is still guaranteed or it wouldn't
> boot: however it's not implemented with the tlbi broadcast anymore.
>
> The break is implemented by enforcing no stale tlb entry of such asid
> exists in the TLB while any userland code runs.

I understand the point you're making, but please don't overload the
terminology. Break-Before-Make is a well-defined term which refers to a
specific sequence which includes TLB invalidation, and the above is not
a break per that definition.

> X86 specs supposed an OS would allocate a TSS per-process and you
> would do a context switch by using a task gate. I recall the first
> Linux version I used had a TSS per process as envisioned by the
> specs. Later the TSS become per-CPU and the esp0 pointer was updated
> instead (native_load_sp0) and the stack was switched by hand.
> 
> I guess reading the specs may result confusing after such a software
> change, that doesn't mean the software shouldn't optimize things
> behind the specs if it's safe to do and it's not explicitly forbidden.

I think the important thing is that this is not necessarily safe. The
architecture currently states that a Break-Before-Make sequence is
necessary, and this series does not follow that.

AFAICT, this series relies on:

* An ISB completing prior page table walks when updating TTBR. I don't
  believe this is necessarily the case, given how things work for an
  EL1->EL2 transition where there can be ongoing EL1 walks.

* Walks never being initiated for `inactive` contexts within the current
  translation regime. e.g. while ASID x is installed, never starting a
  walk for ASID y. I can imagine that the architecture may permit a form
  of this starting with intermediate walk entries in the TLBs.

Before we can say whether this series is safe, we'd need to have a
better description of how a PE may use `inactive` values for an
in-context translation regime. 

I've asked for some clarification on these points.

Thanks,
Mark.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2
  2020-02-23 19:25 [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2020-03-03 13:01 ` [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Larry Woodman
@ 2020-03-18  8:53 ` qi.fuli
  4 siblings, 0 replies; 15+ messages in thread
From: qi.fuli @ 2020-03-18  8:53 UTC (permalink / raw)
  To: 'Andrea Arcangeli',
	Will Deacon, Catalin Marinas, Rafael Aquini, Mark Salter
  Cc: Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, qi.fuli

> 
> Hello,
> 
> This is introducing a nr_active_mm that allows to optimize away the tlbi broadcast
> also for multi threaded processes, it doesn't rely anymore on mm_users <= 1.
> 
> This also optimizes away all TLB flushes (including local ones) when the process is not
> running in any cpu (including during exit_mmap with lazy tlb state).
> 
> This optimization is generally only observable when there are parallel TLB flushes
> from different processes in multiple CPUs. One possible use case is an userland malloc
> libs freeing small objects with MADV_DONTNEED and causing a frequent tiny tlb
> flushes as demonstrated by the tcmalloc testsuite.
> 
> All memory intensive apps dealing a multitude of frequently freed small objects tend
> to opt-out of glibc and they opt-in jemalloc or tcmalloc, so this should facilitate the
> SMP/NUMA scalability of long lived apps with small objects running in different
> containers if they're issuing frequent MADV_DONTNEED tlb flushes while the other
> threads of the process are not running.
> 
> I was suggested to implement the mm_cpumask the standard way in order to
> optimize multithreaded apps too and to avoid restricting the optimization to
> mm_users <= 1. So initially I had two bitmasks allocated as shown at the bottom of
> this cover letter, by setting ARCH_NR_MM_CPUMASK to 2 with the below patch
> applied... however I figured a single atomic per-mm achieves the exact same runtime
> behavior of the extra bitmap, so I just dropped the extra bitmap and I replaced it with
> nr_active_mm as an optimization.
> 
> If the switch_mm atomic ops in the switch_mm fast path would be a concern (they're
> still faster than the cpumask_set_cpu/clear_cpu, with less than 256-512 CPUs), it's
> worth mentioning it'd be possible to remove all atomic ops from the switch_mm fast
> path by restricting this optimization to single threaded processes by checking
> mm_users <= 1 and < 1 instead of nr_active_mm <= 1 and < 1 similarly to what the
> earlier version of this patchset was doing.
> 
> Thanks,
> Andrea
> 
> Andrea Arcangeli (3):
>   mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
>   arm64: select CPUMASK_OFFSTACK if NUMA
>   arm64: tlb: skip tlbi broadcast
> 
>  arch/arm64/Kconfig                   |  1 +
>  arch/arm64/include/asm/efi.h         |  2 +-
>  arch/arm64/include/asm/mmu.h         |  4 +-
>  arch/arm64/include/asm/mmu_context.h | 33 ++++++++--
>  arch/arm64/include/asm/tlbflush.h    | 95
> +++++++++++++++++++++++++++-
>  arch/arm64/mm/context.c              | 54 ++++++++++++++++
>  mm/mmu_context.c                     |  2 +
>  7 files changed, 180 insertions(+), 11 deletions(-)
> 
> Early attempt with the standard mm_cpumask follows:
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: mm: allow per-arch mm_cpumasks based on ARCH_NR_MM_CPUMASK
> 
> Allow archs to allocate multiple mm_cpumasks in the mm_struct per-arch by
> definining a ARCH_NR_MM_CPUMASK > 1 (to be included before
> "linux/mm_types.h").
> 
> Those extra per-mm cpumasks can be referenced with __mm_cpumask(N, mm),
> where N == 0 points to the mm_cpumask() known by the common code and N > 0
> points to the per-arch private ones.
> ---
>  drivers/firmware/efi/efi.c |  3 ++-
>  include/linux/mm_types.h   | 17 +++++++++++++++--
>  kernel/fork.c              |  3 ++-
>  mm/init-mm.c               |  2 +-
>  4 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index
> 5da0232ae33f..608c9bf181e5 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -86,7 +86,8 @@ struct mm_struct efi_mm = {
>  	.mmap_sem		= __RWSEM_INITIALIZER(efi_mm.mmap_sem),
>  	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
>  	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
> -	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS)] = 0},
> +	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS) *
> +				     ARCH_NR_MM_CPUMASK] = 0},
>  };
> 
>  struct workqueue_struct *efi_rts_wq;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index
> f29bba20bba1..b53d5622b3b2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -531,6 +531,9 @@ struct mm_struct {
>  	RH_KABI_RESERVE(7)
>  	RH_KABI_RESERVE(8)
> 
> +#ifndef ARCH_NR_MM_CPUMASK
> +#define ARCH_NR_MM_CPUMASK 1
> +#endif
>  	/*
>  	 * The mm_cpumask needs to be at the end of mm_struct, because it
>  	 * is dynamically sized based on nr_cpu_ids.
> @@ -544,15 +547,25 @@ extern struct mm_struct init_mm;  static inline void
> mm_init_cpumask(struct mm_struct *mm)  {
>  	unsigned long cpu_bitmap = (unsigned long)mm;
> +	int i;
> 
>  	cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
> -	cpumask_clear((struct cpumask *)cpu_bitmap);
> +	for (i = 0; i < ARCH_NR_MM_CPUMASK; i++) {
> +		cpumask_clear((struct cpumask *)cpu_bitmap);
> +		cpu_bitmap += cpumask_size();
> +	}
>  }
> 
>  /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> +static inline cpumask_t *__mm_cpumask(int index, struct mm_struct *mm)
> +{
> +	return (struct cpumask *)((unsigned long)&mm->cpu_bitmap +
> +				  cpumask_size() * index);
> +}
> +
>  static inline cpumask_t *mm_cpumask(struct mm_struct *mm)  {
> -	return (struct cpumask *)&mm->cpu_bitmap;
> +	return __mm_cpumask(0, mm);
>  }
> 
>  struct mmu_gather;
> diff --git a/kernel/fork.c b/kernel/fork.c index 1dad2f91fac3..a6cbbc1b6008 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2418,7 +2418,8 @@ void __init proc_caches_init(void)
>  	 * dynamically sized based on the maximum CPU number this system
>  	 * can have, taking hotplug into account (nr_cpu_ids).
>  	 */
> -	mm_size = sizeof(struct mm_struct) + cpumask_size();
> +	mm_size = sizeof(struct mm_struct) + cpumask_size() * \
> +		ARCH_NR_MM_CPUMASK;
> 
>  	mm_cachep = kmem_cache_create_usercopy("mm_struct",
>  			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
> diff --git a/mm/init-mm.c b/mm/init-mm.c index a787a319211e..d975f8ce270e
> 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -35,6 +35,6 @@ struct mm_struct init_mm = {
>  	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
>  	.user_ns	= &init_user_ns,
> -	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
> +	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS) *
> ARCH_NR_MM_CPUMASK] = 0},
>  	INIT_MM_CONTEXT(init_mm)
>  };
> 
> 
> [bitmap version depending on the above follows]
> 
> @@ -248,6 +260,42 @@ void check_and_switch_context(struct mm_struct *mm,
> unsigned int cpu)
>  		cpu_switch_mm(mm->pgd, mm);
>  }
> 
> +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int
> +cpu) {
> +	if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) {
> +		bool is_local = cpumask_test_cpu(cpu, mm_cpumask(mm));
> +		cpumask_t *stale_cpumask = __mm_cpumask(1, mm);
> +		int next_zero = cpumask_next_zero(-1, stale_cpumask);
> +		bool local_is_clear = false;
> +		if (next_zero < nr_cpu_ids &&
> +		    (is_local && next_zero == cpu)) {
> +			next_zero = cpumask_next_zero(next_zero,
> stale_cpumask);
> +			local_is_clear = true;
> +		}
> +		if (next_zero < nr_cpu_ids) {
> +			cpumask_setall(stale_cpumask);
> +			local_is_clear = false;
> +		}
> +
> +		/*
> +		 * Enforce CPU ordering between the
> +		 * cpumask_setall() and cpumask_any_but().
> +		 */
> +		smp_mb();
> +
> +		if (likely(cpumask_any_but(mm_cpumask(mm),
> +					   cpu) >= nr_cpu_ids)) {
> +			if (is_local) {
> +				if (!local_is_clear)
> +					cpumask_clear_cpu(cpu,
> stale_cpumask);
> +				return TLB_FLUSH_LOCAL;
> +			}
> +			return TLB_FLUSH_NO;
> +		}
> +	}
> +	return TLB_FLUSH_BROADCAST;
> +}
> +
>  /* Errata workaround post TTBRx_EL1 update. */  asmlinkage void
> post_ttbr_update_workaround(void)  {

Hi Andrea,

Thank you very much for your patch.
I also tested this v2 patch with Himeno benchmark[1] on ThunderX2 with v5.5.7 kernel.
In order to confirm the effect of the patch, I used mprotect-attacker-threaded.c[2] program
to issue the tlbi broadcast instruction or noises, and made it run on a single core or multiple
cores via taskset command. The following it the result. 
[w/o patch, w/o noise program]
MFLOPS :  1614.438709 +- 2.640061
[w/o patch, w/ noise program running on multiple cores]
MFLOPS :  1152.862612 +- 0.7933615
[w/o patch, w/ noise program running on a single core]
MFLOPS :  1152.996692 +- 1.6517165
[w/ patch, w/o noise program]
MFLOPS :  1613.904908 +- 0.606810
[w/ patch, w/ noise program running on multiple cores]
MFLOPS :  1614.586227 +- 0.3268545
[w/ patch, w/ noise program running on a single core]
MFLOPS :  1614.910829 +- 0.694644

[1] http://accc.riken.jp/en/supercom/documents/himenobmt/
[2]
$ cat mprotect-attacker-threaded.c
#include <stdio.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <string.h>

void sleeper()
{
        pause();
}

int main(int argc, char *argv[]){
        int i;
        char *buf;
        long size = 4 * 1024 * 1024;
        long loop = 10;

        pthread_t pthread;
        if (pthread_create(&pthread, NULL, sleeper, NULL))
                perror("pthread_create"), exit(1);

        if(argc == 2){
                loop = atoi(argv[1]);
        }

        buf = mmap(NULL, size, PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
        if(buf == MAP_FAILED){
                perror("mmap");
                exit(1);
        }
        memset(buf, 1, size);
        for(i = 0; i < loop; i++){
                mprotect(buf, size, PROT_READ);
                mprotect(buf, size, PROT_WRITE);
        }
        munmap(buf, size);

        return 0;
}

Sincerely,
QI Fuli


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-03-16 14:09       ` Mark Rutland
@ 2020-03-31  9:45         ` Mark Rutland
  2020-04-01  0:32           ` Andrea Arcangeli
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Rutland @ 2020-03-31  9:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Catalin Marinas, Will Deacon, Rafael Aquini, Mark Salter,
	Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

Hi Andrea,

On Mon, Mar 16, 2020 at 02:09:07PM +0000, Mark Rutland wrote:
> AFAICT, this series relies on:
> 
> * An ISB completing prior page table walks when updating TTBR. I don't
>   believe this is necessarily the case, given how things work for an
>   EL1->EL2 transition where there can be ongoing EL1 walks.

I've had confirmation that a DSB is necessary (after the MSR and ISB) to
complete any ongoing translation table walks for the stale context.

Without a DSB, those walks can observe subsequent stores and encounter
the usual set of CONSTRAINED UNPREDICTABLE behaviours (e.g. walking into
MMIO with side-effects, continuing from amalgamted entries, etc). Those
issues are purely to do with the walk, and apply regardless of whether
the resulting translations are architecturally consumed.

> * Walks never being initiated for `inactive` contexts within the current
>   translation regime. e.g. while ASID x is installed, never starting a
>   walk for ASID y. I can imagine that the architecture may permit a form
>   of this starting with intermediate walk entries in the TLBs.

I'm still chasing this point.

Thanks,
Mark.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast
  2020-03-31  9:45         ` Mark Rutland
@ 2020-04-01  0:32           ` Andrea Arcangeli
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2020-04-01  0:32 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Rafael Aquini, Mark Salter,
	Jon Masters, linux-kernel, linux-mm, linux-arm-kernel,
	Michal Hocko, QI Fuli

Hello Mark,

On Tue, Mar 31, 2020 at 10:45:11AM +0100, Mark Rutland wrote:
> Hi Andrea,
> 
> On Mon, Mar 16, 2020 at 02:09:07PM +0000, Mark Rutland wrote:
> > AFAICT, this series relies on:
> > 
> > * An ISB completing prior page table walks when updating TTBR. I don't
> >   believe this is necessarily the case, given how things work for an
> >   EL1->EL2 transition where there can be ongoing EL1 walks.
> 
> I've had confirmation that a DSB is necessary (after the MSR and ISB) to
> complete any ongoing translation table walks for the stale context.
> 
> Without a DSB, those walks can observe subsequent stores and encounter
> the usual set of CONSTRAINED UNPREDICTABLE behaviours (e.g. walking into
> MMIO with side-effects, continuing from amalgamted entries, etc). Those
> issues are purely to do with the walk, and apply regardless of whether
> the resulting translations are architecturally consumed.

Ok, sorry I didn't get it earlier... I attempted a quick fix below.

From ab30d8082be62fe24a97eceec5dbfeea8e278511 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Tue, 31 Mar 2020 20:03:43 -0400
Subject: [PATCH 1/1] arm64: tlb: skip tlbi broadcast, fix speculative tlb
 lookups

Without DSB in between "MSR; ISB" and "atomic_dec(&nr_active_mm)"
there's the risk a speculative pagecache lookup may still be walking
pagetables of the unloaded asid after nr_active_mm has been
decreased. In such case the remote CPU could free the pagetables and
reuse the memory without first issuing a tlbi broadcast, while the
speculative tlb lookup still runs on the unloaded asid. For this
reason the speculative pagetable walks needs to be flushed before
decreasing nr_active_mm.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/arm64/include/asm/mmu_context.h | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 9c66fe317e2f..d821ea3ce839 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -210,8 +210,18 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	if (per_cpu(cpu_not_lazy_tlb, cpu) &&
 	    is_idle_task(tsk)) {
 		per_cpu(cpu_not_lazy_tlb, cpu) = false;
-		if (!system_uses_ttbr0_pan())
+		if (!system_uses_ttbr0_pan()) {
 			cpu_set_reserved_ttbr0();
+			/*
+			 * DSB will flush the speculative pagetable
+			 * walks on the old asid. It's required before
+			 * decreasing nr_active_mm because after
+			 * decreasing nr_active_mm the tlbi broadcast
+			 * may not happen on the unloaded asid before
+			 * the pagetables are freed.
+			 */
+			dsb(ish);
+		}
 		atomic_dec(&mm->context.nr_active_mm);
 	}
 	VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0);
@@ -249,6 +259,14 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	} else if (prev != next) {
 		atomic_inc(&next->context.nr_active_mm);
 		__switch_mm(next, cpu);
+		/*
+		 * DSB will flush the speculative pagetable walks on the old
+		 * asid. It's required before decreasing nr_active_mm because
+		 * after decreasing nr_active_mm the tlbi broadcast may not
+		 * happen on the unloaded asid before the pagetables are
+		 * freed.
+		 */
+		dsb(ish);
 		atomic_dec(&prev->context.nr_active_mm);
 	}
 	VM_WARN_ON(!atomic_read(&next->context.nr_active_mm));


I didn't test it yet, because this being a theoretical issue it is
better reviewed in the source.

> > * Walks never being initiated for `inactive` contexts within the current
> >   translation regime. e.g. while ASID x is installed, never starting a
> >   walk for ASID y. I can imagine that the architecture may permit a form
> >   of this starting with intermediate walk entries in the TLBs.
> 
> I'm still chasing this point.

Appreciated! I'll cross fingers you don't find the speculative lookups
can randomly start on unloaded ASID. That would also imply that it
would be impossible on arm64 to use different asid on different CPUs
as it is normally done on other arches.

Thanks,
Andrea



^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-04-01 10:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-23 19:25 [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Andrea Arcangeli
2020-02-23 19:25 ` [PATCH 1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes Andrea Arcangeli
2020-03-03  4:28   ` Rafael Aquini
2020-02-23 19:25 ` [PATCH 2/3] arm64: select CPUMASK_OFFSTACK if NUMA Andrea Arcangeli
2020-03-03  4:31   ` Rafael Aquini
2020-02-23 19:25 ` [PATCH 3/3] arm64: tlb: skip tlbi broadcast Andrea Arcangeli
2020-03-02 15:24   ` Rafael Aquini
2020-03-04  4:19     ` Rafael Aquini
2020-03-09 11:22   ` Catalin Marinas
2020-03-14  3:16     ` Andrea Arcangeli
2020-03-16 14:09       ` Mark Rutland
2020-03-31  9:45         ` Mark Rutland
2020-04-01  0:32           ` Andrea Arcangeli
2020-03-03 13:01 ` [PATCH 0/3] arm64: tlb: skip tlbi broadcast v2 Larry Woodman
2020-03-18  8:53 ` qi.fuli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).