linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-09-21  8:43 Yicong Yang
  2022-09-21  8:43 ` [PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang
  2022-09-21  8:43 ` [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
  0 siblings, 2 replies; 18+ messages in thread
From: Yicong Yang @ 2022-09-21  8:43 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual

From: Yicong Yang <yangyicong@hisilicon.com>

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyicong@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/


Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page reclamation

 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++
 arch/arm64/include/asm/tlbflush.h             | 37 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               | 15 +++++++-
 mm/rmap.c                                     | 19 ++++------
 6 files changed, 70 insertions(+), 16 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
  2022-09-21  8:43 [PATCH v4 0/2] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang
@ 2022-09-21  8:43 ` Yicong Yang
  2022-09-21  8:54   ` Barry Song
  2022-09-21  8:43 ` [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
  1 sibling, 1 reply; 18+ messages in thread
From: Yicong Yang @ 2022-09-21  8:43 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual,
	Anshuman Khandual

From: Anshuman Khandual <khandual@linux.vnet.ibm.com>

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
 mm/rmap.c                       |  9 +--------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..8a497d902c16 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	bool should_defer = false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 93d5a6f793d2..cd8cf5cb0b01 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -690,17 +690,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-	bool should_defer = false;
-
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-		should_defer = true;
-	put_cpu();
-
-	return should_defer;
+	return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-09-21  8:43 [PATCH v4 0/2] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang
  2022-09-21  8:43 ` [PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang
@ 2022-09-21  8:43 ` Yicong Yang
  2022-09-27  6:16   ` Anshuman Khandual
  1 sibling, 1 reply; 18+ messages in thread
From: Yicong Yang @ 2022-09-21  8:43 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual,
	Barry Song, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++
 arch/arm64/include/asm/tlbflush.h             | 37 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               |  3 +-
 mm/rmap.c                                     | 10 +++--
 6 files changed, 57 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1ce7685ad5de..40da6984f303 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..1b4df0352960 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +279,32 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	/* for small systems with small number of CPUs, TLB shootdown is cheap */
+	if (num_online_cpus() <= 4)
+		return false;
+
+#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
+	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
+		return false;
+#endif
+
+	return true;
+}
+
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8a497d902c16..5bd78ae55cd4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index cd8cf5cb0b01..e060cc0187cd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -645,12 +645,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -728,7 +729,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1590,7 +1592,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
  2022-09-21  8:43 ` [PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang
@ 2022-09-21  8:54   ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2022-09-21  8:54 UTC (permalink / raw)
  To: Yicong Yang
  Cc: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	linux-doc, corbet, peterz, arnd, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual,
	Anshuman Khandual

On Wed, Sep 21, 2022 at 8:45 PM Yicong Yang <yangyicong@huawei.com> wrote:
>
> From: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>
> The entire scheme of deferred TLB flush in reclaim path rests on the
> fact that the cost to refill TLB entries is less than flushing out
> individual entries by sending IPI to remote CPUs. But architecture
> can have different ways to evaluate that. Hence apart from checking
> TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
> architecture specific.
>
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> [Rebase and fix incorrect return value type]
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---

Reviewed-by: Barry Song <baohua@kernel.org>

>  arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
>  mm/rmap.c                       |  9 +--------
>  2 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index cda3118f3b27..8a497d902c16 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
>         flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
>  }
>
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +       bool should_defer = false;
> +
> +       /* If remote CPUs need to be flushed then defer batch the flush */
> +       if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +               should_defer = true;
> +       put_cpu();
> +
> +       return should_defer;
> +}
> +
>  static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>  {
>         /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 93d5a6f793d2..cd8cf5cb0b01 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -690,17 +690,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>   */
>  static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>  {
> -       bool should_defer = false;
> -
>         if (!(flags & TTU_BATCH_FLUSH))
>                 return false;
>
> -       /* If remote CPUs need to be flushed then defer batch the flush */
> -       if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> -               should_defer = true;
> -       put_cpu();
> -
> -       return should_defer;
> +       return arch_tlbbatch_should_defer(mm);
>  }
>
>  /*
> --
> 2.24.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-09-21  8:43 ` [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
@ 2022-09-27  6:16   ` Anshuman Khandual
  2022-09-27  9:15     ` Yicong Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Anshuman Khandual @ 2022-09-27  6:16 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman

[...]

On 9/21/22 14:13, Yicong Yang wrote:
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/* for small systems with small number of CPUs, TLB shootdown is cheap */
> +	if (num_online_cpus() <= 4)

It would be great to have some more inputs from others, whether 4 (which should
to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
is optimal for an wide range of arm64 platforms.

> +		return false;> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
> +		return false;
> +#endif
> +
> +	return true;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-09-27  6:16   ` Anshuman Khandual
@ 2022-09-27  9:15     ` Yicong Yang
  2022-09-28  0:23       ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Yicong Yang @ 2022-09-27  9:15 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: yangyicong, corbet, peterz, arnd, linux-kernel, darren,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linux-mm, x86, linux-arm-kernel,
	linuxppc-dev, akpm, linux-riscv, linux-s390, Barry Song,
	wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit,
	Mel Gorman, catalin.marinas, will, linux-doc

On 2022/9/27 14:16, Anshuman Khandual wrote:
> [...]
> 
> On 9/21/22 14:13, Yicong Yang wrote:
>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>> +{
>> +	/* for small systems with small number of CPUs, TLB shootdown is cheap */
>> +	if (num_online_cpus() <= 4)
> 
> It would be great to have some more inputs from others, whether 4 (which should
> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> is optimal for an wide range of arm64 platforms.
> 

Do you prefer this macro to be static or make it configurable through kconfig then
different platforms can make choice based on their own situations? It maybe hard to
test on all the arm64 platforms.

Thanks.

>> +		return false;> +
>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>> +		return false;
>> +#endif
>> +
>> +	return true;
>> +}
>> +
> 
> [...]
> 
> .
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-09-27  9:15     ` Yicong Yang
@ 2022-09-28  0:23       ` Barry Song
  2022-10-27 10:41         ` Anshuman Khandual
  0 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2022-09-28  0:23 UTC (permalink / raw)
  To: Yicong Yang
  Cc: Anshuman Khandual, yangyicong, corbet, peterz, arnd,
	linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming,
	guojian, realmz6, linux-mips, openrisc, linux-mm, x86,
	linux-arm-kernel, linuxppc-dev, akpm, linux-riscv, linux-s390,
	wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit,
	Mel Gorman, catalin.marinas, will, linux-doc

On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>
> On 2022/9/27 14:16, Anshuman Khandual wrote:
> > [...]
> >
> > On 9/21/22 14:13, Yicong Yang wrote:
> >> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >> +{
> >> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
> >> +    if (num_online_cpus() <= 4)
> >
> > It would be great to have some more inputs from others, whether 4 (which should
> > to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> > is optimal for an wide range of arm64 platforms.
> >

I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
with 5,6,7
cores.
I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
this patch.

so it seems safe to have
if (num_online_cpus()  < 8)

>
> Do you prefer this macro to be static or make it configurable through kconfig then
> different platforms can make choice based on their own situations? It maybe hard to
> test on all the arm64 platforms.

Maybe we can have this default enabled on machines with 8 and more cpus and
provide a tlbflush_batched = on or off to allow users enable or
disable it according
to their hardware and products. Similar example: rodata=on or off.

Hi Anshuman, Will,  Catalin, Andrew,
what do you think about this approach?

BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/

I do believe we need it based on the expensive cost of tlb shootdown in arm64
even by hardware broadcast.

>
> Thanks.
>
> >> +            return false;> +
> >> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> >> +    if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
> >> +            return false;
> >> +#endif
> >> +
> >> +    return true;
> >> +}
> >> +
> >
> > [...]
> >
> > .
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-09-28  0:23       ` Barry Song
@ 2022-10-27 10:41         ` Anshuman Khandual
  2022-10-27 14:19           ` Punit Agrawal
  2022-10-27 22:07           ` Barry Song
  0 siblings, 2 replies; 18+ messages in thread
From: Anshuman Khandual @ 2022-10-27 10:41 UTC (permalink / raw)
  To: Barry Song, Yicong Yang
  Cc: yangyicong, corbet, peterz, arnd, linux-kernel, darren,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linux-mm, x86, linux-arm-kernel,
	linuxppc-dev, akpm, linux-riscv, linux-s390, wangkefeng.wang,
	xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman,
	catalin.marinas, will, linux-doc



On 9/28/22 05:53, Barry Song wrote:
> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>>
>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>> [...]
>>>
>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>> +{
>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>> +    if (num_online_cpus() <= 4)
>>>
>>> It would be great to have some more inputs from others, whether 4 (which should
>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>> is optimal for an wide range of arm64 platforms.
>>>
> 
> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
> with 5,6,7
> cores.
> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
> this patch.
> 
> so it seems safe to have
> if (num_online_cpus()  < 8)
> 
>>
>> Do you prefer this macro to be static or make it configurable through kconfig then
>> different platforms can make choice based on their own situations? It maybe hard to
>> test on all the arm64 platforms.
> 
> Maybe we can have this default enabled on machines with 8 and more cpus and
> provide a tlbflush_batched = on or off to allow users enable or
> disable it according
> to their hardware and products. Similar example: rodata=on or off.

No, sounds bit excessive. Kernel command line options should not be added
for every possible run time switch options.

> 
> Hi Anshuman, Will,  Catalin, Andrew,
> what do you think about this approach?
> 
> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
> 
> I do believe we need it based on the expensive cost of tlb shootdown in arm64
> even by hardware broadcast.

Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
with CONFIG_EXPERT and for num_online_cpus()  > 8 ?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-27 10:41         ` Anshuman Khandual
@ 2022-10-27 14:19           ` Punit Agrawal
  2022-10-27 21:55             ` Barry Song
  2022-10-28  1:20             ` Yicong Yang
  2022-10-27 22:07           ` Barry Song
  1 sibling, 2 replies; 18+ messages in thread
From: Punit Agrawal @ 2022-10-27 14:19 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Barry Song, Yicong Yang, yangyicong, corbet, peterz, arnd,
	linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming,
	guojian, realmz6, linux-mips, openrisc, linux-mm, x86,
	linux-arm-kernel, linuxppc-dev, akpm, linux-riscv, linux-s390,
	wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit,
	Mel Gorman, catalin.marinas, will, linux-doc


[ Apologies for chiming in late in the conversation ]

Anshuman Khandual <anshuman.khandual@arm.com> writes:

> On 9/28/22 05:53, Barry Song wrote:
>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>>>
>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>> [...]
>>>>
>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>> +{
>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>> +    if (num_online_cpus() <= 4)
>>>>
>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>> is optimal for an wide range of arm64 platforms.
>>>>
>> 
>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>> with 5,6,7
>> cores.
>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>> this patch.
>> 
>> so it seems safe to have
>> if (num_online_cpus()  < 8)
>> 
>>>
>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>> different platforms can make choice based on their own situations? It maybe hard to
>>> test on all the arm64 platforms.
>> 
>> Maybe we can have this default enabled on machines with 8 and more cpus and
>> provide a tlbflush_batched = on or off to allow users enable or
>> disable it according
>> to their hardware and products. Similar example: rodata=on or off.
>
> No, sounds bit excessive. Kernel command line options should not be added
> for every possible run time switch options.
>
>> 
>> Hi Anshuman, Will,  Catalin, Andrew,
>> what do you think about this approach?
>> 
>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
>> 
>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>> even by hardware broadcast.
>
> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?

When running the test program in the commit in a VM, I saw benefits from
the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
ptep_clear_flush() went from ~1% in the unpatched version to not showing
up.

Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
there any overhead? I am wondering what are the downsides of enabling
the config by default.

Thanks,
Punit

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-27 14:19           ` Punit Agrawal
@ 2022-10-27 21:55             ` Barry Song
  2022-10-28  2:14               ` Anshuman Khandual
  2022-10-28  1:20             ` Yicong Yang
  1 sibling, 1 reply; 18+ messages in thread
From: Barry Song @ 2022-10-27 21:55 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Anshuman Khandual, Yicong Yang, yangyicong, corbet, peterz, arnd,
	linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming,
	guojian, realmz6, linux-mips, openrisc, linux-mm, x86,
	linux-arm-kernel, linuxppc-dev, akpm, linux-riscv, linux-s390,
	wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit,
	Mel Gorman, catalin.marinas, will, linux-doc

On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
<punit.agrawal@bytedance.com> wrote:
>
>
> [ Apologies for chiming in late in the conversation ]
>
> Anshuman Khandual <anshuman.khandual@arm.com> writes:
>
> > On 9/28/22 05:53, Barry Song wrote:
> >> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
> >>>
> >>> On 2022/9/27 14:16, Anshuman Khandual wrote:
> >>>> [...]
> >>>>
> >>>> On 9/21/22 14:13, Yicong Yang wrote:
> >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>>>> +{
> >>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
> >>>>> +    if (num_online_cpus() <= 4)
> >>>>
> >>>> It would be great to have some more inputs from others, whether 4 (which should
> >>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> >>>> is optimal for an wide range of arm64 platforms.
> >>>>
> >>
> >> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
> >> with 5,6,7
> >> cores.
> >> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
> >> this patch.
> >>
> >> so it seems safe to have
> >> if (num_online_cpus()  < 8)
> >>
> >>>
> >>> Do you prefer this macro to be static or make it configurable through kconfig then
> >>> different platforms can make choice based on their own situations? It maybe hard to
> >>> test on all the arm64 platforms.
> >>
> >> Maybe we can have this default enabled on machines with 8 and more cpus and
> >> provide a tlbflush_batched = on or off to allow users enable or
> >> disable it according
> >> to their hardware and products. Similar example: rodata=on or off.
> >
> > No, sounds bit excessive. Kernel command line options should not be added
> > for every possible run time switch options.
> >
> >>
> >> Hi Anshuman, Will,  Catalin, Andrew,
> >> what do you think about this approach?
> >>
> >> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
> >> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
> >>
> >> I do believe we need it based on the expensive cost of tlb shootdown in arm64
> >> even by hardware broadcast.
> >
> > Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> > with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>
> When running the test program in the commit in a VM, I saw benefits from
> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
> ptep_clear_flush() went from ~1% in the unpatched version to not showing
> up.
>
> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
> there any overhead? I am wondering what are the downsides of enabling
> the config by default.

As we are deferring tlb flush, but sometimes while we are modifying the vma
which are deferred, we need to do a sync by flush_tlb_batched_pending() in
mprotect() , madvise() to make sure they can see the flushed result. if nobody
is doing mprotect(), madvise() etc in the deferred period, the overhead is zero.

>
> Thanks,
> Punit

Thanks
Barry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-27 10:41         ` Anshuman Khandual
  2022-10-27 14:19           ` Punit Agrawal
@ 2022-10-27 22:07           ` Barry Song
  2022-10-28  1:56             ` Anshuman Khandual
  1 sibling, 1 reply; 18+ messages in thread
From: Barry Song @ 2022-10-27 22:07 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Yicong Yang, yangyicong, corbet, peterz, arnd, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linux-mm, x86, linux-arm-kernel,
	linuxppc-dev, akpm, linux-riscv, linux-s390, wangkefeng.wang,
	xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman,
	catalin.marinas, will, linux-doc

On Thu, Oct 27, 2022 at 11:42 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 9/28/22 05:53, Barry Song wrote:
> > On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
> >>
> >> On 2022/9/27 14:16, Anshuman Khandual wrote:
> >>> [...]
> >>>
> >>> On 9/21/22 14:13, Yicong Yang wrote:
> >>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>>> +{
> >>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
> >>>> +    if (num_online_cpus() <= 4)
> >>>
> >>> It would be great to have some more inputs from others, whether 4 (which should
> >>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> >>> is optimal for an wide range of arm64 platforms.
> >>>
> >
> > I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
> > with 5,6,7
> > cores.
> > I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
> > this patch.
> >
> > so it seems safe to have
> > if (num_online_cpus()  < 8)
> >
> >>
> >> Do you prefer this macro to be static or make it configurable through kconfig then
> >> different platforms can make choice based on their own situations? It maybe hard to
> >> test on all the arm64 platforms.
> >
> > Maybe we can have this default enabled on machines with 8 and more cpus and
> > provide a tlbflush_batched = on or off to allow users enable or
> > disable it according
> > to their hardware and products. Similar example: rodata=on or off.
>
> No, sounds bit excessive. Kernel command line options should not be added
> for every possible run time switch options.
>
> >
> > Hi Anshuman, Will,  Catalin, Andrew,
> > what do you think about this approach?
> >
> > BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
> > https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
> >
> > I do believe we need it based on the expensive cost of tlb shootdown in arm64
> > even by hardware broadcast.
>
> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?

Sounds good to me. It is a good start to bring up tlb batched flush in
ARM64. Later on, we
might want to see it in both memory reclamation and migration.

Thanks
Barry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-27 14:19           ` Punit Agrawal
  2022-10-27 21:55             ` Barry Song
@ 2022-10-28  1:20             ` Yicong Yang
  2022-10-28 13:11               ` Punit Agrawal
  1 sibling, 1 reply; 18+ messages in thread
From: Yicong Yang @ 2022-10-28  1:20 UTC (permalink / raw)
  To: Punit Agrawal, Barry Song
  Cc: yangyicong, corbet, peterz, arnd, linux-kernel, darren,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linux-mm, x86, linux-arm-kernel,
	linuxppc-dev, akpm, linux-riscv, linux-s390, wangkefeng.wang,
	xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman,
	catalin.marinas, will, linux-doc, Anshuman Khandual

On 2022/10/27 22:19, Punit Agrawal wrote:
> 
> [ Apologies for chiming in late in the conversation ]
> 
> Anshuman Khandual <anshuman.khandual@arm.com> writes:
> 
>> On 9/28/22 05:53, Barry Song wrote:
>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>>>>
>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>> [...]
>>>>>
>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>> +{
>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>> +    if (num_online_cpus() <= 4)
>>>>>
>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>> is optimal for an wide range of arm64 platforms.
>>>>>
>>>
>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>> with 5,6,7
>>> cores.
>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>> this patch.
>>>
>>> so it seems safe to have
>>> if (num_online_cpus()  < 8)
>>>
>>>>
>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>> test on all the arm64 platforms.
>>>
>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>> provide a tlbflush_batched = on or off to allow users enable or
>>> disable it according
>>> to their hardware and products. Similar example: rodata=on or off.
>>
>> No, sounds bit excessive. Kernel command line options should not be added
>> for every possible run time switch options.
>>
>>>
>>> Hi Anshuman, Will,  Catalin, Andrew,
>>> what do you think about this approach?
>>>
>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
>>>
>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>> even by hardware broadcast.
>>
>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
> 
> When running the test program in the commit in a VM, I saw benefits from
> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
> ptep_clear_flush() went from ~1% in the unpatched version to not showing
> up.
> 

Maybe you're booting VM on a server with more than 32 cores and Barry tested
on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to
a 4 CPU real machine as the tbli and dsb in the VM may influence the host
as well.

> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
> there any overhead? I am wondering what are the downsides of enabling
> the config by default.
> 
> Thanks,
> Punit
> .
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-27 22:07           ` Barry Song
@ 2022-10-28  1:56             ` Anshuman Khandual
  0 siblings, 0 replies; 18+ messages in thread
From: Anshuman Khandual @ 2022-10-28  1:56 UTC (permalink / raw)
  To: Barry Song
  Cc: Yicong Yang, yangyicong, corbet, peterz, arnd, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linux-mm, x86, linux-arm-kernel,
	linuxppc-dev, akpm, linux-riscv, linux-s390, wangkefeng.wang,
	xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman,
	catalin.marinas, will, linux-doc



On 10/28/22 03:37, Barry Song wrote:
> On Thu, Oct 27, 2022 at 11:42 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 9/28/22 05:53, Barry Song wrote:
>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>>>>
>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>> [...]
>>>>>
>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>> +{
>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>> +    if (num_online_cpus() <= 4)
>>>>>
>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>> is optimal for an wide range of arm64 platforms.
>>>>>
>>>
>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>> with 5,6,7
>>> cores.
>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>> this patch.
>>>
>>> so it seems safe to have
>>> if (num_online_cpus()  < 8)
>>>
>>>>
>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>> test on all the arm64 platforms.
>>>
>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>> provide a tlbflush_batched = on or off to allow users enable or
>>> disable it according
>>> to their hardware and products. Similar example: rodata=on or off.
>>
>> No, sounds bit excessive. Kernel command line options should not be added
>> for every possible run time switch options.
>>
>>>
>>> Hi Anshuman, Will,  Catalin, Andrew,
>>> what do you think about this approach?
>>>
>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
>>>
>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>> even by hardware broadcast.
>>
>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
> 
> Sounds good to me. It is a good start to bring up tlb batched flush in
> ARM64. Later on, we
> might want to see it in both memory reclamation and migration.

Right, that is the idea, CONFIG_EXPERT gives an way to test this out for some time
on various platforms, and later it can be dropped off. Regarding num_online_cpus()
= '8' as the threshold which would potentially give benefit of batched TLB should
be defined as a macro e.g NR_CPUS_FOR_BATCHED_TLB or internal (non user selectable)
config , with a proper in-code comment, explaining the rationale.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-27 21:55             ` Barry Song
@ 2022-10-28  2:14               ` Anshuman Khandual
  2022-10-28 13:12                 ` Punit Agrawal
  0 siblings, 1 reply; 18+ messages in thread
From: Anshuman Khandual @ 2022-10-28  2:14 UTC (permalink / raw)
  To: Barry Song, Punit Agrawal
  Cc: Yicong Yang, yangyicong, corbet, peterz, arnd, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linux-mm, x86, linux-arm-kernel,
	linuxppc-dev, akpm, linux-riscv, linux-s390, wangkefeng.wang,
	xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman,
	catalin.marinas, will, linux-doc



On 10/28/22 03:25, Barry Song wrote:
> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
> <punit.agrawal@bytedance.com> wrote:
>>
>> [ Apologies for chiming in late in the conversation ]
>>
>> Anshuman Khandual <anshuman.khandual@arm.com> writes:
>>
>>> On 9/28/22 05:53, Barry Song wrote:
>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>> [...]
>>>>>>
>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>>> +    if (num_online_cpus() <= 4)
>>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>
>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>> with 5,6,7
>>>> cores.
>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>>> this patch.
>>>>
>>>> so it seems safe to have
>>>> if (num_online_cpus()  < 8)
>>>>
>>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>>> test on all the arm64 platforms.
>>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>> disable it according
>>>> to their hardware and products. Similar example: rodata=on or off.
>>> No, sounds bit excessive. Kernel command line options should not be added
>>> for every possible run time switch options.
>>>
>>>> Hi Anshuman, Will,  Catalin, Andrew,
>>>> what do you think about this approach?
>>>>
>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
>>>>
>>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>>> even by hardware broadcast.
>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>> When running the test program in the commit in a VM, I saw benefits from
>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>> up.
>>
>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
>> there any overhead? I am wondering what are the downsides of enabling
>> the config by default.
> As we are deferring tlb flush, but sometimes while we are modifying the vma
> which are deferred, we need to do a sync by flush_tlb_batched_pending() in
> mprotect() , madvise() to make sure they can see the flushed result. if nobody
> is doing mprotect(), madvise() etc in the deferred period, the overhead is zero.

Right, it is difficult to justify this overhead for smaller systems,
which for sure would not benefit from this batched TLB framework.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28  1:20             ` Yicong Yang
@ 2022-10-28 13:11               ` Punit Agrawal
  2022-10-28 21:40                 ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Punit Agrawal @ 2022-10-28 13:11 UTC (permalink / raw)
  To: Yicong Yang
  Cc: Punit Agrawal, Barry Song, yangyicong, corbet, peterz, arnd,
	linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming,
	guojian, realmz6, linux-mips, openrisc, linux-mm, x86,
	linux-arm-kernel, linuxppc-dev, akpm, linux-riscv, linux-s390,
	wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit,
	Mel Gorman, catalin.marinas, will, linux-doc, Anshuman Khandual

Yicong Yang <yangyicong@huawei.com> writes:

> On 2022/10/27 22:19, Punit Agrawal wrote:
>> 
>> [ Apologies for chiming in late in the conversation ]
>> 
>> Anshuman Khandual <anshuman.khandual@arm.com> writes:
>> 
>>> On 9/28/22 05:53, Barry Song wrote:
>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>>>>>
>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>> [...]
>>>>>>
>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>>> +    if (num_online_cpus() <= 4)
>>>>>>
>>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>
>>>>
>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>> with 5,6,7
>>>> cores.
>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>>> this patch.
>>>>
>>>> so it seems safe to have
>>>> if (num_online_cpus()  < 8)
>>>>
>>>>>
>>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>>> test on all the arm64 platforms.
>>>>
>>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>> disable it according
>>>> to their hardware and products. Similar example: rodata=on or off.
>>>
>>> No, sounds bit excessive. Kernel command line options should not be added
>>> for every possible run time switch options.
>>>
>>>>
>>>> Hi Anshuman, Will,  Catalin, Andrew,
>>>> what do you think about this approach?
>>>>
>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
>>>>
>>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>>> even by hardware broadcast.
>>>
>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>> 
>> When running the test program in the commit in a VM, I saw benefits from
>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>> up.
>> 
>
> Maybe you're booting VM on a server with more than 32 cores and Barry tested
> on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to
> a 4 CPU real machine as the tbli and dsb in the VM may influence the host
> as well.

Yeah, I also wondered about this.

I was able to test on a 6-core RK3399 based system - there the
ptep_clear_flush() was only 0.10% of the overall execution time. The
hardware seems to do a pretty good job of keeping the TLB flushing
overhead low.

[...]


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28  2:14               ` Anshuman Khandual
@ 2022-10-28 13:12                 ` Punit Agrawal
  0 siblings, 0 replies; 18+ messages in thread
From: Punit Agrawal @ 2022-10-28 13:12 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Barry Song, Punit Agrawal, Yicong Yang, yangyicong, corbet,
	peterz, arnd, linux-kernel, darren, huzhanyuan, lipeifeng,
	zhangshiming, guojian, realmz6, linux-mips, openrisc, linux-mm,
	x86, linux-arm-kernel, linuxppc-dev, akpm, linux-riscv,
	linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman, catalin.marinas, will, linux-doc

Anshuman Khandual <anshuman.khandual@arm.com> writes:

> On 10/28/22 03:25, Barry Song wrote:
>> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
>> <punit.agrawal@bytedance.com> wrote:
>>>
>>> [ Apologies for chiming in late in the conversation ]
>>>
>>> Anshuman Khandual <anshuman.khandual@arm.com> writes:
>>>
>>>> On 9/28/22 05:53, Barry Song wrote:
>>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>>> +{
>>>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>>>> +    if (num_online_cpus() <= 4)
>>>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>>
>>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>>> with 5,6,7
>>>>> cores.
>>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>>>> this patch.
>>>>>
>>>>> so it seems safe to have
>>>>> if (num_online_cpus()  < 8)
>>>>>
>>>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>>>> test on all the arm64 platforms.
>>>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>>> disable it according
>>>>> to their hardware and products. Similar example: rodata=on or off.
>>>> No, sounds bit excessive. Kernel command line options should not be added
>>>> for every possible run time switch options.
>>>>
>>>>> Hi Anshuman, Will,  Catalin, Andrew,
>>>>> what do you think about this approach?
>>>>>
>>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
>>>>>
>>>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>>>> even by hardware broadcast.
>>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>>>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>>> When running the test program in the commit in a VM, I saw benefits from
>>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>>> up.
>>>
>>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
>>> there any overhead? I am wondering what are the downsides of enabling
>>> the config by default.
>> As we are deferring tlb flush, but sometimes while we are modifying the vma
>> which are deferred, we need to do a sync by flush_tlb_batched_pending() in
>> mprotect() , madvise() to make sure they can see the flushed result. if nobody
>> is doing mprotect(), madvise() etc in the deferred period, the overhead is zero.
>
> Right, it is difficult to justify this overhead for smaller systems,
> which for sure would not benefit from this batched TLB framework.

Thank you for the pointers to the overhead.

Having looked at this more closely, I also see that
flush_tlb_batched_pending() discards the entire mm vs just flushing the
page being unmapped (as is done with ptep_clear_flush()).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28 13:11               ` Punit Agrawal
@ 2022-10-28 21:40                 ` Barry Song
  2022-10-31 18:36                   ` Punit Agrawal
  0 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2022-10-28 21:40 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Yicong Yang, yangyicong, corbet, peterz, arnd, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linux-mm, x86, linux-arm-kernel,
	linuxppc-dev, akpm, linux-riscv, linux-s390, wangkefeng.wang,
	xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman,
	catalin.marinas, will, linux-doc, Anshuman Khandual

On Sat, Oct 29, 2022 at 2:11 AM Punit Agrawal
<punit.agrawal@bytedance.com> wrote:
>
> Yicong Yang <yangyicong@huawei.com> writes:
>
> > On 2022/10/27 22:19, Punit Agrawal wrote:
> >>
> >> [ Apologies for chiming in late in the conversation ]
> >>
> >> Anshuman Khandual <anshuman.khandual@arm.com> writes:
> >>
> >>> On 9/28/22 05:53, Barry Song wrote:
> >>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
> >>>>>
> >>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
> >>>>>> [...]
> >>>>>>
> >>>>>> On 9/21/22 14:13, Yicong Yang wrote:
> >>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>>>>>> +{
> >>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
> >>>>>>> +    if (num_online_cpus() <= 4)
> >>>>>>
> >>>>>> It would be great to have some more inputs from others, whether 4 (which should
> >>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> >>>>>> is optimal for an wide range of arm64 platforms.
> >>>>>>
> >>>>
> >>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
> >>>> with 5,6,7
> >>>> cores.
> >>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
> >>>> this patch.
> >>>>
> >>>> so it seems safe to have
> >>>> if (num_online_cpus()  < 8)
> >>>>
> >>>>>
> >>>>> Do you prefer this macro to be static or make it configurable through kconfig then
> >>>>> different platforms can make choice based on their own situations? It maybe hard to
> >>>>> test on all the arm64 platforms.
> >>>>
> >>>> Maybe we can have this default enabled on machines with 8 and more cpus and
> >>>> provide a tlbflush_batched = on or off to allow users enable or
> >>>> disable it according
> >>>> to their hardware and products. Similar example: rodata=on or off.
> >>>
> >>> No, sounds bit excessive. Kernel command line options should not be added
> >>> for every possible run time switch options.
> >>>
> >>>>
> >>>> Hi Anshuman, Will,  Catalin, Andrew,
> >>>> what do you think about this approach?
> >>>>
> >>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
> >>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
> >>>>
> >>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
> >>>> even by hardware broadcast.
> >>>
> >>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> >>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
> >>
> >> When running the test program in the commit in a VM, I saw benefits from
> >> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
> >> ptep_clear_flush() went from ~1% in the unpatched version to not showing
> >> up.
> >>
> >
> > Maybe you're booting VM on a server with more than 32 cores and Barry tested
> > on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to
> > a 4 CPU real machine as the tbli and dsb in the VM may influence the host
> > as well.
>
> Yeah, I also wondered about this.
>
> I was able to test on a 6-core RK3399 based system - there the
> ptep_clear_flush() was only 0.10% of the overall execution time. The
> hardware seems to do a pretty good job of keeping the TLB flushing
> overhead low.

RK3399 has Dual-core ARM Cortex-A72 MPCore processor and
Quad-core ARM Cortex-A53 MPCore processor. you are probably
going to see different overhead of ptep_clear_flush() when you
bind the micro-benchmark on different cores.

>
> [...]
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28 21:40                 ` Barry Song
@ 2022-10-31 18:36                   ` Punit Agrawal
  0 siblings, 0 replies; 18+ messages in thread
From: Punit Agrawal @ 2022-10-31 18:36 UTC (permalink / raw)
  To: Barry Song
  Cc: Punit Agrawal, Yicong Yang, yangyicong, corbet, peterz, arnd,
	linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming,
	guojian, realmz6, linux-mips, openrisc, linux-mm, x86,
	linux-arm-kernel, linuxppc-dev, akpm, linux-riscv, linux-s390,
	wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit,
	Mel Gorman, catalin.marinas, will, linux-doc, Anshuman Khandual

Barry Song <21cnbao@gmail.com> writes:

> On Sat, Oct 29, 2022 at 2:11 AM Punit Agrawal
> <punit.agrawal@bytedance.com> wrote:
>>
>> Yicong Yang <yangyicong@huawei.com> writes:
>>
>> > On 2022/10/27 22:19, Punit Agrawal wrote:
>> >>
>> >> [ Apologies for chiming in late in the conversation ]
>> >>
>> >> Anshuman Khandual <anshuman.khandual@arm.com> writes:
>> >>
>> >>> On 9/28/22 05:53, Barry Song wrote:
>> >>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote:
>> >>>>>
>> >>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>> >>>>>> [...]
>> >>>>>>
>> >>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>> >>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>> >>>>>>> +{
>> >>>>>>> +    /* for small systems with small number of CPUs, TLB shootdown is cheap */
>> >>>>>>> +    if (num_online_cpus() <= 4)
>> >>>>>>
>> >>>>>> It would be great to have some more inputs from others, whether 4 (which should
>> >>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>> >>>>>> is optimal for an wide range of arm64 platforms.
>> >>>>>>
>> >>>>
>> >>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>> >>>> with 5,6,7
>> >>>> cores.
>> >>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>> >>>> this patch.
>> >>>>
>> >>>> so it seems safe to have
>> >>>> if (num_online_cpus()  < 8)
>> >>>>
>> >>>>>
>> >>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>> >>>>> different platforms can make choice based on their own situations? It maybe hard to
>> >>>>> test on all the arm64 platforms.
>> >>>>
>> >>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>> >>>> provide a tlbflush_batched = on or off to allow users enable or
>> >>>> disable it according
>> >>>> to their hardware and products. Similar example: rodata=on or off.
>> >>>
>> >>> No, sounds bit excessive. Kernel command line options should not be added
>> >>> for every possible run time switch options.
>> >>>
>> >>>>
>> >>>> Hi Anshuman, Will,  Catalin, Andrew,
>> >>>> what do you think about this approach?
>> >>>>
>> >>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>> >>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/
>> >>>>
>> >>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>> >>>> even by hardware broadcast.
>> >>>
>> >>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>> >>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>> >>
>> >> When running the test program in the commit in a VM, I saw benefits from
>> >> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>> >> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>> >> up.
>> >>
>> >
>> > Maybe you're booting VM on a server with more than 32 cores and Barry tested
>> > on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to
>> > a 4 CPU real machine as the tbli and dsb in the VM may influence the host
>> > as well.
>>
>> Yeah, I also wondered about this.
>>
>> I was able to test on a 6-core RK3399 based system - there the
>> ptep_clear_flush() was only 0.10% of the overall execution time. The
>> hardware seems to do a pretty good job of keeping the TLB flushing
>> overhead low.

I found a problem with my measurements (missing volatile). Correcting
that increased the overhead somewhat - more below.

> RK3399 has Dual-core ARM Cortex-A72 MPCore processor and
> Quad-core ARM Cortex-A53 MPCore processor. you are probably
> going to see different overhead of ptep_clear_flush() when you
> bind the micro-benchmark on different cores.

Indeed - binding the code on the A53 shows half the overhead from
ptep_clear_flush() compared to the A72.

On the A53 -

    $ perf report --stdio -i perf.vanilla.a53.data | grep ptep_clear_flush
         0.63%  pageout  [kernel.kallsyms]  [k] ptep_clear_flush

On the A72

    $ perf report --stdio -i perf.vanilla.a72.data | grep ptep_clear_flush
         1.34%  pageout  [kernel.kallsyms]      [k] ptep_clear_flush


[...]


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-10-31 18:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-21  8:43 [PATCH v4 0/2] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang
2022-09-21  8:43 ` [PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang
2022-09-21  8:54   ` Barry Song
2022-09-21  8:43 ` [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
2022-09-27  6:16   ` Anshuman Khandual
2022-09-27  9:15     ` Yicong Yang
2022-09-28  0:23       ` Barry Song
2022-10-27 10:41         ` Anshuman Khandual
2022-10-27 14:19           ` Punit Agrawal
2022-10-27 21:55             ` Barry Song
2022-10-28  2:14               ` Anshuman Khandual
2022-10-28 13:12                 ` Punit Agrawal
2022-10-28  1:20             ` Yicong Yang
2022-10-28 13:11               ` Punit Agrawal
2022-10-28 21:40                 ` Barry Song
2022-10-31 18:36                   ` Punit Agrawal
2022-10-27 22:07           ` Barry Song
2022-10-28  1:56             ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).