linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-07 12:52 Barry Song
  2022-07-07 12:52 ` [PATCH 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Barry Song
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Barry Song @ 2022-07-07 12:52 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, Barry Song

Though ARM64 has the hardware to do tlb shootdown, it is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
My testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

While believing the micro benchmark in 4/4 will perform better
on arm64 servers, I don't have a hardware to test. Thus,
Hi Yicong,
Would you like to run the same test in 4/4 on Kunpeng920?
Hi Darren,
Would you like to run the same test in 4/4 on Ampere's ARM64 server?
Remember to enable zRAM swap device so that pageout can actually
work for the micro benchmark.
thanks!

Barry Song (4):
  Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
    apply to ARM64"
  mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  mm: rmap: Extend tlbbatch APIs to fit new platforms
  arm64: support batched/deferred tlb shootdown during page reclamation

 Documentation/features/arch-support.txt       |  1 -
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++++++++
 arch/arm64/include/asm/tlbflush.h             | 13 ++++++++++++
 arch/x86/include/asm/tlbflush.h               |  4 +++-
 mm/rmap.c                                     | 21 +++++++++++++------
 7 files changed, 45 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64"
  2022-07-07 12:52 [PATCH 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Barry Song
@ 2022-07-07 12:52 ` Barry Song
  2022-07-07 12:52 ` [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush Barry Song
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2022-07-07 12:52 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, Barry Song,
	Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42.

I was wrong to say batched tlb flush didn't apply to ARM. The fact
is though ARM64 has hardware TLB flush, it is not free and could
be quite expensive.

We still have a good chance to enable batched and deferred TLB flush
on ARM64 for memory reclamation. A possible way is that we only queue
tlbi instructions in hardware's queue. When we have to broadcast TLB,
we broadcast it by dsb. We just need to get adapted to the existing
framework of BATCHED_UNMAP_TLB_FLUSH.

Cc: Will Deacon <will@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/features/arch-support.txt        | 1 -
 Documentation/features/vm/TLB/arch-support.txt | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt
index 118ae031840b..d22a1095e661 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,5 +8,4 @@ The meaning of entries in the tables is:
     | ok |  # feature supported by the architecture
     |TODO|  # feature not yet supported by the architecture
     | .. |  # feature cannot be supported by the hardware
-    | N/A|  # feature doesn't apply to the architecture
 
diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..1c009312b9c1 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: | TODO |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-07 12:52 [PATCH 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Barry Song
  2022-07-07 12:52 ` [PATCH 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Barry Song
@ 2022-07-07 12:52 ` Barry Song
  2022-07-08  6:36   ` Nadav Amit
  2022-07-07 12:52 ` [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Barry Song
  2022-07-07 12:52 ` [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Barry Song
  3 siblings, 1 reply; 16+ messages in thread
From: Barry Song @ 2022-07-07 12:52 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, Barry Song,
	Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

Platforms like ARM64 have hareware TLB shootdown broadcast. They
don't maintain mm_cpumask and they just send tlbi and related
sync instructions for TLB flush.
So if mm_cpumask is empty, we also allow deferred TLB flush

Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
---
 mm/rmap.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..d320c29a4ad8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,8 +692,13 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+	/*
+	 * If remote CPUs need to be flushed then defer batch the flush;
+	 * If ARCHs like ARM64 have hardware TLB flush broadcast, thus
+	 * they don't maintain mm_cpumask() at all, defer batch as well.
+	 */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids ||
+	    cpumask_empty(mm_cpumask(mm)))
 		should_defer = true;
 	put_cpu();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms
  2022-07-07 12:52 [PATCH 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Barry Song
  2022-07-07 12:52 ` [PATCH 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Barry Song
  2022-07-07 12:52 ` [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush Barry Song
@ 2022-07-07 12:52 ` Barry Song
  2022-07-07 16:43   ` Dave Hansen
  2022-07-07 12:52 ` [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Barry Song
  3 siblings, 1 reply; 16+ messages in thread
From: Barry Song @ 2022-07-07 12:52 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, Barry Song,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

Add vma and uaddr to tlbbatch APIs so that platforms like ARM64
are able to apply these two parameters on their specific hard-
ware features. For ARM64, this could be sending tlbi into hard-
ware queues without waiting for its completion.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/x86/include/asm/tlbflush.h |  4 +++-
 mm/rmap.c                       | 12 ++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..9fc48c103b31 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -251,7 +251,9 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index d320c29a4ad8..2b5b740d0001 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -642,12 +642,14 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      struct vm_area_struct *vma,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, vma, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -737,7 +739,9 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      struct vm_area_struct *vma,
+				      unsigned long uaddr)
 {
 }
 
@@ -1600,7 +1604,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), vma, address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-07-07 12:52 [PATCH 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Barry Song
                   ` (2 preceding siblings ...)
  2022-07-07 12:52 ` [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Barry Song
@ 2022-07-07 12:52 ` Barry Song
  2022-07-07 13:49   ` Peter Zijlstra
  2022-07-08  6:21   ` Yicong Yang
  3 siblings, 2 replies; 16+ messages in thread
From: Barry Song @ 2022-07-07 12:52 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, Barry Song,
	Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/features/vm/TLB/arch-support.txt |  2 +-
 arch/arm64/Kconfig                             |  1 +
 arch/arm64/include/asm/tlbbatch.h              | 12 ++++++++++++
 arch/arm64/include/asm/tlbflush.h              | 13 +++++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 1c009312b9c1..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | TODO |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e94913a0b040 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..b3ed163267ca 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long uaddr)
+{
+	flush_tlb_page_nosync(vma, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-07-07 12:52 ` [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Barry Song
@ 2022-07-07 13:49   ` Peter Zijlstra
  2022-07-07 21:10     ` Barry Song
  2022-07-08  6:21   ` Yicong Yang
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2022-07-07 13:49 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	linux-doc, corbet, arnd, linux-kernel, darren, yangyicong,
	huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	Barry Song, Nadav Amit, Mel Gorman

On Fri, Jul 08, 2022 at 12:52:42AM +1200, Barry Song wrote:

> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> +	/*
> +	 * For arm64, HW can do tlb shootdown, so we don't
> +	 * need to record cpumask for sending IPI
> +	 */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..b3ed163267ca 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>  	dsb(ish);
>  }
>  
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long uaddr)
> +{
> +	flush_tlb_page_nosync(vma, uaddr);
> +}

You're passing that vma along just to get the mm, that's quite silly and
trivially fixed.


diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..87505ecce1f0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
-					 unsigned long uaddr)
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
+					   unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms
  2022-07-07 12:52 ` [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Barry Song
@ 2022-07-07 16:43   ` Dave Hansen
  2022-07-07 21:12     ` Barry Song
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2022-07-07 16:43 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, Barry Song,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Nadav Amit, Mel Gorman

On 7/7/22 05:52, Barry Song wrote:
>  static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> -					struct mm_struct *mm)
> +					struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long uaddr)
>  {

It's not a huge deal, but could we pass 'vma' _instead_ of 'mm'?  The
implementations could then just use vma->vm_mm instead of the passed-in mm.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-07-07 13:49   ` Peter Zijlstra
@ 2022-07-07 21:10     ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2022-07-07 21:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, Barry Song, Nadav Amit, Mel Gorman

On Fri, Jul 8, 2022 at 1:50 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 08, 2022 at 12:52:42AM +1200, Barry Song wrote:
>
> > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> > new file mode 100644
> > index 000000000000..fedb0b87b8db
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/tlbbatch.h
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ARCH_ARM64_TLBBATCH_H
> > +#define _ARCH_ARM64_TLBBATCH_H
> > +
> > +struct arch_tlbflush_unmap_batch {
> > +     /*
> > +      * For arm64, HW can do tlb shootdown, so we don't
> > +      * need to record cpumask for sending IPI
> > +      */
> > +};
> > +
> > +#endif /* _ARCH_ARM64_TLBBATCH_H */
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 412a3b9a3c25..b3ed163267ca 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> >       dsb(ish);
> >  }
> >
> > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> > +                                     struct mm_struct *mm,
> > +                                     struct vm_area_struct *vma,
> > +                                     unsigned long uaddr)
> > +{
> > +     flush_tlb_page_nosync(vma, uaddr);
> > +}
>
> You're passing that vma along just to get the mm, that's quite silly and
> trivially fixed.

Yes, this was silly. will include your fix in v2.

>
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..87505ecce1f0 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>         dsb(ish);
>  }
>
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> -                                        unsigned long uaddr)
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> +                                          unsigned long uaddr)
>  {
>         unsigned long addr;
>
>         dsb(ishst);
> -       addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> +       addr = __TLBI_VADDR(uaddr, ASID(mm));
>         __tlbi(vale1is, addr);
>         __tlbi_user(vale1is, addr);
>  }
>
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +                                        unsigned long uaddr)
> +{
> +       return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>  static inline void flush_tlb_page(struct vm_area_struct *vma,
>                                   unsigned long uaddr)
>  {

Thanks
Barry

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms
  2022-07-07 16:43   ` Dave Hansen
@ 2022-07-07 21:12     ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2022-07-07 21:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, Barry Song, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Nadav Amit,
	Mel Gorman

On Fri, Jul 8, 2022 at 4:43 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 7/7/22 05:52, Barry Song wrote:
> >  static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> > -                                     struct mm_struct *mm)
> > +                                     struct mm_struct *mm,
> > +                                     struct vm_area_struct *vma,
> > +                                     unsigned long uaddr)
> >  {
>
> It's not a huge deal, but could we pass 'vma' _instead_ of 'mm'?  The
> implementations could then just use vma->vm_mm instead of the passed-in mm.

Yes, Dave. Peter made the same suggestion in 4/4.
will get this fixed in v2.

Thanks
Barry

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-07-07 12:52 ` [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Barry Song
  2022-07-07 13:49   ` Peter Zijlstra
@ 2022-07-08  6:21   ` Yicong Yang
  1 sibling, 0 replies; 16+ messages in thread
From: Yicong Yang @ 2022-07-08  6:21 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, arnd, linux-kernel, darren, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, Barry Song,
	Nadav Amit, Mel Gorman

Hi Barry,

On 2022/7/7 20:52, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
> 
> Even running a simplest program which requires swapout can
> prove this is true,
>  #include <sys/types.h>
>  #include <unistd.h>
>  #include <sys/mman.h>
>  #include <string.h>
> 
>  int main()
>  {
>  #define SIZE (1 * 1024 * 1024)
>          volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                                           MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> 
>          memset(p, 0x88, SIZE);
> 
>          for (int k = 0; k < 10000; k++) {
>                  /* swap in */
>                  for (int i = 0; i < SIZE; i += 4096) {
>                          (void)p[i];
>                  }
> 
>                  /* swap out */
>                  madvise(p, SIZE, MADV_PAGEOUT);
>          }
>  }
> 
> Perf result on snapdragon 888 with 8 cores by using zRAM
> as the swap block device.
> 
>  ~ # perf record taskset -c 4 ./a.out
>  [ perf record: Woken up 10 times to write data ]
>  [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>  ~ # perf report
>  # To display the perf.data header info, please use --header/--header-only options.
>  # To display the perf.data header info, please use --header/--header-only options.
>  #
>  #
>  # Total Lost Samples: 0
>  #
>  # Samples: 60K of event 'cycles'
>  # Event count (approx.): 35706225414
>  #
>  # Overhead  Command  Shared Object      Symbol
>  # ........  .......  .................  .............................................................................
>  #
>     21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
>      8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>      6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
>      6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
>      5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>      3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
>      3.49%  a.out    [kernel.kallsyms]  [k] memset64
>      1.63%  a.out    [kernel.kallsyms]  [k] clear_page
>      1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
>      1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
>      1.23%  a.out    [kernel.kallsyms]  [k] xas_load
>      1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock
> 
> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> swapping in/out a page mapped by only one process. If the
> page is mapped by multiple processes, typically, like more
> than 100 on a phone, the overhead would be much higher as
> we have to run tlb flush 100 times for one single page.
> Plus, tlb flush overhead will increase with the number
> of CPU cores due to the bad scalability of tlb shootdown
> in HW, so those ARM64 servers should expect much higher
> overhead.
> 
> Further perf annonate shows 95% cpu time of ptep_clear_flush
> is actually used by the final dsb() to wait for the completion
> of tlb flush. This provides us a very good chance to leverage
> the existing batched tlb in kernel. The minimum modification
> is that we only send async tlbi in the first stage and we send
> dsb while we have to sync in the second stage.
> 
> With the above simplest micro benchmark, collapsed time to
> finish the program decreases around 5%.
> 
> Typical collapsed time w/o patch:
>  ~ # time taskset -c 4 ./a.out
>  0.21user 14.34system 0:14.69elapsed
> w/ patch:
>  ~ # time taskset -c 4 ./a.out
>  0.22user 13.45system 0:13.80elapsed
> 

Tested with benchmark in the commit on Kunpeng920 arm64 server, observed an improvement
around 12.5% with command `time ./swap_bench`.
	w/o		w/
real	0m13.460s	0m11.771s
user	0m0.248s	0m0.279s
sys	0m12.039s	0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated
by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
    16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

Feel free to add:
Tested-by: Yicong Yang <yangyicong@hisilicon.com>


> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  Documentation/features/vm/TLB/arch-support.txt |  2 +-
>  arch/arm64/Kconfig                             |  1 +
>  arch/arm64/include/asm/tlbbatch.h              | 12 ++++++++++++
>  arch/arm64/include/asm/tlbflush.h              | 13 +++++++++++++
>  4 files changed, 27 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/include/asm/tlbbatch.h
> 
> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> index 1c009312b9c1..2caf815d7c6c 100644
> --- a/Documentation/features/vm/TLB/arch-support.txt
> +++ b/Documentation/features/vm/TLB/arch-support.txt
> @@ -9,7 +9,7 @@
>      |       alpha: | TODO |
>      |         arc: | TODO |
>      |         arm: | TODO |
> -    |       arm64: | TODO |
> +    |       arm64: |  ok  |
>      |        csky: | TODO |
>      |     hexagon: | TODO |
>      |        ia64: | TODO |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1652a9800ebe..e94913a0b040 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -93,6 +93,7 @@ config ARM64
>  	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>  	select ARCH_SUPPORTS_NUMA_BALANCING
>  	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> +	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>  	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>  	select ARCH_WANT_DEFAULT_BPF_JIT
>  	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> +	/*
> +	 * For arm64, HW can do tlb shootdown, so we don't
> +	 * need to record cpumask for sending IPI
> +	 */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..b3ed163267ca 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>  	dsb(ish);
>  }
>  
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long uaddr)
> +{
> +	flush_tlb_page_nosync(vma, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}
> +
>  /*
>   * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
>   * necessarily a performance improvement.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-07 12:52 ` [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush Barry Song
@ 2022-07-08  6:36   ` Nadav Amit
  2022-07-08  6:59     ` Barry Song
  0 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2022-07-08  6:36 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linux MM, linux-arm-kernel, X86 ML,
	Catalin Marinas, Will Deacon, linux-doc, Jonathan Corbet,
	Arnd Bergmann, LKML, darren, yangyicong, huzhanyuan, lipeifeng,
	zhangshiming, guojian, realmz6, Barry Song, Mel Gorman

On Jul 7, 2022, at 5:52 AM, Barry Song <21cnbao@gmail.com> wrote:

> From: Barry Song <v-songbaohua@oppo.com>
> 
> Platforms like ARM64 have hareware TLB shootdown broadcast. They
> don't maintain mm_cpumask and they just send tlbi and related
> sync instructions for TLB flush.
> So if mm_cpumask is empty, we also allow deferred TLB flush
> 
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> ---
> mm/rmap.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..d320c29a4ad8 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -692,8 +692,13 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> 	if (!(flags & TTU_BATCH_FLUSH))
> 		return false;
> 
> -	/* If remote CPUs need to be flushed then defer batch the flush */
> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +	/*
> +	 * If remote CPUs need to be flushed then defer batch the flush;
> +	 * If ARCHs like ARM64 have hardware TLB flush broadcast, thus
> +	 * they don't maintain mm_cpumask() at all, defer batch as well.
> +	 */
> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids ||
> +	    cpumask_empty(mm_cpumask(mm)))

The cpumask_empty() is indeed just another memory access, which is most
likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
make the code simpler and (slightly, certainly slightly) more performant?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-08  6:36   ` Nadav Amit
@ 2022-07-08  6:59     ` Barry Song
  2022-07-08  8:08       ` Nadav Amit
  0 siblings, 1 reply; 16+ messages in thread
From: Barry Song @ 2022-07-08  6:59 UTC (permalink / raw)
  To: namit
  Cc: 21cnbao, akpm, arnd, catalin.marinas, corbet, darren, guojian,
	huzhanyuan, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	lipeifeng, mgorman, realmz6, v-songbaohua, will, x86, yangyicong,
	zhangshiming

> The cpumask_empty() is indeed just another memory access, which is most
> likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
> make the code simpler and (slightly, certainly slightly) more performant?

Yep. good suggestion, Nadav. So the code will be as below, right?

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.

+config ARCH_HAS_MM_CPUMASK
+	bool
+
 config ARCH_HAS_VM_GET_PAGE_PROT
 	bool

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;

+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+	return true;
+#endif
+
 	/* If remote CPUs need to be flushed then defer batch the flush */
 	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
 		should_defer = true;

Thanks
Barry


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-08  6:59     ` Barry Song
@ 2022-07-08  8:08       ` Nadav Amit
  2022-07-08  8:17         ` Barry Song
  2022-07-08  8:27         ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Nadav Amit @ 2022-07-08  8:08 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Arnd Bergmann, Catalin Marinas, Jonathan Corbet,
	darren, guojian, huzhanyuan, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, lipeifeng, mgorman, realmz6,
	v-songbaohua, will, x86, yangyicong, zhangshiming

On Jul 7, 2022, at 11:59 PM, Barry Song <21cnbao@gmail.com> wrote:

>> The cpumask_empty() is indeed just another memory access, which is most
>> likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
>> make the code simpler and (slightly, certainly slightly) more performant?
> 
> Yep. good suggestion, Nadav. So the code will be as below, right?

Hmmm… Although it is likely to work (because only x86 and arm would use this
batch flushing), I think that for consistency ARCH_HAS_MM_CPUMASK should be
correct for all architectures.

Is it really only x86 that has mm_cpumask()?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-08  8:08       ` Nadav Amit
@ 2022-07-08  8:17         ` Barry Song
  2022-07-08  8:27         ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Barry Song @ 2022-07-08  8:17 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Arnd Bergmann, Catalin Marinas, Jonathan Corbet,
	darren, guojian, huzhanyuan, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, lipeifeng, mgorman, realmz6,
	v-songbaohua, will, x86, yangyicong, zhangshiming

On Fri, Jul 8, 2022 at 8:08 PM Nadav Amit <namit@vmware.com> wrote:
>
> On Jul 7, 2022, at 11:59 PM, Barry Song <21cnbao@gmail.com> wrote:
>
> >> The cpumask_empty() is indeed just another memory access, which is most
> >> likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
> >> make the code simpler and (slightly, certainly slightly) more performant?
> >
> > Yep. good suggestion, Nadav. So the code will be as below, right?
>
> Hmmm… Although it is likely to work (because only x86 and arm would use this
> batch flushing), I think that for consistency ARCH_HAS_MM_CPUMASK should be
> correct for all architectures.
>
> Is it really only x86 that has mm_cpumask()?

i am quite sure there are some other platforms having mm_cpumask().
for example, arm(not arm64).
but i am not exactly sure of the situation of each individual arch. thus,
i don't risk changing their source code.
but arm64 is the second platform looking for tlbbatch, and
ARCH_HAS_MM_CPUMASK only affects tlbbatch. so i would
expect those platforms to fill their ARCH_HAS_MM_CPUMASK
while they start to bringup their tlbbatch? for this moment,
we only need to make certain we don't break x86?
does it make sense?

Thanks
Barry


>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-08  8:08       ` Nadav Amit
  2022-07-08  8:17         ` Barry Song
@ 2022-07-08  8:27         ` Peter Zijlstra
  2022-07-11  0:30           ` Barry Song
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2022-07-08  8:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Barry Song, Andrew Morton, Arnd Bergmann, Catalin Marinas,
	Jonathan Corbet, darren, guojian, huzhanyuan, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, lipeifeng, mgorman, realmz6,
	v-songbaohua, will, x86, yangyicong, zhangshiming

On Fri, Jul 08, 2022 at 08:08:45AM +0000, Nadav Amit wrote:

> Is it really only x86 that has mm_cpumask()?

Unlikely, everybody who needs to IPI (eg. doesn't have broadcast
invalidate) has benefit to track this mask more accurately.

The below greps for clearing CPUs in the mask and ought to be a fair
indicator:

$ git grep -l "cpumask_clear_cpu.*mm_cpumask" arch/
arch/arm/include/asm/mmu_context.h
arch/loongarch/include/asm/mmu_context.h
arch/loongarch/mm/tlb.c
arch/mips/include/asm/mmu_context.h
arch/openrisc/mm/tlb.c
arch/powerpc/include/asm/book3s/64/mmu.h
arch/powerpc/mm/book3s64/radix_tlb.c
arch/riscv/mm/context.c
arch/s390/kernel/smp.c
arch/um/include/asm/mmu_context.h
arch/x86/mm/tlb.c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-08  8:27         ` Peter Zijlstra
@ 2022-07-11  0:30           ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2022-07-11  0:30 UTC (permalink / raw)
  To: peterz
  Cc: 21cnbao, akpm, arnd, catalin.marinas, corbet, darren, guojian,
	huzhanyuan, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	lipeifeng, mgorman, namit, realmz6, v-songbaohua, will, x86,
	yangyicong, zhangshiming

On Fri, Jul 8, 2022 at 8:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 08, 2022 at 08:08:45AM +0000, Nadav Amit wrote:
>
> > Is it really only x86 that has mm_cpumask()?
>
> Unlikely, everybody who needs to IPI (eg. doesn't have broadcast
> invalidate) has benefit to track this mask more accurately.
>
> The below greps for clearing CPUs in the mask and ought to be a fair
> indicator:
>
> $ git grep -l "cpumask_clear_cpu.*mm_cpumask" arch/
> arch/arm/include/asm/mmu_context.h
> arch/loongarch/include/asm/mmu_context.h
> arch/loongarch/mm/tlb.c
> arch/mips/include/asm/mmu_context.h
> arch/openrisc/mm/tlb.c
> arch/powerpc/include/asm/book3s/64/mmu.h
> arch/powerpc/mm/book3s64/radix_tlb.c
> arch/riscv/mm/context.c
> arch/s390/kernel/smp.c
> arch/um/include/asm/mmu_context.h
> arch/x86/mm/tlb.c

so i suppose we need the below at this moment. i am not able to
test all of them. but since only x86 has already got tlbbatch
and arm64 is the second one to have tlbbatch now, i suppose the
below changes won't break those archs without tlbbatch. i would
expect people bringing up tlbbatch in those platforms to test
them later,

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7630ba9cb6cc..25c42747f488 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 1920d52653b4..4b737c0d17a2 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -7,6 +7,7 @@ config LOONGARCH
 	select ARCH_ENABLE_MEMORY_HOTPLUG
 	select ARCH_ENABLE_MEMORY_HOTREMOVE
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index db09d45d59ec..1b196acdeca3 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -9,6 +9,7 @@ config MIPS
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE if !EVA
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL if !(32BIT && CPU_HAS_RIXI)
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index e814df4c483c..82483b192f4a 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -9,6 +9,7 @@ config OPENRISC
 	select ARCH_32BIT_OFF_T
 	select ARCH_HAS_DMA_SET_UNCACHED
 	select ARCH_HAS_DMA_CLEAR_UNCACHED
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select COMMON_CLK
 	select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c2ce2e60c8f0..19061ffe73a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEMREMAP_COMPAT_ALIGN	if PPC_64S_HASH_MMU
 	select ARCH_HAS_MMIOWB			if PPC64
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PMEM_API
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c22f58155948..7570c95a9cc8 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,6 +25,7 @@ config RISCV
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SET_DIRECT_MAP if MMU
 	select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 91c0b80a8bf0..48d91fa05bab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEM_ENCRYPT
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 4ec22e156a2e..df29c729267b 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -8,6 +8,7 @@ config UML
 	select ARCH_EPHEMERAL_INODES
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
 	select ARCH_NO_PREEMPT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.

+config ARCH_HAS_MM_CPUMASK
+	bool
+
 config ARCH_HAS_VM_GET_PAGE_PROT
 	bool

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;

+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+	return true;
+#endif
+
 	/* If remote CPUs need to be flushed then defer batch the flush */
 	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
 		should_defer = true;

Thanks
Barry


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-07-11  0:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07 12:52 [PATCH 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Barry Song
2022-07-07 12:52 ` [PATCH 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Barry Song
2022-07-07 12:52 ` [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush Barry Song
2022-07-08  6:36   ` Nadav Amit
2022-07-08  6:59     ` Barry Song
2022-07-08  8:08       ` Nadav Amit
2022-07-08  8:17         ` Barry Song
2022-07-08  8:27         ` Peter Zijlstra
2022-07-11  0:30           ` Barry Song
2022-07-07 12:52 ` [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Barry Song
2022-07-07 16:43   ` Dave Hansen
2022-07-07 21:12     ` Barry Song
2022-07-07 12:52 ` [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Barry Song
2022-07-07 13:49   ` Peter Zijlstra
2022-07-07 21:10     ` Barry Song
2022-07-08  6:21   ` Yicong Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).