All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-11  3:46 ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
My testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.


-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Barry Song (4):
  Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
    apply to ARM64"
  mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  mm: rmap: Extend tlbbatch APIs to fit new platforms
  arm64: support batched/deferred tlb shootdown during page reclamation

 Documentation/features/arch-support.txt       |  1 -
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm/Kconfig                              |  1 +
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 arch/loongarch/Kconfig                        |  1 +
 arch/mips/Kconfig                             |  1 +
 arch/openrisc/Kconfig                         |  1 +
 arch/powerpc/Kconfig                          |  1 +
 arch/riscv/Kconfig                            |  1 +
 arch/s390/Kconfig                             |  1 +
 arch/um/Kconfig                               |  1 +
 arch/x86/Kconfig                              |  1 +
 arch/x86/include/asm/tlbflush.h               |  3 ++-
 mm/Kconfig                                    |  3 +++
 mm/rmap.c                                     | 14 +++++++----
 17 files changed, 59 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-11  3:46 ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
My testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.


-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Barry Song (4):
  Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
    apply to ARM64"
  mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  mm: rmap: Extend tlbbatch APIs to fit new platforms
  arm64: support batched/deferred tlb shootdown during page reclamation

 Documentation/features/arch-support.txt       |  1 -
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm/Kconfig                              |  1 +
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 arch/loongarch/Kconfig                        |  1 +
 arch/mips/Kconfig                             |  1 +
 arch/openrisc/Kconfig                         |  1 +
 arch/powerpc/Kconfig                          |  1 +
 arch/riscv/Kconfig                            |  1 +
 arch/s390/Kconfig                             |  1 +
 arch/um/Kconfig                               |  1 +
 arch/x86/Kconfig                              |  1 +
 arch/x86/include/asm/tlbflush.h               |  3 ++-
 mm/Kconfig                                    |  3 +++
 mm/rmap.c                                     | 14 +++++++----
 17 files changed, 59 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.25.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-11  3:46 ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: linux-s390, zhangshiming, lipeifeng, arnd, corbet, realmz6,
	Barry Song, linux-kernel, yangyicong, openrisc, darren,
	huzhanyuan, guojian, linux-riscv, linux-mips, linuxppc-dev

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
My testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.


-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Barry Song (4):
  Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
    apply to ARM64"
  mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  mm: rmap: Extend tlbbatch APIs to fit new platforms
  arm64: support batched/deferred tlb shootdown during page reclamation

 Documentation/features/arch-support.txt       |  1 -
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm/Kconfig                              |  1 +
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 arch/loongarch/Kconfig                        |  1 +
 arch/mips/Kconfig                             |  1 +
 arch/openrisc/Kconfig                         |  1 +
 arch/powerpc/Kconfig                          |  1 +
 arch/riscv/Kconfig                            |  1 +
 arch/s390/Kconfig                             |  1 +
 arch/um/Kconfig                               |  1 +
 arch/x86/Kconfig                              |  1 +
 arch/x86/include/asm/tlbflush.h               |  3 ++-
 mm/Kconfig                                    |  3 +++
 mm/rmap.c                                     | 14 +++++++----
 17 files changed, 59 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-11  3:46 ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
My testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.


-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Barry Song (4):
  Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
    apply to ARM64"
  mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  mm: rmap: Extend tlbbatch APIs to fit new platforms
  arm64: support batched/deferred tlb shootdown during page reclamation

 Documentation/features/arch-support.txt       |  1 -
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm/Kconfig                              |  1 +
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 arch/loongarch/Kconfig                        |  1 +
 arch/mips/Kconfig                             |  1 +
 arch/openrisc/Kconfig                         |  1 +
 arch/powerpc/Kconfig                          |  1 +
 arch/riscv/Kconfig                            |  1 +
 arch/s390/Kconfig                             |  1 +
 arch/um/Kconfig                               |  1 +
 arch/x86/Kconfig                              |  1 +
 arch/x86/include/asm/tlbflush.h               |  3 ++-
 mm/Kconfig                                    |  3 +++
 mm/rmap.c                                     | 14 +++++++----
 17 files changed, 59 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v2 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64"
  2022-07-11  3:46 ` Barry Song
  (?)
  (?)
@ 2022-07-11  3:46   ` Barry Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42.

I was wrong. Though ARM64 has hardware TLB flush, but it is not free
and it is still expensive.
We still have a good chance to enable batched and deferred TLB flush
on ARM64 for memory reclamation. A possible way is that we only queue
tlbi instructions in hardware's queue. When we have to broadcast TLB,
we broadcast it by dsb. We just need to get adapted the existing
BATCHED_UNMAP_TLB_FLUSH.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/features/arch-support.txt        | 1 -
 Documentation/features/vm/TLB/arch-support.txt | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt
index 118ae031840b..d22a1095e661 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,5 +8,4 @@ The meaning of entries in the tables is:
     | ok |  # feature supported by the architecture
     |TODO|  # feature not yet supported by the architecture
     | .. |  # feature cannot be supported by the hardware
-    | N/A|  # feature doesn't apply to the architecture
 
diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..1c009312b9c1 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: | TODO |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64"
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42.

I was wrong. Though ARM64 has hardware TLB flush, but it is not free
and it is still expensive.
We still have a good chance to enable batched and deferred TLB flush
on ARM64 for memory reclamation. A possible way is that we only queue
tlbi instructions in hardware's queue. When we have to broadcast TLB,
we broadcast it by dsb. We just need to get adapted the existing
BATCHED_UNMAP_TLB_FLUSH.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/features/arch-support.txt        | 1 -
 Documentation/features/vm/TLB/arch-support.txt | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt
index 118ae031840b..d22a1095e661 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,5 +8,4 @@ The meaning of entries in the tables is:
     | ok |  # feature supported by the architecture
     |TODO|  # feature not yet supported by the architecture
     | .. |  # feature cannot be supported by the hardware
-    | N/A|  # feature doesn't apply to the architecture
 
diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..1c009312b9c1 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: | TODO |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
-- 
2.25.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64"
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: linux-s390, zhangshiming, lipeifeng, arnd, corbet, realmz6,
	linux-kernel, yangyicong, Barry Song, openrisc, darren,
	huzhanyuan, guojian, linux-riscv, linux-mips, linuxppc-dev

From: Barry Song <v-songbaohua@oppo.com>

This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42.

I was wrong. Though ARM64 has hardware TLB flush, but it is not free
and it is still expensive.
We still have a good chance to enable batched and deferred TLB flush
on ARM64 for memory reclamation. A possible way is that we only queue
tlbi instructions in hardware's queue. When we have to broadcast TLB,
we broadcast it by dsb. We just need to get adapted the existing
BATCHED_UNMAP_TLB_FLUSH.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/features/arch-support.txt        | 1 -
 Documentation/features/vm/TLB/arch-support.txt | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt
index 118ae031840b..d22a1095e661 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,5 +8,4 @@ The meaning of entries in the tables is:
     | ok |  # feature supported by the architecture
     |TODO|  # feature not yet supported by the architecture
     | .. |  # feature cannot be supported by the hardware
-    | N/A|  # feature doesn't apply to the architecture
 
diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..1c009312b9c1 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: | TODO |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64"
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42.

I was wrong. Though ARM64 has hardware TLB flush, but it is not free
and it is still expensive.
We still have a good chance to enable batched and deferred TLB flush
on ARM64 for memory reclamation. A possible way is that we only queue
tlbi instructions in hardware's queue. When we have to broadcast TLB,
we broadcast it by dsb. We just need to get adapted the existing
BATCHED_UNMAP_TLB_FLUSH.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/features/arch-support.txt        | 1 -
 Documentation/features/vm/TLB/arch-support.txt | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt
index 118ae031840b..d22a1095e661 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,5 +8,4 @@ The meaning of entries in the tables is:
     | ok |  # feature supported by the architecture
     |TODO|  # feature not yet supported by the architecture
     | .. |  # feature cannot be supported by the hardware
-    | N/A|  # feature doesn't apply to the architecture
 
diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..1c009312b9c1 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: | TODO |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-11  3:46 ` Barry Song
  (?)
  (?)
@ 2022-07-11  3:46   ` Barry Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

Platforms like ARM64 have hareware TLB shootdown broadcast. They
don't maintain mm_cpumask but just send tlbi and related sync
instructions for TLB flush. task's mm_cpumask is normally empty
in this case. We also allow deferred TLB flush on this kind of
platforms.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
---
 arch/arm/Kconfig       | 1 +
 arch/loongarch/Kconfig | 1 +
 arch/mips/Kconfig      | 1 +
 arch/openrisc/Kconfig  | 1 +
 arch/powerpc/Kconfig   | 1 +
 arch/riscv/Kconfig     | 1 +
 arch/s390/Kconfig      | 1 +
 arch/um/Kconfig        | 1 +
 arch/x86/Kconfig       | 1 +
 mm/Kconfig             | 3 +++
 mm/rmap.c              | 4 ++++
 11 files changed, 16 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7630ba9cb6cc..25c42747f488 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 1920d52653b4..4b737c0d17a2 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -7,6 +7,7 @@ config LOONGARCH
 	select ARCH_ENABLE_MEMORY_HOTPLUG
 	select ARCH_ENABLE_MEMORY_HOTREMOVE
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index db09d45d59ec..1b196acdeca3 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -9,6 +9,7 @@ config MIPS
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE if !EVA
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL if !(32BIT && CPU_HAS_RIXI)
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index e814df4c483c..82483b192f4a 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -9,6 +9,7 @@ config OPENRISC
 	select ARCH_32BIT_OFF_T
 	select ARCH_HAS_DMA_SET_UNCACHED
 	select ARCH_HAS_DMA_CLEAR_UNCACHED
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select COMMON_CLK
 	select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c2ce2e60c8f0..19061ffe73a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEMREMAP_COMPAT_ALIGN	if PPC_64S_HASH_MMU
 	select ARCH_HAS_MMIOWB			if PPC64
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PMEM_API
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c22f58155948..7570c95a9cc8 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,6 +25,7 @@ config RISCV
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SET_DIRECT_MAP if MMU
 	select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 91c0b80a8bf0..48d91fa05bab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEM_ENCRYPT
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 4ec22e156a2e..df29c729267b 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -8,6 +8,7 @@ config UML
 	select ARCH_EPHEMERAL_INODES
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
 	select ARCH_NO_PREEMPT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.
 
+config ARCH_HAS_MM_CPUMASK
+	bool
+
 config ARCH_HAS_VM_GET_PAGE_PROT
 	bool
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+	return true;
+#endif
+
 	/* If remote CPUs need to be flushed then defer batch the flush */
 	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
 		should_defer = true;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

Platforms like ARM64 have hareware TLB shootdown broadcast. They
don't maintain mm_cpumask but just send tlbi and related sync
instructions for TLB flush. task's mm_cpumask is normally empty
in this case. We also allow deferred TLB flush on this kind of
platforms.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
---
 arch/arm/Kconfig       | 1 +
 arch/loongarch/Kconfig | 1 +
 arch/mips/Kconfig      | 1 +
 arch/openrisc/Kconfig  | 1 +
 arch/powerpc/Kconfig   | 1 +
 arch/riscv/Kconfig     | 1 +
 arch/s390/Kconfig      | 1 +
 arch/um/Kconfig        | 1 +
 arch/x86/Kconfig       | 1 +
 mm/Kconfig             | 3 +++
 mm/rmap.c              | 4 ++++
 11 files changed, 16 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7630ba9cb6cc..25c42747f488 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 1920d52653b4..4b737c0d17a2 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -7,6 +7,7 @@ config LOONGARCH
 	select ARCH_ENABLE_MEMORY_HOTPLUG
 	select ARCH_ENABLE_MEMORY_HOTREMOVE
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index db09d45d59ec..1b196acdeca3 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -9,6 +9,7 @@ config MIPS
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE if !EVA
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL if !(32BIT && CPU_HAS_RIXI)
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index e814df4c483c..82483b192f4a 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -9,6 +9,7 @@ config OPENRISC
 	select ARCH_32BIT_OFF_T
 	select ARCH_HAS_DMA_SET_UNCACHED
 	select ARCH_HAS_DMA_CLEAR_UNCACHED
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select COMMON_CLK
 	select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c2ce2e60c8f0..19061ffe73a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEMREMAP_COMPAT_ALIGN	if PPC_64S_HASH_MMU
 	select ARCH_HAS_MMIOWB			if PPC64
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PMEM_API
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c22f58155948..7570c95a9cc8 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,6 +25,7 @@ config RISCV
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SET_DIRECT_MAP if MMU
 	select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 91c0b80a8bf0..48d91fa05bab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEM_ENCRYPT
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 4ec22e156a2e..df29c729267b 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -8,6 +8,7 @@ config UML
 	select ARCH_EPHEMERAL_INODES
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
 	select ARCH_NO_PREEMPT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.
 
+config ARCH_HAS_MM_CPUMASK
+	bool
+
 config ARCH_HAS_VM_GET_PAGE_PROT
 	bool
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+	return true;
+#endif
+
 	/* If remote CPUs need to be flushed then defer batch the flush */
 	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
 		should_defer = true;
-- 
2.25.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

Platforms like ARM64 have hareware TLB shootdown broadcast. They
don't maintain mm_cpumask but just send tlbi and related sync
instructions for TLB flush. task's mm_cpumask is normally empty
in this case. We also allow deferred TLB flush on this kind of
platforms.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
---
 arch/arm/Kconfig       | 1 +
 arch/loongarch/Kconfig | 1 +
 arch/mips/Kconfig      | 1 +
 arch/openrisc/Kconfig  | 1 +
 arch/powerpc/Kconfig   | 1 +
 arch/riscv/Kconfig     | 1 +
 arch/s390/Kconfig      | 1 +
 arch/um/Kconfig        | 1 +
 arch/x86/Kconfig       | 1 +
 mm/Kconfig             | 3 +++
 mm/rmap.c              | 4 ++++
 11 files changed, 16 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7630ba9cb6cc..25c42747f488 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 1920d52653b4..4b737c0d17a2 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -7,6 +7,7 @@ config LOONGARCH
 	select ARCH_ENABLE_MEMORY_HOTPLUG
 	select ARCH_ENABLE_MEMORY_HOTREMOVE
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index db09d45d59ec..1b196acdeca3 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -9,6 +9,7 @@ config MIPS
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE if !EVA
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL if !(32BIT && CPU_HAS_RIXI)
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index e814df4c483c..82483b192f4a 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -9,6 +9,7 @@ config OPENRISC
 	select ARCH_32BIT_OFF_T
 	select ARCH_HAS_DMA_SET_UNCACHED
 	select ARCH_HAS_DMA_CLEAR_UNCACHED
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select COMMON_CLK
 	select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c2ce2e60c8f0..19061ffe73a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEMREMAP_COMPAT_ALIGN	if PPC_64S_HASH_MMU
 	select ARCH_HAS_MMIOWB			if PPC64
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PMEM_API
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c22f58155948..7570c95a9cc8 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,6 +25,7 @@ config RISCV
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SET_DIRECT_MAP if MMU
 	select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 91c0b80a8bf0..48d91fa05bab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEM_ENCRYPT
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 4ec22e156a2e..df29c729267b 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -8,6 +8,7 @@ config UML
 	select ARCH_EPHEMERAL_INODES
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
 	select ARCH_NO_PREEMPT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.
 
+config ARCH_HAS_MM_CPUMASK
+	bool
+
 config ARCH_HAS_VM_GET_PAGE_PROT
 	bool
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+	return true;
+#endif
+
 	/* If remote CPUs need to be flushed then defer batch the flush */
 	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
 		should_defer = true;
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: linux-s390, zhangshiming, lipeifeng, arnd, corbet, realmz6,
	linux-kernel, yangyicong, Barry Song, openrisc, darren,
	huzhanyuan, guojian, linux-riscv, linux-mips, linuxppc-dev

From: Barry Song <v-songbaohua@oppo.com>

Platforms like ARM64 have hareware TLB shootdown broadcast. They
don't maintain mm_cpumask but just send tlbi and related sync
instructions for TLB flush. task's mm_cpumask is normally empty
in this case. We also allow deferred TLB flush on this kind of
platforms.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
---
 arch/arm/Kconfig       | 1 +
 arch/loongarch/Kconfig | 1 +
 arch/mips/Kconfig      | 1 +
 arch/openrisc/Kconfig  | 1 +
 arch/powerpc/Kconfig   | 1 +
 arch/riscv/Kconfig     | 1 +
 arch/s390/Kconfig      | 1 +
 arch/um/Kconfig        | 1 +
 arch/x86/Kconfig       | 1 +
 mm/Kconfig             | 3 +++
 mm/rmap.c              | 4 ++++
 11 files changed, 16 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7630ba9cb6cc..25c42747f488 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 1920d52653b4..4b737c0d17a2 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -7,6 +7,7 @@ config LOONGARCH
 	select ARCH_ENABLE_MEMORY_HOTPLUG
 	select ARCH_ENABLE_MEMORY_HOTREMOVE
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index db09d45d59ec..1b196acdeca3 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -9,6 +9,7 @@ config MIPS
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE if !EVA
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL if !(32BIT && CPU_HAS_RIXI)
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index e814df4c483c..82483b192f4a 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -9,6 +9,7 @@ config OPENRISC
 	select ARCH_32BIT_OFF_T
 	select ARCH_HAS_DMA_SET_UNCACHED
 	select ARCH_HAS_DMA_CLEAR_UNCACHED
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select COMMON_CLK
 	select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c2ce2e60c8f0..19061ffe73a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEMREMAP_COMPAT_ALIGN	if PPC_64S_HASH_MMU
 	select ARCH_HAS_MMIOWB			if PPC64
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PMEM_API
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c22f58155948..7570c95a9cc8 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,6 +25,7 @@ config RISCV
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SET_DIRECT_MAP if MMU
 	select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 91c0b80a8bf0..48d91fa05bab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEM_ENCRYPT
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 4ec22e156a2e..df29c729267b 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -8,6 +8,7 @@ config UML
 	select ARCH_EPHEMERAL_INODES
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
 	select ARCH_NO_PREEMPT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MM_CPUMASK
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.
 
+config ARCH_HAS_MM_CPUMASK
+	bool
+
 config ARCH_HAS_VM_GET_PAGE_PROT
 	bool
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+	return true;
+#endif
+
 	/* If remote CPUs need to be flushed then defer batch the flush */
 	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
 		should_defer = true;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms
  2022-07-11  3:46 ` Barry Song
  (?)
  (?)
@ 2022-07-11  3:46   ` Barry Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

Add uaddr to tlbbatch APIs so that platforms like ARM64 are
able to apply this on their specific hardware features. For
ARM64, this could be sending tlbi into hardware queues for
the page with this particular uaddr.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/x86/include/asm/tlbflush.h |  3 ++-
 mm/rmap.c                       | 10 ++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..1b32f4b999c7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -251,7 +251,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index 13d4f9a1d4f1..a52381a680db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -736,7 +737,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1599,7 +1601,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

Add uaddr to tlbbatch APIs so that platforms like ARM64 are
able to apply this on their specific hardware features. For
ARM64, this could be sending tlbi into hardware queues for
the page with this particular uaddr.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/x86/include/asm/tlbflush.h |  3 ++-
 mm/rmap.c                       | 10 ++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..1b32f4b999c7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -251,7 +251,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index 13d4f9a1d4f1..a52381a680db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -736,7 +737,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1599,7 +1601,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.25.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

Add uaddr to tlbbatch APIs so that platforms like ARM64 are
able to apply this on their specific hardware features. For
ARM64, this could be sending tlbi into hardware queues for
the page with this particular uaddr.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/x86/include/asm/tlbflush.h |  3 ++-
 mm/rmap.c                       | 10 ++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..1b32f4b999c7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -251,7 +251,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index 13d4f9a1d4f1..a52381a680db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -736,7 +737,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1599,7 +1601,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: Dave Hansen, yangyicong, Nadav Amit, H. Peter Anvin, guojian,
	linux-riscv, linux-s390, zhangshiming, lipeifeng, corbet,
	Ingo Molnar, Mel Gorman, linux-mips, arnd, realmz6, Barry Song,
	openrisc, Borislav Petkov, darren, Thomas Gleixner, linux-kernel,
	huzhanyuan, linuxppc-dev

From: Barry Song <v-songbaohua@oppo.com>

Add uaddr to tlbbatch APIs so that platforms like ARM64 are
able to apply this on their specific hardware features. For
ARM64, this could be sending tlbi into hardware queues for
the page with this particular uaddr.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/x86/include/asm/tlbflush.h |  3 ++-
 mm/rmap.c                       | 10 ++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..1b32f4b999c7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -251,7 +251,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index 13d4f9a1d4f1..a52381a680db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -736,7 +737,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1599,7 +1601,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-07-11  3:46 ` Barry Song
  (?)
  (?)
@ 2022-07-11  3:46   ` Barry Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song, Nadav Amit,
	Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 4 files changed, 35 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 1c009312b9c1..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | TODO |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e94913a0b040 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..10364cf8451d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +279,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song, Nadav Amit,
	Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 4 files changed, 35 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 1c009312b9c1..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | TODO |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e94913a0b040 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..10364cf8451d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +279,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.25.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song, Nadav Amit,
	Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 4 files changed, 35 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 1c009312b9c1..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | TODO |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e94913a0b040 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..10364cf8451d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +279,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-07-11  3:46   ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11  3:46 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc
  Cc: linux-s390, zhangshiming, lipeifeng, arnd, corbet, realmz6,
	linux-kernel, yangyicong, Barry Song, openrisc, Nadav Amit,
	Mel Gorman, darren, huzhanyuan, guojian, linux-riscv, linux-mips,
	linuxppc-dev

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 4 files changed, 35 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 1c009312b9c1..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | TODO |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e94913a0b040 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..10364cf8451d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +279,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-11  3:46   ` Barry Song
  (?)
  (?)
@ 2022-07-11 13:35     ` Kefeng Wang
  -1 siblings, 0 replies; 56+ messages in thread
From: Kefeng Wang @ 2022-07-11 13:35 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

Hi Barry,

On 2022/7/11 11:46, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Platforms like ARM64 have hareware TLB shootdown broadcast. They
> don't maintain mm_cpumask but just send tlbi and related sync
> instructions for TLB flush. task's mm_cpumask is normally empty
> in this case. We also allow deferred TLB flush on this kind of
> platforms.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> ---
...
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 169e64192e48..7bf54f57ca01 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
>   	  register alias named "current_stack_pointer", this config can be
>   	  selected.
>   
> +config ARCH_HAS_MM_CPUMASK
> +	bool
> +
>   config ARCH_HAS_VM_GET_PAGE_PROT
>   	bool
>   
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..13d4f9a1d4f1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>   	if (!(flags & TTU_BATCH_FLUSH))
>   		return false;
>   
> +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> +	return true;
> +#endif
> +

Here is another option to enable arch's tlbbatch defer

[1] 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/

>   	/* If remote CPUs need to be flushed then defer batch the flush */
>   	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
>   		should_defer = true;

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11 13:35     ` Kefeng Wang
  0 siblings, 0 replies; 56+ messages in thread
From: Kefeng Wang @ 2022-07-11 13:35 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: linux-s390, zhangshiming, lipeifeng, arnd, corbet, realmz6,
	linux-kernel, yangyicong, Barry Song, openrisc, darren,
	huzhanyuan, guojian, linux-riscv, linux-mips, linuxppc-dev

Hi Barry,

On 2022/7/11 11:46, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Platforms like ARM64 have hareware TLB shootdown broadcast. They
> don't maintain mm_cpumask but just send tlbi and related sync
> instructions for TLB flush. task's mm_cpumask is normally empty
> in this case. We also allow deferred TLB flush on this kind of
> platforms.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> ---
...
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 169e64192e48..7bf54f57ca01 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
>   	  register alias named "current_stack_pointer", this config can be
>   	  selected.
>   
> +config ARCH_HAS_MM_CPUMASK
> +	bool
> +
>   config ARCH_HAS_VM_GET_PAGE_PROT
>   	bool
>   
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..13d4f9a1d4f1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>   	if (!(flags & TTU_BATCH_FLUSH))
>   		return false;
>   
> +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> +	return true;
> +#endif
> +

Here is another option to enable arch's tlbbatch defer

[1] 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/

>   	/* If remote CPUs need to be flushed then defer batch the flush */
>   	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
>   		should_defer = true;

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11 13:35     ` Kefeng Wang
  0 siblings, 0 replies; 56+ messages in thread
From: Kefeng Wang @ 2022-07-11 13:35 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

Hi Barry,

On 2022/7/11 11:46, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Platforms like ARM64 have hareware TLB shootdown broadcast. They
> don't maintain mm_cpumask but just send tlbi and related sync
> instructions for TLB flush. task's mm_cpumask is normally empty
> in this case. We also allow deferred TLB flush on this kind of
> platforms.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> ---
...
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 169e64192e48..7bf54f57ca01 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
>   	  register alias named "current_stack_pointer", this config can be
>   	  selected.
>   
> +config ARCH_HAS_MM_CPUMASK
> +	bool
> +
>   config ARCH_HAS_VM_GET_PAGE_PROT
>   	bool
>   
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..13d4f9a1d4f1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>   	if (!(flags & TTU_BATCH_FLUSH))
>   		return false;
>   
> +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> +	return true;
> +#endif
> +

Here is another option to enable arch's tlbbatch defer

[1] 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/

>   	/* If remote CPUs need to be flushed then defer batch the flush */
>   	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
>   		should_defer = true;

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11 13:35     ` Kefeng Wang
  0 siblings, 0 replies; 56+ messages in thread
From: Kefeng Wang @ 2022-07-11 13:35 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song

Hi Barry,

On 2022/7/11 11:46, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Platforms like ARM64 have hareware TLB shootdown broadcast. They
> don't maintain mm_cpumask but just send tlbi and related sync
> instructions for TLB flush. task's mm_cpumask is normally empty
> in this case. We also allow deferred TLB flush on this kind of
> platforms.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> ---
...
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 169e64192e48..7bf54f57ca01 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
>   	  register alias named "current_stack_pointer", this config can be
>   	  selected.
>   
> +config ARCH_HAS_MM_CPUMASK
> +	bool
> +
>   config ARCH_HAS_VM_GET_PAGE_PROT
>   	bool
>   
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..13d4f9a1d4f1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>   	if (!(flags & TTU_BATCH_FLUSH))
>   		return false;
>   
> +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> +	return true;
> +#endif
> +

Here is another option to enable arch's tlbbatch defer

[1] 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/

>   	/* If remote CPUs need to be flushed then defer batch the flush */
>   	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
>   		should_defer = true;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
  2022-07-11 13:35     ` Kefeng Wang
  (?)
  (?)
@ 2022-07-11 22:52       ` Barry Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11 22:52 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song

On Tue, Jul 12, 2022 at 1:35 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> Hi Barry,
>
> On 2022/7/11 11:46, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Platforms like ARM64 have hareware TLB shootdown broadcast. They
> > don't maintain mm_cpumask but just send tlbi and related sync
> > instructions for TLB flush. task's mm_cpumask is normally empty
> > in this case. We also allow deferred TLB flush on this kind of
> > platforms.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> > ---
> ...
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 169e64192e48..7bf54f57ca01 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
> >         register alias named "current_stack_pointer", this config can be
> >         selected.
> >
> > +config ARCH_HAS_MM_CPUMASK
> > +     bool
> > +
> >   config ARCH_HAS_VM_GET_PAGE_PROT
> >       bool
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 5bcb334cd6f2..13d4f9a1d4f1 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> >       if (!(flags & TTU_BATCH_FLUSH))
> >               return false;
> >
> > +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> > +     return true;
> > +#endif
> > +
>
> Here is another option to enable arch's tlbbatch defer
>

This option is even better than simply having ARCH_HAS_MM_CPUMASK
since arch might make decisions based on specific hardware characters.
for example,
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-November/165468.html

+bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+     if (!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1))
+         return false;
+
+     if (!mm_is_thread_local(mm))
+         return true;
+
+     return false;
+}

In this case, having MM_CPUMASK doesn't necessarily mean tlbbatch is needed.

> [1]
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/
>
> >       /* If remote CPUs need to be flushed then defer batch the flush */
> >       if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> >               should_defer = true;

Thanks
Barry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11 22:52       ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11 22:52 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song

On Tue, Jul 12, 2022 at 1:35 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> Hi Barry,
>
> On 2022/7/11 11:46, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Platforms like ARM64 have hareware TLB shootdown broadcast. They
> > don't maintain mm_cpumask but just send tlbi and related sync
> > instructions for TLB flush. task's mm_cpumask is normally empty
> > in this case. We also allow deferred TLB flush on this kind of
> > platforms.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> > ---
> ...
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 169e64192e48..7bf54f57ca01 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
> >         register alias named "current_stack_pointer", this config can be
> >         selected.
> >
> > +config ARCH_HAS_MM_CPUMASK
> > +     bool
> > +
> >   config ARCH_HAS_VM_GET_PAGE_PROT
> >       bool
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 5bcb334cd6f2..13d4f9a1d4f1 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> >       if (!(flags & TTU_BATCH_FLUSH))
> >               return false;
> >
> > +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> > +     return true;
> > +#endif
> > +
>
> Here is another option to enable arch's tlbbatch defer
>

This option is even better than simply having ARCH_HAS_MM_CPUMASK
since arch might make decisions based on specific hardware characters.
for example,
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-November/165468.html

+bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+     if (!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1))
+         return false;
+
+     if (!mm_is_thread_local(mm))
+         return true;
+
+     return false;
+}

In this case, having MM_CPUMASK doesn't necessarily mean tlbbatch is needed.

> [1]
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/
>
> >       /* If remote CPUs need to be flushed then defer batch the flush */
> >       if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> >               should_defer = true;

Thanks
Barry

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11 22:52       ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11 22:52 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Linux Doc Mailing List, Catalin Marinas, Yicong Yang, Linux-MM,
	郭健,
	linux-riscv, Will Deacon, linux-s390,
	张诗明(Simon Zhang),
	李培锋(wink),
	Jonathan Corbet, x86, linux-mips, Arnd Bergmann, real mz,
	Barry Song, openrisc, Darren Hart, LAK, LKML, huzhanyuan,
	Andrew Morton, linuxppc-dev

On Tue, Jul 12, 2022 at 1:35 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> Hi Barry,
>
> On 2022/7/11 11:46, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Platforms like ARM64 have hareware TLB shootdown broadcast. They
> > don't maintain mm_cpumask but just send tlbi and related sync
> > instructions for TLB flush. task's mm_cpumask is normally empty
> > in this case. We also allow deferred TLB flush on this kind of
> > platforms.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> > ---
> ...
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 169e64192e48..7bf54f57ca01 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
> >         register alias named "current_stack_pointer", this config can be
> >         selected.
> >
> > +config ARCH_HAS_MM_CPUMASK
> > +     bool
> > +
> >   config ARCH_HAS_VM_GET_PAGE_PROT
> >       bool
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 5bcb334cd6f2..13d4f9a1d4f1 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> >       if (!(flags & TTU_BATCH_FLUSH))
> >               return false;
> >
> > +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> > +     return true;
> > +#endif
> > +
>
> Here is another option to enable arch's tlbbatch defer
>

This option is even better than simply having ARCH_HAS_MM_CPUMASK
since arch might make decisions based on specific hardware characters.
for example,
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-November/165468.html

+bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+     if (!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1))
+         return false;
+
+     if (!mm_is_thread_local(mm))
+         return true;
+
+     return false;
+}

In this case, having MM_CPUMASK doesn't necessarily mean tlbbatch is needed.

> [1]
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/
>
> >       /* If remote CPUs need to be flushed then defer batch the flush */
> >       if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> >               should_defer = true;

Thanks
Barry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
@ 2022-07-11 22:52       ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-11 22:52 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song

On Tue, Jul 12, 2022 at 1:35 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> Hi Barry,
>
> On 2022/7/11 11:46, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Platforms like ARM64 have hareware TLB shootdown broadcast. They
> > don't maintain mm_cpumask but just send tlbi and related sync
> > instructions for TLB flush. task's mm_cpumask is normally empty
> > in this case. We also allow deferred TLB flush on this kind of
> > platforms.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>>
> > ---
> ...
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 169e64192e48..7bf54f57ca01 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
> >         register alias named "current_stack_pointer", this config can be
> >         selected.
> >
> > +config ARCH_HAS_MM_CPUMASK
> > +     bool
> > +
> >   config ARCH_HAS_VM_GET_PAGE_PROT
> >       bool
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 5bcb334cd6f2..13d4f9a1d4f1 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> >       if (!(flags & TTU_BATCH_FLUSH))
> >               return false;
> >
> > +#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
> > +     return true;
> > +#endif
> > +
>
> Here is another option to enable arch's tlbbatch defer
>

This option is even better than simply having ARCH_HAS_MM_CPUMASK
since arch might make decisions based on specific hardware characters.
for example,
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-November/165468.html

+bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+     if (!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1))
+         return false;
+
+     if (!mm_is_thread_local(mm))
+         return true;
+
+     return false;
+}

In this case, having MM_CPUMASK doesn't necessarily mean tlbbatch is needed.

> [1]
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20171101101735.2318-2-khandual@linux.vnet.ibm.com/
>
> >       /* If remote CPUs need to be flushed then defer batch the flush */
> >       if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> >               should_defer = true;

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
  2022-07-11  3:46 ` Barry Song
  (?)
  (?)
@ 2022-07-14  3:28   ` Xin Hao
  -1 siblings, 0 replies; 56+ messages in thread
From: Xin Hao @ 2022-07-14  3:28 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390

Hi barry.

I do some test on Kunpeng arm64 machine use Unixbench.

The test  result as below.

One core, we can see the performance improvement above +30%.
./Run -c 1 -i 1 shell1
w/o
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
========
System Benchmarks Index Score (Partial Only)                         1292.7

w/
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
========
System Benchmarks Index Score (Partial Only)                         1645.0


But with whole cores, there have little performance degradation above -5%

./Run -c 96 -i 1 shell1
w/o
Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
========
System Benchmarks Index Score (Partial Only)                        19048.5

w
Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
========
System Benchmarks Index Score (Partial Only)                        18003.2

---------------------------------------------------------------------------------------------- 


After discuss with you, and do some changes in the patch.

ndex a52381a680db..1ecba81f1277 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
         int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;

         if (pending != flushed) {
+#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
                 flush_tlb_mm(mm);
+#else
+               dsb(ish);
+#endif
                 /*
                  * If the new TLB flushing is pending during flushing, leave
                  * mm->tlb_flush_batched as is, to avoid losing flushing.

there have a performance improvement with whole cores, above +30%

./Run -c 96 -i 1 shell1
96 CPUs in system; running 96 parallel copies of tests

Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
                                                                    ========
System Benchmarks Index Score (Partial Only)                        25761.6


Tested-by: Xin Hao<xhao@linux.alibaba.com>

Looking forward to your next version patch.

On 7/11/22 11:46 AM, Barry Song wrote:
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> My testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
>
> -v2:
> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>     according to the comments of Peter Zijlstra and Dave Hansen
> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>     is empty according to the comments of Nadav Amit
>
> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> , and comments.
>
> -v1:
> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>
> Barry Song (4):
>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>      apply to ARM64"
>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>    arm64: support batched/deferred tlb shootdown during page reclamation
>
>   Documentation/features/arch-support.txt       |  1 -
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm/Kconfig                              |  1 +
>   arch/arm64/Kconfig                            |  1 +
>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>   arch/loongarch/Kconfig                        |  1 +
>   arch/mips/Kconfig                             |  1 +
>   arch/openrisc/Kconfig                         |  1 +
>   arch/powerpc/Kconfig                          |  1 +
>   arch/riscv/Kconfig                            |  1 +
>   arch/s390/Kconfig                             |  1 +
>   arch/um/Kconfig                               |  1 +
>   arch/x86/Kconfig                              |  1 +
>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>   mm/Kconfig                                    |  3 +++
>   mm/rmap.c                                     | 14 +++++++----
>   17 files changed, 59 insertions(+), 9 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
-- 
Best Regards!
Xin Hao


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-14  3:28   ` Xin Hao
  0 siblings, 0 replies; 56+ messages in thread
From: Xin Hao @ 2022-07-14  3:28 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390

Hi barry.

I do some test on Kunpeng arm64 machine use Unixbench.

The test  result as below.

One core, we can see the performance improvement above +30%.
./Run -c 1 -i 1 shell1
w/o
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
========
System Benchmarks Index Score (Partial Only)                         1292.7

w/
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
========
System Benchmarks Index Score (Partial Only)                         1645.0


But with whole cores, there have little performance degradation above -5%

./Run -c 96 -i 1 shell1
w/o
Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
========
System Benchmarks Index Score (Partial Only)                        19048.5

w
Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
========
System Benchmarks Index Score (Partial Only)                        18003.2

---------------------------------------------------------------------------------------------- 


After discuss with you, and do some changes in the patch.

ndex a52381a680db..1ecba81f1277 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
         int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;

         if (pending != flushed) {
+#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
                 flush_tlb_mm(mm);
+#else
+               dsb(ish);
+#endif
                 /*
                  * If the new TLB flushing is pending during flushing, leave
                  * mm->tlb_flush_batched as is, to avoid losing flushing.

there have a performance improvement with whole cores, above +30%

./Run -c 96 -i 1 shell1
96 CPUs in system; running 96 parallel copies of tests

Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
                                                                    ========
System Benchmarks Index Score (Partial Only)                        25761.6


Tested-by: Xin Hao<xhao@linux.alibaba.com>

Looking forward to your next version patch.

On 7/11/22 11:46 AM, Barry Song wrote:
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> My testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
>
> -v2:
> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>     according to the comments of Peter Zijlstra and Dave Hansen
> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>     is empty according to the comments of Nadav Amit
>
> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> , and comments.
>
> -v1:
> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>
> Barry Song (4):
>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>      apply to ARM64"
>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>    arm64: support batched/deferred tlb shootdown during page reclamation
>
>   Documentation/features/arch-support.txt       |  1 -
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm/Kconfig                              |  1 +
>   arch/arm64/Kconfig                            |  1 +
>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>   arch/loongarch/Kconfig                        |  1 +
>   arch/mips/Kconfig                             |  1 +
>   arch/openrisc/Kconfig                         |  1 +
>   arch/powerpc/Kconfig                          |  1 +
>   arch/riscv/Kconfig                            |  1 +
>   arch/s390/Kconfig                             |  1 +
>   arch/um/Kconfig                               |  1 +
>   arch/x86/Kconfig                              |  1 +
>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>   mm/Kconfig                                    |  3 +++
>   mm/rmap.c                                     | 14 +++++++----
>   17 files changed, 59 insertions(+), 9 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
-- 
Best Regards!
Xin Hao


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-14  3:28   ` Xin Hao
  0 siblings, 0 replies; 56+ messages in thread
From: Xin Hao @ 2022-07-14  3:28 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, arnd, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390

Hi barry.

I do some test on Kunpeng arm64 machine use Unixbench.

The test  result as below.

One core, we can see the performance improvement above +30%.
./Run -c 1 -i 1 shell1
w/o
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
========
System Benchmarks Index Score (Partial Only)                         1292.7

w/
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
========
System Benchmarks Index Score (Partial Only)                         1645.0


But with whole cores, there have little performance degradation above -5%

./Run -c 96 -i 1 shell1
w/o
Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
========
System Benchmarks Index Score (Partial Only)                        19048.5

w
Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
========
System Benchmarks Index Score (Partial Only)                        18003.2

---------------------------------------------------------------------------------------------- 


After discuss with you, and do some changes in the patch.

ndex a52381a680db..1ecba81f1277 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
         int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;

         if (pending != flushed) {
+#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
                 flush_tlb_mm(mm);
+#else
+               dsb(ish);
+#endif
                 /*
                  * If the new TLB flushing is pending during flushing, leave
                  * mm->tlb_flush_batched as is, to avoid losing flushing.

there have a performance improvement with whole cores, above +30%

./Run -c 96 -i 1 shell1
96 CPUs in system; running 96 parallel copies of tests

Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
                                                                    ========
System Benchmarks Index Score (Partial Only)                        25761.6


Tested-by: Xin Hao<xhao@linux.alibaba.com>

Looking forward to your next version patch.

On 7/11/22 11:46 AM, Barry Song wrote:
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> My testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
>
> -v2:
> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>     according to the comments of Peter Zijlstra and Dave Hansen
> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>     is empty according to the comments of Nadav Amit
>
> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> , and comments.
>
> -v1:
> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>
> Barry Song (4):
>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>      apply to ARM64"
>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>    arm64: support batched/deferred tlb shootdown during page reclamation
>
>   Documentation/features/arch-support.txt       |  1 -
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm/Kconfig                              |  1 +
>   arch/arm64/Kconfig                            |  1 +
>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>   arch/loongarch/Kconfig                        |  1 +
>   arch/mips/Kconfig                             |  1 +
>   arch/openrisc/Kconfig                         |  1 +
>   arch/powerpc/Kconfig                          |  1 +
>   arch/riscv/Kconfig                            |  1 +
>   arch/s390/Kconfig                             |  1 +
>   arch/um/Kconfig                               |  1 +
>   arch/x86/Kconfig                              |  1 +
>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>   mm/Kconfig                                    |  3 +++
>   mm/rmap.c                                     | 14 +++++++----
>   17 files changed, 59 insertions(+), 9 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
-- 
Best Regards!
Xin Hao


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-14  3:28   ` Xin Hao
  0 siblings, 0 replies; 56+ messages in thread
From: Xin Hao @ 2022-07-14  3:28 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: linux-s390, zhangshiming, lipeifeng, arnd, corbet, realmz6,
	linux-kernel, yangyicong, openrisc, darren, huzhanyuan, guojian,
	linux-riscv, linux-mips, linuxppc-dev

Hi barry.

I do some test on Kunpeng arm64 machine use Unixbench.

The test  result as below.

One core, we can see the performance improvement above +30%.
./Run -c 1 -i 1 shell1
w/o
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
========
System Benchmarks Index Score (Partial Only)                         1292.7

w/
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
========
System Benchmarks Index Score (Partial Only)                         1645.0


But with whole cores, there have little performance degradation above -5%

./Run -c 96 -i 1 shell1
w/o
Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
========
System Benchmarks Index Score (Partial Only)                        19048.5

w
Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1 
samples)
System Benchmarks Partial Index              BASELINE RESULT INDEX
Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
========
System Benchmarks Index Score (Partial Only)                        18003.2

---------------------------------------------------------------------------------------------- 


After discuss with you, and do some changes in the patch.

ndex a52381a680db..1ecba81f1277 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
         int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;

         if (pending != flushed) {
+#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
                 flush_tlb_mm(mm);
+#else
+               dsb(ish);
+#endif
                 /*
                  * If the new TLB flushing is pending during flushing, leave
                  * mm->tlb_flush_batched as is, to avoid losing flushing.

there have a performance improvement with whole cores, above +30%

./Run -c 96 -i 1 shell1
96 CPUs in system; running 96 parallel copies of tests

Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
                                                                    ========
System Benchmarks Index Score (Partial Only)                        25761.6


Tested-by: Xin Hao<xhao@linux.alibaba.com>

Looking forward to your next version patch.

On 7/11/22 11:46 AM, Barry Song wrote:
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> My testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
>
> -v2:
> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>     according to the comments of Peter Zijlstra and Dave Hansen
> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>     is empty according to the comments of Nadav Amit
>
> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> , and comments.
>
> -v1:
> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>
> Barry Song (4):
>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>      apply to ARM64"
>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>    arm64: support batched/deferred tlb shootdown during page reclamation
>
>   Documentation/features/arch-support.txt       |  1 -
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm/Kconfig                              |  1 +
>   arch/arm64/Kconfig                            |  1 +
>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>   arch/loongarch/Kconfig                        |  1 +
>   arch/mips/Kconfig                             |  1 +
>   arch/openrisc/Kconfig                         |  1 +
>   arch/powerpc/Kconfig                          |  1 +
>   arch/riscv/Kconfig                            |  1 +
>   arch/s390/Kconfig                             |  1 +
>   arch/um/Kconfig                               |  1 +
>   arch/x86/Kconfig                              |  1 +
>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>   mm/Kconfig                                    |  3 +++
>   mm/rmap.c                                     | 14 +++++++----
>   17 files changed, 59 insertions(+), 9 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
-- 
Best Regards!
Xin Hao


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
  2022-07-14  3:28   ` Xin Hao
  (?)
  (?)
@ 2022-07-14  4:51     ` Barry Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-14  4:51 UTC (permalink / raw)
  To: xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390

On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>
> Hi barry.
>
> I do some test on Kunpeng arm64 machine use Unixbench.
>
> The test  result as below.
>
> One core, we can see the performance improvement above +30%.

I am really pleased to see the 30%+ improvement on unixbench on single core.

> ./Run -c 1 -i 1 shell1
> w/o
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> ========
> System Benchmarks Index Score (Partial Only)                         1292.7
>
> w/
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> ========
> System Benchmarks Index Score (Partial Only)                         1645.0
>
>
> But with whole cores, there have little performance degradation above -5%

That is sad as we might get more concurrency between mprotect(), madvise(),
mremap(), zap_pte_range() and the deferred tlbi.

>
> ./Run -c 96 -i 1 shell1
> w/o
> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> ========
> System Benchmarks Index Score (Partial Only)                        19048.5
>
> w
> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> ========
> System Benchmarks Index Score (Partial Only)                        18003.2
>
> ----------------------------------------------------------------------------------------------
>
>
> After discuss with you, and do some changes in the patch.
>
> ndex a52381a680db..1ecba81f1277 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>
>          if (pending != flushed) {
> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>                  flush_tlb_mm(mm);
> +#else
> +               dsb(ish);
> +#endif
>

i was guessing the problem might be flush_tlb_batched_pending()
so i asked you to change this to verify my guess.

     /*
>                   * If the new TLB flushing is pending during flushing, leave
>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>
> there have a performance improvement with whole cores, above +30%

But I don't think it is a proper patch. There is no guarantee the cpu calling
flush_tlb_batched_pending is exactly the cpu sending the deferred
tlbi. so the solution is unsafe. But since this temporary code can bring the
30%+ performance improvement back for high concurrency, we have huge
potential to finally make it.

Unfortunately I don't have an arm64 server to debug on this. I only have
8 cores which are unlikely to reproduce regression which happens in
high concurrency with 96 parallel tasks.

So I'd ask if @yicong or someone else working on kunpeng or other
arm64 servers  is able to actually debug and figure out a proper
patch for this, then add the patch as 5/5 into this series?

>
> ./Run -c 96 -i 1 shell1
> 96 CPUs in system; running 96 parallel copies of tests
>
> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>                                                                     ========
> System Benchmarks Index Score (Partial Only)                        25761.6
>
>
> Tested-by: Xin Hao<xhao@linux.alibaba.com>

Thanks for your testing!

>
> Looking forward to your next version patch.
>
> On 7/11/22 11:46 AM, Barry Song wrote:
> > Though ARM64 has the hardware to do tlb shootdown, the hardware
> > broadcasting is not free.
> > A simplest micro benchmark shows even on snapdragon 888 with only
> > 8 cores, the overhead for ptep_clear_flush is huge even for paging
> > out one page mapped by only one process:
> > 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
> >
> > While pages are mapped by multiple processes or HW has more CPUs,
> > the cost should become even higher due to the bad scalability of
> > tlb shootdown.
> >
> > The same benchmark can result in 16.99% CPU consumption on ARM64
> > server with around 100 cores according to Yicong's test on patch
> > 4/4.
> >
> > This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> > 1. only send tlbi instructions in the first stage -
> >       arch_tlbbatch_add_mm()
> > 2. wait for the completion of tlbi by dsb while doing tlbbatch
> >       sync in arch_tlbbatch_flush()
> > My testing on snapdragon shows the overhead of ptep_clear_flush
> > is removed by the patchset. The micro benchmark becomes 5% faster
> > even for one page mapped by single process on snapdragon 888.
> >
> >
> > -v2:
> > 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> > 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
> >     according to the comments of Peter Zijlstra and Dave Hansen
> > 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
> >     is empty according to the comments of Nadav Amit
> >
> > Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> > , and comments.
> >
> > -v1:
> > https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
> >
> > Barry Song (4):
> >    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
> >      apply to ARM64"
> >    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
> >    mm: rmap: Extend tlbbatch APIs to fit new platforms
> >    arm64: support batched/deferred tlb shootdown during page reclamation
> >
> >   Documentation/features/arch-support.txt       |  1 -
> >   .../features/vm/TLB/arch-support.txt          |  2 +-
> >   arch/arm/Kconfig                              |  1 +
> >   arch/arm64/Kconfig                            |  1 +
> >   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
> >   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
> >   arch/loongarch/Kconfig                        |  1 +
> >   arch/mips/Kconfig                             |  1 +
> >   arch/openrisc/Kconfig                         |  1 +
> >   arch/powerpc/Kconfig                          |  1 +
> >   arch/riscv/Kconfig                            |  1 +
> >   arch/s390/Kconfig                             |  1 +
> >   arch/um/Kconfig                               |  1 +
> >   arch/x86/Kconfig                              |  1 +
> >   arch/x86/include/asm/tlbflush.h               |  3 ++-
> >   mm/Kconfig                                    |  3 +++
> >   mm/rmap.c                                     | 14 +++++++----
> >   17 files changed, 59 insertions(+), 9 deletions(-)
> >   create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >
> --
> Best Regards!
> Xin Hao
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-14  4:51     ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-14  4:51 UTC (permalink / raw)
  To: xhao
  Cc: Linux Doc Mailing List, Catalin Marinas, Yicong Yang, Linux-MM,
	郭健,
	linux-riscv, Will Deacon, linux-s390,
	张诗明(Simon Zhang),
	李培锋(wink),
	Jonathan Corbet, x86, linux-mips, Arnd Bergmann, real mz,
	openrisc, Darren Hart, LAK, LKML, huzhanyuan, Andrew Morton,
	linuxppc-dev

On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>
> Hi barry.
>
> I do some test on Kunpeng arm64 machine use Unixbench.
>
> The test  result as below.
>
> One core, we can see the performance improvement above +30%.

I am really pleased to see the 30%+ improvement on unixbench on single core.

> ./Run -c 1 -i 1 shell1
> w/o
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> ========
> System Benchmarks Index Score (Partial Only)                         1292.7
>
> w/
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> ========
> System Benchmarks Index Score (Partial Only)                         1645.0
>
>
> But with whole cores, there have little performance degradation above -5%

That is sad as we might get more concurrency between mprotect(), madvise(),
mremap(), zap_pte_range() and the deferred tlbi.

>
> ./Run -c 96 -i 1 shell1
> w/o
> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> ========
> System Benchmarks Index Score (Partial Only)                        19048.5
>
> w
> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> ========
> System Benchmarks Index Score (Partial Only)                        18003.2
>
> ----------------------------------------------------------------------------------------------
>
>
> After discuss with you, and do some changes in the patch.
>
> ndex a52381a680db..1ecba81f1277 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>
>          if (pending != flushed) {
> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>                  flush_tlb_mm(mm);
> +#else
> +               dsb(ish);
> +#endif
>

i was guessing the problem might be flush_tlb_batched_pending()
so i asked you to change this to verify my guess.

     /*
>                   * If the new TLB flushing is pending during flushing, leave
>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>
> there have a performance improvement with whole cores, above +30%

But I don't think it is a proper patch. There is no guarantee the cpu calling
flush_tlb_batched_pending is exactly the cpu sending the deferred
tlbi. so the solution is unsafe. But since this temporary code can bring the
30%+ performance improvement back for high concurrency, we have huge
potential to finally make it.

Unfortunately I don't have an arm64 server to debug on this. I only have
8 cores which are unlikely to reproduce regression which happens in
high concurrency with 96 parallel tasks.

So I'd ask if @yicong or someone else working on kunpeng or other
arm64 servers  is able to actually debug and figure out a proper
patch for this, then add the patch as 5/5 into this series?

>
> ./Run -c 96 -i 1 shell1
> 96 CPUs in system; running 96 parallel copies of tests
>
> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>                                                                     ========
> System Benchmarks Index Score (Partial Only)                        25761.6
>
>
> Tested-by: Xin Hao<xhao@linux.alibaba.com>

Thanks for your testing!

>
> Looking forward to your next version patch.
>
> On 7/11/22 11:46 AM, Barry Song wrote:
> > Though ARM64 has the hardware to do tlb shootdown, the hardware
> > broadcasting is not free.
> > A simplest micro benchmark shows even on snapdragon 888 with only
> > 8 cores, the overhead for ptep_clear_flush is huge even for paging
> > out one page mapped by only one process:
> > 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
> >
> > While pages are mapped by multiple processes or HW has more CPUs,
> > the cost should become even higher due to the bad scalability of
> > tlb shootdown.
> >
> > The same benchmark can result in 16.99% CPU consumption on ARM64
> > server with around 100 cores according to Yicong's test on patch
> > 4/4.
> >
> > This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> > 1. only send tlbi instructions in the first stage -
> >       arch_tlbbatch_add_mm()
> > 2. wait for the completion of tlbi by dsb while doing tlbbatch
> >       sync in arch_tlbbatch_flush()
> > My testing on snapdragon shows the overhead of ptep_clear_flush
> > is removed by the patchset. The micro benchmark becomes 5% faster
> > even for one page mapped by single process on snapdragon 888.
> >
> >
> > -v2:
> > 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> > 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
> >     according to the comments of Peter Zijlstra and Dave Hansen
> > 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
> >     is empty according to the comments of Nadav Amit
> >
> > Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> > , and comments.
> >
> > -v1:
> > https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
> >
> > Barry Song (4):
> >    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
> >      apply to ARM64"
> >    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
> >    mm: rmap: Extend tlbbatch APIs to fit new platforms
> >    arm64: support batched/deferred tlb shootdown during page reclamation
> >
> >   Documentation/features/arch-support.txt       |  1 -
> >   .../features/vm/TLB/arch-support.txt          |  2 +-
> >   arch/arm/Kconfig                              |  1 +
> >   arch/arm64/Kconfig                            |  1 +
> >   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
> >   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
> >   arch/loongarch/Kconfig                        |  1 +
> >   arch/mips/Kconfig                             |  1 +
> >   arch/openrisc/Kconfig                         |  1 +
> >   arch/powerpc/Kconfig                          |  1 +
> >   arch/riscv/Kconfig                            |  1 +
> >   arch/s390/Kconfig                             |  1 +
> >   arch/um/Kconfig                               |  1 +
> >   arch/x86/Kconfig                              |  1 +
> >   arch/x86/include/asm/tlbflush.h               |  3 ++-
> >   mm/Kconfig                                    |  3 +++
> >   mm/rmap.c                                     | 14 +++++++----
> >   17 files changed, 59 insertions(+), 9 deletions(-)
> >   create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >
> --
> Best Regards!
> Xin Hao
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-14  4:51     ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-14  4:51 UTC (permalink / raw)
  To: xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390

On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>
> Hi barry.
>
> I do some test on Kunpeng arm64 machine use Unixbench.
>
> The test  result as below.
>
> One core, we can see the performance improvement above +30%.

I am really pleased to see the 30%+ improvement on unixbench on single core.

> ./Run -c 1 -i 1 shell1
> w/o
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> ========
> System Benchmarks Index Score (Partial Only)                         1292.7
>
> w/
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> ========
> System Benchmarks Index Score (Partial Only)                         1645.0
>
>
> But with whole cores, there have little performance degradation above -5%

That is sad as we might get more concurrency between mprotect(), madvise(),
mremap(), zap_pte_range() and the deferred tlbi.

>
> ./Run -c 96 -i 1 shell1
> w/o
> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> ========
> System Benchmarks Index Score (Partial Only)                        19048.5
>
> w
> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> ========
> System Benchmarks Index Score (Partial Only)                        18003.2
>
> ----------------------------------------------------------------------------------------------
>
>
> After discuss with you, and do some changes in the patch.
>
> ndex a52381a680db..1ecba81f1277 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>
>          if (pending != flushed) {
> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>                  flush_tlb_mm(mm);
> +#else
> +               dsb(ish);
> +#endif
>

i was guessing the problem might be flush_tlb_batched_pending()
so i asked you to change this to verify my guess.

     /*
>                   * If the new TLB flushing is pending during flushing, leave
>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>
> there have a performance improvement with whole cores, above +30%

But I don't think it is a proper patch. There is no guarantee the cpu calling
flush_tlb_batched_pending is exactly the cpu sending the deferred
tlbi. so the solution is unsafe. But since this temporary code can bring the
30%+ performance improvement back for high concurrency, we have huge
potential to finally make it.

Unfortunately I don't have an arm64 server to debug on this. I only have
8 cores which are unlikely to reproduce regression which happens in
high concurrency with 96 parallel tasks.

So I'd ask if @yicong or someone else working on kunpeng or other
arm64 servers  is able to actually debug and figure out a proper
patch for this, then add the patch as 5/5 into this series?

>
> ./Run -c 96 -i 1 shell1
> 96 CPUs in system; running 96 parallel copies of tests
>
> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>                                                                     ========
> System Benchmarks Index Score (Partial Only)                        25761.6
>
>
> Tested-by: Xin Hao<xhao@linux.alibaba.com>

Thanks for your testing!

>
> Looking forward to your next version patch.
>
> On 7/11/22 11:46 AM, Barry Song wrote:
> > Though ARM64 has the hardware to do tlb shootdown, the hardware
> > broadcasting is not free.
> > A simplest micro benchmark shows even on snapdragon 888 with only
> > 8 cores, the overhead for ptep_clear_flush is huge even for paging
> > out one page mapped by only one process:
> > 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
> >
> > While pages are mapped by multiple processes or HW has more CPUs,
> > the cost should become even higher due to the bad scalability of
> > tlb shootdown.
> >
> > The same benchmark can result in 16.99% CPU consumption on ARM64
> > server with around 100 cores according to Yicong's test on patch
> > 4/4.
> >
> > This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> > 1. only send tlbi instructions in the first stage -
> >       arch_tlbbatch_add_mm()
> > 2. wait for the completion of tlbi by dsb while doing tlbbatch
> >       sync in arch_tlbbatch_flush()
> > My testing on snapdragon shows the overhead of ptep_clear_flush
> > is removed by the patchset. The micro benchmark becomes 5% faster
> > even for one page mapped by single process on snapdragon 888.
> >
> >
> > -v2:
> > 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> > 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
> >     according to the comments of Peter Zijlstra and Dave Hansen
> > 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
> >     is empty according to the comments of Nadav Amit
> >
> > Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> > , and comments.
> >
> > -v1:
> > https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
> >
> > Barry Song (4):
> >    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
> >      apply to ARM64"
> >    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
> >    mm: rmap: Extend tlbbatch APIs to fit new platforms
> >    arm64: support batched/deferred tlb shootdown during page reclamation
> >
> >   Documentation/features/arch-support.txt       |  1 -
> >   .../features/vm/TLB/arch-support.txt          |  2 +-
> >   arch/arm/Kconfig                              |  1 +
> >   arch/arm64/Kconfig                            |  1 +
> >   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
> >   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
> >   arch/loongarch/Kconfig                        |  1 +
> >   arch/mips/Kconfig                             |  1 +
> >   arch/openrisc/Kconfig                         |  1 +
> >   arch/powerpc/Kconfig                          |  1 +
> >   arch/riscv/Kconfig                            |  1 +
> >   arch/s390/Kconfig                             |  1 +
> >   arch/um/Kconfig                               |  1 +
> >   arch/x86/Kconfig                              |  1 +
> >   arch/x86/include/asm/tlbflush.h               |  3 ++-
> >   mm/Kconfig                                    |  3 +++
> >   mm/rmap.c                                     | 14 +++++++----
> >   17 files changed, 59 insertions(+), 9 deletions(-)
> >   create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >
> --
> Best Regards!
> Xin Hao
>

Thanks
Barry

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-14  4:51     ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-14  4:51 UTC (permalink / raw)
  To: xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, Yicong Yang, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390

On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>
> Hi barry.
>
> I do some test on Kunpeng arm64 machine use Unixbench.
>
> The test  result as below.
>
> One core, we can see the performance improvement above +30%.

I am really pleased to see the 30%+ improvement on unixbench on single core.

> ./Run -c 1 -i 1 shell1
> w/o
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> ========
> System Benchmarks Index Score (Partial Only)                         1292.7
>
> w/
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> ========
> System Benchmarks Index Score (Partial Only)                         1645.0
>
>
> But with whole cores, there have little performance degradation above -5%

That is sad as we might get more concurrency between mprotect(), madvise(),
mremap(), zap_pte_range() and the deferred tlbi.

>
> ./Run -c 96 -i 1 shell1
> w/o
> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> ========
> System Benchmarks Index Score (Partial Only)                        19048.5
>
> w
> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> samples)
> System Benchmarks Partial Index              BASELINE RESULT INDEX
> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> ========
> System Benchmarks Index Score (Partial Only)                        18003.2
>
> ----------------------------------------------------------------------------------------------
>
>
> After discuss with you, and do some changes in the patch.
>
> ndex a52381a680db..1ecba81f1277 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>
>          if (pending != flushed) {
> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>                  flush_tlb_mm(mm);
> +#else
> +               dsb(ish);
> +#endif
>

i was guessing the problem might be flush_tlb_batched_pending()
so i asked you to change this to verify my guess.

     /*
>                   * If the new TLB flushing is pending during flushing, leave
>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>
> there have a performance improvement with whole cores, above +30%

But I don't think it is a proper patch. There is no guarantee the cpu calling
flush_tlb_batched_pending is exactly the cpu sending the deferred
tlbi. so the solution is unsafe. But since this temporary code can bring the
30%+ performance improvement back for high concurrency, we have huge
potential to finally make it.

Unfortunately I don't have an arm64 server to debug on this. I only have
8 cores which are unlikely to reproduce regression which happens in
high concurrency with 96 parallel tasks.

So I'd ask if @yicong or someone else working on kunpeng or other
arm64 servers  is able to actually debug and figure out a proper
patch for this, then add the patch as 5/5 into this series?

>
> ./Run -c 96 -i 1 shell1
> 96 CPUs in system; running 96 parallel copies of tests
>
> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>                                                                     ========
> System Benchmarks Index Score (Partial Only)                        25761.6
>
>
> Tested-by: Xin Hao<xhao@linux.alibaba.com>

Thanks for your testing!

>
> Looking forward to your next version patch.
>
> On 7/11/22 11:46 AM, Barry Song wrote:
> > Though ARM64 has the hardware to do tlb shootdown, the hardware
> > broadcasting is not free.
> > A simplest micro benchmark shows even on snapdragon 888 with only
> > 8 cores, the overhead for ptep_clear_flush is huge even for paging
> > out one page mapped by only one process:
> > 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
> >
> > While pages are mapped by multiple processes or HW has more CPUs,
> > the cost should become even higher due to the bad scalability of
> > tlb shootdown.
> >
> > The same benchmark can result in 16.99% CPU consumption on ARM64
> > server with around 100 cores according to Yicong's test on patch
> > 4/4.
> >
> > This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> > 1. only send tlbi instructions in the first stage -
> >       arch_tlbbatch_add_mm()
> > 2. wait for the completion of tlbi by dsb while doing tlbbatch
> >       sync in arch_tlbbatch_flush()
> > My testing on snapdragon shows the overhead of ptep_clear_flush
> > is removed by the patchset. The micro benchmark becomes 5% faster
> > even for one page mapped by single process on snapdragon 888.
> >
> >
> > -v2:
> > 1. Collected Yicong's test result on kunpeng920 ARM64 server;
> > 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
> >     according to the comments of Peter Zijlstra and Dave Hansen
> > 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
> >     is empty according to the comments of Nadav Amit
> >
> > Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
> > , and comments.
> >
> > -v1:
> > https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
> >
> > Barry Song (4):
> >    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
> >      apply to ARM64"
> >    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
> >    mm: rmap: Extend tlbbatch APIs to fit new platforms
> >    arm64: support batched/deferred tlb shootdown during page reclamation
> >
> >   Documentation/features/arch-support.txt       |  1 -
> >   .../features/vm/TLB/arch-support.txt          |  2 +-
> >   arch/arm/Kconfig                              |  1 +
> >   arch/arm64/Kconfig                            |  1 +
> >   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
> >   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
> >   arch/loongarch/Kconfig                        |  1 +
> >   arch/mips/Kconfig                             |  1 +
> >   arch/openrisc/Kconfig                         |  1 +
> >   arch/powerpc/Kconfig                          |  1 +
> >   arch/riscv/Kconfig                            |  1 +
> >   arch/s390/Kconfig                             |  1 +
> >   arch/um/Kconfig                               |  1 +
> >   arch/x86/Kconfig                              |  1 +
> >   arch/x86/include/asm/tlbflush.h               |  3 ++-
> >   mm/Kconfig                                    |  3 +++
> >   mm/rmap.c                                     | 14 +++++++----
> >   17 files changed, 59 insertions(+), 9 deletions(-)
> >   create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >
> --
> Best Regards!
> Xin Hao
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
  2022-07-14  4:51     ` Barry Song
  (?)
  (?)
@ 2022-07-15  2:47       ` Yicong Yang
  -1 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-15  2:47 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: yangyicong, Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas,
	Will Deacon, Linux Doc Mailing List, Jonathan Corbet,
	Arnd Bergmann, LKML, Darren Hart, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, tiantao (H)

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 
>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 

sure, Tiantao and I will look into this on Kunpeng 920.

>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-15  2:47       ` Yicong Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-15  2:47 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: yangyicong, Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas,
	Will Deacon, Linux Doc Mailing List, Jonathan Corbet,
	Arnd Bergmann, LKML, Darren Hart, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, tiantao (H)

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 
>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 

sure, Tiantao and I will look into this on Kunpeng 920.

>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-15  2:47       ` Yicong Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-15  2:47 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: yangyicong, Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas,
	Will Deacon, Linux Doc Mailing List, Jonathan Corbet,
	Arnd Bergmann, LKML, Darren Hart, huzhanyuan,
	李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, tiantao (H)

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 
>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 

sure, Tiantao and I will look into this on Kunpeng 920.

>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-15  2:47       ` Yicong Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-15  2:47 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: Linux Doc Mailing List, Catalin Marinas, yangyicong, Linux-MM,
	郭健,
	linux-riscv, Will Deacon, linux-s390,
	张诗明(Simon Zhang),
	李培锋(wink),
	Jonathan Corbet, x86, linux-mips, Arnd Bergmann, real mz,
	openrisc, Darren Hart, LAK, LKML, huzhanyuan, tiantao (H),
	Andrew Morton, linuxppc-dev

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 
>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 

sure, Tiantao and I will look into this on Kunpeng 920.

>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
  2022-07-14  4:51     ` Barry Song
  (?)
  (?)
@ 2022-07-18 13:28       ` Yicong Yang
  -1 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-18 13:28 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, yangyicong, tiantao (H)

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 

flush_tlb_batched_pending() looks like the critical path for this issue then the code
above can mitigate this.

I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
      iter-1      iter-2     iter-3
w/o  17708.1     17637.1    17630.1
w    17766.0     17752.3    17861.7

And flush_tlb_batched_pending()isn't the hot spot with the patch:
   7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
   4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
   2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
   1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
   1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
   1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
   1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
   1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
   1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
   1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc

Hi Xin Hao,

I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
should not be the reason I think). Did you check that the 5% is a fluctuation or
not? It'll be helpful if more information provided for reproducing this issue.

Thanks.

>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 
>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-18 13:28       ` Yicong Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-18 13:28 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, yangyicong, tiantao (H)

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 

flush_tlb_batched_pending() looks like the critical path for this issue then the code
above can mitigate this.

I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
      iter-1      iter-2     iter-3
w/o  17708.1     17637.1    17630.1
w    17766.0     17752.3    17861.7

And flush_tlb_batched_pending()isn't the hot spot with the patch:
   7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
   4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
   2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
   1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
   1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
   1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
   1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
   1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
   1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
   1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc

Hi Xin Hao,

I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
should not be the reason I think). Did you check that the 5% is a fluctuation or
not? It'll be helpful if more information provided for reproducing this issue.

Thanks.

>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 
>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-18 13:28       ` Yicong Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-18 13:28 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: Linux Doc Mailing List, Catalin Marinas, yangyicong, Linux-MM,
	郭健,
	linux-riscv, Will Deacon, linux-s390,
	张诗明(Simon Zhang),
	李培锋(wink),
	Jonathan Corbet, x86, linux-mips, Arnd Bergmann, real mz,
	openrisc, Darren Hart, LAK, LKML, huzhanyuan, tiantao (H),
	Andrew Morton, linuxppc-dev

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 

flush_tlb_batched_pending() looks like the critical path for this issue then the code
above can mitigate this.

I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
      iter-1      iter-2     iter-3
w/o  17708.1     17637.1    17630.1
w    17766.0     17752.3    17861.7

And flush_tlb_batched_pending()isn't the hot spot with the patch:
   7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
   4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
   2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
   1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
   1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
   1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
   1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
   1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
   1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
   1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc

Hi Xin Hao,

I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
should not be the reason I think). Did you check that the 5% is a fluctuation or
not? It'll be helpful if more information provided for reproducing this issue.

Thanks.

>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 
>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-18 13:28       ` Yicong Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Yicong Yang @ 2022-07-18 13:28 UTC (permalink / raw)
  To: Barry Song, xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, yangyicong, tiantao (H)

On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test  result as below.
>>
>> One core, we can see the performance improvement above +30%.
> 
> I am really pleased to see the 30%+ improvement on unixbench on single core.
> 
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only)                         1292.7
>>
>> w/
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only)                         1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
> 
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
> 
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only)                        19048.5
>>
>> w
>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only)                        18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>>          if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>                  flush_tlb_mm(mm);
>> +#else
>> +               dsb(ish);
>> +#endif
>>
> 
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
> 

flush_tlb_batched_pending() looks like the critical path for this issue then the code
above can mitigate this.

I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
      iter-1      iter-2     iter-3
w/o  17708.1     17637.1    17630.1
w    17766.0     17752.3    17861.7

And flush_tlb_batched_pending()isn't the hot spot with the patch:
   7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
   4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
   2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
   1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
   1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
   1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
   1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
   1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
   1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
   1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc

Hi Xin Hao,

I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
should not be the reason I think). Did you check that the 5% is a fluctuation or
not? It'll be helpful if more information provided for reproducing this issue.

Thanks.

>      /*
>>                   * If the new TLB flushing is pending during flushing, leave
>>                   * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
> 
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
> 
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
> 
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers  is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
> 
>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>                                                                     ========
>> System Benchmarks Index Score (Partial Only)                        25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
> 
> Thanks for your testing!
> 
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>>       arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>       sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>     according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>     is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>>    Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>      apply to ARM64"
>>>    mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>    mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>    arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>>   Documentation/features/arch-support.txt       |  1 -
>>>   .../features/vm/TLB/arch-support.txt          |  2 +-
>>>   arch/arm/Kconfig                              |  1 +
>>>   arch/arm64/Kconfig                            |  1 +
>>>   arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>   arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>   arch/loongarch/Kconfig                        |  1 +
>>>   arch/mips/Kconfig                             |  1 +
>>>   arch/openrisc/Kconfig                         |  1 +
>>>   arch/powerpc/Kconfig                          |  1 +
>>>   arch/riscv/Kconfig                            |  1 +
>>>   arch/s390/Kconfig                             |  1 +
>>>   arch/um/Kconfig                               |  1 +
>>>   arch/x86/Kconfig                              |  1 +
>>>   arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>   mm/Kconfig                                    |  3 +++
>>>   mm/rmap.c                                     | 14 +++++++----
>>>   17 files changed, 59 insertions(+), 9 deletions(-)
>>>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
> 
> Thanks
> Barry
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
  2022-07-18 13:28       ` Yicong Yang
  (?)
  (?)
@ 2022-07-20 11:18         ` Barry Song
  -1 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-20 11:18 UTC (permalink / raw)
  To: Yicong Yang, xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Yicong Yang, tiantao (H)

On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>
> On 2022/7/14 12:51, Barry Song wrote:
> > On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
> >>
> >> Hi barry.
> >>
> >> I do some test on Kunpeng arm64 machine use Unixbench.
> >>
> >> The test  result as below.
> >>
> >> One core, we can see the performance improvement above +30%.
> >
> > I am really pleased to see the 30%+ improvement on unixbench on single core.
> >
> >> ./Run -c 1 -i 1 shell1
> >> w/o
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1292.7
> >>
> >> w/
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1645.0
> >>
> >>
> >> But with whole cores, there have little performance degradation above -5%
> >
> > That is sad as we might get more concurrency between mprotect(), madvise(),
> > mremap(), zap_pte_range() and the deferred tlbi.
> >
> >>
> >> ./Run -c 96 -i 1 shell1
> >> w/o
> >> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        19048.5
> >>
> >> w
> >> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        18003.2
> >>
> >> ----------------------------------------------------------------------------------------------
> >>
> >>
> >> After discuss with you, and do some changes in the patch.
> >>
> >> ndex a52381a680db..1ecba81f1277 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
> >>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> >>
> >>          if (pending != flushed) {
> >> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
> >>                  flush_tlb_mm(mm);
> >> +#else
> >> +               dsb(ish);
> >> +#endif
> >>
> >
> > i was guessing the problem might be flush_tlb_batched_pending()
> > so i asked you to change this to verify my guess.
> >
>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>       iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>    7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>    4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>    2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>    1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>    1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>    1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>    1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>    1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>    1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>    1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
>
> Thanks.

I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
stressed on
memory. Hi Xin, in what kinds of configurations can we reproduce your test
result?

As I suppose tlbbatch will mainly affect the performance of user scenarios
which require memory page-out/page-in like reclaiming file/anon pages.
"./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
affected by tlbbatch at all, I believe.

Thanks
Barry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-20 11:18         ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-20 11:18 UTC (permalink / raw)
  To: Yicong Yang, xhao
  Cc: Linux Doc Mailing List, Catalin Marinas, Yicong Yang, Linux-MM,
	郭健,
	linux-riscv, Will Deacon, linux-s390,
	张诗明(Simon Zhang),
	李培锋(wink),
	Jonathan Corbet, x86, linux-mips, Arnd Bergmann, real mz,
	openrisc, Darren Hart, LAK, LKML, huzhanyuan, tiantao (H),
	Andrew Morton, linuxppc-dev

On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>
> On 2022/7/14 12:51, Barry Song wrote:
> > On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
> >>
> >> Hi barry.
> >>
> >> I do some test on Kunpeng arm64 machine use Unixbench.
> >>
> >> The test  result as below.
> >>
> >> One core, we can see the performance improvement above +30%.
> >
> > I am really pleased to see the 30%+ improvement on unixbench on single core.
> >
> >> ./Run -c 1 -i 1 shell1
> >> w/o
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1292.7
> >>
> >> w/
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1645.0
> >>
> >>
> >> But with whole cores, there have little performance degradation above -5%
> >
> > That is sad as we might get more concurrency between mprotect(), madvise(),
> > mremap(), zap_pte_range() and the deferred tlbi.
> >
> >>
> >> ./Run -c 96 -i 1 shell1
> >> w/o
> >> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        19048.5
> >>
> >> w
> >> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        18003.2
> >>
> >> ----------------------------------------------------------------------------------------------
> >>
> >>
> >> After discuss with you, and do some changes in the patch.
> >>
> >> ndex a52381a680db..1ecba81f1277 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
> >>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> >>
> >>          if (pending != flushed) {
> >> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
> >>                  flush_tlb_mm(mm);
> >> +#else
> >> +               dsb(ish);
> >> +#endif
> >>
> >
> > i was guessing the problem might be flush_tlb_batched_pending()
> > so i asked you to change this to verify my guess.
> >
>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>       iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>    7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>    4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>    2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>    1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>    1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>    1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>    1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>    1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>    1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>    1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
>
> Thanks.

I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
stressed on
memory. Hi Xin, in what kinds of configurations can we reproduce your test
result?

As I suppose tlbbatch will mainly affect the performance of user scenarios
which require memory page-out/page-in like reclaiming file/anon pages.
"./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
affected by tlbbatch at all, I believe.

Thanks
Barry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-20 11:18         ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-20 11:18 UTC (permalink / raw)
  To: Yicong Yang, xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Yicong Yang, tiantao (H)

On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>
> On 2022/7/14 12:51, Barry Song wrote:
> > On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
> >>
> >> Hi barry.
> >>
> >> I do some test on Kunpeng arm64 machine use Unixbench.
> >>
> >> The test  result as below.
> >>
> >> One core, we can see the performance improvement above +30%.
> >
> > I am really pleased to see the 30%+ improvement on unixbench on single core.
> >
> >> ./Run -c 1 -i 1 shell1
> >> w/o
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1292.7
> >>
> >> w/
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1645.0
> >>
> >>
> >> But with whole cores, there have little performance degradation above -5%
> >
> > That is sad as we might get more concurrency between mprotect(), madvise(),
> > mremap(), zap_pte_range() and the deferred tlbi.
> >
> >>
> >> ./Run -c 96 -i 1 shell1
> >> w/o
> >> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        19048.5
> >>
> >> w
> >> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        18003.2
> >>
> >> ----------------------------------------------------------------------------------------------
> >>
> >>
> >> After discuss with you, and do some changes in the patch.
> >>
> >> ndex a52381a680db..1ecba81f1277 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
> >>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> >>
> >>          if (pending != flushed) {
> >> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
> >>                  flush_tlb_mm(mm);
> >> +#else
> >> +               dsb(ish);
> >> +#endif
> >>
> >
> > i was guessing the problem might be flush_tlb_batched_pending()
> > so i asked you to change this to verify my guess.
> >
>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>       iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>    7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>    4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>    2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>    1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>    1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>    1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>    1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>    1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>    1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>    1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
>
> Thanks.

I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
stressed on
memory. Hi Xin, in what kinds of configurations can we reproduce your test
result?

As I suppose tlbbatch will mainly affect the performance of user scenarios
which require memory page-out/page-in like reclaiming file/anon pages.
"./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
affected by tlbbatch at all, I believe.

Thanks
Barry

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-20 11:18         ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2022-07-20 11:18 UTC (permalink / raw)
  To: Yicong Yang, xhao
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Yicong Yang, tiantao (H)

On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>
> On 2022/7/14 12:51, Barry Song wrote:
> > On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
> >>
> >> Hi barry.
> >>
> >> I do some test on Kunpeng arm64 machine use Unixbench.
> >>
> >> The test  result as below.
> >>
> >> One core, we can see the performance improvement above +30%.
> >
> > I am really pleased to see the 30%+ improvement on unixbench on single core.
> >
> >> ./Run -c 1 -i 1 shell1
> >> w/o
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1292.7
> >>
> >> w/
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
> >> ========
> >> System Benchmarks Index Score (Partial Only)                         1645.0
> >>
> >>
> >> But with whole cores, there have little performance degradation above -5%
> >
> > That is sad as we might get more concurrency between mprotect(), madvise(),
> > mremap(), zap_pte_range() and the deferred tlbi.
> >
> >>
> >> ./Run -c 96 -i 1 shell1
> >> w/o
> >> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        19048.5
> >>
> >> w
> >> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
> >> samples)
> >> System Benchmarks Partial Index              BASELINE RESULT INDEX
> >> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
> >> ========
> >> System Benchmarks Index Score (Partial Only)                        18003.2
> >>
> >> ----------------------------------------------------------------------------------------------
> >>
> >>
> >> After discuss with you, and do some changes in the patch.
> >>
> >> ndex a52381a680db..1ecba81f1277 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
> >>          int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> >>
> >>          if (pending != flushed) {
> >> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
> >>                  flush_tlb_mm(mm);
> >> +#else
> >> +               dsb(ish);
> >> +#endif
> >>
> >
> > i was guessing the problem might be flush_tlb_batched_pending()
> > so i asked you to change this to verify my guess.
> >
>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>       iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>    7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>    4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>    2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>    1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>    1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>    1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>    1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>    1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>    1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>    1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
>
> Thanks.

I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
stressed on
memory. Hi Xin, in what kinds of configurations can we reproduce your test
result?

As I suppose tlbbatch will mainly affect the performance of user scenarios
which require memory page-out/page-in like reclaiming file/anon pages.
"./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
affected by tlbbatch at all, I believe.

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
  2022-07-18 13:28       ` Yicong Yang
  (?)
  (?)
@ 2022-07-23  9:17         ` xhao
  -1 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:17 UTC (permalink / raw)
  To: Yicong Yang, Barry Song
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, yangyicong, tiantao (H)


On 7/18/22 9:28 PM, Yicong Yang wrote:
> On 2022/7/14 12:51, Barry Song wrote:
>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>> Hi barry.
>>>
>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>
>>> The test  result as below.
>>>
>>> One core, we can see the performance improvement above +30%.
>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>
>>> ./Run -c 1 -i 1 shell1
>>> w/o
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>
>>> w/
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>
>>>
>>> But with whole cores, there have little performance degradation above -5%
>> That is sad as we might get more concurrency between mprotect(), madvise(),
>> mremap(), zap_pte_range() and the deferred tlbi.
>>
>>> ./Run -c 96 -i 1 shell1
>>> w/o
>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>
>>> w
>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>
>>> ----------------------------------------------------------------------------------------------
>>>
>>>
>>> After discuss with you, and do some changes in the patch.
>>>
>>> ndex a52381a680db..1ecba81f1277 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>
>>>           if (pending != flushed) {
>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>                   flush_tlb_mm(mm);
>>> +#else
>>> +               dsb(ish);
>>> +#endif
>>>
>> i was guessing the problem might be flush_tlb_batched_pending()
>> so i asked you to change this to verify my guess.
>>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>        iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
Yes, not always the 5% reduce,  there exist a fluctuation.
>
> Thanks.
>
>>       /*
>>>                    * If the new TLB flushing is pending during flushing, leave
>>>                    * mm->tlb_flush_batched as is, to avoid losing flushing.
>>>
>>> there have a performance improvement with whole cores, above +30%
>> But I don't think it is a proper patch. There is no guarantee the cpu calling
>> flush_tlb_batched_pending is exactly the cpu sending the deferred
>> tlbi. so the solution is unsafe. But since this temporary code can bring the
>> 30%+ performance improvement back for high concurrency, we have huge
>> potential to finally make it.
>>
>> Unfortunately I don't have an arm64 server to debug on this. I only have
>> 8 cores which are unlikely to reproduce regression which happens in
>> high concurrency with 96 parallel tasks.
>>
>> So I'd ask if @yicong or someone else working on kunpeng or other
>> arm64 servers  is able to actually debug and figure out a proper
>> patch for this, then add the patch as 5/5 into this series?
>>
>>> ./Run -c 96 -i 1 shell1
>>> 96 CPUs in system; running 96 parallel copies of tests
>>>
>>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>>                                                                      ========
>>> System Benchmarks Index Score (Partial Only)                        25761.6
>>>
>>>
>>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
>> Thanks for your testing!
>>
>>> Looking forward to your next version patch.
>>>
>>> On 7/11/22 11:46 AM, Barry Song wrote:
>>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>>> broadcasting is not free.
>>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>>> out one page mapped by only one process:
>>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>>
>>>> While pages are mapped by multiple processes or HW has more CPUs,
>>>> the cost should become even higher due to the bad scalability of
>>>> tlb shootdown.
>>>>
>>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>>> server with around 100 cores according to Yicong's test on patch
>>>> 4/4.
>>>>
>>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>>> 1. only send tlbi instructions in the first stage -
>>>>        arch_tlbbatch_add_mm()
>>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>>        sync in arch_tlbbatch_flush()
>>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>>> even for one page mapped by single process on snapdragon 888.
>>>>
>>>>
>>>> -v2:
>>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>>      according to the comments of Peter Zijlstra and Dave Hansen
>>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>>      is empty according to the comments of Nadav Amit
>>>>
>>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>>> , and comments.
>>>>
>>>> -v1:
>>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>>
>>>> Barry Song (4):
>>>>     Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>>       apply to ARM64"
>>>>     mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>>     mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>>     arm64: support batched/deferred tlb shootdown during page reclamation
>>>>
>>>>    Documentation/features/arch-support.txt       |  1 -
>>>>    .../features/vm/TLB/arch-support.txt          |  2 +-
>>>>    arch/arm/Kconfig                              |  1 +
>>>>    arch/arm64/Kconfig                            |  1 +
>>>>    arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>>    arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>>    arch/loongarch/Kconfig                        |  1 +
>>>>    arch/mips/Kconfig                             |  1 +
>>>>    arch/openrisc/Kconfig                         |  1 +
>>>>    arch/powerpc/Kconfig                          |  1 +
>>>>    arch/riscv/Kconfig                            |  1 +
>>>>    arch/s390/Kconfig                             |  1 +
>>>>    arch/um/Kconfig                               |  1 +
>>>>    arch/x86/Kconfig                              |  1 +
>>>>    arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>>    mm/Kconfig                                    |  3 +++
>>>>    mm/rmap.c                                     | 14 +++++++----
>>>>    17 files changed, 59 insertions(+), 9 deletions(-)
>>>>    create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>>
>>> --
>>> Best Regards!
>>> Xin Hao
>>>
>> Thanks
>> Barry
>> .
>>
-- 
Best Regards!
Xin Hao


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-23  9:17         ` xhao
  0 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:17 UTC (permalink / raw)
  To: Yicong Yang, Barry Song
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, yangyicong, tiantao (H)


On 7/18/22 9:28 PM, Yicong Yang wrote:
> On 2022/7/14 12:51, Barry Song wrote:
>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>> Hi barry.
>>>
>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>
>>> The test  result as below.
>>>
>>> One core, we can see the performance improvement above +30%.
>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>
>>> ./Run -c 1 -i 1 shell1
>>> w/o
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>
>>> w/
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>
>>>
>>> But with whole cores, there have little performance degradation above -5%
>> That is sad as we might get more concurrency between mprotect(), madvise(),
>> mremap(), zap_pte_range() and the deferred tlbi.
>>
>>> ./Run -c 96 -i 1 shell1
>>> w/o
>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>
>>> w
>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>
>>> ----------------------------------------------------------------------------------------------
>>>
>>>
>>> After discuss with you, and do some changes in the patch.
>>>
>>> ndex a52381a680db..1ecba81f1277 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>
>>>           if (pending != flushed) {
>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>                   flush_tlb_mm(mm);
>>> +#else
>>> +               dsb(ish);
>>> +#endif
>>>
>> i was guessing the problem might be flush_tlb_batched_pending()
>> so i asked you to change this to verify my guess.
>>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>        iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
Yes, not always the 5% reduce,  there exist a fluctuation.
>
> Thanks.
>
>>       /*
>>>                    * If the new TLB flushing is pending during flushing, leave
>>>                    * mm->tlb_flush_batched as is, to avoid losing flushing.
>>>
>>> there have a performance improvement with whole cores, above +30%
>> But I don't think it is a proper patch. There is no guarantee the cpu calling
>> flush_tlb_batched_pending is exactly the cpu sending the deferred
>> tlbi. so the solution is unsafe. But since this temporary code can bring the
>> 30%+ performance improvement back for high concurrency, we have huge
>> potential to finally make it.
>>
>> Unfortunately I don't have an arm64 server to debug on this. I only have
>> 8 cores which are unlikely to reproduce regression which happens in
>> high concurrency with 96 parallel tasks.
>>
>> So I'd ask if @yicong or someone else working on kunpeng or other
>> arm64 servers  is able to actually debug and figure out a proper
>> patch for this, then add the patch as 5/5 into this series?
>>
>>> ./Run -c 96 -i 1 shell1
>>> 96 CPUs in system; running 96 parallel copies of tests
>>>
>>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>>                                                                      ========
>>> System Benchmarks Index Score (Partial Only)                        25761.6
>>>
>>>
>>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
>> Thanks for your testing!
>>
>>> Looking forward to your next version patch.
>>>
>>> On 7/11/22 11:46 AM, Barry Song wrote:
>>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>>> broadcasting is not free.
>>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>>> out one page mapped by only one process:
>>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>>
>>>> While pages are mapped by multiple processes or HW has more CPUs,
>>>> the cost should become even higher due to the bad scalability of
>>>> tlb shootdown.
>>>>
>>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>>> server with around 100 cores according to Yicong's test on patch
>>>> 4/4.
>>>>
>>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>>> 1. only send tlbi instructions in the first stage -
>>>>        arch_tlbbatch_add_mm()
>>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>>        sync in arch_tlbbatch_flush()
>>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>>> even for one page mapped by single process on snapdragon 888.
>>>>
>>>>
>>>> -v2:
>>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>>      according to the comments of Peter Zijlstra and Dave Hansen
>>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>>      is empty according to the comments of Nadav Amit
>>>>
>>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>>> , and comments.
>>>>
>>>> -v1:
>>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>>
>>>> Barry Song (4):
>>>>     Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>>       apply to ARM64"
>>>>     mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>>     mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>>     arm64: support batched/deferred tlb shootdown during page reclamation
>>>>
>>>>    Documentation/features/arch-support.txt       |  1 -
>>>>    .../features/vm/TLB/arch-support.txt          |  2 +-
>>>>    arch/arm/Kconfig                              |  1 +
>>>>    arch/arm64/Kconfig                            |  1 +
>>>>    arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>>    arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>>    arch/loongarch/Kconfig                        |  1 +
>>>>    arch/mips/Kconfig                             |  1 +
>>>>    arch/openrisc/Kconfig                         |  1 +
>>>>    arch/powerpc/Kconfig                          |  1 +
>>>>    arch/riscv/Kconfig                            |  1 +
>>>>    arch/s390/Kconfig                             |  1 +
>>>>    arch/um/Kconfig                               |  1 +
>>>>    arch/x86/Kconfig                              |  1 +
>>>>    arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>>    mm/Kconfig                                    |  3 +++
>>>>    mm/rmap.c                                     | 14 +++++++----
>>>>    17 files changed, 59 insertions(+), 9 deletions(-)
>>>>    create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>>
>>> --
>>> Best Regards!
>>> Xin Hao
>>>
>> Thanks
>> Barry
>> .
>>
-- 
Best Regards!
Xin Hao


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-23  9:17         ` xhao
  0 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:17 UTC (permalink / raw)
  To: Yicong Yang, Barry Song
  Cc: Linux Doc Mailing List, Catalin Marinas, yangyicong, Linux-MM,
	郭健,
	linux-riscv, Will Deacon, linux-s390,
	张诗明(Simon Zhang),
	李培锋(wink),
	Jonathan Corbet, x86, linux-mips, Arnd Bergmann, real mz,
	openrisc, Darren Hart, LAK, LKML, huzhanyuan, tiantao (H),
	Andrew Morton, linuxppc-dev


On 7/18/22 9:28 PM, Yicong Yang wrote:
> On 2022/7/14 12:51, Barry Song wrote:
>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>> Hi barry.
>>>
>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>
>>> The test  result as below.
>>>
>>> One core, we can see the performance improvement above +30%.
>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>
>>> ./Run -c 1 -i 1 shell1
>>> w/o
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>
>>> w/
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>
>>>
>>> But with whole cores, there have little performance degradation above -5%
>> That is sad as we might get more concurrency between mprotect(), madvise(),
>> mremap(), zap_pte_range() and the deferred tlbi.
>>
>>> ./Run -c 96 -i 1 shell1
>>> w/o
>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>
>>> w
>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>
>>> ----------------------------------------------------------------------------------------------
>>>
>>>
>>> After discuss with you, and do some changes in the patch.
>>>
>>> ndex a52381a680db..1ecba81f1277 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>
>>>           if (pending != flushed) {
>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>                   flush_tlb_mm(mm);
>>> +#else
>>> +               dsb(ish);
>>> +#endif
>>>
>> i was guessing the problem might be flush_tlb_batched_pending()
>> so i asked you to change this to verify my guess.
>>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>        iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
Yes, not always the 5% reduce,  there exist a fluctuation.
>
> Thanks.
>
>>       /*
>>>                    * If the new TLB flushing is pending during flushing, leave
>>>                    * mm->tlb_flush_batched as is, to avoid losing flushing.
>>>
>>> there have a performance improvement with whole cores, above +30%
>> But I don't think it is a proper patch. There is no guarantee the cpu calling
>> flush_tlb_batched_pending is exactly the cpu sending the deferred
>> tlbi. so the solution is unsafe. But since this temporary code can bring the
>> 30%+ performance improvement back for high concurrency, we have huge
>> potential to finally make it.
>>
>> Unfortunately I don't have an arm64 server to debug on this. I only have
>> 8 cores which are unlikely to reproduce regression which happens in
>> high concurrency with 96 parallel tasks.
>>
>> So I'd ask if @yicong or someone else working on kunpeng or other
>> arm64 servers  is able to actually debug and figure out a proper
>> patch for this, then add the patch as 5/5 into this series?
>>
>>> ./Run -c 96 -i 1 shell1
>>> 96 CPUs in system; running 96 parallel copies of tests
>>>
>>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>>                                                                      ========
>>> System Benchmarks Index Score (Partial Only)                        25761.6
>>>
>>>
>>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
>> Thanks for your testing!
>>
>>> Looking forward to your next version patch.
>>>
>>> On 7/11/22 11:46 AM, Barry Song wrote:
>>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>>> broadcasting is not free.
>>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>>> out one page mapped by only one process:
>>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>>
>>>> While pages are mapped by multiple processes or HW has more CPUs,
>>>> the cost should become even higher due to the bad scalability of
>>>> tlb shootdown.
>>>>
>>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>>> server with around 100 cores according to Yicong's test on patch
>>>> 4/4.
>>>>
>>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>>> 1. only send tlbi instructions in the first stage -
>>>>        arch_tlbbatch_add_mm()
>>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>>        sync in arch_tlbbatch_flush()
>>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>>> even for one page mapped by single process on snapdragon 888.
>>>>
>>>>
>>>> -v2:
>>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>>      according to the comments of Peter Zijlstra and Dave Hansen
>>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>>      is empty according to the comments of Nadav Amit
>>>>
>>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>>> , and comments.
>>>>
>>>> -v1:
>>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>>
>>>> Barry Song (4):
>>>>     Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>>       apply to ARM64"
>>>>     mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>>     mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>>     arm64: support batched/deferred tlb shootdown during page reclamation
>>>>
>>>>    Documentation/features/arch-support.txt       |  1 -
>>>>    .../features/vm/TLB/arch-support.txt          |  2 +-
>>>>    arch/arm/Kconfig                              |  1 +
>>>>    arch/arm64/Kconfig                            |  1 +
>>>>    arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>>    arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>>    arch/loongarch/Kconfig                        |  1 +
>>>>    arch/mips/Kconfig                             |  1 +
>>>>    arch/openrisc/Kconfig                         |  1 +
>>>>    arch/powerpc/Kconfig                          |  1 +
>>>>    arch/riscv/Kconfig                            |  1 +
>>>>    arch/s390/Kconfig                             |  1 +
>>>>    arch/um/Kconfig                               |  1 +
>>>>    arch/x86/Kconfig                              |  1 +
>>>>    arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>>    mm/Kconfig                                    |  3 +++
>>>>    mm/rmap.c                                     | 14 +++++++----
>>>>    17 files changed, 59 insertions(+), 9 deletions(-)
>>>>    create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>>
>>> --
>>> Best Regards!
>>> Xin Hao
>>>
>> Thanks
>> Barry
>> .
>>
-- 
Best Regards!
Xin Hao


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-23  9:17         ` xhao
  0 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:17 UTC (permalink / raw)
  To: Yicong Yang, Barry Song
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, yangyicong, tiantao (H)


On 7/18/22 9:28 PM, Yicong Yang wrote:
> On 2022/7/14 12:51, Barry Song wrote:
>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>> Hi barry.
>>>
>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>
>>> The test  result as below.
>>>
>>> One core, we can see the performance improvement above +30%.
>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>
>>> ./Run -c 1 -i 1 shell1
>>> w/o
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>
>>> w/
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>> ========
>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>
>>>
>>> But with whole cores, there have little performance degradation above -5%
>> That is sad as we might get more concurrency between mprotect(), madvise(),
>> mremap(), zap_pte_range() and the deferred tlbi.
>>
>>> ./Run -c 96 -i 1 shell1
>>> w/o
>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>
>>> w
>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>> samples)
>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>> ========
>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>
>>> ----------------------------------------------------------------------------------------------
>>>
>>>
>>> After discuss with you, and do some changes in the patch.
>>>
>>> ndex a52381a680db..1ecba81f1277 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>
>>>           if (pending != flushed) {
>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>                   flush_tlb_mm(mm);
>>> +#else
>>> +               dsb(ish);
>>> +#endif
>>>
>> i was guessing the problem might be flush_tlb_batched_pending()
>> so i asked you to change this to verify my guess.
>>
> flush_tlb_batched_pending() looks like the critical path for this issue then the code
> above can mitigate this.
>
> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>        iter-1      iter-2     iter-3
> w/o  17708.1     17637.1    17630.1
> w    17766.0     17752.3    17861.7
>
> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>
> Hi Xin Hao,
>
> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
> should not be the reason I think). Did you check that the 5% is a fluctuation or
> not? It'll be helpful if more information provided for reproducing this issue.
Yes, not always the 5% reduce,  there exist a fluctuation.
>
> Thanks.
>
>>       /*
>>>                    * If the new TLB flushing is pending during flushing, leave
>>>                    * mm->tlb_flush_batched as is, to avoid losing flushing.
>>>
>>> there have a performance improvement with whole cores, above +30%
>> But I don't think it is a proper patch. There is no guarantee the cpu calling
>> flush_tlb_batched_pending is exactly the cpu sending the deferred
>> tlbi. so the solution is unsafe. But since this temporary code can bring the
>> 30%+ performance improvement back for high concurrency, we have huge
>> potential to finally make it.
>>
>> Unfortunately I don't have an arm64 server to debug on this. I only have
>> 8 cores which are unlikely to reproduce regression which happens in
>> high concurrency with 96 parallel tasks.
>>
>> So I'd ask if @yicong or someone else working on kunpeng or other
>> arm64 servers  is able to actually debug and figure out a proper
>> patch for this, then add the patch as 5/5 into this series?
>>
>>> ./Run -c 96 -i 1 shell1
>>> 96 CPUs in system; running 96 parallel copies of tests
>>>
>>> Shell Scripts (1 concurrent)                 109229.0 lpm   (60.0 s, 1 samples)
>>> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
>>> Shell Scripts (1 concurrent)                     42.4     109229.0  25761.6
>>>                                                                      ========
>>> System Benchmarks Index Score (Partial Only)                        25761.6
>>>
>>>
>>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
>> Thanks for your testing!
>>
>>> Looking forward to your next version patch.
>>>
>>> On 7/11/22 11:46 AM, Barry Song wrote:
>>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>>> broadcasting is not free.
>>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>>> out one page mapped by only one process:
>>>> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>>>>
>>>> While pages are mapped by multiple processes or HW has more CPUs,
>>>> the cost should become even higher due to the bad scalability of
>>>> tlb shootdown.
>>>>
>>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>>> server with around 100 cores according to Yicong's test on patch
>>>> 4/4.
>>>>
>>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>>> 1. only send tlbi instructions in the first stage -
>>>>        arch_tlbbatch_add_mm()
>>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>>>        sync in arch_tlbbatch_flush()
>>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>>> even for one page mapped by single process on snapdragon 888.
>>>>
>>>>
>>>> -v2:
>>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>>>      according to the comments of Peter Zijlstra and Dave Hansen
>>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>>>      is empty according to the comments of Nadav Amit
>>>>
>>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>>> , and comments.
>>>>
>>>> -v1:
>>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>>
>>>> Barry Song (4):
>>>>     Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>>>       apply to ARM64"
>>>>     mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>>>     mm: rmap: Extend tlbbatch APIs to fit new platforms
>>>>     arm64: support batched/deferred tlb shootdown during page reclamation
>>>>
>>>>    Documentation/features/arch-support.txt       |  1 -
>>>>    .../features/vm/TLB/arch-support.txt          |  2 +-
>>>>    arch/arm/Kconfig                              |  1 +
>>>>    arch/arm64/Kconfig                            |  1 +
>>>>    arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
>>>>    arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
>>>>    arch/loongarch/Kconfig                        |  1 +
>>>>    arch/mips/Kconfig                             |  1 +
>>>>    arch/openrisc/Kconfig                         |  1 +
>>>>    arch/powerpc/Kconfig                          |  1 +
>>>>    arch/riscv/Kconfig                            |  1 +
>>>>    arch/s390/Kconfig                             |  1 +
>>>>    arch/um/Kconfig                               |  1 +
>>>>    arch/x86/Kconfig                              |  1 +
>>>>    arch/x86/include/asm/tlbflush.h               |  3 ++-
>>>>    mm/Kconfig                                    |  3 +++
>>>>    mm/rmap.c                                     | 14 +++++++----
>>>>    17 files changed, 59 insertions(+), 9 deletions(-)
>>>>    create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>>
>>> --
>>> Best Regards!
>>> Xin Hao
>>>
>> Thanks
>> Barry
>> .
>>
-- 
Best Regards!
Xin Hao


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
  2022-07-20 11:18         ` Barry Song
  (?)
  (?)
@ 2022-07-23  9:22           ` xhao
  -1 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:22 UTC (permalink / raw)
  To: Barry Song, Yicong Yang
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Yicong Yang, tiantao (H)


On 7/20/22 7:18 PM, Barry Song wrote:
> On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>> On 2022/7/14 12:51, Barry Song wrote:
>>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>>> Hi barry.
>>>>
>>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>>
>>>> The test  result as below.
>>>>
>>>> One core, we can see the performance improvement above +30%.
>>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>>
>>>> ./Run -c 1 -i 1 shell1
>>>> w/o
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>>
>>>> w/
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>>
>>>>
>>>> But with whole cores, there have little performance degradation above -5%
>>> That is sad as we might get more concurrency between mprotect(), madvise(),
>>> mremap(), zap_pte_range() and the deferred tlbi.
>>>
>>>> ./Run -c 96 -i 1 shell1
>>>> w/o
>>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>>
>>>> w
>>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>>
>>>> ----------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> After discuss with you, and do some changes in the patch.
>>>>
>>>> ndex a52381a680db..1ecba81f1277 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>>
>>>>           if (pending != flushed) {
>>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>>                   flush_tlb_mm(mm);
>>>> +#else
>>>> +               dsb(ish);
>>>> +#endif
>>>>
>>> i was guessing the problem might be flush_tlb_batched_pending()
>>> so i asked you to change this to verify my guess.
>>>
>> flush_tlb_batched_pending() looks like the critical path for this issue then the code
>> above can mitigate this.
>>
>> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
>> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>>        iter-1      iter-2     iter-3
>> w/o  17708.1     17637.1    17630.1
>> w    17766.0     17752.3    17861.7
>>
>> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>>
>> Hi Xin Hao,
>>
>> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
>> should not be the reason I think). Did you check that the 5% is a fluctuation or
>> not? It'll be helpful if more information provided for reproducing this issue.
>>
>> Thanks.
> I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
> stressed on
> memory. Hi Xin, in what kinds of configurations can we reproduce your test
> result?

Oh, my fault, I do the test is not based on the lastest upstream kernel, there maybe some impact here,
i will do a new test on the lastest kernel.

> As I suppose tlbbatch will mainly affect the performance of user scenarios
> which require memory page-out/page-in like reclaiming file/anon pages.
> "./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
> affected by tlbbatch at all, I believe.
>
> Thanks
> Barry

-- 
Best Regards!
Xin Hao


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-23  9:22           ` xhao
  0 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:22 UTC (permalink / raw)
  To: Barry Song, Yicong Yang
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Yicong Yang, tiantao (H)


On 7/20/22 7:18 PM, Barry Song wrote:
> On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>> On 2022/7/14 12:51, Barry Song wrote:
>>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>>> Hi barry.
>>>>
>>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>>
>>>> The test  result as below.
>>>>
>>>> One core, we can see the performance improvement above +30%.
>>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>>
>>>> ./Run -c 1 -i 1 shell1
>>>> w/o
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>>
>>>> w/
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>>
>>>>
>>>> But with whole cores, there have little performance degradation above -5%
>>> That is sad as we might get more concurrency between mprotect(), madvise(),
>>> mremap(), zap_pte_range() and the deferred tlbi.
>>>
>>>> ./Run -c 96 -i 1 shell1
>>>> w/o
>>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>>
>>>> w
>>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>>
>>>> ----------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> After discuss with you, and do some changes in the patch.
>>>>
>>>> ndex a52381a680db..1ecba81f1277 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>>
>>>>           if (pending != flushed) {
>>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>>                   flush_tlb_mm(mm);
>>>> +#else
>>>> +               dsb(ish);
>>>> +#endif
>>>>
>>> i was guessing the problem might be flush_tlb_batched_pending()
>>> so i asked you to change this to verify my guess.
>>>
>> flush_tlb_batched_pending() looks like the critical path for this issue then the code
>> above can mitigate this.
>>
>> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
>> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>>        iter-1      iter-2     iter-3
>> w/o  17708.1     17637.1    17630.1
>> w    17766.0     17752.3    17861.7
>>
>> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>>
>> Hi Xin Hao,
>>
>> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
>> should not be the reason I think). Did you check that the 5% is a fluctuation or
>> not? It'll be helpful if more information provided for reproducing this issue.
>>
>> Thanks.
> I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
> stressed on
> memory. Hi Xin, in what kinds of configurations can we reproduce your test
> result?

Oh, my fault, I do the test is not based on the lastest upstream kernel, there maybe some impact here,
i will do a new test on the lastest kernel.

> As I suppose tlbbatch will mainly affect the performance of user scenarios
> which require memory page-out/page-in like reclaiming file/anon pages.
> "./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
> affected by tlbbatch at all, I believe.
>
> Thanks
> Barry

-- 
Best Regards!
Xin Hao


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-23  9:22           ` xhao
  0 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:22 UTC (permalink / raw)
  To: Barry Song, Yicong Yang
  Cc: Linux Doc Mailing List, Catalin Marinas, Yicong Yang, Linux-MM,
	郭健,
	linux-riscv, Will Deacon, linux-s390,
	张诗明(Simon Zhang),
	李培锋(wink),
	Jonathan Corbet, x86, linux-mips, Arnd Bergmann, real mz,
	openrisc, Darren Hart, LAK, LKML, huzhanyuan, tiantao (H),
	Andrew Morton, linuxppc-dev


On 7/20/22 7:18 PM, Barry Song wrote:
> On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>> On 2022/7/14 12:51, Barry Song wrote:
>>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>>> Hi barry.
>>>>
>>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>>
>>>> The test  result as below.
>>>>
>>>> One core, we can see the performance improvement above +30%.
>>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>>
>>>> ./Run -c 1 -i 1 shell1
>>>> w/o
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>>
>>>> w/
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>>
>>>>
>>>> But with whole cores, there have little performance degradation above -5%
>>> That is sad as we might get more concurrency between mprotect(), madvise(),
>>> mremap(), zap_pte_range() and the deferred tlbi.
>>>
>>>> ./Run -c 96 -i 1 shell1
>>>> w/o
>>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>>
>>>> w
>>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>>
>>>> ----------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> After discuss with you, and do some changes in the patch.
>>>>
>>>> ndex a52381a680db..1ecba81f1277 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>>
>>>>           if (pending != flushed) {
>>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>>                   flush_tlb_mm(mm);
>>>> +#else
>>>> +               dsb(ish);
>>>> +#endif
>>>>
>>> i was guessing the problem might be flush_tlb_batched_pending()
>>> so i asked you to change this to verify my guess.
>>>
>> flush_tlb_batched_pending() looks like the critical path for this issue then the code
>> above can mitigate this.
>>
>> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
>> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>>        iter-1      iter-2     iter-3
>> w/o  17708.1     17637.1    17630.1
>> w    17766.0     17752.3    17861.7
>>
>> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>>
>> Hi Xin Hao,
>>
>> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
>> should not be the reason I think). Did you check that the 5% is a fluctuation or
>> not? It'll be helpful if more information provided for reproducing this issue.
>>
>> Thanks.
> I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
> stressed on
> memory. Hi Xin, in what kinds of configurations can we reproduce your test
> result?

Oh, my fault, I do the test is not based on the lastest upstream kernel, there maybe some impact here,
i will do a new test on the lastest kernel.

> As I suppose tlbbatch will mainly affect the performance of user scenarios
> which require memory page-out/page-in like reclaiming file/anon pages.
> "./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
> affected by tlbbatch at all, I believe.
>
> Thanks
> Barry

-- 
Best Regards!
Xin Hao


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
@ 2022-07-23  9:22           ` xhao
  0 siblings, 0 replies; 56+ messages in thread
From: xhao @ 2022-07-23  9:22 UTC (permalink / raw)
  To: Barry Song, Yicong Yang
  Cc: Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas, Will Deacon,
	Linux Doc Mailing List, Jonathan Corbet, Arnd Bergmann, LKML,
	Darren Hart, huzhanyuan, 李培锋(wink),
	张诗明(Simon Zhang), 郭健,
	real mz, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Yicong Yang, tiantao (H)


On 7/20/22 7:18 PM, Barry Song wrote:
> On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@huawei.com> wrote:
>> On 2022/7/14 12:51, Barry Song wrote:
>>> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>>> Hi barry.
>>>>
>>>> I do some test on Kunpeng arm64 machine use Unixbench.
>>>>
>>>> The test  result as below.
>>>>
>>>> One core, we can see the performance improvement above +30%.
>>> I am really pleased to see the 30%+ improvement on unixbench on single core.
>>>
>>>> ./Run -c 1 -i 1 shell1
>>>> w/o
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 5481.0 1292.7
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1292.7
>>>>
>>>> w/
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 6974.6 1645.0
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                         1645.0
>>>>
>>>>
>>>> But with whole cores, there have little performance degradation above -5%
>>> That is sad as we might get more concurrency between mprotect(), madvise(),
>>> mremap(), zap_pte_range() and the deferred tlbi.
>>>
>>>> ./Run -c 96 -i 1 shell1
>>>> w/o
>>>> Shell Scripts (1 concurrent)                  80765.5 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 80765.5 19048.5
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        19048.5
>>>>
>>>> w
>>>> Shell Scripts (1 concurrent)                  76333.6 lpm   (60.0 s, 1
>>>> samples)
>>>> System Benchmarks Partial Index              BASELINE RESULT INDEX
>>>> Shell Scripts (1 concurrent)                     42.4 76333.6 18003.2
>>>> ========
>>>> System Benchmarks Index Score (Partial Only)                        18003.2
>>>>
>>>> ----------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> After discuss with you, and do some changes in the patch.
>>>>
>>>> ndex a52381a680db..1ecba81f1277 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>>>>           int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>>>
>>>>           if (pending != flushed) {
>>>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>>>>                   flush_tlb_mm(mm);
>>>> +#else
>>>> +               dsb(ish);
>>>> +#endif
>>>>
>>> i was guessing the problem might be flush_tlb_batched_pending()
>>> so i asked you to change this to verify my guess.
>>>
>> flush_tlb_batched_pending() looks like the critical path for this issue then the code
>> above can mitigate this.
>>
>> I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
>> v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
>>        iter-1      iter-2     iter-3
>> w/o  17708.1     17637.1    17630.1
>> w    17766.0     17752.3    17861.7
>>
>> And flush_tlb_batched_pending()isn't the hot spot with the patch:
>>     7.00%  sh        [kernel.kallsyms]      [k] ptep_clear_flush
>>     4.17%  sh        [kernel.kallsyms]      [k] ptep_set_access_flags
>>     2.43%  multi.sh  [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.98%  sh        [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>>     1.69%  sh        [kernel.kallsyms]      [k] next_uptodate_page
>>     1.66%  sort      [kernel.kallsyms]      [k] ptep_clear_flush
>>     1.56%  multi.sh  [kernel.kallsyms]      [k] ptep_set_access_flags
>>     1.27%  sh        [kernel.kallsyms]      [k] page_counter_cancel
>>     1.11%  sh        [kernel.kallsyms]      [k] page_remove_rmap
>>     1.06%  sh        [kernel.kallsyms]      [k] perf_event_alloc
>>
>> Hi Xin Hao,
>>
>> I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
>> should not be the reason I think). Did you check that the 5% is a fluctuation or
>> not? It'll be helpful if more information provided for reproducing this issue.
>>
>> Thanks.
> I guess that is because  "./Run -c 1 -i 1 shell1" isn't an application
> stressed on
> memory. Hi Xin, in what kinds of configurations can we reproduce your test
> result?

Oh, my fault, I do the test is not based on the lastest upstream kernel, there maybe some impact here,
i will do a new test on the lastest kernel.

> As I suppose tlbbatch will mainly affect the performance of user scenarios
> which require memory page-out/page-in like reclaiming file/anon pages.
> "./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
> affected by tlbbatch at all, I believe.
>
> Thanks
> Barry

-- 
Best Regards!
Xin Hao


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2022-07-23  9:23 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-11  3:46 [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Barry Song
2022-07-11  3:46 ` Barry Song
2022-07-11  3:46 ` Barry Song
2022-07-11  3:46 ` Barry Song
2022-07-11  3:46 ` [PATCH v2 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46 ` [PATCH v2 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11 13:35   ` Kefeng Wang
2022-07-11 13:35     ` Kefeng Wang
2022-07-11 13:35     ` Kefeng Wang
2022-07-11 13:35     ` Kefeng Wang
2022-07-11 22:52     ` Barry Song
2022-07-11 22:52       ` Barry Song
2022-07-11 22:52       ` Barry Song
2022-07-11 22:52       ` Barry Song
2022-07-11  3:46 ` [PATCH v2 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46 ` [PATCH v2 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-11  3:46   ` Barry Song
2022-07-14  3:28 ` [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Xin Hao
2022-07-14  3:28   ` Xin Hao
2022-07-14  3:28   ` Xin Hao
2022-07-14  3:28   ` Xin Hao
2022-07-14  4:51   ` Barry Song
2022-07-14  4:51     ` Barry Song
2022-07-14  4:51     ` Barry Song
2022-07-14  4:51     ` Barry Song
2022-07-15  2:47     ` Yicong Yang
2022-07-15  2:47       ` Yicong Yang
2022-07-15  2:47       ` Yicong Yang
2022-07-15  2:47       ` Yicong Yang
2022-07-18 13:28     ` Yicong Yang
2022-07-18 13:28       ` Yicong Yang
2022-07-18 13:28       ` Yicong Yang
2022-07-18 13:28       ` Yicong Yang
2022-07-20 11:18       ` Barry Song
2022-07-20 11:18         ` Barry Song
2022-07-20 11:18         ` Barry Song
2022-07-20 11:18         ` Barry Song
2022-07-23  9:22         ` xhao
2022-07-23  9:22           ` xhao
2022-07-23  9:22           ` xhao
2022-07-23  9:22           ` xhao
2022-07-23  9:17       ` xhao
2022-07-23  9:17         ` xhao
2022-07-23  9:17         ` xhao
2022-07-23  9:17         ` xhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.