All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-10-28  8:12 ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng

From: Yicong Yang <yangyicong@hisilicon.com>

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

With this support we're possible to do more optimization for memory
reclamation and migration[*].

[*] https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/

-v5:
1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on arm64.
2. Make a threshhold of CPU numbers for enabling batched TLP flush on arm64
Link: https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyicong@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page reclamation

 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               | 15 +++++-
 mm/rmap.c                                     | 19 +++-----
 6 files changed, 84 insertions(+), 16 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-10-28  8:12 ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng

From: Yicong Yang <yangyicong@hisilicon.com>

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

With this support we're possible to do more optimization for memory
reclamation and migration[*].

[*] https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/

-v5:
1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on arm64.
2. Make a threshhold of CPU numbers for enabling batched TLP flush on arm64
Link: https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyicong@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page reclamation

 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               | 15 +++++-
 mm/rmap.c                                     | 19 +++-----
 6 files changed, 84 insertions(+), 16 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-10-28  8:12 ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: linux-s390, wangkefeng.wang, zhangshiming, lipeifeng, prime.zeng,
	arnd, corbet, peterz, realmz6, Barry Song, linux-kernel,
	yangyicong, guojian, openrisc, xhao, darren, huzhanyuan,
	punit.agrawal, linux-riscv, linux-mips, linuxppc-dev

From: Yicong Yang <yangyicong@hisilicon.com>

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

With this support we're possible to do more optimization for memory
reclamation and migration[*].

[*] https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/

-v5:
1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on arm64.
2. Make a threshhold of CPU numbers for enabling batched TLP flush on arm64
Link: https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyicong@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page reclamation

 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               | 15 +++++-
 mm/rmap.c                                     | 19 +++-----
 6 files changed, 84 insertions(+), 16 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-10-28  8:12 ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng

From: Yicong Yang <yangyicong@hisilicon.com>

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
	arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
	sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

With this support we're possible to do more optimization for memory
reclamation and migration[*].

[*] https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/

-v5:
1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on arm64.
2. Make a threshhold of CPU numbers for enabling batched TLP flush on arm64
Link: https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyicong@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/

Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page reclamation

 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               | 15 +++++-
 mm/rmap.c                                     | 19 +++-----
 6 files changed, 84 insertions(+), 16 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v5 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
  2022-10-28  8:12 ` Yicong Yang
  (?)
  (?)
@ 2022-10-28  8:12   ` Yicong Yang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Anshuman Khandual, Barry Song

From: Anshuman Khandual <khandual@linux.vnet.ibm.com>

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
 mm/rmap.c                       |  9 +--------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..8a497d902c16 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	bool should_defer = false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..a9ab10bc0144 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -685,17 +685,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-	bool should_defer = false;
-
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-		should_defer = true;
-	put_cpu();
-
-	return should_defer;
+	return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
@ 2022-10-28  8:12   ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Anshuman Khandual, Barry Song

From: Anshuman Khandual <khandual@linux.vnet.ibm.com>

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
 mm/rmap.c                       |  9 +--------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..8a497d902c16 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	bool should_defer = false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..a9ab10bc0144 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -685,17 +685,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-	bool should_defer = false;
-
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-		should_defer = true;
-	put_cpu();
-
-	return should_defer;
+	return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
@ 2022-10-28  8:12   ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: wangkefeng.wang, darren, peterz, yangyicong, punit.agrawal,
	guojian, linux-riscv, Anshuman Khandual, linux-s390,
	zhangshiming, lipeifeng, corbet, Barry Song, linux-mips, arnd,
	realmz6, openrisc, prime.zeng, Barry Song, xhao, linux-kernel,
	huzhanyuan, linuxppc-dev

From: Anshuman Khandual <khandual@linux.vnet.ibm.com>

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
 mm/rmap.c                       |  9 +--------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..8a497d902c16 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	bool should_defer = false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..a9ab10bc0144 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -685,17 +685,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-	bool should_defer = false;
-
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-		should_defer = true;
-	put_cpu();
-
-	return should_defer;
+	return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
@ 2022-10-28  8:12   ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Anshuman Khandual, Barry Song

From: Anshuman Khandual <khandual@linux.vnet.ibm.com>

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
 mm/rmap.c                       |  9 +--------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..8a497d902c16 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	bool should_defer = false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..a9ab10bc0144 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -685,17 +685,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-	bool should_defer = false;
-
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-		should_defer = true;
-	put_cpu();
-
-	return should_defer;
+	return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28  8:12 ` Yicong Yang
  (?)
  (?)
@ 2022-10-28  8:12   ` Yicong Yang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Barry Song, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and only make this enabled on systems
with more than 8 CPUs. User can modify this threshold according to
their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               |  3 +-
 mm/rmap.c                                     | 10 ++--
 6 files changed, 71 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 505c8a1ccbe0..72975e82c7d7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
@@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARM64_NR_CPUS_FOR_BATCHED_TLB
+	int "Threshold to enable batched TLB flush"
+	default 8
+	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..b21cdeb57a18 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	/*
+	 * TLB batched flush is proved to be beneficial for systems with large
+	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
+	 * is cheap on small systems which may not need this feature. So use
+	 * a threshold for enabling this to avoid potential side effects on
+	 * these platforms.
+	 */
+	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
+		return false;
+
+#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
+	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
+		return false;
+#endif
+
+	return true;
+}
+
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8a497d902c16..5bd78ae55cd4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index a9ab10bc0144..a1b408ff44e5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-10-28  8:12   ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Barry Song, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and only make this enabled on systems
with more than 8 CPUs. User can modify this threshold according to
their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               |  3 +-
 mm/rmap.c                                     | 10 ++--
 6 files changed, 71 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 505c8a1ccbe0..72975e82c7d7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
@@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARM64_NR_CPUS_FOR_BATCHED_TLB
+	int "Threshold to enable batched TLB flush"
+	default 8
+	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..b21cdeb57a18 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	/*
+	 * TLB batched flush is proved to be beneficial for systems with large
+	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
+	 * is cheap on small systems which may not need this feature. So use
+	 * a threshold for enabling this to avoid potential side effects on
+	 * these platforms.
+	 */
+	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
+		return false;
+
+#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
+	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
+		return false;
+#endif
+
+	return true;
+}
+
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8a497d902c16..5bd78ae55cd4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index a9ab10bc0144..a1b408ff44e5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.24.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-10-28  8:12   ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: wangkefeng.wang, darren, peterz, yangyicong, punit.agrawal,
	Nadav Amit, guojian, linux-riscv, linux-s390, zhangshiming,
	lipeifeng, corbet, Barry Song, Mel Gorman, linux-mips, arnd,
	realmz6, Barry Song, openrisc, prime.zeng, xhao, linux-kernel,
	huzhanyuan, linuxppc-dev

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and only make this enabled on systems
with more than 8 CPUs. User can modify this threshold according to
their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               |  3 +-
 mm/rmap.c                                     | 10 ++--
 6 files changed, 71 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 505c8a1ccbe0..72975e82c7d7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
@@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARM64_NR_CPUS_FOR_BATCHED_TLB
+	int "Threshold to enable batched TLB flush"
+	default 8
+	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..b21cdeb57a18 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	/*
+	 * TLB batched flush is proved to be beneficial for systems with large
+	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
+	 * is cheap on small systems which may not need this feature. So use
+	 * a threshold for enabling this to avoid potential side effects on
+	 * these platforms.
+	 */
+	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
+		return false;
+
+#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
+	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
+		return false;
+#endif
+
+	return true;
+}
+
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8a497d902c16..5bd78ae55cd4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index a9ab10bc0144..a1b408ff44e5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-10-28  8:12   ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-10-28  8:12 UTC (permalink / raw)
  To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Barry Song, Nadav Amit, Mel Gorman

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and only make this enabled on systems
with more than 8 CPUs. User can modify this threshold according to
their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               |  3 +-
 mm/rmap.c                                     | 10 ++--
 6 files changed, 71 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | N/A  |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 505c8a1ccbe0..72975e82c7d7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
@@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARM64_NR_CPUS_FOR_BATCHED_TLB
+	int "Threshold to enable batched TLB flush"
+	default 8
+	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..b21cdeb57a18 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	/*
+	 * TLB batched flush is proved to be beneficial for systems with large
+	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
+	 * is cheap on small systems which may not need this feature. So use
+	 * a threshold for enabling this to avoid potential side effects on
+	 * these platforms.
+	 */
+	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
+		return false;
+
+#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
+	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
+		return false;
+#endif
+
+	return true;
+}
+
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8a497d902c16..5bd78ae55cd4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
+					struct mm_struct *mm,
+					unsigned long uaddr)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index a9ab10bc0144..a1b408ff44e5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
 #define TLB_FLUSH_BATCH_PENDING_LARGE			\
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch, nbatch;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	}
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+				      unsigned long uaddr)
 {
 }
 
@@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28  8:12 ` Yicong Yang
  (?)
  (?)
@ 2022-11-11 10:17   ` Punit Agrawal
  -1 siblings, 0 replies; 36+ messages in thread
From: Punit Agrawal @ 2022-11-11 10:17 UTC (permalink / raw)
  To: Yicong Yang
  Cc: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc, corbet, peterz, arnd,
	punit.agrawal, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song,
	wangkefeng.wang, xhao, prime.zeng

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-11 10:17   ` Punit Agrawal
  0 siblings, 0 replies; 36+ messages in thread
From: Punit Agrawal @ 2022-11-11 10:17 UTC (permalink / raw)
  To: Yicong Yang
  Cc: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc, corbet, peterz, arnd,
	punit.agrawal, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song,
	wangkefeng.wang, xhao, prime.zeng

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-11 10:17   ` Punit Agrawal
  0 siblings, 0 replies; 36+ messages in thread
From: Punit Agrawal @ 2022-11-11 10:17 UTC (permalink / raw)
  To: Yicong Yang
  Cc: wangkefeng.wang, prime.zeng, realmz6, linux-doc, peterz,
	catalin.marinas, linux-kernel, linux-mm, punit.agrawal,
	linux-riscv, will, linux-s390, zhangshiming, lipeifeng, corbet,
	x86, Barry Song, arnd, anshuman.khandual, openrisc, darren,
	yangyicong, linux-arm-kernel, guojian, xhao, linux-mips,
	huzhanyuan, akpm, linuxppc-dev

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-11 10:17   ` Punit Agrawal
  0 siblings, 0 replies; 36+ messages in thread
From: Punit Agrawal @ 2022-11-11 10:17 UTC (permalink / raw)
  To: Yicong Yang
  Cc: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will,
	anshuman.khandual, linux-doc, corbet, peterz, arnd,
	punit.agrawal, linux-kernel, darren, yangyicong, huzhanyuan,
	lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc,
	linuxppc-dev, linux-riscv, linux-s390, Barry Song,
	wangkefeng.wang, xhao, prime.zeng

Yicong Yang <yangyicong@huawei.com> writes:

> From: Yicong Yang <yangyicong@hisilicon.com>
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
> 	arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
> 	sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>

Thanks,
Punit

[...]


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28  8:12   ` Yicong Yang
  (?)
  (?)
@ 2022-11-14  3:29     ` Anshuman Khandual
  -1 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14  3:29 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Barry Song, Nadav Amit, Mel Gorman



On 10/28/22 13:42, Yicong Yang wrote:
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
> +		return false;
> +#endif

should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
It should be okay to check capability with this_cpu_has_cap() as the entire call chain
here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
simpler, consistent, and also cost effective ?

Regardless, a comment is needed before the #ifdef block explaining why it does not make
sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
between two TLBI instructions to workaround the errata.

> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  3:29     ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14  3:29 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Barry Song, Nadav Amit, Mel Gorman



On 10/28/22 13:42, Yicong Yang wrote:
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
> +		return false;
> +#endif

should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
It should be okay to check capability with this_cpu_has_cap() as the entire call chain
here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
simpler, consistent, and also cost effective ?

Regardless, a comment is needed before the #ifdef block explaining why it does not make
sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
between two TLBI instructions to workaround the errata.

> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  3:29     ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14  3:29 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: wangkefeng.wang, darren, peterz, yangyicong, punit.agrawal,
	Nadav Amit, guojian, linux-riscv, linux-s390, zhangshiming,
	lipeifeng, corbet, Barry Song, Mel Gorman, linux-mips, arnd,
	realmz6, Barry Song, openrisc, prime.zeng, xhao, linux-kernel,
	huzhanyuan, linuxppc-dev



On 10/28/22 13:42, Yicong Yang wrote:
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
> +		return false;
> +#endif

should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
It should be okay to check capability with this_cpu_has_cap() as the entire call chain
here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
simpler, consistent, and also cost effective ?

Regardless, a comment is needed before the #ifdef block explaining why it does not make
sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
between two TLBI instructions to workaround the errata.

> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  3:29     ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14  3:29 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng,
	Barry Song, Nadav Amit, Mel Gorman



On 10/28/22 13:42, Yicong Yang wrote:
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
> +		return false;
> +#endif

should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
It should be okay to check capability with this_cpu_has_cap() as the entire call chain
here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
simpler, consistent, and also cost effective ?

Regardless, a comment is needed before the #ifdef block explaining why it does not make
sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
between two TLBI instructions to workaround the errata.

> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-10-28  8:12   ` Yicong Yang
  (?)
  (?)
@ 2022-11-14  8:00     ` haoxin
  -1 siblings, 0 replies; 36+ messages in thread
From: haoxin @ 2022-11-14  8:00 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman


在 2022/10/28 下午4:12, Yicong Yang 写道:
> From: Barry Song <v-songbaohua@oppo.com>
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
>
> Even running a simplest program which requires swapout can
> prove this is true,
>   #include <sys/types.h>
>   #include <unistd.h>
>   #include <sys/mman.h>
>   #include <string.h>
>
>   int main()
>   {
>   #define SIZE (1 * 1024 * 1024)
>           volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>
>           memset(p, 0x88, SIZE);
>
>           for (int k = 0; k < 10000; k++) {
>                   /* swap in */
>                   for (int i = 0; i < SIZE; i += 4096) {
>                           (void)p[i];
>                   }
>
>                   /* swap out */
>                   madvise(p, SIZE, MADV_PAGEOUT);
>           }
>   }
>
> Perf result on snapdragon 888 with 8 cores by using zRAM
> as the swap block device.
>
>   ~ # perf record taskset -c 4 ./a.out
>   [ perf record: Woken up 10 times to write data ]
>   [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>   ~ # perf report
>   # To display the perf.data header info, please use --header/--header-only options.
>   # To display the perf.data header info, please use --header/--header-only options.
>   #
>   #
>   # Total Lost Samples: 0
>   #
>   # Samples: 60K of event 'cycles'
>   # Event count (approx.): 35706225414
>   #
>   # Overhead  Command  Shared Object      Symbol
>   # ........  .......  .................  .............................................................................
>   #
>      21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
>       8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>       6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
>       6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
>       5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>       3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
>       3.49%  a.out    [kernel.kallsyms]  [k] memset64
>       1.63%  a.out    [kernel.kallsyms]  [k] clear_page
>       1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
>       1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
>       1.23%  a.out    [kernel.kallsyms]  [k] xas_load
>       1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock
>
> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> swapping in/out a page mapped by only one process. If the
> page is mapped by multiple processes, typically, like more
> than 100 on a phone, the overhead would be much higher as
> we have to run tlb flush 100 times for one single page.
> Plus, tlb flush overhead will increase with the number
> of CPU cores due to the bad scalability of tlb shootdown
> in HW, so those ARM64 servers should expect much higher
> overhead.
>
> Further perf annonate shows 95% cpu time of ptep_clear_flush
> is actually used by the final dsb() to wait for the completion
> of tlb flush. This provides us a very good chance to leverage
> the existing batched tlb in kernel. The minimum modification
> is that we only send async tlbi in the first stage and we send
> dsb while we have to sync in the second stage.
>
> With the above simplest micro benchmark, collapsed time to
> finish the program decreases around 5%.
>
> Typical collapsed time w/o patch:
>   ~ # time taskset -c 4 ./a.out
>   0.21user 14.34system 0:14.69elapsed
> w/ patch:
>   ~ # time taskset -c 4 ./a.out
>   0.22user 13.45system 0:13.80elapsed
>
> Also, Yicong Yang added the following observation.
> 	Tested with benchmark in the commit on Kunpeng920 arm64 server,
> 	observed an improvement around 12.5% with command
> 	`time ./swap_bench`.
> 		w/o		w/
> 	real	0m13.460s	0m11.771s
> 	user	0m0.248s	0m0.279s
> 	sys	0m12.039s	0m11.458s
>
> 	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> 	which has been eliminated by this patch:
>
> 	[root@localhost yang]# perf record -- ./swap_bench && perf report
> 	[...]
> 	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush
>
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and only make this enabled on systems
> with more than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Tested-by: Xin Hao <xhao@linux.alibaba.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm64/Kconfig                            |  6 +++
>   arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>   arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
>   arch/x86/include/asm/tlbflush.h               |  3 +-
>   mm/rmap.c                                     | 10 ++--
>   6 files changed, 71 insertions(+), 8 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> index 039e4e91ada3..2caf815d7c6c 100644
> --- a/Documentation/features/vm/TLB/arch-support.txt
> +++ b/Documentation/features/vm/TLB/arch-support.txt
> @@ -9,7 +9,7 @@
>       |       alpha: | TODO |
>       |         arc: | TODO |
>       |         arm: | TODO |
> -    |       arm64: | N/A  |
> +    |       arm64: |  ok  |
>       |        csky: | TODO |
>       |     hexagon: | TODO |
>       |        ia64: | TODO |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 505c8a1ccbe0..72975e82c7d7 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -93,6 +93,7 @@ config ARM64
>   	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>   	select ARCH_SUPPORTS_NUMA_BALANCING
>   	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> +	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
>   	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>   	select ARCH_WANT_DEFAULT_BPF_JIT
>   	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> @@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
>   	default 5 if ARM64_16K_PAGES
>   	default 4
>   
> +config ARM64_NR_CPUS_FOR_BATCHED_TLB
> +	int "Threshold to enable batched TLB flush"
> +	default 8
> +	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
>   config ARCH_MMAP_RND_BITS_MIN
>   	default 14 if ARM64_64K_PAGES
>   	default 16 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> +	/*
> +	 * For arm64, HW can do tlb shootdown, so we don't
> +	 * need to record cpumask for sending IPI
> +	 */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..b21cdeb57a18 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   	dsb(ish);
>   }
>   
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>   					 unsigned long uaddr)
>   {
>   	unsigned long addr;
>   
>   	dsb(ishst);
> -	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> +	addr = __TLBI_VADDR(uaddr, ASID(mm));
>   	__tlbi(vale1is, addr);
>   	__tlbi_user(vale1is, addr);
>   }
>   
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +					 unsigned long uaddr)
> +{
> +	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>   static inline void flush_tlb_page(struct vm_area_struct *vma,
>   				  unsigned long uaddr)
>   {
> @@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>   	dsb(ish);
>   }
>   
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
Maybe there need a comment  why set  ARM64_WORKAROUND_REPEAT_TLBI it 
should return false?
> +		return false;
> +#endif
> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}
> +
> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> +
>   /*
>    * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
>    * necessarily a performance improvement.
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8a497d902c16..5bd78ae55cd4 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>   }
>   
>   static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> -					struct mm_struct *mm)
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
>   {
>   	inc_mm_tlb_gen(mm);
>   	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a9ab10bc0144..a1b408ff44e5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
>   #define TLB_FLUSH_BATCH_PENDING_LARGE			\
>   	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
>   
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>   	int batch, nbatch;
>   
> -	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
> +	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
>   	tlb_ubc->flush_required = true;
>   
>   	/*
> @@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>   	}
>   }
>   #else
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   }
>   
> @@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 */
>   				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>   
> -				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> +				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
>   			} else {
>   				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>   			}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  8:00     ` haoxin
  0 siblings, 0 replies; 36+ messages in thread
From: haoxin @ 2022-11-14  8:00 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman


在 2022/10/28 下午4:12, Yicong Yang 写道:
> From: Barry Song <v-songbaohua@oppo.com>
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
>
> Even running a simplest program which requires swapout can
> prove this is true,
>   #include <sys/types.h>
>   #include <unistd.h>
>   #include <sys/mman.h>
>   #include <string.h>
>
>   int main()
>   {
>   #define SIZE (1 * 1024 * 1024)
>           volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>
>           memset(p, 0x88, SIZE);
>
>           for (int k = 0; k < 10000; k++) {
>                   /* swap in */
>                   for (int i = 0; i < SIZE; i += 4096) {
>                           (void)p[i];
>                   }
>
>                   /* swap out */
>                   madvise(p, SIZE, MADV_PAGEOUT);
>           }
>   }
>
> Perf result on snapdragon 888 with 8 cores by using zRAM
> as the swap block device.
>
>   ~ # perf record taskset -c 4 ./a.out
>   [ perf record: Woken up 10 times to write data ]
>   [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>   ~ # perf report
>   # To display the perf.data header info, please use --header/--header-only options.
>   # To display the perf.data header info, please use --header/--header-only options.
>   #
>   #
>   # Total Lost Samples: 0
>   #
>   # Samples: 60K of event 'cycles'
>   # Event count (approx.): 35706225414
>   #
>   # Overhead  Command  Shared Object      Symbol
>   # ........  .......  .................  .............................................................................
>   #
>      21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
>       8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>       6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
>       6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
>       5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>       3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
>       3.49%  a.out    [kernel.kallsyms]  [k] memset64
>       1.63%  a.out    [kernel.kallsyms]  [k] clear_page
>       1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
>       1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
>       1.23%  a.out    [kernel.kallsyms]  [k] xas_load
>       1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock
>
> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> swapping in/out a page mapped by only one process. If the
> page is mapped by multiple processes, typically, like more
> than 100 on a phone, the overhead would be much higher as
> we have to run tlb flush 100 times for one single page.
> Plus, tlb flush overhead will increase with the number
> of CPU cores due to the bad scalability of tlb shootdown
> in HW, so those ARM64 servers should expect much higher
> overhead.
>
> Further perf annonate shows 95% cpu time of ptep_clear_flush
> is actually used by the final dsb() to wait for the completion
> of tlb flush. This provides us a very good chance to leverage
> the existing batched tlb in kernel. The minimum modification
> is that we only send async tlbi in the first stage and we send
> dsb while we have to sync in the second stage.
>
> With the above simplest micro benchmark, collapsed time to
> finish the program decreases around 5%.
>
> Typical collapsed time w/o patch:
>   ~ # time taskset -c 4 ./a.out
>   0.21user 14.34system 0:14.69elapsed
> w/ patch:
>   ~ # time taskset -c 4 ./a.out
>   0.22user 13.45system 0:13.80elapsed
>
> Also, Yicong Yang added the following observation.
> 	Tested with benchmark in the commit on Kunpeng920 arm64 server,
> 	observed an improvement around 12.5% with command
> 	`time ./swap_bench`.
> 		w/o		w/
> 	real	0m13.460s	0m11.771s
> 	user	0m0.248s	0m0.279s
> 	sys	0m12.039s	0m11.458s
>
> 	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> 	which has been eliminated by this patch:
>
> 	[root@localhost yang]# perf record -- ./swap_bench && perf report
> 	[...]
> 	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush
>
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and only make this enabled on systems
> with more than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Tested-by: Xin Hao <xhao@linux.alibaba.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm64/Kconfig                            |  6 +++
>   arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>   arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
>   arch/x86/include/asm/tlbflush.h               |  3 +-
>   mm/rmap.c                                     | 10 ++--
>   6 files changed, 71 insertions(+), 8 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> index 039e4e91ada3..2caf815d7c6c 100644
> --- a/Documentation/features/vm/TLB/arch-support.txt
> +++ b/Documentation/features/vm/TLB/arch-support.txt
> @@ -9,7 +9,7 @@
>       |       alpha: | TODO |
>       |         arc: | TODO |
>       |         arm: | TODO |
> -    |       arm64: | N/A  |
> +    |       arm64: |  ok  |
>       |        csky: | TODO |
>       |     hexagon: | TODO |
>       |        ia64: | TODO |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 505c8a1ccbe0..72975e82c7d7 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -93,6 +93,7 @@ config ARM64
>   	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>   	select ARCH_SUPPORTS_NUMA_BALANCING
>   	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> +	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
>   	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>   	select ARCH_WANT_DEFAULT_BPF_JIT
>   	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> @@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
>   	default 5 if ARM64_16K_PAGES
>   	default 4
>   
> +config ARM64_NR_CPUS_FOR_BATCHED_TLB
> +	int "Threshold to enable batched TLB flush"
> +	default 8
> +	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
>   config ARCH_MMAP_RND_BITS_MIN
>   	default 14 if ARM64_64K_PAGES
>   	default 16 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> +	/*
> +	 * For arm64, HW can do tlb shootdown, so we don't
> +	 * need to record cpumask for sending IPI
> +	 */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..b21cdeb57a18 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   	dsb(ish);
>   }
>   
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>   					 unsigned long uaddr)
>   {
>   	unsigned long addr;
>   
>   	dsb(ishst);
> -	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> +	addr = __TLBI_VADDR(uaddr, ASID(mm));
>   	__tlbi(vale1is, addr);
>   	__tlbi_user(vale1is, addr);
>   }
>   
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +					 unsigned long uaddr)
> +{
> +	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>   static inline void flush_tlb_page(struct vm_area_struct *vma,
>   				  unsigned long uaddr)
>   {
> @@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>   	dsb(ish);
>   }
>   
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
Maybe there need a comment  why set  ARM64_WORKAROUND_REPEAT_TLBI it 
should return false?
> +		return false;
> +#endif
> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}
> +
> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> +
>   /*
>    * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
>    * necessarily a performance improvement.
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8a497d902c16..5bd78ae55cd4 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>   }
>   
>   static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> -					struct mm_struct *mm)
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
>   {
>   	inc_mm_tlb_gen(mm);
>   	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a9ab10bc0144..a1b408ff44e5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
>   #define TLB_FLUSH_BATCH_PENDING_LARGE			\
>   	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
>   
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>   	int batch, nbatch;
>   
> -	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
> +	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
>   	tlb_ubc->flush_required = true;
>   
>   	/*
> @@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>   	}
>   }
>   #else
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   }
>   
> @@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 */
>   				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>   
> -				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> +				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
>   			} else {
>   				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>   			}

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  8:00     ` haoxin
  0 siblings, 0 replies; 36+ messages in thread
From: haoxin @ 2022-11-14  8:00 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, anshuman.khandual, linux-doc
  Cc: wangkefeng.wang, darren, peterz, yangyicong, punit.agrawal,
	Nadav Amit, guojian, linux-riscv, linux-s390, zhangshiming,
	lipeifeng, corbet, Barry Song, Mel Gorman, linux-mips, arnd,
	realmz6, Barry Song, openrisc, prime.zeng, linux-kernel,
	huzhanyuan, linuxppc-dev


在 2022/10/28 下午4:12, Yicong Yang 写道:
> From: Barry Song <v-songbaohua@oppo.com>
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
>
> Even running a simplest program which requires swapout can
> prove this is true,
>   #include <sys/types.h>
>   #include <unistd.h>
>   #include <sys/mman.h>
>   #include <string.h>
>
>   int main()
>   {
>   #define SIZE (1 * 1024 * 1024)
>           volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>
>           memset(p, 0x88, SIZE);
>
>           for (int k = 0; k < 10000; k++) {
>                   /* swap in */
>                   for (int i = 0; i < SIZE; i += 4096) {
>                           (void)p[i];
>                   }
>
>                   /* swap out */
>                   madvise(p, SIZE, MADV_PAGEOUT);
>           }
>   }
>
> Perf result on snapdragon 888 with 8 cores by using zRAM
> as the swap block device.
>
>   ~ # perf record taskset -c 4 ./a.out
>   [ perf record: Woken up 10 times to write data ]
>   [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>   ~ # perf report
>   # To display the perf.data header info, please use --header/--header-only options.
>   # To display the perf.data header info, please use --header/--header-only options.
>   #
>   #
>   # Total Lost Samples: 0
>   #
>   # Samples: 60K of event 'cycles'
>   # Event count (approx.): 35706225414
>   #
>   # Overhead  Command  Shared Object      Symbol
>   # ........  .......  .................  .............................................................................
>   #
>      21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
>       8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>       6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
>       6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
>       5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>       3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
>       3.49%  a.out    [kernel.kallsyms]  [k] memset64
>       1.63%  a.out    [kernel.kallsyms]  [k] clear_page
>       1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
>       1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
>       1.23%  a.out    [kernel.kallsyms]  [k] xas_load
>       1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock
>
> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> swapping in/out a page mapped by only one process. If the
> page is mapped by multiple processes, typically, like more
> than 100 on a phone, the overhead would be much higher as
> we have to run tlb flush 100 times for one single page.
> Plus, tlb flush overhead will increase with the number
> of CPU cores due to the bad scalability of tlb shootdown
> in HW, so those ARM64 servers should expect much higher
> overhead.
>
> Further perf annonate shows 95% cpu time of ptep_clear_flush
> is actually used by the final dsb() to wait for the completion
> of tlb flush. This provides us a very good chance to leverage
> the existing batched tlb in kernel. The minimum modification
> is that we only send async tlbi in the first stage and we send
> dsb while we have to sync in the second stage.
>
> With the above simplest micro benchmark, collapsed time to
> finish the program decreases around 5%.
>
> Typical collapsed time w/o patch:
>   ~ # time taskset -c 4 ./a.out
>   0.21user 14.34system 0:14.69elapsed
> w/ patch:
>   ~ # time taskset -c 4 ./a.out
>   0.22user 13.45system 0:13.80elapsed
>
> Also, Yicong Yang added the following observation.
> 	Tested with benchmark in the commit on Kunpeng920 arm64 server,
> 	observed an improvement around 12.5% with command
> 	`time ./swap_bench`.
> 		w/o		w/
> 	real	0m13.460s	0m11.771s
> 	user	0m0.248s	0m0.279s
> 	sys	0m12.039s	0m11.458s
>
> 	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> 	which has been eliminated by this patch:
>
> 	[root@localhost yang]# perf record -- ./swap_bench && perf report
> 	[...]
> 	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush
>
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and only make this enabled on systems
> with more than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Tested-by: Xin Hao <xhao@linux.alibaba.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm64/Kconfig                            |  6 +++
>   arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>   arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
>   arch/x86/include/asm/tlbflush.h               |  3 +-
>   mm/rmap.c                                     | 10 ++--
>   6 files changed, 71 insertions(+), 8 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> index 039e4e91ada3..2caf815d7c6c 100644
> --- a/Documentation/features/vm/TLB/arch-support.txt
> +++ b/Documentation/features/vm/TLB/arch-support.txt
> @@ -9,7 +9,7 @@
>       |       alpha: | TODO |
>       |         arc: | TODO |
>       |         arm: | TODO |
> -    |       arm64: | N/A  |
> +    |       arm64: |  ok  |
>       |        csky: | TODO |
>       |     hexagon: | TODO |
>       |        ia64: | TODO |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 505c8a1ccbe0..72975e82c7d7 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -93,6 +93,7 @@ config ARM64
>   	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>   	select ARCH_SUPPORTS_NUMA_BALANCING
>   	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> +	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
>   	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>   	select ARCH_WANT_DEFAULT_BPF_JIT
>   	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> @@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
>   	default 5 if ARM64_16K_PAGES
>   	default 4
>   
> +config ARM64_NR_CPUS_FOR_BATCHED_TLB
> +	int "Threshold to enable batched TLB flush"
> +	default 8
> +	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
>   config ARCH_MMAP_RND_BITS_MIN
>   	default 14 if ARM64_64K_PAGES
>   	default 16 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> +	/*
> +	 * For arm64, HW can do tlb shootdown, so we don't
> +	 * need to record cpumask for sending IPI
> +	 */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..b21cdeb57a18 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   	dsb(ish);
>   }
>   
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>   					 unsigned long uaddr)
>   {
>   	unsigned long addr;
>   
>   	dsb(ishst);
> -	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> +	addr = __TLBI_VADDR(uaddr, ASID(mm));
>   	__tlbi(vale1is, addr);
>   	__tlbi_user(vale1is, addr);
>   }
>   
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +					 unsigned long uaddr)
> +{
> +	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>   static inline void flush_tlb_page(struct vm_area_struct *vma,
>   				  unsigned long uaddr)
>   {
> @@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>   	dsb(ish);
>   }
>   
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
Maybe there need a comment  why set  ARM64_WORKAROUND_REPEAT_TLBI it 
should return false?
> +		return false;
> +#endif
> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}
> +
> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> +
>   /*
>    * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
>    * necessarily a performance improvement.
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8a497d902c16..5bd78ae55cd4 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>   }
>   
>   static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> -					struct mm_struct *mm)
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
>   {
>   	inc_mm_tlb_gen(mm);
>   	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a9ab10bc0144..a1b408ff44e5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
>   #define TLB_FLUSH_BATCH_PENDING_LARGE			\
>   	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
>   
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>   	int batch, nbatch;
>   
> -	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
> +	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
>   	tlb_ubc->flush_required = true;
>   
>   	/*
> @@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>   	}
>   }
>   #else
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   }
>   
> @@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 */
>   				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>   
> -				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> +				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
>   			} else {
>   				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>   			}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  8:00     ` haoxin
  0 siblings, 0 replies; 36+ messages in thread
From: haoxin @ 2022-11-14  8:00 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, anshuman.khandual, linux-doc
  Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren,
	yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian,
	realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv,
	linux-s390, Barry Song, wangkefeng.wang, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman


在 2022/10/28 下午4:12, Yicong Yang 写道:
> From: Barry Song <v-songbaohua@oppo.com>
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
>
> Even running a simplest program which requires swapout can
> prove this is true,
>   #include <sys/types.h>
>   #include <unistd.h>
>   #include <sys/mman.h>
>   #include <string.h>
>
>   int main()
>   {
>   #define SIZE (1 * 1024 * 1024)
>           volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>
>           memset(p, 0x88, SIZE);
>
>           for (int k = 0; k < 10000; k++) {
>                   /* swap in */
>                   for (int i = 0; i < SIZE; i += 4096) {
>                           (void)p[i];
>                   }
>
>                   /* swap out */
>                   madvise(p, SIZE, MADV_PAGEOUT);
>           }
>   }
>
> Perf result on snapdragon 888 with 8 cores by using zRAM
> as the swap block device.
>
>   ~ # perf record taskset -c 4 ./a.out
>   [ perf record: Woken up 10 times to write data ]
>   [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>   ~ # perf report
>   # To display the perf.data header info, please use --header/--header-only options.
>   # To display the perf.data header info, please use --header/--header-only options.
>   #
>   #
>   # Total Lost Samples: 0
>   #
>   # Samples: 60K of event 'cycles'
>   # Event count (approx.): 35706225414
>   #
>   # Overhead  Command  Shared Object      Symbol
>   # ........  .......  .................  .............................................................................
>   #
>      21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
>       8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>       6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
>       6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
>       5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
>       3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
>       3.49%  a.out    [kernel.kallsyms]  [k] memset64
>       1.63%  a.out    [kernel.kallsyms]  [k] clear_page
>       1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
>       1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
>       1.23%  a.out    [kernel.kallsyms]  [k] xas_load
>       1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock
>
> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> swapping in/out a page mapped by only one process. If the
> page is mapped by multiple processes, typically, like more
> than 100 on a phone, the overhead would be much higher as
> we have to run tlb flush 100 times for one single page.
> Plus, tlb flush overhead will increase with the number
> of CPU cores due to the bad scalability of tlb shootdown
> in HW, so those ARM64 servers should expect much higher
> overhead.
>
> Further perf annonate shows 95% cpu time of ptep_clear_flush
> is actually used by the final dsb() to wait for the completion
> of tlb flush. This provides us a very good chance to leverage
> the existing batched tlb in kernel. The minimum modification
> is that we only send async tlbi in the first stage and we send
> dsb while we have to sync in the second stage.
>
> With the above simplest micro benchmark, collapsed time to
> finish the program decreases around 5%.
>
> Typical collapsed time w/o patch:
>   ~ # time taskset -c 4 ./a.out
>   0.21user 14.34system 0:14.69elapsed
> w/ patch:
>   ~ # time taskset -c 4 ./a.out
>   0.22user 13.45system 0:13.80elapsed
>
> Also, Yicong Yang added the following observation.
> 	Tested with benchmark in the commit on Kunpeng920 arm64 server,
> 	observed an improvement around 12.5% with command
> 	`time ./swap_bench`.
> 		w/o		w/
> 	real	0m13.460s	0m11.771s
> 	user	0m0.248s	0m0.279s
> 	sys	0m12.039s	0m11.458s
>
> 	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> 	which has been eliminated by this patch:
>
> 	[root@localhost yang]# perf record -- ./swap_bench && perf report
> 	[...]
> 	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush
>
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and only make this enabled on systems
> with more than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Tested-by: Xin Hao <xhao@linux.alibaba.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   .../features/vm/TLB/arch-support.txt          |  2 +-
>   arch/arm64/Kconfig                            |  6 +++
>   arch/arm64/include/asm/tlbbatch.h             | 12 +++++
>   arch/arm64/include/asm/tlbflush.h             | 46 ++++++++++++++++++-
>   arch/x86/include/asm/tlbflush.h               |  3 +-
>   mm/rmap.c                                     | 10 ++--
>   6 files changed, 71 insertions(+), 8 deletions(-)
>   create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> index 039e4e91ada3..2caf815d7c6c 100644
> --- a/Documentation/features/vm/TLB/arch-support.txt
> +++ b/Documentation/features/vm/TLB/arch-support.txt
> @@ -9,7 +9,7 @@
>       |       alpha: | TODO |
>       |         arc: | TODO |
>       |         arm: | TODO |
> -    |       arm64: | N/A  |
> +    |       arm64: |  ok  |
>       |        csky: | TODO |
>       |     hexagon: | TODO |
>       |        ia64: | TODO |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 505c8a1ccbe0..72975e82c7d7 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -93,6 +93,7 @@ config ARM64
>   	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>   	select ARCH_SUPPORTS_NUMA_BALANCING
>   	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> +	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT
>   	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>   	select ARCH_WANT_DEFAULT_BPF_JIT
>   	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> @@ -268,6 +269,11 @@ config ARM64_CONT_PMD_SHIFT
>   	default 5 if ARM64_16K_PAGES
>   	default 4
>   
> +config ARM64_NR_CPUS_FOR_BATCHED_TLB
> +	int "Threshold to enable batched TLB flush"
> +	default 8
> +	depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
>   config ARCH_MMAP_RND_BITS_MIN
>   	default 14 if ARM64_64K_PAGES
>   	default 16 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> +	/*
> +	 * For arm64, HW can do tlb shootdown, so we don't
> +	 * need to record cpumask for sending IPI
> +	 */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..b21cdeb57a18 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   	dsb(ish);
>   }
>   
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>   					 unsigned long uaddr)
>   {
>   	unsigned long addr;
>   
>   	dsb(ishst);
> -	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> +	addr = __TLBI_VADDR(uaddr, ASID(mm));
>   	__tlbi(vale1is, addr);
>   	__tlbi_user(vale1is, addr);
>   }
>   
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +					 unsigned long uaddr)
> +{
> +	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>   static inline void flush_tlb_page(struct vm_area_struct *vma,
>   				  unsigned long uaddr)
>   {
> @@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>   	dsb(ish);
>   }
>   
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +	/*
> +	 * TLB batched flush is proved to be beneficial for systems with large
> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +	 * is cheap on small systems which may not need this feature. So use
> +	 * a threshold for enabling this to avoid potential side effects on
> +	 * these platforms.
> +	 */
> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> +		return false;
> +
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
Maybe there need a comment  why set  ARM64_WORKAROUND_REPEAT_TLBI it 
should return false?
> +		return false;
> +#endif
> +
> +	return true;
> +}
> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
> +{
> +	__flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> +	dsb(ish);
> +}
> +
> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> +
>   /*
>    * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
>    * necessarily a performance improvement.
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8a497d902c16..5bd78ae55cd4 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>   }
>   
>   static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> -					struct mm_struct *mm)
> +					struct mm_struct *mm,
> +					unsigned long uaddr)
>   {
>   	inc_mm_tlb_gen(mm);
>   	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a9ab10bc0144..a1b408ff44e5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -640,12 +640,13 @@ void try_to_unmap_flush_dirty(void)
>   #define TLB_FLUSH_BATCH_PENDING_LARGE			\
>   	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
>   
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>   	int batch, nbatch;
>   
> -	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
> +	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr);
>   	tlb_ubc->flush_required = true;
>   
>   	/*
> @@ -723,7 +724,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>   	}
>   }
>   #else
> -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
> +				      unsigned long uaddr)
>   {
>   }
>   
> @@ -1596,7 +1598,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 */
>   				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>   
> -				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> +				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address);
>   			} else {
>   				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>   			}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-11-14  3:29     ` Anshuman Khandual
  (?)
  (?)
@ 2022-11-14  8:46       ` Yicong Yang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-14  8:46 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman

On 2022/11/14 11:29, Anshuman Khandual wrote:
> 
> 
> On 10/28/22 13:42, Yicong Yang wrote:
>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * TLB batched flush is proved to be beneficial for systems with large
>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>> +	 * is cheap on small systems which may not need this feature. So use
>> +	 * a threshold for enabling this to avoid potential side effects on
>> +	 * these platforms.
>> +	 */
>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>> +		return false;
>> +
>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>> +		return false;
>> +#endif
> 
> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
> simpler, consistent, and also cost effective ?
> 

ok. Checked cpus_have_const_cap() I think it matches your words.

> Regardless, a comment is needed before the #ifdef block explaining why it does not make
> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
> between two TLBI instructions to workaround the errata.
> 

The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

>> +
>> +	return true;
>> +}
>> +
>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
>> +					struct mm_struct *mm,
>> +					unsigned long uaddr)
>> +{
>> +	__flush_tlb_page_nosync(mm, uaddr);
>> +}
>> +
>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> +{
>> +	dsb(ish);
>> +}
> .
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  8:46       ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-14  8:46 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman

On 2022/11/14 11:29, Anshuman Khandual wrote:
> 
> 
> On 10/28/22 13:42, Yicong Yang wrote:
>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * TLB batched flush is proved to be beneficial for systems with large
>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>> +	 * is cheap on small systems which may not need this feature. So use
>> +	 * a threshold for enabling this to avoid potential side effects on
>> +	 * these platforms.
>> +	 */
>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>> +		return false;
>> +
>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>> +		return false;
>> +#endif
> 
> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
> simpler, consistent, and also cost effective ?
> 

ok. Checked cpus_have_const_cap() I think it matches your words.

> Regardless, a comment is needed before the #ifdef block explaining why it does not make
> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
> between two TLBI instructions to workaround the errata.
> 

The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

>> +
>> +	return true;
>> +}
>> +
>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
>> +					struct mm_struct *mm,
>> +					unsigned long uaddr)
>> +{
>> +	__flush_tlb_page_nosync(mm, uaddr);
>> +}
>> +
>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> +{
>> +	dsb(ish);
>> +}
> .
> 

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  8:46       ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-14  8:46 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: wangkefeng.wang, darren, peterz, yangyicong, punit.agrawal,
	Nadav Amit, guojian, linux-riscv, linux-s390, zhangshiming,
	lipeifeng, corbet, Barry Song, Mel Gorman, linux-mips, arnd,
	realmz6, Barry Song, openrisc, prime.zeng, xhao, linux-kernel,
	huzhanyuan, linuxppc-dev

On 2022/11/14 11:29, Anshuman Khandual wrote:
> 
> 
> On 10/28/22 13:42, Yicong Yang wrote:
>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * TLB batched flush is proved to be beneficial for systems with large
>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>> +	 * is cheap on small systems which may not need this feature. So use
>> +	 * a threshold for enabling this to avoid potential side effects on
>> +	 * these platforms.
>> +	 */
>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>> +		return false;
>> +
>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>> +		return false;
>> +#endif
> 
> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
> simpler, consistent, and also cost effective ?
> 

ok. Checked cpus_have_const_cap() I think it matches your words.

> Regardless, a comment is needed before the #ifdef block explaining why it does not make
> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
> between two TLBI instructions to workaround the errata.
> 

The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

>> +
>> +	return true;
>> +}
>> +
>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
>> +					struct mm_struct *mm,
>> +					unsigned long uaddr)
>> +{
>> +	__flush_tlb_page_nosync(mm, uaddr);
>> +}
>> +
>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> +{
>> +	dsb(ish);
>> +}
> .
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14  8:46       ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-14  8:46 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman

On 2022/11/14 11:29, Anshuman Khandual wrote:
> 
> 
> On 10/28/22 13:42, Yicong Yang wrote:
>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * TLB batched flush is proved to be beneficial for systems with large
>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>> +	 * is cheap on small systems which may not need this feature. So use
>> +	 * a threshold for enabling this to avoid potential side effects on
>> +	 * these platforms.
>> +	 */
>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>> +		return false;
>> +
>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>> +		return false;
>> +#endif
> 
> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
> simpler, consistent, and also cost effective ?
> 

ok. Checked cpus_have_const_cap() I think it matches your words.

> Regardless, a comment is needed before the #ifdef block explaining why it does not make
> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
> between two TLBI instructions to workaround the errata.
> 

The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

>> +
>> +	return true;
>> +}
>> +
>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
>> +					struct mm_struct *mm,
>> +					unsigned long uaddr)
>> +{
>> +	__flush_tlb_page_nosync(mm, uaddr);
>> +}
>> +
>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> +{
>> +	dsb(ish);
>> +}
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-11-14  8:46       ` Yicong Yang
  (?)
  (?)
@ 2022-11-14 14:19         ` Anshuman Khandual
  -1 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14 14:19 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman



On 11/14/22 14:16, Yicong Yang wrote:
> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>
>> On 10/28/22 13:42, Yicong Yang wrote:
>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>> +	 * is cheap on small systems which may not need this feature. So use
>>> +	 * a threshold for enabling this to avoid potential side effects on
>>> +	 * these platforms.
>>> +	 */
>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>> +		return false;
>>> +
>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>> +		return false;
>>> +#endif
>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>> simpler, consistent, and also cost effective ?
>>
> ok. Checked cpus_have_const_cap() I think it matches your words.
> 
>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>> between two TLBI instructions to workaround the errata.
>>
> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

We should not defer TLB flush on such systems, as ensured by the above test and 'false'
return afterwards. The only question is whether this decision should be taken at a CPU
level (which is affected by the errata) or the whole system level.

What is required now

- Replace this_cpu_has_cap() with cpus_have_const_cap ?
- Add the following comment before the #ifdef check

/*
 * TLB flush deferral is not required on systems, which are affected with
 * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
 * will have two consecutive TLBI instructions with a dsb(ish) in between
 * defeating the purpose (i.e save overall 'dsb ish' cost).
 */

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14 14:19         ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14 14:19 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman



On 11/14/22 14:16, Yicong Yang wrote:
> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>
>> On 10/28/22 13:42, Yicong Yang wrote:
>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>> +	 * is cheap on small systems which may not need this feature. So use
>>> +	 * a threshold for enabling this to avoid potential side effects on
>>> +	 * these platforms.
>>> +	 */
>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>> +		return false;
>>> +
>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>> +		return false;
>>> +#endif
>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>> simpler, consistent, and also cost effective ?
>>
> ok. Checked cpus_have_const_cap() I think it matches your words.
> 
>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>> between two TLBI instructions to workaround the errata.
>>
> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

We should not defer TLB flush on such systems, as ensured by the above test and 'false'
return afterwards. The only question is whether this decision should be taken at a CPU
level (which is affected by the errata) or the whole system level.

What is required now

- Replace this_cpu_has_cap() with cpus_have_const_cap ?
- Add the following comment before the #ifdef check

/*
 * TLB flush deferral is not required on systems, which are affected with
 * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
 * will have two consecutive TLBI instructions with a dsb(ish) in between
 * defeating the purpose (i.e save overall 'dsb ish' cost).
 */

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14 14:19         ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14 14:19 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: wangkefeng.wang, darren, peterz, yangyicong, punit.agrawal,
	Nadav Amit, guojian, linux-riscv, linux-s390, zhangshiming,
	lipeifeng, corbet, Barry Song, Mel Gorman, linux-mips, arnd,
	realmz6, Barry Song, openrisc, prime.zeng, xhao, linux-kernel,
	huzhanyuan, linuxppc-dev



On 11/14/22 14:16, Yicong Yang wrote:
> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>
>> On 10/28/22 13:42, Yicong Yang wrote:
>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>> +	 * is cheap on small systems which may not need this feature. So use
>>> +	 * a threshold for enabling this to avoid potential side effects on
>>> +	 * these platforms.
>>> +	 */
>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>> +		return false;
>>> +
>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>> +		return false;
>>> +#endif
>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>> simpler, consistent, and also cost effective ?
>>
> ok. Checked cpus_have_const_cap() I think it matches your words.
> 
>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>> between two TLBI instructions to workaround the errata.
>>
> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

We should not defer TLB flush on such systems, as ensured by the above test and 'false'
return afterwards. The only question is whether this decision should be taken at a CPU
level (which is affected by the errata) or the whole system level.

What is required now

- Replace this_cpu_has_cap() with cpus_have_const_cap ?
- Add the following comment before the #ifdef check

/*
 * TLB flush deferral is not required on systems, which are affected with
 * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
 * will have two consecutive TLBI instructions with a dsb(ish) in between
 * defeating the purpose (i.e save overall 'dsb ish' cost).
 */

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-14 14:19         ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2022-11-14 14:19 UTC (permalink / raw)
  To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman



On 11/14/22 14:16, Yicong Yang wrote:
> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>
>> On 10/28/22 13:42, Yicong Yang wrote:
>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>> +	 * is cheap on small systems which may not need this feature. So use
>>> +	 * a threshold for enabling this to avoid potential side effects on
>>> +	 * these platforms.
>>> +	 */
>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>> +		return false;
>>> +
>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>> +		return false;
>>> +#endif
>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>> simpler, consistent, and also cost effective ?
>>
> ok. Checked cpus_have_const_cap() I think it matches your words.
> 
>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>> between two TLBI instructions to workaround the errata.
>>
> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.

We should not defer TLB flush on such systems, as ensured by the above test and 'false'
return afterwards. The only question is whether this decision should be taken at a CPU
level (which is affected by the errata) or the whole system level.

What is required now

- Replace this_cpu_has_cap() with cpus_have_const_cap ?
- Add the following comment before the #ifdef check

/*
 * TLB flush deferral is not required on systems, which are affected with
 * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
 * will have two consecutive TLBI instructions with a dsb(ish) in between
 * defeating the purpose (i.e save overall 'dsb ish' cost).
 */

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
  2022-11-14 14:19         ` Anshuman Khandual
  (?)
  (?)
@ 2022-11-15  3:34           ` Yicong Yang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-15  3:34 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman

On 2022/11/14 22:19, Anshuman Khandual wrote:
> 
> 
> On 11/14/22 14:16, Yicong Yang wrote:
>> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>>
>>> On 10/28/22 13:42, Yicong Yang wrote:
>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>>> +	 * is cheap on small systems which may not need this feature. So use
>>>> +	 * a threshold for enabling this to avoid potential side effects on
>>>> +	 * these platforms.
>>>> +	 */
>>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>>> +		return false;
>>>> +
>>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>>> +		return false;
>>>> +#endif
>>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>>> simpler, consistent, and also cost effective ?
>>>
>> ok. Checked cpus_have_const_cap() I think it matches your words.
>>
>>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>>> between two TLBI instructions to workaround the errata.
>>>
>> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
>> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
>> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.
> 
> We should not defer TLB flush on such systems, as ensured by the above test and 'false'
> return afterwards. The only question is whether this decision should be taken at a CPU
> level (which is affected by the errata) or the whole system level.
> 
> What is required now
> 
> - Replace this_cpu_has_cap() with cpus_have_const_cap ?
> - Add the following comment before the #ifdef check
> 

Have respin the series according to the above comments:
https://lore.kernel.org/lkml/20221115031425.44640-3-yangyicong@huawei.com/

Thanks.

> /*
>  * TLB flush deferral is not required on systems, which are affected with
>  * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
>  * will have two consecutive TLBI instructions with a dsb(ish) in between
>  * defeating the purpose (i.e save overall 'dsb ish' cost).
>  */
> 
> .
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-15  3:34           ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-15  3:34 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman

On 2022/11/14 22:19, Anshuman Khandual wrote:
> 
> 
> On 11/14/22 14:16, Yicong Yang wrote:
>> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>>
>>> On 10/28/22 13:42, Yicong Yang wrote:
>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>>> +	 * is cheap on small systems which may not need this feature. So use
>>>> +	 * a threshold for enabling this to avoid potential side effects on
>>>> +	 * these platforms.
>>>> +	 */
>>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>>> +		return false;
>>>> +
>>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>>> +		return false;
>>>> +#endif
>>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>>> simpler, consistent, and also cost effective ?
>>>
>> ok. Checked cpus_have_const_cap() I think it matches your words.
>>
>>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>>> between two TLBI instructions to workaround the errata.
>>>
>> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
>> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
>> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.
> 
> We should not defer TLB flush on such systems, as ensured by the above test and 'false'
> return afterwards. The only question is whether this decision should be taken at a CPU
> level (which is affected by the errata) or the whole system level.
> 
> What is required now
> 
> - Replace this_cpu_has_cap() with cpus_have_const_cap ?
> - Add the following comment before the #ifdef check
> 

Have respin the series according to the above comments:
https://lore.kernel.org/lkml/20221115031425.44640-3-yangyicong@huawei.com/

Thanks.

> /*
>  * TLB flush deferral is not required on systems, which are affected with
>  * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
>  * will have two consecutive TLBI instructions with a dsb(ish) in between
>  * defeating the purpose (i.e save overall 'dsb ish' cost).
>  */
> 
> .
> 

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-15  3:34           ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-15  3:34 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: wangkefeng.wang, darren, peterz, yangyicong, punit.agrawal,
	Nadav Amit, guojian, linux-riscv, linux-s390, zhangshiming,
	lipeifeng, corbet, Barry Song, Mel Gorman, linux-mips, arnd,
	realmz6, Barry Song, openrisc, prime.zeng, xhao, linux-kernel,
	huzhanyuan, linuxppc-dev

On 2022/11/14 22:19, Anshuman Khandual wrote:
> 
> 
> On 11/14/22 14:16, Yicong Yang wrote:
>> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>>
>>> On 10/28/22 13:42, Yicong Yang wrote:
>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>>> +	 * is cheap on small systems which may not need this feature. So use
>>>> +	 * a threshold for enabling this to avoid potential side effects on
>>>> +	 * these platforms.
>>>> +	 */
>>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>>> +		return false;
>>>> +
>>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>>> +		return false;
>>>> +#endif
>>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>>> simpler, consistent, and also cost effective ?
>>>
>> ok. Checked cpus_have_const_cap() I think it matches your words.
>>
>>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>>> between two TLBI instructions to workaround the errata.
>>>
>> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
>> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
>> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.
> 
> We should not defer TLB flush on such systems, as ensured by the above test and 'false'
> return afterwards. The only question is whether this decision should be taken at a CPU
> level (which is affected by the errata) or the whole system level.
> 
> What is required now
> 
> - Replace this_cpu_has_cap() with cpus_have_const_cap ?
> - Add the following comment before the #ifdef check
> 

Have respin the series according to the above comments:
https://lore.kernel.org/lkml/20221115031425.44640-3-yangyicong@huawei.com/

Thanks.

> /*
>  * TLB flush deferral is not required on systems, which are affected with
>  * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
>  * will have two consecutive TLBI instructions with a dsb(ish) in between
>  * defeating the purpose (i.e save overall 'dsb ish' cost).
>  */
> 
> .
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
@ 2022-11-15  3:34           ` Yicong Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Yicong Yang @ 2022-11-15  3:34 UTC (permalink / raw)
  To: Anshuman Khandual, akpm, linux-mm, linux-arm-kernel, x86,
	catalin.marinas, will, linux-doc
  Cc: yangyicong, corbet, peterz, arnd, punit.agrawal, linux-kernel,
	darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6,
	linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
	Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song,
	Nadav Amit, Mel Gorman

On 2022/11/14 22:19, Anshuman Khandual wrote:
> 
> 
> On 11/14/22 14:16, Yicong Yang wrote:
>> On 2022/11/14 11:29, Anshuman Khandual wrote:
>>>
>>> On 10/28/22 13:42, Yicong Yang wrote:
>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * TLB batched flush is proved to be beneficial for systems with large
>>>> +	 * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
>>>> +	 * is cheap on small systems which may not need this feature. So use
>>>> +	 * a threshold for enabling this to avoid potential side effects on
>>>> +	 * these platforms.
>>>> +	 */
>>>> +	if (num_online_cpus() <= CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
>>>> +		return false;
>>>> +
>>>> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
>>>> +	if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
>>>> +		return false;
>>>> +#endif
>>> should_defer_flush() is immediately followed by set_tlb_ubc_flush_pending() which calls
>>> arch_tlbbatch_add_mm(), triggering the actual TLBI flush via __flush_tlb_page_nosync().
>>> It should be okay to check capability with this_cpu_has_cap() as the entire call chain
>>> here is executed on the same cpu. But just wondering if cpus_have_const_cap() would be
>>> simpler, consistent, and also cost effective ?
>>>
>> ok. Checked cpus_have_const_cap() I think it matches your words.
>>
>>> Regardless, a comment is needed before the #ifdef block explaining why it does not make
>>> sense to defer/batch when __tlbi()/__tlbi_user() implementation will execute 'dsb(ish)'
>>> between two TLBI instructions to workaround the errata.
>>>
>> The workaround for the errata mentioned the affected platforms need the tlbi+dsb to be done
>> twice, so I'm not sure if we defer the final dsb will cause any problem so I think the judgement
>> here is used for safety. I have no such platform to test if it's ok to defer the last dsb.
> 
> We should not defer TLB flush on such systems, as ensured by the above test and 'false'
> return afterwards. The only question is whether this decision should be taken at a CPU
> level (which is affected by the errata) or the whole system level.
> 
> What is required now
> 
> - Replace this_cpu_has_cap() with cpus_have_const_cap ?
> - Add the following comment before the #ifdef check
> 

Have respin the series according to the above comments:
https://lore.kernel.org/lkml/20221115031425.44640-3-yangyicong@huawei.com/

Thanks.

> /*
>  * TLB flush deferral is not required on systems, which are affected with
>  * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
>  * will have two consecutive TLBI instructions with a dsb(ish) in between
>  * defeating the purpose (i.e save overall 'dsb ish' cost).
>  */
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-11-15  3:35 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-28  8:12 [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
2022-10-28  8:12 ` Yicong Yang
2022-10-28  8:12 ` Yicong Yang
2022-10-28  8:12 ` Yicong Yang
2022-10-28  8:12 ` [PATCH v5 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12 ` [PATCH v5 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-10-28  8:12   ` Yicong Yang
2022-11-14  3:29   ` Anshuman Khandual
2022-11-14  3:29     ` Anshuman Khandual
2022-11-14  3:29     ` Anshuman Khandual
2022-11-14  3:29     ` Anshuman Khandual
2022-11-14  8:46     ` Yicong Yang
2022-11-14  8:46       ` Yicong Yang
2022-11-14  8:46       ` Yicong Yang
2022-11-14  8:46       ` Yicong Yang
2022-11-14 14:19       ` Anshuman Khandual
2022-11-14 14:19         ` Anshuman Khandual
2022-11-14 14:19         ` Anshuman Khandual
2022-11-14 14:19         ` Anshuman Khandual
2022-11-15  3:34         ` Yicong Yang
2022-11-15  3:34           ` Yicong Yang
2022-11-15  3:34           ` Yicong Yang
2022-11-15  3:34           ` Yicong Yang
2022-11-14  8:00   ` haoxin
2022-11-14  8:00     ` haoxin
2022-11-14  8:00     ` haoxin
2022-11-14  8:00     ` haoxin
2022-11-11 10:17 ` [External] [PATCH v5 0/2] " Punit Agrawal
2022-11-11 10:17   ` Punit Agrawal
2022-11-11 10:17   ` Punit Agrawal
2022-11-11 10:17   ` Punit Agrawal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.