* [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH @ 2022-08-22 8:21 Yicong Yang 2022-08-22 8:21 ` [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Yicong Yang ` (4 more replies) 0 siblings, 5 replies; 34+ messages in thread From: Yicong Yang @ 2022-08-22 8:21 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual From: Yicong Yang <yangyicong@hisilicon.com> Though ARM64 has the hardware to do tlb shootdown, the hardware broadcasting is not free. A simplest micro benchmark shows even on snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is huge even for paging out one page mapped by only one process: 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush While pages are mapped by multiple processes or HW has more CPUs, the cost should become even higher due to the bad scalability of tlb shootdown. The same benchmark can result in 16.99% CPU consumption on ARM64 server with around 100 cores according to Yicong's test on patch 4/4. This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by 1. only send tlbi instructions in the first stage - arch_tlbbatch_add_mm() 2. wait for the completion of tlbi by dsb while doing tlbbatch sync in arch_tlbbatch_flush() My testing on snapdragon shows the overhead of ptep_clear_flush is removed by the patchset. The micro benchmark becomes 5% faster even for one page mapped by single process on snapdragon 888. -v3: 1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng 2. Add Tested-by from Xin Hao Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/ -v2: 1. Collected Yicong's test result on kunpeng920 ARM64 server; 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm() according to the comments of Peter Zijlstra and Dave Hansen 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask is empty according to the comments of Nadav Amit Thanks, Peter, Dave and Nadav for your testing or reviewing , and comments. -v1: https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/ Anshuman Khandual (1): mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Barry Song (3): Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" mm: rmap: Extend tlbbatch APIs to fit new platforms arm64: support batched/deferred tlb shootdown during page reclamation Documentation/features/arch-support.txt | 1 - .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- arch/x86/include/asm/tlbflush.h | 15 +++++++++- mm/rmap.c | 19 +++++-------- 7 files changed, 61 insertions(+), 17 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h -- 2.24.0 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" 2022-08-22 8:21 [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang @ 2022-08-22 8:21 ` Yicong Yang 2022-09-09 4:26 ` Anshuman Khandual 2022-08-22 8:21 ` [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang ` (3 subsequent siblings) 4 siblings, 1 reply; 34+ messages in thread From: Yicong Yang @ 2022-08-22 8:21 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual, Barry Song From: Barry Song <v-songbaohua@oppo.com> This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42. I was wrong. Though ARM64 has hardware TLB flush, but it is not free and it is still expensive. We still have a good chance to enable batched and deferred TLB flush on ARM64 for memory reclamation. A possible way is that we only queue tlbi instructions in hardware's queue. When we have to broadcast TLB, we broadcast it by dsb. We just need to get adapted the existing BATCHED_UNMAP_TLB_FLUSH. Tested-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> --- Documentation/features/arch-support.txt | 1 - Documentation/features/vm/TLB/arch-support.txt | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt index 118ae031840b..d22a1095e661 100644 --- a/Documentation/features/arch-support.txt +++ b/Documentation/features/arch-support.txt @@ -8,5 +8,4 @@ The meaning of entries in the tables is: | ok | # feature supported by the architecture |TODO| # feature not yet supported by the architecture | .. | # feature cannot be supported by the hardware - | N/A| # feature doesn't apply to the architecture diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt index 039e4e91ada3..1c009312b9c1 100644 --- a/Documentation/features/vm/TLB/arch-support.txt +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | N/A | + | arm64: | TODO | | csky: | TODO | | hexagon: | TODO | | ia64: | TODO | -- 2.24.0 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" 2022-08-22 8:21 ` [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Yicong Yang @ 2022-09-09 4:26 ` Anshuman Khandual 2022-09-09 4:40 ` Barry Song 0 siblings, 1 reply; 34+ messages in thread From: Anshuman Khandual @ 2022-09-09 4:26 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song On 8/22/22 13:51, Yicong Yang wrote: > From: Barry Song <v-songbaohua@oppo.com> > > This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42. > > I was wrong. Though ARM64 has hardware TLB flush, but it is not free > and it is still expensive. > We still have a good chance to enable batched and deferred TLB flush > on ARM64 for memory reclamation. A possible way is that we only queue > tlbi instructions in hardware's queue. When we have to broadcast TLB, > we broadcast it by dsb. We just need to get adapted the existing > BATCHED_UNMAP_TLB_FLUSH. > > Tested-by: Xin Hao <xhao@linux.alibaba.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > --- > Documentation/features/arch-support.txt | 1 - > Documentation/features/vm/TLB/arch-support.txt | 2 +- > 2 files changed, 1 insertion(+), 2 deletions(-) > > diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt > index 118ae031840b..d22a1095e661 100644 > --- a/Documentation/features/arch-support.txt > +++ b/Documentation/features/arch-support.txt > @@ -8,5 +8,4 @@ The meaning of entries in the tables is: > | ok | # feature supported by the architecture > |TODO| # feature not yet supported by the architecture > | .. | # feature cannot be supported by the hardware > - | N/A| # feature doesn't apply to the architecture > > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > index 039e4e91ada3..1c009312b9c1 100644 > --- a/Documentation/features/vm/TLB/arch-support.txt > +++ b/Documentation/features/vm/TLB/arch-support.txt > @@ -9,7 +9,7 @@ > | alpha: | TODO | > | arc: | TODO | > | arm: | TODO | > - | arm64: | N/A | > + | arm64: | TODO | > | csky: | TODO | > | hexagon: | TODO | > | ia64: | TODO | I believe this patch is not needed, which explicitly reverts an older commit. Instead when ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH framework gets enabled on arm64, the same patch could just drop 'NA' as possible values for arch support for a give feature in file Documentation/features/arch-support.txt. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" 2022-09-09 4:26 ` Anshuman Khandual @ 2022-09-09 4:40 ` Barry Song 0 siblings, 0 replies; 34+ messages in thread From: Barry Song @ 2022-09-09 4:40 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song On Fri, Sep 9, 2022 at 4:26 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 8/22/22 13:51, Yicong Yang wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42. > > > > I was wrong. Though ARM64 has hardware TLB flush, but it is not free > > and it is still expensive. > > We still have a good chance to enable batched and deferred TLB flush > > on ARM64 for memory reclamation. A possible way is that we only queue > > tlbi instructions in hardware's queue. When we have to broadcast TLB, > > we broadcast it by dsb. We just need to get adapted the existing > > BATCHED_UNMAP_TLB_FLUSH. > > > > Tested-by: Xin Hao <xhao@linux.alibaba.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > --- > > Documentation/features/arch-support.txt | 1 - > > Documentation/features/vm/TLB/arch-support.txt | 2 +- > > 2 files changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt > > index 118ae031840b..d22a1095e661 100644 > > --- a/Documentation/features/arch-support.txt > > +++ b/Documentation/features/arch-support.txt > > @@ -8,5 +8,4 @@ The meaning of entries in the tables is: > > | ok | # feature supported by the architecture > > |TODO| # feature not yet supported by the architecture > > | .. | # feature cannot be supported by the hardware > > - | N/A| # feature doesn't apply to the architecture > > > > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > > index 039e4e91ada3..1c009312b9c1 100644 > > --- a/Documentation/features/vm/TLB/arch-support.txt > > +++ b/Documentation/features/vm/TLB/arch-support.txt > > @@ -9,7 +9,7 @@ > > | alpha: | TODO | > > | arc: | TODO | > > | arm: | TODO | > > - | arm64: | N/A | > > + | arm64: | TODO | > > | csky: | TODO | > > | hexagon: | TODO | > > | ia64: | TODO | > > I believe this patch is not needed, which explicitly reverts an > older commit. Instead when ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > framework gets enabled on arm64, the same patch could just drop > 'NA' as possible values for arch support for a give feature in > file Documentation/features/arch-support.txt. Sure. it is certainly ok to fix this in arm64: support batched/deferred tlb shootdown during page reclamation By a separate patch, I was trying to highlight that my previous patch was wrong. but, yes. it is not fundamentally necessary. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() 2022-08-22 8:21 [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang 2022-08-22 8:21 ` [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Yicong Yang @ 2022-08-22 8:21 ` Yicong Yang 2022-08-24 9:40 ` Kefeng Wang 2022-09-09 4:16 ` Anshuman Khandual 2022-08-22 8:21 ` [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Yicong Yang ` (2 subsequent siblings) 4 siblings, 2 replies; 34+ messages in thread From: Yicong Yang @ 2022-08-22 8:21 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual, Anshuman Khandual From: Anshuman Khandual <khandual@linux.vnet.ibm.com> The entire scheme of deferred TLB flush in reclaim path rests on the fact that the cost to refill TLB entries is less than flushing out individual entries by sending IPI to remote CPUs. But architecture can have different ways to evaluate that. Hence apart from checking TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be architecture specific. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/] Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> [Rebase and fix incorrect return value type] --- arch/x86/include/asm/tlbflush.h | 12 ++++++++++++ mm/rmap.c | 9 +-------- 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cda3118f3b27..8a497d902c16 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); } +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + bool should_defer = false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} + static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) { /* diff --git a/mm/rmap.c b/mm/rmap.c index edc06c52bc82..a17a004550c6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -687,17 +687,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) */ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) { - bool should_defer = false; - if (!(flags & TTU_BATCH_FLUSH)) return false; - /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) - should_defer = true; - put_cpu(); - - return should_defer; + return arch_tlbbatch_should_defer(mm); } /* -- 2.24.0 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() 2022-08-22 8:21 ` [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang @ 2022-08-24 9:40 ` Kefeng Wang 2022-09-09 4:16 ` Anshuman Khandual 1 sibling, 0 replies; 34+ messages in thread From: Kefeng Wang @ 2022-08-24 9:40 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, xhao, prime.zeng, anshuman.khandual, Anshuman Khandual On 2022/8/22 16:21, Yicong Yang wrote: > From: Anshuman Khandual <khandual@linux.vnet.ibm.com> > > The entire scheme of deferred TLB flush in reclaim path rests on the > fact that the cost to refill TLB entries is less than flushing out > individual entries by sending IPI to remote CPUs. But architecture > can have different ways to evaluate that. Hence apart from checking > TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be > architecture specific. > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> > [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/] > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > [Rebase and fix incorrect return value type] Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > arch/x86/include/asm/tlbflush.h | 12 ++++++++++++ > mm/rmap.c | 9 +-------- > 2 files changed, 13 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index cda3118f3b27..8a497d902c16 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) > flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); > } > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + bool should_defer = false; > + > + /* If remote CPUs need to be flushed then defer batch the flush */ > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + should_defer = true; > + put_cpu(); > + > + return should_defer; > +} > + > static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) > { > /* > diff --git a/mm/rmap.c b/mm/rmap.c > index edc06c52bc82..a17a004550c6 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -687,17 +687,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > */ > static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > { > - bool should_defer = false; > - > if (!(flags & TTU_BATCH_FLUSH)) > return false; > > - /* If remote CPUs need to be flushed then defer batch the flush */ > - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > - should_defer = true; > - put_cpu(); > - > - return should_defer; > + return arch_tlbbatch_should_defer(mm); > } > > /* _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() 2022-08-22 8:21 ` [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang 2022-08-24 9:40 ` Kefeng Wang @ 2022-09-09 4:16 ` Anshuman Khandual 1 sibling, 0 replies; 34+ messages in thread From: Anshuman Khandual @ 2022-09-09 4:16 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Anshuman Khandual On 8/22/22 13:51, Yicong Yang wrote: > From: Anshuman Khandual <khandual@linux.vnet.ibm.com> > > The entire scheme of deferred TLB flush in reclaim path rests on the > fact that the cost to refill TLB entries is less than flushing out > individual entries by sending IPI to remote CPUs. But architecture > can have different ways to evaluate that. Hence apart from checking > TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be > architecture specific. > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> > [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/] > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > [Rebase and fix incorrect return value type] From semantics perspective, this patch still makes sense, even on its own. Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> > --- > arch/x86/include/asm/tlbflush.h | 12 ++++++++++++ > mm/rmap.c | 9 +-------- > 2 files changed, 13 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index cda3118f3b27..8a497d902c16 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) > flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); > } > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + bool should_defer = false; > + > + /* If remote CPUs need to be flushed then defer batch the flush */ > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + should_defer = true; > + put_cpu(); > + > + return should_defer; > +} > + > static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) > { > /* > diff --git a/mm/rmap.c b/mm/rmap.c > index edc06c52bc82..a17a004550c6 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -687,17 +687,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > */ > static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > { > - bool should_defer = false; > - > if (!(flags & TTU_BATCH_FLUSH)) > return false; > > - /* If remote CPUs need to be flushed then defer batch the flush */ > - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > - should_defer = true; > - put_cpu(); > - > - return should_defer; > + return arch_tlbbatch_should_defer(mm); > } > > /* _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms 2022-08-22 8:21 [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang 2022-08-22 8:21 ` [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Yicong Yang 2022-08-22 8:21 ` [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang @ 2022-08-22 8:21 ` Yicong Yang 2022-08-24 9:43 ` Kefeng Wang 2022-09-09 4:51 ` Anshuman Khandual 2022-08-22 8:21 ` [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 2022-09-06 8:53 ` [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang 4 siblings, 2 replies; 34+ messages in thread From: Yicong Yang @ 2022-08-22 8:21 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual, Barry Song, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Nadav Amit, Mel Gorman From: Barry Song <v-songbaohua@oppo.com> Add uaddr to tlbbatch APIs so that platforms like ARM64 are able to apply this on their specific hardware features. For ARM64, this could be sending tlbi into hardware queues for the page with this particular uaddr. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Tested-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> --- arch/x86/include/asm/tlbflush.h | 3 ++- mm/rmap.c | 10 ++++++---- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 8a497d902c16..5bd78ae55cd4 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) } static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, - struct mm_struct *mm) + struct mm_struct *mm, + unsigned long uaddr) { inc_mm_tlb_gen(mm); cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); diff --git a/mm/rmap.c b/mm/rmap.c index a17a004550c6..7187a72b63b1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void) #define TLB_FLUSH_BATCH_PENDING_LARGE \ (TLB_FLUSH_BATCH_PENDING_MASK / 2) -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, + unsigned long uaddr) { struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; int batch, nbatch; - arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); + arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr); tlb_ubc->flush_required = true; /* @@ -725,7 +726,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm) } } #else -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, + unsigned long uaddr) { } @@ -1587,7 +1589,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pvmw.pte); - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address); } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); } -- 2.24.0 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms 2022-08-22 8:21 ` [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Yicong Yang @ 2022-08-24 9:43 ` Kefeng Wang 2022-09-09 4:51 ` Anshuman Khandual 1 sibling, 0 replies; 34+ messages in thread From: Kefeng Wang @ 2022-08-24 9:43 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, xhao, prime.zeng, anshuman.khandual, Barry Song, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Nadav Amit, Mel Gorman On 2022/8/22 16:21, Yicong Yang wrote: > From: Barry Song <v-songbaohua@oppo.com> > > Add uaddr to tlbbatch APIs so that platforms like ARM64 are > able to apply this on their specific hardware features. For > ARM64, this could be sending tlbi into hardware queues for > the page with this particular uaddr. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: Nadav Amit <namit@vmware.com> > Cc: Mel Gorman <mgorman@suse.de> > Tested-by: Xin Hao <xhao@linux.alibaba.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > arch/x86/include/asm/tlbflush.h | 3 ++- > mm/rmap.c | 10 ++++++---- > 2 files changed, 8 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index 8a497d902c16..5bd78ae55cd4 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) > } > > static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > - struct mm_struct *mm) > + struct mm_struct *mm, > + unsigned long uaddr) > { > inc_mm_tlb_gen(mm); > cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); > diff --git a/mm/rmap.c b/mm/rmap.c > index a17a004550c6..7187a72b63b1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void) > #define TLB_FLUSH_BATCH_PENDING_LARGE \ > (TLB_FLUSH_BATCH_PENDING_MASK / 2) > > -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, > + unsigned long uaddr) > { > struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; > int batch, nbatch; > > - arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); > + arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr); > tlb_ubc->flush_required = true; > > /* > @@ -725,7 +726,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm) > } > } > #else > -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, > + unsigned long uaddr) > { > } > > @@ -1587,7 +1589,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > */ > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address); > } else { > pteval = ptep_clear_flush(vma, address, pvmw.pte); > } _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms 2022-08-22 8:21 ` [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Yicong Yang 2022-08-24 9:43 ` Kefeng Wang @ 2022-09-09 4:51 ` Anshuman Khandual 2022-09-09 5:25 ` Barry Song 1 sibling, 1 reply; 34+ messages in thread From: Anshuman Khandual @ 2022-09-09 4:51 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Nadav Amit, Mel Gorman On 8/22/22 13:51, Yicong Yang wrote: > From: Barry Song <v-songbaohua@oppo.com> > > Add uaddr to tlbbatch APIs so that platforms like ARM64 are I guess 'uaddr' refers to a virtual address from the process address space itself ? Please be more specific. > able to apply this on their specific hardware features. For > ARM64, this could be sending tlbi into hardware queues for > the page with this particular uaddr. This subject line and commit message here are misleading. The patch adds an address argument to arch callback arch_tlbbatch_add_mm() as arm64 platform could use that to perform the TLB flush batching ? This patch can be folded into the next one, so that the requirement for an additional argument 'uaddr' in the arch callback will be self evident. OR if this is going to be a preparatory patch, then it must explain how 'uaddr' argument is helpful on platforms like arm64 while performing TLB flush batching. But TBH, just folding it to next patch explains the context better. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: Nadav Amit <namit@vmware.com> > Cc: Mel Gorman <mgorman@suse.de> > Tested-by: Xin Hao <xhao@linux.alibaba.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > --- > arch/x86/include/asm/tlbflush.h | 3 ++- > mm/rmap.c | 10 ++++++---- > 2 files changed, 8 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index 8a497d902c16..5bd78ae55cd4 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) > } > > static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > - struct mm_struct *mm) > + struct mm_struct *mm, > + unsigned long uaddr) > { > inc_mm_tlb_gen(mm); > cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); > diff --git a/mm/rmap.c b/mm/rmap.c > index a17a004550c6..7187a72b63b1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void) > #define TLB_FLUSH_BATCH_PENDING_LARGE \ > (TLB_FLUSH_BATCH_PENDING_MASK / 2) > > -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, > + unsigned long uaddr) > { > struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; > int batch, nbatch; > > - arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); > + arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr); > tlb_ubc->flush_required = true; > > /* > @@ -725,7 +726,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm) > } > } > #else > -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, > + unsigned long uaddr) > { > } > > @@ -1587,7 +1589,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > */ > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address); > } else { > pteval = ptep_clear_flush(vma, address, pvmw.pte); > } _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms 2022-09-09 4:51 ` Anshuman Khandual @ 2022-09-09 5:25 ` Barry Song 0 siblings, 0 replies; 34+ messages in thread From: Barry Song @ 2022-09-09 5:25 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Nadav Amit, Mel Gorman On Fri, Sep 9, 2022 at 4:51 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 8/22/22 13:51, Yicong Yang wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > Add uaddr to tlbbatch APIs so that platforms like ARM64 are > > I guess 'uaddr' refers to a virtual address from the process address > space itself ? Please be more specific. > > > able to apply this on their specific hardware features. For > > ARM64, this could be sending tlbi into hardware queues for > > the page with this particular uaddr. > > This subject line and commit message here are misleading. The patch > adds an address argument to arch callback arch_tlbbatch_add_mm() as > arm64 platform could use that to perform the TLB flush batching ? > > This patch can be folded into the next one, so that the requirement > for an additional argument 'uaddr' in the arch callback will be self > evident. OR if this is going to be a preparatory patch, then it must > explain how 'uaddr' argument is helpful on platforms like arm64 while > performing TLB flush batching. But TBH, just folding it to next patch > explains the context better. The intention was to keep each change small, while still functionally independent, so that it was easier to be reviewed. but yes, i agree in this particular case, if we fold this one to the last one, we are actually able to make the modification self-evident while the new patch seems still small. > > > > > Cc: Thomas Gleixner <tglx@linutronix.de> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Borislav Petkov <bp@alien8.de> > > Cc: Dave Hansen <dave.hansen@linux.intel.com> > > Cc: "H. Peter Anvin" <hpa@zytor.com> > > Cc: Nadav Amit <namit@vmware.com> > > Cc: Mel Gorman <mgorman@suse.de> > > Tested-by: Xin Hao <xhao@linux.alibaba.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > --- > > arch/x86/include/asm/tlbflush.h | 3 ++- > > mm/rmap.c | 10 ++++++---- > > 2 files changed, 8 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > > index 8a497d902c16..5bd78ae55cd4 100644 > > --- a/arch/x86/include/asm/tlbflush.h > > +++ b/arch/x86/include/asm/tlbflush.h > > @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) > > } > > > > static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > > - struct mm_struct *mm) > > + struct mm_struct *mm, > > + unsigned long uaddr) > > { > > inc_mm_tlb_gen(mm); > > cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); > > diff --git a/mm/rmap.c b/mm/rmap.c > > index a17a004550c6..7187a72b63b1 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -642,12 +642,13 @@ void try_to_unmap_flush_dirty(void) > > #define TLB_FLUSH_BATCH_PENDING_LARGE \ > > (TLB_FLUSH_BATCH_PENDING_MASK / 2) > > > > -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, > > + unsigned long uaddr) > > { > > struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; > > int batch, nbatch; > > > > - arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); > > + arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr); > > tlb_ubc->flush_required = true; > > > > /* > > @@ -725,7 +726,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm) > > } > > } > > #else > > -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, > > + unsigned long uaddr) > > { > > } > > > > @@ -1587,7 +1589,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > > */ > > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > > > - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > > + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address); > > } else { > > pteval = ptep_clear_flush(vma, address, pvmw.pte); > > } Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-08-22 8:21 [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang ` (2 preceding siblings ...) 2022-08-22 8:21 ` [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Yicong Yang @ 2022-08-22 8:21 ` Yicong Yang 2022-08-24 9:46 ` Kefeng Wang ` (3 more replies) 2022-09-06 8:53 ` [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang 4 siblings, 4 replies; 34+ messages in thread From: Yicong Yang @ 2022-08-22 8:21 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual, Barry Song, Nadav Amit, Mel Gorman From: Barry Song <v-songbaohua@oppo.com> on x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 10000; k++) { /* swap in */ for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } /* swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ........ ....... ................. ............................................................................. # 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out [kernel.kallsyms] [k] memset64 1.63% a.out [kernel.kallsyms] [k] clear_page 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out [kernel.kallsyms] [k] xas_load 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also, Yicong Yang added the following observation. Tested with benchmark in the commit on Kunpeng920 arm64 server, observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real 0m13.460s 0m11.771s user 0m0.248s 0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> --- .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- 4 files changed, 40 insertions(+), 3 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt index 1c009312b9c1..2caf815d7c6c 100644 --- a/Documentation/features/vm/TLB/arch-support.txt +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | TODO | + | arm64: | ok | | csky: | TODO | | hexagon: | TODO | | ia64: | TODO | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 571cc234d0b3..09d45cd6d665 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -93,6 +93,7 @@ config ARM64 select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h new file mode 100644 index 000000000000..fedb0b87b8db --- /dev/null +++ b/arch/arm64/include/asm/tlbbatch.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ARCH_ARM64_TLBBATCH_H +#define _ARCH_ARM64_TLBBATCH_H + +struct arch_tlbflush_unmap_batch { + /* + * For arm64, HW can do tlb shootdown, so we don't + * need to record cpumask for sending IPI + */ +}; + +#endif /* _ARCH_ARM64_TLBBATCH_H */ diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 412a3b9a3c25..23cbc987321a 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) dsb(ish); } -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, + +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, unsigned long uaddr) { unsigned long addr; dsb(ishst); - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); + addr = __TLBI_VADDR(uaddr, ASID(mm)); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); } +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, + unsigned long uaddr) +{ + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); +} + static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) { @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, dsb(ish); } +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + return true; +} + +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, + struct mm_struct *mm, + unsigned long uaddr) +{ + __flush_tlb_page_nosync(mm, uaddr); +} + +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) +{ + dsb(ish); +} + /* * This is meant to avoid soft lock-ups on large TLB flushing ranges and not * necessarily a performance improvement. -- 2.24.0 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-08-22 8:21 ` [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang @ 2022-08-24 9:46 ` Kefeng Wang 2022-09-09 5:24 ` Anshuman Khandual ` (2 subsequent siblings) 3 siblings, 0 replies; 34+ messages in thread From: Kefeng Wang @ 2022-08-24 9:46 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, xhao, prime.zeng, anshuman.khandual, Barry Song, Nadav Amit, Mel Gorman On 2022/8/22 16:21, Yicong Yang wrote: > From: Barry Song <v-songbaohua@oppo.com> > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. > > Even running a simplest program which requires swapout can > prove this is true, > #include <sys/types.h> > #include <unistd.h> > #include <sys/mman.h> > #include <string.h> > > int main() > { > #define SIZE (1 * 1024 * 1024) > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > memset(p, 0x88, SIZE); > > for (int k = 0; k < 10000; k++) { > /* swap in */ > for (int i = 0; i < SIZE; i += 4096) { > (void)p[i]; > } > > /* swap out */ > madvise(p, SIZE, MADV_PAGEOUT); > } > } > > Perf result on snapdragon 888 with 8 cores by using zRAM > as the swap block device. > > ~ # perf record taskset -c 4 ./a.out > [ perf record: Woken up 10 times to write data ] > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > ~ # perf report > # To display the perf.data header info, please use --header/--header-only options. > # To display the perf.data header info, please use --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 60K of event 'cycles' > # Event count (approx.): 35706225414 > # > # Overhead Command Shared Object Symbol > # ........ ....... ................. ............................................................................. > # > 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > 3.49% a.out [kernel.kallsyms] [k] memset64 > 1.63% a.out [kernel.kallsyms] [k] clear_page > 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > 1.23% a.out [kernel.kallsyms] [k] xas_load > 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > swapping in/out a page mapped by only one process. If the > page is mapped by multiple processes, typically, like more > than 100 on a phone, the overhead would be much higher as > we have to run tlb flush 100 times for one single page. > Plus, tlb flush overhead will increase with the number > of CPU cores due to the bad scalability of tlb shootdown > in HW, so those ARM64 servers should expect much higher > overhead. > > Further perf annonate shows 95% cpu time of ptep_clear_flush > is actually used by the final dsb() to wait for the completion > of tlb flush. This provides us a very good chance to leverage > the existing batched tlb in kernel. The minimum modification > is that we only send async tlbi in the first stage and we send > dsb while we have to sync in the second stage. > > With the above simplest micro benchmark, collapsed time to > finish the program decreases around 5%. > > Typical collapsed time w/o patch: > ~ # time taskset -c 4 ./a.out > 0.21user 14.34system 0:14.69elapsed > w/ patch: > ~ # time taskset -c 4 ./a.out > 0.22user 13.45system 0:13.80elapsed > > Also, Yicong Yang added the following observation. > Tested with benchmark in the commit on Kunpeng920 arm64 server, > observed an improvement around 12.5% with command > `time ./swap_bench`. > w/o w/ > real 0m13.460s 0m11.771s > user 0m0.248s 0m0.279s > sys 0m12.039s 0m11.458s > > Originally it's noticed a 16.99% overhead of ptep_clear_flush() > which has been eliminated by this patch: > > [root@localhost yang]# perf record -- ./swap_bench && perf report > [...] > 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Nadav Amit <namit@vmware.com> > Cc: Mel Gorman <mgorman@suse.de> > Tested-by: Yicong Yang <yangyicong@hisilicon.com> > Tested-by: Xin Hao <xhao@linux.alibaba.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> I tested on my kunpeng board too, looks good for now. Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > .../features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > 4 files changed, 40 insertions(+), 3 deletions(-) > create mode 100644 arch/arm64/include/asm/tlbbatch.h > > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > index 1c009312b9c1..2caf815d7c6c 100644 > --- a/Documentation/features/vm/TLB/arch-support.txt > +++ b/Documentation/features/vm/TLB/arch-support.txt > @@ -9,7 +9,7 @@ > | alpha: | TODO | > | arc: | TODO | > | arm: | TODO | > - | arm64: | TODO | > + | arm64: | ok | > | csky: | TODO | > | hexagon: | TODO | > | ia64: | TODO | > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 571cc234d0b3..09d45cd6d665 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -93,6 +93,7 @@ config ARM64 > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > select ARCH_SUPPORTS_NUMA_BALANCING > select ARCH_SUPPORTS_PAGE_TABLE_CHECK > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > select ARCH_WANT_DEFAULT_BPF_JIT > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > new file mode 100644 > index 000000000000..fedb0b87b8db > --- /dev/null > +++ b/arch/arm64/include/asm/tlbbatch.h > @@ -0,0 +1,12 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ARCH_ARM64_TLBBATCH_H > +#define _ARCH_ARM64_TLBBATCH_H > + > +struct arch_tlbflush_unmap_batch { > + /* > + * For arm64, HW can do tlb shootdown, so we don't > + * need to record cpumask for sending IPI > + */ > +}; > + > +#endif /* _ARCH_ARM64_TLBBATCH_H */ > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index 412a3b9a3c25..23cbc987321a 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > dsb(ish); > } > > -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > + > +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > unsigned long uaddr) > { > unsigned long addr; > > dsb(ishst); > - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > + addr = __TLBI_VADDR(uaddr, ASID(mm)); > __tlbi(vale1is, addr); > __tlbi_user(vale1is, addr); > } > > +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > + unsigned long uaddr) > +{ > + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > +} > + > static inline void flush_tlb_page(struct vm_area_struct *vma, > unsigned long uaddr) > { > @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > dsb(ish); > } > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + return true; > +} > + > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > + struct mm_struct *mm, > + unsigned long uaddr) > +{ > + __flush_tlb_page_nosync(mm, uaddr); > +} > + > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > +{ > + dsb(ish); > +} > + > /* > * This is meant to avoid soft lock-ups on large TLB flushing ranges and not > * necessarily a performance improvement. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-08-22 8:21 ` [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 2022-08-24 9:46 ` Kefeng Wang @ 2022-09-09 5:24 ` Anshuman Khandual 2022-09-09 5:35 ` Barry Song 2022-09-20 3:00 ` Anshuman Khandual 2022-09-21 6:53 ` Anshuman Khandual 3 siblings, 1 reply; 34+ messages in thread From: Anshuman Khandual @ 2022-09-09 5:24 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 8/22/22 13:51, Yicong Yang wrote: > From: Barry Song <v-songbaohua@oppo.com> > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. > > Even running a simplest program which requires swapout can > prove this is true, > #include <sys/types.h> > #include <unistd.h> > #include <sys/mman.h> > #include <string.h> > > int main() > { > #define SIZE (1 * 1024 * 1024) > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > memset(p, 0x88, SIZE); > > for (int k = 0; k < 10000; k++) { > /* swap in */ > for (int i = 0; i < SIZE; i += 4096) { > (void)p[i]; > } > > /* swap out */ > madvise(p, SIZE, MADV_PAGEOUT); > } > } > > Perf result on snapdragon 888 with 8 cores by using zRAM > as the swap block device. > > ~ # perf record taskset -c 4 ./a.out > [ perf record: Woken up 10 times to write data ] > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > ~ # perf report > # To display the perf.data header info, please use --header/--header-only options. > # To display the perf.data header info, please use --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 60K of event 'cycles' > # Event count (approx.): 35706225414 > # > # Overhead Command Shared Object Symbol > # ........ ....... ................. ............................................................................. > # > 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > 3.49% a.out [kernel.kallsyms] [k] memset64 > 1.63% a.out [kernel.kallsyms] [k] clear_page > 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > 1.23% a.out [kernel.kallsyms] [k] xas_load > 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > swapping in/out a page mapped by only one process. If the > page is mapped by multiple processes, typically, like more > than 100 on a phone, the overhead would be much higher as > we have to run tlb flush 100 times for one single page. > Plus, tlb flush overhead will increase with the number > of CPU cores due to the bad scalability of tlb shootdown > in HW, so those ARM64 servers should expect much higher > overhead. > > Further perf annonate shows 95% cpu time of ptep_clear_flush > is actually used by the final dsb() to wait for the completion > of tlb flush. This provides us a very good chance to leverage > the existing batched tlb in kernel. The minimum modification > is that we only send async tlbi in the first stage and we send > dsb while we have to sync in the second stage. > > With the above simplest micro benchmark, collapsed time to > finish the program decreases around 5%. > > Typical collapsed time w/o patch: > ~ # time taskset -c 4 ./a.out > 0.21user 14.34system 0:14.69elapsed > w/ patch: > ~ # time taskset -c 4 ./a.out > 0.22user 13.45system 0:13.80elapsed > > Also, Yicong Yang added the following observation. > Tested with benchmark in the commit on Kunpeng920 arm64 server, > observed an improvement around 12.5% with command > `time ./swap_bench`. > w/o w/ > real 0m13.460s 0m11.771s > user 0m0.248s 0m0.279s > sys 0m12.039s 0m11.458s > > Originally it's noticed a 16.99% overhead of ptep_clear_flush() > which has been eliminated by this patch: > > [root@localhost yang]# perf record -- ./swap_bench && perf report > [...] > 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Nadav Amit <namit@vmware.com> > Cc: Mel Gorman <mgorman@suse.de> > Tested-by: Yicong Yang <yangyicong@hisilicon.com> > Tested-by: Xin Hao <xhao@linux.alibaba.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > --- > .../features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > 4 files changed, 40 insertions(+), 3 deletions(-) > create mode 100644 arch/arm64/include/asm/tlbbatch.h > > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > index 1c009312b9c1..2caf815d7c6c 100644 > --- a/Documentation/features/vm/TLB/arch-support.txt > +++ b/Documentation/features/vm/TLB/arch-support.txt > @@ -9,7 +9,7 @@ > | alpha: | TODO | > | arc: | TODO | > | arm: | TODO | > - | arm64: | TODO | > + | arm64: | ok | > | csky: | TODO | > | hexagon: | TODO | > | ia64: | TODO | > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 571cc234d0b3..09d45cd6d665 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -93,6 +93,7 @@ config ARM64 > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > select ARCH_SUPPORTS_NUMA_BALANCING > select ARCH_SUPPORTS_PAGE_TABLE_CHECK > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > select ARCH_WANT_DEFAULT_BPF_JIT > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > new file mode 100644 > index 000000000000..fedb0b87b8db > --- /dev/null > +++ b/arch/arm64/include/asm/tlbbatch.h > @@ -0,0 +1,12 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ARCH_ARM64_TLBBATCH_H > +#define _ARCH_ARM64_TLBBATCH_H > + > +struct arch_tlbflush_unmap_batch { > + /* > + * For arm64, HW can do tlb shootdown, so we don't > + * need to record cpumask for sending IPI > + */ > +}; > + > +#endif /* _ARCH_ARM64_TLBBATCH_H */ > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index 412a3b9a3c25..23cbc987321a 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > dsb(ish); > } > > -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > + > +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > unsigned long uaddr) > { > unsigned long addr; > > dsb(ishst); > - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > + addr = __TLBI_VADDR(uaddr, ASID(mm)); > __tlbi(vale1is, addr); > __tlbi_user(vale1is, addr); > } > > +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > + unsigned long uaddr) > +{ > + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > +} > + > static inline void flush_tlb_page(struct vm_area_struct *vma, > unsigned long uaddr) > { > @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > dsb(ish); > } > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + return true; > +} Always defer and batch up TLB flush, unconditionally ? > + > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > + struct mm_struct *mm, > + unsigned long uaddr) > +{ > + __flush_tlb_page_nosync(mm, uaddr); > +} > + > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > +{ > + dsb(ish); > +} Adding up __flush_tlb_page_nosync() without a corresponding dsb(ish) and then doing once via arch_tlbbatch_flush() will have the same effect from an architecture perspective ? > + > /* > * This is meant to avoid soft lock-ups on large TLB flushing ranges and not > * necessarily a performance improvement. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-09 5:24 ` Anshuman Khandual @ 2022-09-09 5:35 ` Barry Song 2022-09-09 6:32 ` Yicong Yang 2022-09-15 6:07 ` Anshuman Khandual 0 siblings, 2 replies; 34+ messages in thread From: Barry Song @ 2022-09-09 5:35 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 8/22/22 13:51, Yicong Yang wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > on x86, batched and deferred tlb shootdown has lead to 90% > > performance increase on tlb shootdown. on arm64, HW can do > > tlb shootdown without software IPI. But sync tlbi is still > > quite expensive. > > > > Even running a simplest program which requires swapout can > > prove this is true, > > #include <sys/types.h> > > #include <unistd.h> > > #include <sys/mman.h> > > #include <string.h> > > > > int main() > > { > > #define SIZE (1 * 1024 * 1024) > > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > > > memset(p, 0x88, SIZE); > > > > for (int k = 0; k < 10000; k++) { > > /* swap in */ > > for (int i = 0; i < SIZE; i += 4096) { > > (void)p[i]; > > } > > > > /* swap out */ > > madvise(p, SIZE, MADV_PAGEOUT); > > } > > } > > > > Perf result on snapdragon 888 with 8 cores by using zRAM > > as the swap block device. > > > > ~ # perf record taskset -c 4 ./a.out > > [ perf record: Woken up 10 times to write data ] > > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > > ~ # perf report > > # To display the perf.data header info, please use --header/--header-only options. > > # To display the perf.data header info, please use --header/--header-only options. > > # > > # > > # Total Lost Samples: 0 > > # > > # Samples: 60K of event 'cycles' > > # Event count (approx.): 35706225414 > > # > > # Overhead Command Shared Object Symbol > > # ........ ....... ................. ............................................................................. > > # > > 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > > 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > > 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > > 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > > 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > > 3.49% a.out [kernel.kallsyms] [k] memset64 > > 1.63% a.out [kernel.kallsyms] [k] clear_page > > 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > > 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > > 1.23% a.out [kernel.kallsyms] [k] xas_load > > 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > > > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > > swapping in/out a page mapped by only one process. If the > > page is mapped by multiple processes, typically, like more > > than 100 on a phone, the overhead would be much higher as > > we have to run tlb flush 100 times for one single page. > > Plus, tlb flush overhead will increase with the number > > of CPU cores due to the bad scalability of tlb shootdown > > in HW, so those ARM64 servers should expect much higher > > overhead. > > > > Further perf annonate shows 95% cpu time of ptep_clear_flush > > is actually used by the final dsb() to wait for the completion > > of tlb flush. This provides us a very good chance to leverage > > the existing batched tlb in kernel. The minimum modification > > is that we only send async tlbi in the first stage and we send > > dsb while we have to sync in the second stage. > > > > With the above simplest micro benchmark, collapsed time to > > finish the program decreases around 5%. > > > > Typical collapsed time w/o patch: > > ~ # time taskset -c 4 ./a.out > > 0.21user 14.34system 0:14.69elapsed > > w/ patch: > > ~ # time taskset -c 4 ./a.out > > 0.22user 13.45system 0:13.80elapsed > > > > Also, Yicong Yang added the following observation. > > Tested with benchmark in the commit on Kunpeng920 arm64 server, > > observed an improvement around 12.5% with command > > `time ./swap_bench`. > > w/o w/ > > real 0m13.460s 0m11.771s > > user 0m0.248s 0m0.279s > > sys 0m12.039s 0m11.458s > > > > Originally it's noticed a 16.99% overhead of ptep_clear_flush() > > which has been eliminated by this patch: > > > > [root@localhost yang]# perf record -- ./swap_bench && perf report > > [...] > > 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > > > > Cc: Jonathan Corbet <corbet@lwn.net> > > Cc: Nadav Amit <namit@vmware.com> > > Cc: Mel Gorman <mgorman@suse.de> > > Tested-by: Yicong Yang <yangyicong@hisilicon.com> > > Tested-by: Xin Hao <xhao@linux.alibaba.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > --- > > .../features/vm/TLB/arch-support.txt | 2 +- > > arch/arm64/Kconfig | 1 + > > arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > > arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > > 4 files changed, 40 insertions(+), 3 deletions(-) > > create mode 100644 arch/arm64/include/asm/tlbbatch.h > > > > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > > index 1c009312b9c1..2caf815d7c6c 100644 > > --- a/Documentation/features/vm/TLB/arch-support.txt > > +++ b/Documentation/features/vm/TLB/arch-support.txt > > @@ -9,7 +9,7 @@ > > | alpha: | TODO | > > | arc: | TODO | > > | arm: | TODO | > > - | arm64: | TODO | > > + | arm64: | ok | > > | csky: | TODO | > > | hexagon: | TODO | > > | ia64: | TODO | > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > index 571cc234d0b3..09d45cd6d665 100644 > > --- a/arch/arm64/Kconfig > > +++ b/arch/arm64/Kconfig > > @@ -93,6 +93,7 @@ config ARM64 > > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > > select ARCH_SUPPORTS_NUMA_BALANCING > > select ARCH_SUPPORTS_PAGE_TABLE_CHECK > > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > > select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > > select ARCH_WANT_DEFAULT_BPF_JIT > > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > > new file mode 100644 > > index 000000000000..fedb0b87b8db > > --- /dev/null > > +++ b/arch/arm64/include/asm/tlbbatch.h > > @@ -0,0 +1,12 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#ifndef _ARCH_ARM64_TLBBATCH_H > > +#define _ARCH_ARM64_TLBBATCH_H > > + > > +struct arch_tlbflush_unmap_batch { > > + /* > > + * For arm64, HW can do tlb shootdown, so we don't > > + * need to record cpumask for sending IPI > > + */ > > +}; > > + > > +#endif /* _ARCH_ARM64_TLBBATCH_H */ > > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > > index 412a3b9a3c25..23cbc987321a 100644 > > --- a/arch/arm64/include/asm/tlbflush.h > > +++ b/arch/arm64/include/asm/tlbflush.h > > @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > > dsb(ish); > > } > > > > -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > > + > > +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > > unsigned long uaddr) > > { > > unsigned long addr; > > > > dsb(ishst); > > - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > > + addr = __TLBI_VADDR(uaddr, ASID(mm)); > > __tlbi(vale1is, addr); > > __tlbi_user(vale1is, addr); > > } > > > > +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > > + unsigned long uaddr) > > +{ > > + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > > +} > > + > > static inline void flush_tlb_page(struct vm_area_struct *vma, > > unsigned long uaddr) > > { > > @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > > dsb(ish); > > } > > > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > > +{ > > + return true; > > +} > > Always defer and batch up TLB flush, unconditionally ? My understanding is we actually don't need tlbbatch for a machine with one or two cores as the tlb flush is not expensive. even for a system with four cortex-a55 cores, i didn't see obvious cost. it was less than 1%. when we have 8 cores, we see the obvious cost of tlb flush. for a server with 100 crores, the cost is incredibly huge. But, we can hardly write source code to differentiate machines according to how many cores a machine has, especially when cores can be hot-plugged. > > > + > > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > > + struct mm_struct *mm, > > + unsigned long uaddr) > > +{ > > + __flush_tlb_page_nosync(mm, uaddr); > > +} > > + > > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > > +{ > > + dsb(ish); > > +} > > Adding up __flush_tlb_page_nosync() without a corresponding dsb(ish) and > then doing once via arch_tlbbatch_flush() will have the same effect from > an architecture perspective ? The difference is we drop the cost of lots of single tlb flush. we only need to sync when we have to sync. dsb(ish) guarantees the completion of previous multiple tlb flush instructions. > > > + > > /* > > * This is meant to avoid soft lock-ups on large TLB flushing ranges and not > > * necessarily a performance improvement. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-09 5:35 ` Barry Song @ 2022-09-09 6:32 ` Yicong Yang 2022-09-15 6:07 ` Anshuman Khandual 1 sibling, 0 replies; 34+ messages in thread From: Yicong Yang @ 2022-09-09 6:32 UTC (permalink / raw) To: Barry Song, Anshuman Khandual Cc: yangyicong, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 2022/9/9 13:35, Barry Song wrote: > On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> >> On 8/22/22 13:51, Yicong Yang wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> on x86, batched and deferred tlb shootdown has lead to 90% >>> performance increase on tlb shootdown. on arm64, HW can do >>> tlb shootdown without software IPI. But sync tlbi is still >>> quite expensive. >>> >>> Even running a simplest program which requires swapout can >>> prove this is true, >>> #include <sys/types.h> >>> #include <unistd.h> >>> #include <sys/mman.h> >>> #include <string.h> >>> >>> int main() >>> { >>> #define SIZE (1 * 1024 * 1024) >>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); >>> >>> memset(p, 0x88, SIZE); >>> >>> for (int k = 0; k < 10000; k++) { >>> /* swap in */ >>> for (int i = 0; i < SIZE; i += 4096) { >>> (void)p[i]; >>> } >>> >>> /* swap out */ >>> madvise(p, SIZE, MADV_PAGEOUT); >>> } >>> } >>> >>> Perf result on snapdragon 888 with 8 cores by using zRAM >>> as the swap block device. >>> >>> ~ # perf record taskset -c 4 ./a.out >>> [ perf record: Woken up 10 times to write data ] >>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] >>> ~ # perf report >>> # To display the perf.data header info, please use --header/--header-only options. >>> # To display the perf.data header info, please use --header/--header-only options. >>> # >>> # >>> # Total Lost Samples: 0 >>> # >>> # Samples: 60K of event 'cycles' >>> # Event count (approx.): 35706225414 >>> # >>> # Overhead Command Shared Object Symbol >>> # ........ ....... ................. ............................................................................. >>> # >>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq >>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages >>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write >>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush >>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock >>> 3.49% a.out [kernel.kallsyms] [k] memset64 >>> 1.63% a.out [kernel.kallsyms] [k] clear_page >>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock >>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 >>> 1.23% a.out [kernel.kallsyms] [k] xas_load >>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock >>> >>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark >>> swapping in/out a page mapped by only one process. If the >>> page is mapped by multiple processes, typically, like more >>> than 100 on a phone, the overhead would be much higher as >>> we have to run tlb flush 100 times for one single page. >>> Plus, tlb flush overhead will increase with the number >>> of CPU cores due to the bad scalability of tlb shootdown >>> in HW, so those ARM64 servers should expect much higher >>> overhead. >>> >>> Further perf annonate shows 95% cpu time of ptep_clear_flush >>> is actually used by the final dsb() to wait for the completion >>> of tlb flush. This provides us a very good chance to leverage >>> the existing batched tlb in kernel. The minimum modification >>> is that we only send async tlbi in the first stage and we send >>> dsb while we have to sync in the second stage. >>> >>> With the above simplest micro benchmark, collapsed time to >>> finish the program decreases around 5%. >>> >>> Typical collapsed time w/o patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.21user 14.34system 0:14.69elapsed >>> w/ patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.22user 13.45system 0:13.80elapsed >>> >>> Also, Yicong Yang added the following observation. >>> Tested with benchmark in the commit on Kunpeng920 arm64 server, >>> observed an improvement around 12.5% with command >>> `time ./swap_bench`. >>> w/o w/ >>> real 0m13.460s 0m11.771s >>> user 0m0.248s 0m0.279s >>> sys 0m12.039s 0m11.458s >>> >>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() >>> which has been eliminated by this patch: >>> >>> [root@localhost yang]# perf record -- ./swap_bench && perf report >>> [...] >>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush >>> >>> Cc: Jonathan Corbet <corbet@lwn.net> >>> Cc: Nadav Amit <namit@vmware.com> >>> Cc: Mel Gorman <mgorman@suse.de> >>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> >>> Tested-by: Xin Hao <xhao@linux.alibaba.com> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >>> --- >>> .../features/vm/TLB/arch-support.txt | 2 +- >>> arch/arm64/Kconfig | 1 + >>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ >>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- >>> 4 files changed, 40 insertions(+), 3 deletions(-) >>> create mode 100644 arch/arm64/include/asm/tlbbatch.h >>> >>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt >>> index 1c009312b9c1..2caf815d7c6c 100644 >>> --- a/Documentation/features/vm/TLB/arch-support.txt >>> +++ b/Documentation/features/vm/TLB/arch-support.txt >>> @@ -9,7 +9,7 @@ >>> | alpha: | TODO | >>> | arc: | TODO | >>> | arm: | TODO | >>> - | arm64: | TODO | >>> + | arm64: | ok | >>> | csky: | TODO | >>> | hexagon: | TODO | >>> | ia64: | TODO | >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>> index 571cc234d0b3..09d45cd6d665 100644 >>> --- a/arch/arm64/Kconfig >>> +++ b/arch/arm64/Kconfig >>> @@ -93,6 +93,7 @@ config ARM64 >>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 >>> select ARCH_SUPPORTS_NUMA_BALANCING >>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK >>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH >>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT >>> select ARCH_WANT_DEFAULT_BPF_JIT >>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h >>> new file mode 100644 >>> index 000000000000..fedb0b87b8db >>> --- /dev/null >>> +++ b/arch/arm64/include/asm/tlbbatch.h >>> @@ -0,0 +1,12 @@ >>> +/* SPDX-License-Identifier: GPL-2.0 */ >>> +#ifndef _ARCH_ARM64_TLBBATCH_H >>> +#define _ARCH_ARM64_TLBBATCH_H >>> + >>> +struct arch_tlbflush_unmap_batch { >>> + /* >>> + * For arm64, HW can do tlb shootdown, so we don't >>> + * need to record cpumask for sending IPI >>> + */ >>> +}; >>> + >>> +#endif /* _ARCH_ARM64_TLBBATCH_H */ >>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h >>> index 412a3b9a3c25..23cbc987321a 100644 >>> --- a/arch/arm64/include/asm/tlbflush.h >>> +++ b/arch/arm64/include/asm/tlbflush.h >>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) >>> dsb(ish); >>> } >>> >>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, >>> + >>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >>> unsigned long uaddr) >>> { >>> unsigned long addr; >>> >>> dsb(ishst); >>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); >>> + addr = __TLBI_VADDR(uaddr, ASID(mm)); >>> __tlbi(vale1is, addr); >>> __tlbi_user(vale1is, addr); >>> } >>> >>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, >>> + unsigned long uaddr) >>> +{ >>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); >>> +} >>> + >>> static inline void flush_tlb_page(struct vm_area_struct *vma, >>> unsigned long uaddr) >>> { >>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, >>> dsb(ish); >>> } >>> >>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>> +{ >>> + return true; >>> +} >> >> Always defer and batch up TLB flush, unconditionally ? > > My understanding is we actually don't need tlbbatch for a machine with one > or two cores as the tlb flush is not expensive. even for a system with four > cortex-a55 cores, i didn't see obvious cost. it was less than 1%. > when we have 8 cores, we see the obvious cost of tlb flush. for a server with > 100 crores, the cost is incredibly huge. > > But, we can hardly write source code to differentiate machines according to > how many cores a machine has, especially when cores can be hot-plugged. > Another thing is that we're not recording mm_cpumask() on arm64 so for now we cannot do the check like x86 and others. >> >>> + >>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, >>> + struct mm_struct *mm, >>> + unsigned long uaddr) >>> +{ >>> + __flush_tlb_page_nosync(mm, uaddr); >>> +} >>> + >>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) >>> +{ >>> + dsb(ish); >>> +} >> >> Adding up __flush_tlb_page_nosync() without a corresponding dsb(ish) and >> then doing once via arch_tlbbatch_flush() will have the same effect from >> an architecture perspective ? > > The difference is we drop the cost of lots of single tlb flush. we > only need to sync > when we have to sync. dsb(ish) guarantees the completion of previous > multiple tlb > flush instructions. > >> >>> + >>> /* >>> * This is meant to avoid soft lock-ups on large TLB flushing ranges and not >>> * necessarily a performance improvement. > > Thanks > Barry > . > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-09 5:35 ` Barry Song 2022-09-09 6:32 ` Yicong Yang @ 2022-09-15 6:07 ` Anshuman Khandual 2022-09-15 6:42 ` Barry Song 1 sibling, 1 reply; 34+ messages in thread From: Anshuman Khandual @ 2022-09-15 6:07 UTC (permalink / raw) To: Barry Song Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 9/9/22 11:05, Barry Song wrote: > On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> >> On 8/22/22 13:51, Yicong Yang wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> on x86, batched and deferred tlb shootdown has lead to 90% >>> performance increase on tlb shootdown. on arm64, HW can do >>> tlb shootdown without software IPI. But sync tlbi is still >>> quite expensive. >>> >>> Even running a simplest program which requires swapout can >>> prove this is true, >>> #include <sys/types.h> >>> #include <unistd.h> >>> #include <sys/mman.h> >>> #include <string.h> >>> >>> int main() >>> { >>> #define SIZE (1 * 1024 * 1024) >>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); >>> >>> memset(p, 0x88, SIZE); >>> >>> for (int k = 0; k < 10000; k++) { >>> /* swap in */ >>> for (int i = 0; i < SIZE; i += 4096) { >>> (void)p[i]; >>> } >>> >>> /* swap out */ >>> madvise(p, SIZE, MADV_PAGEOUT); >>> } >>> } >>> >>> Perf result on snapdragon 888 with 8 cores by using zRAM >>> as the swap block device. >>> >>> ~ # perf record taskset -c 4 ./a.out >>> [ perf record: Woken up 10 times to write data ] >>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] >>> ~ # perf report >>> # To display the perf.data header info, please use --header/--header-only options. >>> # To display the perf.data header info, please use --header/--header-only options. >>> # >>> # >>> # Total Lost Samples: 0 >>> # >>> # Samples: 60K of event 'cycles' >>> # Event count (approx.): 35706225414 >>> # >>> # Overhead Command Shared Object Symbol >>> # ........ ....... ................. ............................................................................. >>> # >>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq >>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages >>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write >>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush >>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock >>> 3.49% a.out [kernel.kallsyms] [k] memset64 >>> 1.63% a.out [kernel.kallsyms] [k] clear_page >>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock >>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 >>> 1.23% a.out [kernel.kallsyms] [k] xas_load >>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock >>> >>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark >>> swapping in/out a page mapped by only one process. If the >>> page is mapped by multiple processes, typically, like more >>> than 100 on a phone, the overhead would be much higher as >>> we have to run tlb flush 100 times for one single page. >>> Plus, tlb flush overhead will increase with the number >>> of CPU cores due to the bad scalability of tlb shootdown >>> in HW, so those ARM64 servers should expect much higher >>> overhead. >>> >>> Further perf annonate shows 95% cpu time of ptep_clear_flush >>> is actually used by the final dsb() to wait for the completion >>> of tlb flush. This provides us a very good chance to leverage >>> the existing batched tlb in kernel. The minimum modification >>> is that we only send async tlbi in the first stage and we send >>> dsb while we have to sync in the second stage. >>> >>> With the above simplest micro benchmark, collapsed time to >>> finish the program decreases around 5%. >>> >>> Typical collapsed time w/o patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.21user 14.34system 0:14.69elapsed >>> w/ patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.22user 13.45system 0:13.80elapsed >>> >>> Also, Yicong Yang added the following observation. >>> Tested with benchmark in the commit on Kunpeng920 arm64 server, >>> observed an improvement around 12.5% with command >>> `time ./swap_bench`. >>> w/o w/ >>> real 0m13.460s 0m11.771s >>> user 0m0.248s 0m0.279s >>> sys 0m12.039s 0m11.458s >>> >>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() >>> which has been eliminated by this patch: >>> >>> [root@localhost yang]# perf record -- ./swap_bench && perf report >>> [...] >>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush >>> >>> Cc: Jonathan Corbet <corbet@lwn.net> >>> Cc: Nadav Amit <namit@vmware.com> >>> Cc: Mel Gorman <mgorman@suse.de> >>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> >>> Tested-by: Xin Hao <xhao@linux.alibaba.com> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >>> --- >>> .../features/vm/TLB/arch-support.txt | 2 +- >>> arch/arm64/Kconfig | 1 + >>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ >>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- >>> 4 files changed, 40 insertions(+), 3 deletions(-) >>> create mode 100644 arch/arm64/include/asm/tlbbatch.h >>> >>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt >>> index 1c009312b9c1..2caf815d7c6c 100644 >>> --- a/Documentation/features/vm/TLB/arch-support.txt >>> +++ b/Documentation/features/vm/TLB/arch-support.txt >>> @@ -9,7 +9,7 @@ >>> | alpha: | TODO | >>> | arc: | TODO | >>> | arm: | TODO | >>> - | arm64: | TODO | >>> + | arm64: | ok | >>> | csky: | TODO | >>> | hexagon: | TODO | >>> | ia64: | TODO | >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>> index 571cc234d0b3..09d45cd6d665 100644 >>> --- a/arch/arm64/Kconfig >>> +++ b/arch/arm64/Kconfig >>> @@ -93,6 +93,7 @@ config ARM64 >>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 >>> select ARCH_SUPPORTS_NUMA_BALANCING >>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK >>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH >>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT >>> select ARCH_WANT_DEFAULT_BPF_JIT >>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h >>> new file mode 100644 >>> index 000000000000..fedb0b87b8db >>> --- /dev/null >>> +++ b/arch/arm64/include/asm/tlbbatch.h >>> @@ -0,0 +1,12 @@ >>> +/* SPDX-License-Identifier: GPL-2.0 */ >>> +#ifndef _ARCH_ARM64_TLBBATCH_H >>> +#define _ARCH_ARM64_TLBBATCH_H >>> + >>> +struct arch_tlbflush_unmap_batch { >>> + /* >>> + * For arm64, HW can do tlb shootdown, so we don't >>> + * need to record cpumask for sending IPI >>> + */ >>> +}; >>> + >>> +#endif /* _ARCH_ARM64_TLBBATCH_H */ >>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h >>> index 412a3b9a3c25..23cbc987321a 100644 >>> --- a/arch/arm64/include/asm/tlbflush.h >>> +++ b/arch/arm64/include/asm/tlbflush.h >>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) >>> dsb(ish); >>> } >>> >>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, >>> + >>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >>> unsigned long uaddr) >>> { >>> unsigned long addr; >>> >>> dsb(ishst); >>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); >>> + addr = __TLBI_VADDR(uaddr, ASID(mm)); >>> __tlbi(vale1is, addr); >>> __tlbi_user(vale1is, addr); >>> } >>> >>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, >>> + unsigned long uaddr) >>> +{ >>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); >>> +} >>> + >>> static inline void flush_tlb_page(struct vm_area_struct *vma, >>> unsigned long uaddr) >>> { >>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, >>> dsb(ish); >>> } >>> >>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>> +{ >>> + return true; >>> +} >> >> Always defer and batch up TLB flush, unconditionally ? > > My understanding is we actually don't need tlbbatch for a machine with one > or two cores as the tlb flush is not expensive. even for a system with four > cortex-a55 cores, i didn't see obvious cost. it was less than 1%. > when we have 8 cores, we see the obvious cost of tlb flush. for a server with > 100 crores, the cost is incredibly huge. Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that expensive as well, while queuing up individual TLB flushes ? The very idea behind TLB deferral is the opportunity it (might) provide to accumulate address ranges and cpu masks so that individual TLB flush can be replaced with a more cost effective range based TLB flush. Hence I guess unless address range or cpumask based cost effective TLB flush is available, deferral does not improve the unmap performance as much. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-15 6:07 ` Anshuman Khandual @ 2022-09-15 6:42 ` Barry Song 2022-09-15 14:31 ` Nadav Amit 2022-09-19 4:24 ` Anshuman Khandual 0 siblings, 2 replies; 34+ messages in thread From: Barry Song @ 2022-09-15 6:42 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 9/9/22 11:05, Barry Song wrote: > > On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > > <anshuman.khandual@arm.com> wrote: > >> > >> > >> > >> On 8/22/22 13:51, Yicong Yang wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> on x86, batched and deferred tlb shootdown has lead to 90% > >>> performance increase on tlb shootdown. on arm64, HW can do > >>> tlb shootdown without software IPI. But sync tlbi is still > >>> quite expensive. > >>> > >>> Even running a simplest program which requires swapout can > >>> prove this is true, > >>> #include <sys/types.h> > >>> #include <unistd.h> > >>> #include <sys/mman.h> > >>> #include <string.h> > >>> > >>> int main() > >>> { > >>> #define SIZE (1 * 1024 * 1024) > >>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > >>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); > >>> > >>> memset(p, 0x88, SIZE); > >>> > >>> for (int k = 0; k < 10000; k++) { > >>> /* swap in */ > >>> for (int i = 0; i < SIZE; i += 4096) { > >>> (void)p[i]; > >>> } > >>> > >>> /* swap out */ > >>> madvise(p, SIZE, MADV_PAGEOUT); > >>> } > >>> } > >>> > >>> Perf result on snapdragon 888 with 8 cores by using zRAM > >>> as the swap block device. > >>> > >>> ~ # perf record taskset -c 4 ./a.out > >>> [ perf record: Woken up 10 times to write data ] > >>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > >>> ~ # perf report > >>> # To display the perf.data header info, please use --header/--header-only options. > >>> # To display the perf.data header info, please use --header/--header-only options. > >>> # > >>> # > >>> # Total Lost Samples: 0 > >>> # > >>> # Samples: 60K of event 'cycles' > >>> # Event count (approx.): 35706225414 > >>> # > >>> # Overhead Command Shared Object Symbol > >>> # ........ ....... ................. ............................................................................. > >>> # > >>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > >>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > >>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > >>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > >>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > >>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > >>> 3.49% a.out [kernel.kallsyms] [k] memset64 > >>> 1.63% a.out [kernel.kallsyms] [k] clear_page > >>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > >>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > >>> 1.23% a.out [kernel.kallsyms] [k] xas_load > >>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > >>> > >>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > >>> swapping in/out a page mapped by only one process. If the > >>> page is mapped by multiple processes, typically, like more > >>> than 100 on a phone, the overhead would be much higher as > >>> we have to run tlb flush 100 times for one single page. > >>> Plus, tlb flush overhead will increase with the number > >>> of CPU cores due to the bad scalability of tlb shootdown > >>> in HW, so those ARM64 servers should expect much higher > >>> overhead. > >>> > >>> Further perf annonate shows 95% cpu time of ptep_clear_flush > >>> is actually used by the final dsb() to wait for the completion > >>> of tlb flush. This provides us a very good chance to leverage > >>> the existing batched tlb in kernel. The minimum modification > >>> is that we only send async tlbi in the first stage and we send > >>> dsb while we have to sync in the second stage. > >>> > >>> With the above simplest micro benchmark, collapsed time to > >>> finish the program decreases around 5%. > >>> > >>> Typical collapsed time w/o patch: > >>> ~ # time taskset -c 4 ./a.out > >>> 0.21user 14.34system 0:14.69elapsed > >>> w/ patch: > >>> ~ # time taskset -c 4 ./a.out > >>> 0.22user 13.45system 0:13.80elapsed > >>> > >>> Also, Yicong Yang added the following observation. > >>> Tested with benchmark in the commit on Kunpeng920 arm64 server, > >>> observed an improvement around 12.5% with command > >>> `time ./swap_bench`. > >>> w/o w/ > >>> real 0m13.460s 0m11.771s > >>> user 0m0.248s 0m0.279s > >>> sys 0m12.039s 0m11.458s > >>> > >>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() > >>> which has been eliminated by this patch: > >>> > >>> [root@localhost yang]# perf record -- ./swap_bench && perf report > >>> [...] > >>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > >>> > >>> Cc: Jonathan Corbet <corbet@lwn.net> > >>> Cc: Nadav Amit <namit@vmware.com> > >>> Cc: Mel Gorman <mgorman@suse.de> > >>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> > >>> Tested-by: Xin Hao <xhao@linux.alibaba.com> > >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > >>> --- > >>> .../features/vm/TLB/arch-support.txt | 2 +- > >>> arch/arm64/Kconfig | 1 + > >>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > >>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > >>> 4 files changed, 40 insertions(+), 3 deletions(-) > >>> create mode 100644 arch/arm64/include/asm/tlbbatch.h > >>> > >>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > >>> index 1c009312b9c1..2caf815d7c6c 100644 > >>> --- a/Documentation/features/vm/TLB/arch-support.txt > >>> +++ b/Documentation/features/vm/TLB/arch-support.txt > >>> @@ -9,7 +9,7 @@ > >>> | alpha: | TODO | > >>> | arc: | TODO | > >>> | arm: | TODO | > >>> - | arm64: | TODO | > >>> + | arm64: | ok | > >>> | csky: | TODO | > >>> | hexagon: | TODO | > >>> | ia64: | TODO | > >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > >>> index 571cc234d0b3..09d45cd6d665 100644 > >>> --- a/arch/arm64/Kconfig > >>> +++ b/arch/arm64/Kconfig > >>> @@ -93,6 +93,7 @@ config ARM64 > >>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > >>> select ARCH_SUPPORTS_NUMA_BALANCING > >>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK > >>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > >>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > >>> select ARCH_WANT_DEFAULT_BPF_JIT > >>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > >>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > >>> new file mode 100644 > >>> index 000000000000..fedb0b87b8db > >>> --- /dev/null > >>> +++ b/arch/arm64/include/asm/tlbbatch.h > >>> @@ -0,0 +1,12 @@ > >>> +/* SPDX-License-Identifier: GPL-2.0 */ > >>> +#ifndef _ARCH_ARM64_TLBBATCH_H > >>> +#define _ARCH_ARM64_TLBBATCH_H > >>> + > >>> +struct arch_tlbflush_unmap_batch { > >>> + /* > >>> + * For arm64, HW can do tlb shootdown, so we don't > >>> + * need to record cpumask for sending IPI > >>> + */ > >>> +}; > >>> + > >>> +#endif /* _ARCH_ARM64_TLBBATCH_H */ > >>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > >>> index 412a3b9a3c25..23cbc987321a 100644 > >>> --- a/arch/arm64/include/asm/tlbflush.h > >>> +++ b/arch/arm64/include/asm/tlbflush.h > >>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > >>> dsb(ish); > >>> } > >>> > >>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > >>> + > >>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > >>> unsigned long uaddr) > >>> { > >>> unsigned long addr; > >>> > >>> dsb(ishst); > >>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > >>> + addr = __TLBI_VADDR(uaddr, ASID(mm)); > >>> __tlbi(vale1is, addr); > >>> __tlbi_user(vale1is, addr); > >>> } > >>> > >>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > >>> + unsigned long uaddr) > >>> +{ > >>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > >>> +} > >>> + > >>> static inline void flush_tlb_page(struct vm_area_struct *vma, > >>> unsigned long uaddr) > >>> { > >>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > >>> dsb(ish); > >>> } > >>> > >>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >>> +{ > >>> + return true; > >>> +} > >> > >> Always defer and batch up TLB flush, unconditionally ? > > > > My understanding is we actually don't need tlbbatch for a machine with one > > or two cores as the tlb flush is not expensive. even for a system with four > > cortex-a55 cores, i didn't see obvious cost. it was less than 1%. > > when we have 8 cores, we see the obvious cost of tlb flush. for a server with > > 100 crores, the cost is incredibly huge. > > Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still > one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that > expensive as well, while queuing up individual TLB flushes ? This one is much much cheaper as it is not waiting for the completion of tlbi. waiting for the completion of tlbi is a big deal in arm64, thus, similar optimization can be seen here 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation when page aging"). > > The very idea behind TLB deferral is the opportunity it (might) provide > to accumulate address ranges and cpu masks so that individual TLB flush > can be replaced with a more cost effective range based TLB flush. Hence > I guess unless address range or cpumask based cost effective TLB flush > is available, deferral does not improve the unmap performance as much. After sending tlbi, if we wait for the completion of tlbi, we have to get Ack from all cpus in the system, tlbi is not scalable. The point here is that we avoid waiting for each individual TLBi. Alternatively, they are batched. If you read the benchmark in the commit log, you can find the great decline in the cost to swap out a page. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-15 6:42 ` Barry Song @ 2022-09-15 14:31 ` Nadav Amit 2022-09-19 2:46 ` Anshuman Khandual 2022-09-19 4:24 ` Anshuman Khandual 1 sibling, 1 reply; 34+ messages in thread From: Nadav Amit @ 2022-09-15 14:31 UTC (permalink / raw) To: Barry Song Cc: Anshuman Khandual, Yicong Yang, Andrew Morton, Linux MM, linux-arm-kernel, x86, catalin.marinas, Will Deacon, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Mel Gorman > On Sep 14, 2022, at 11:42 PM, Barry Song <21cnbao@gmail.com> wrote: > >> >> The very idea behind TLB deferral is the opportunity it (might) provide >> to accumulate address ranges and cpu masks so that individual TLB flush >> can be replaced with a more cost effective range based TLB flush. Hence >> I guess unless address range or cpumask based cost effective TLB flush >> is available, deferral does not improve the unmap performance as much. > > > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack > from all cpus in the system, tlbi is not scalable. The point here is that we > avoid waiting for each individual TLBi. Alternatively, they are batched. If > you read the benchmark in the commit log, you can find the great decline > in the cost to swap out a page. Just a minor correction: arch_tlbbatch_flush() does not collect ranges. On x86 it only accumulate CPU mask. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-15 14:31 ` Nadav Amit @ 2022-09-19 2:46 ` Anshuman Khandual 0 siblings, 0 replies; 34+ messages in thread From: Anshuman Khandual @ 2022-09-19 2:46 UTC (permalink / raw) To: Nadav Amit, Barry Song Cc: Yicong Yang, Andrew Morton, Linux MM, linux-arm-kernel, x86, catalin.marinas, Will Deacon, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Mel Gorman On 9/15/22 20:01, Nadav Amit wrote: > > >> On Sep 14, 2022, at 11:42 PM, Barry Song <21cnbao@gmail.com> wrote: >> >>> >>> The very idea behind TLB deferral is the opportunity it (might) provide >>> to accumulate address ranges and cpu masks so that individual TLB flush >>> can be replaced with a more cost effective range based TLB flush. Hence >>> I guess unless address range or cpumask based cost effective TLB flush >>> is available, deferral does not improve the unmap performance as much. >> >> >> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack >> from all cpus in the system, tlbi is not scalable. The point here is that we >> avoid waiting for each individual TLBi. Alternatively, they are batched. If >> you read the benchmark in the commit log, you can find the great decline >> in the cost to swap out a page. > > Just a minor correction: arch_tlbbatch_flush() does not collect ranges. > On x86 it only accumulate CPU mask. Thanks Nadav for the clarification. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-15 6:42 ` Barry Song 2022-09-15 14:31 ` Nadav Amit @ 2022-09-19 4:24 ` Anshuman Khandual 2022-09-19 4:53 ` Barry Song 1 sibling, 1 reply; 34+ messages in thread From: Anshuman Khandual @ 2022-09-19 4:24 UTC (permalink / raw) To: Barry Song Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 9/15/22 12:12, Barry Song wrote: > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> >> On 9/9/22 11:05, Barry Song wrote: >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual >>> <anshuman.khandual@arm.com> wrote: >>>> >>>> >>>> >>>> On 8/22/22 13:51, Yicong Yang wrote: >>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>> >>>>> on x86, batched and deferred tlb shootdown has lead to 90% >>>>> performance increase on tlb shootdown. on arm64, HW can do >>>>> tlb shootdown without software IPI. But sync tlbi is still >>>>> quite expensive. >>>>> >>>>> Even running a simplest program which requires swapout can >>>>> prove this is true, >>>>> #include <sys/types.h> >>>>> #include <unistd.h> >>>>> #include <sys/mman.h> >>>>> #include <string.h> >>>>> >>>>> int main() >>>>> { >>>>> #define SIZE (1 * 1024 * 1024) >>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); >>>>> >>>>> memset(p, 0x88, SIZE); >>>>> >>>>> for (int k = 0; k < 10000; k++) { >>>>> /* swap in */ >>>>> for (int i = 0; i < SIZE; i += 4096) { >>>>> (void)p[i]; >>>>> } >>>>> >>>>> /* swap out */ >>>>> madvise(p, SIZE, MADV_PAGEOUT); >>>>> } >>>>> } >>>>> >>>>> Perf result on snapdragon 888 with 8 cores by using zRAM >>>>> as the swap block device. >>>>> >>>>> ~ # perf record taskset -c 4 ./a.out >>>>> [ perf record: Woken up 10 times to write data ] >>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] >>>>> ~ # perf report >>>>> # To display the perf.data header info, please use --header/--header-only options. >>>>> # To display the perf.data header info, please use --header/--header-only options. >>>>> # >>>>> # >>>>> # Total Lost Samples: 0 >>>>> # >>>>> # Samples: 60K of event 'cycles' >>>>> # Event count (approx.): 35706225414 >>>>> # >>>>> # Overhead Command Shared Object Symbol >>>>> # ........ ....... ................. ............................................................................. >>>>> # >>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq >>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages >>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write >>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush >>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock >>>>> 3.49% a.out [kernel.kallsyms] [k] memset64 >>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page >>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock >>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 >>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load >>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock >>>>> >>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark >>>>> swapping in/out a page mapped by only one process. If the >>>>> page is mapped by multiple processes, typically, like more >>>>> than 100 on a phone, the overhead would be much higher as >>>>> we have to run tlb flush 100 times for one single page. >>>>> Plus, tlb flush overhead will increase with the number >>>>> of CPU cores due to the bad scalability of tlb shootdown >>>>> in HW, so those ARM64 servers should expect much higher >>>>> overhead. >>>>> >>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush >>>>> is actually used by the final dsb() to wait for the completion >>>>> of tlb flush. This provides us a very good chance to leverage >>>>> the existing batched tlb in kernel. The minimum modification >>>>> is that we only send async tlbi in the first stage and we send >>>>> dsb while we have to sync in the second stage. >>>>> >>>>> With the above simplest micro benchmark, collapsed time to >>>>> finish the program decreases around 5%. >>>>> >>>>> Typical collapsed time w/o patch: >>>>> ~ # time taskset -c 4 ./a.out >>>>> 0.21user 14.34system 0:14.69elapsed >>>>> w/ patch: >>>>> ~ # time taskset -c 4 ./a.out >>>>> 0.22user 13.45system 0:13.80elapsed >>>>> >>>>> Also, Yicong Yang added the following observation. >>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server, >>>>> observed an improvement around 12.5% with command >>>>> `time ./swap_bench`. >>>>> w/o w/ >>>>> real 0m13.460s 0m11.771s >>>>> user 0m0.248s 0m0.279s >>>>> sys 0m12.039s 0m11.458s >>>>> >>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() >>>>> which has been eliminated by this patch: >>>>> >>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report >>>>> [...] >>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush >>>>> >>>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>>> Cc: Nadav Amit <namit@vmware.com> >>>>> Cc: Mel Gorman <mgorman@suse.de> >>>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> >>>>> Tested-by: Xin Hao <xhao@linux.alibaba.com> >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >>>>> --- >>>>> .../features/vm/TLB/arch-support.txt | 2 +- >>>>> arch/arm64/Kconfig | 1 + >>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ >>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- >>>>> 4 files changed, 40 insertions(+), 3 deletions(-) >>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h >>>>> >>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt >>>>> index 1c009312b9c1..2caf815d7c6c 100644 >>>>> --- a/Documentation/features/vm/TLB/arch-support.txt >>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt >>>>> @@ -9,7 +9,7 @@ >>>>> | alpha: | TODO | >>>>> | arc: | TODO | >>>>> | arm: | TODO | >>>>> - | arm64: | TODO | >>>>> + | arm64: | ok | >>>>> | csky: | TODO | >>>>> | hexagon: | TODO | >>>>> | ia64: | TODO | >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>>>> index 571cc234d0b3..09d45cd6d665 100644 >>>>> --- a/arch/arm64/Kconfig >>>>> +++ b/arch/arm64/Kconfig >>>>> @@ -93,6 +93,7 @@ config ARM64 >>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 >>>>> select ARCH_SUPPORTS_NUMA_BALANCING >>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK >>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH >>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT >>>>> select ARCH_WANT_DEFAULT_BPF_JIT >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h >>>>> new file mode 100644 >>>>> index 000000000000..fedb0b87b8db >>>>> --- /dev/null >>>>> +++ b/arch/arm64/include/asm/tlbbatch.h >>>>> @@ -0,0 +1,12 @@ >>>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H >>>>> +#define _ARCH_ARM64_TLBBATCH_H >>>>> + >>>>> +struct arch_tlbflush_unmap_batch { >>>>> + /* >>>>> + * For arm64, HW can do tlb shootdown, so we don't >>>>> + * need to record cpumask for sending IPI >>>>> + */ >>>>> +}; >>>>> + >>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */ >>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h >>>>> index 412a3b9a3c25..23cbc987321a 100644 >>>>> --- a/arch/arm64/include/asm/tlbflush.h >>>>> +++ b/arch/arm64/include/asm/tlbflush.h >>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) >>>>> dsb(ish); >>>>> } >>>>> >>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, >>>>> + >>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >>>>> unsigned long uaddr) >>>>> { >>>>> unsigned long addr; >>>>> >>>>> dsb(ishst); >>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); >>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm)); >>>>> __tlbi(vale1is, addr); >>>>> __tlbi_user(vale1is, addr); >>>>> } >>>>> >>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, >>>>> + unsigned long uaddr) >>>>> +{ >>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); >>>>> +} >>>>> + >>>>> static inline void flush_tlb_page(struct vm_area_struct *vma, >>>>> unsigned long uaddr) >>>>> { >>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, >>>>> dsb(ish); >>>>> } >>>>> >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>> +{ >>>>> + return true; >>>>> +} >>>> >>>> Always defer and batch up TLB flush, unconditionally ? >>> >>> My understanding is we actually don't need tlbbatch for a machine with one >>> or two cores as the tlb flush is not expensive. even for a system with four >>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%. >>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with >>> 100 crores, the cost is incredibly huge. >> >> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still >> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that >> expensive as well, while queuing up individual TLB flushes ? > > This one is much much cheaper as it is not waiting for the > completion of tlbi. waiting for the completion of tlbi is a big > deal in arm64, thus, similar optimization can be seen here > > 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation > when page aging"). > > >> >> The very idea behind TLB deferral is the opportunity it (might) provide >> to accumulate address ranges and cpu masks so that individual TLB flush >> can be replaced with a more cost effective range based TLB flush. Hence >> I guess unless address range or cpumask based cost effective TLB flush >> is available, deferral does not improve the unmap performance as much. > > > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack > from all cpus in the system, tlbi is not scalable. The point here is that we > avoid waiting for each individual TLBi. Alternatively, they are batched. If > you read the benchmark in the commit log, you can find the great decline > in the cost to swap out a page. Alright, although collecting and deferring 'dsb(ish)' to the very end, does not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I guess it can be used to improve unmap performance on arm64. But is this 'dsb(ish)' deferral architecturally valid ? Let's examine single page unmap path via try_to_unmap_one(). should_defer_flush() { ptep_get_and_clear() set_tlb_ubc_flush_pending() arch_tlbbatch_add_mm() __flush_tlb_page_nosync() } else { ptep_clear_flush() ptep_get_and_clear() flush_tlb_page() flush_tlb_page_nosync() __flush_tlb_page_nosync() dsb(ish) } __flush_tlb_page_nosync() { dsb(ishst); addr = __TLBI_VADDR(uaddr, ASID(mm)); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); } Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi() and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function. #define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \ "tlbi " #op "\n" \ ALTERNATIVE("nop\n nop", \ "dsb ish\n tlbi " #op, \ ARM64_WORKAROUND_REPEAT_TLBI, \ CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ : : ) #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \ "tlbi " #op ", %0\n" \ ALTERNATIVE("nop\n nop", \ "dsb ish\n tlbi " #op ", %0", \ ARM64_WORKAROUND_REPEAT_TLBI, \ CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ : : "r" (arg)) #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg) #define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0) #define __tlbi_user(op, arg) do { \ if (arm64_kernel_unmapped_at_el0()) \ __tlbi(op, (arg) | USER_ASID_FLAG); \ } while (0) There is already a 'dsb(ish)' in between two subsequent TLB operations in case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess deferral should not enabled on such systems ? But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush() via try_to_unmap_flush[_dirty](). There might be random number of instructions in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'. Just wondering, if such 'detached in time with other instructions in between' 'tlbi' and 'dsb(ish)', is architecturally valid ? There is a comment in 'struct tlbflush_unmap_batch'. /* * The arch code makes the following promise: generic code can modify a * PTE, then call arch_tlbbatch_add_mm() (which internally provides all * needed barriers), then call arch_tlbbatch_flush(), and the entries * will be flushed on all CPUs by the time that arch_tlbbatch_flush() * returns. */ It expects arch_tlbbatch_add_mm() to provide all barriers, hence wondering if that would include just the first 'dsb(isht)' not the subsequent 'dsb(ish)' ? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-19 4:24 ` Anshuman Khandual @ 2022-09-19 4:53 ` Barry Song 2022-09-19 5:08 ` Barry Song 0 siblings, 1 reply; 34+ messages in thread From: Barry Song @ 2022-09-19 4:53 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 9/15/22 12:12, Barry Song wrote: > > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual > > <anshuman.khandual@arm.com> wrote: > >> > >> > >> > >> On 9/9/22 11:05, Barry Song wrote: > >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > >>> <anshuman.khandual@arm.com> wrote: > >>>> > >>>> > >>>> > >>>> On 8/22/22 13:51, Yicong Yang wrote: > >>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>> > >>>>> on x86, batched and deferred tlb shootdown has lead to 90% > >>>>> performance increase on tlb shootdown. on arm64, HW can do > >>>>> tlb shootdown without software IPI. But sync tlbi is still > >>>>> quite expensive. > >>>>> > >>>>> Even running a simplest program which requires swapout can > >>>>> prove this is true, > >>>>> #include <sys/types.h> > >>>>> #include <unistd.h> > >>>>> #include <sys/mman.h> > >>>>> #include <string.h> > >>>>> > >>>>> int main() > >>>>> { > >>>>> #define SIZE (1 * 1024 * 1024) > >>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > >>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); > >>>>> > >>>>> memset(p, 0x88, SIZE); > >>>>> > >>>>> for (int k = 0; k < 10000; k++) { > >>>>> /* swap in */ > >>>>> for (int i = 0; i < SIZE; i += 4096) { > >>>>> (void)p[i]; > >>>>> } > >>>>> > >>>>> /* swap out */ > >>>>> madvise(p, SIZE, MADV_PAGEOUT); > >>>>> } > >>>>> } > >>>>> > >>>>> Perf result on snapdragon 888 with 8 cores by using zRAM > >>>>> as the swap block device. > >>>>> > >>>>> ~ # perf record taskset -c 4 ./a.out > >>>>> [ perf record: Woken up 10 times to write data ] > >>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > >>>>> ~ # perf report > >>>>> # To display the perf.data header info, please use --header/--header-only options. > >>>>> # To display the perf.data header info, please use --header/--header-only options. > >>>>> # > >>>>> # > >>>>> # Total Lost Samples: 0 > >>>>> # > >>>>> # Samples: 60K of event 'cycles' > >>>>> # Event count (approx.): 35706225414 > >>>>> # > >>>>> # Overhead Command Shared Object Symbol > >>>>> # ........ ....... ................. ............................................................................. > >>>>> # > >>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > >>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > >>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > >>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > >>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > >>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > >>>>> 3.49% a.out [kernel.kallsyms] [k] memset64 > >>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page > >>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > >>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > >>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load > >>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > >>>>> > >>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > >>>>> swapping in/out a page mapped by only one process. If the > >>>>> page is mapped by multiple processes, typically, like more > >>>>> than 100 on a phone, the overhead would be much higher as > >>>>> we have to run tlb flush 100 times for one single page. > >>>>> Plus, tlb flush overhead will increase with the number > >>>>> of CPU cores due to the bad scalability of tlb shootdown > >>>>> in HW, so those ARM64 servers should expect much higher > >>>>> overhead. > >>>>> > >>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush > >>>>> is actually used by the final dsb() to wait for the completion > >>>>> of tlb flush. This provides us a very good chance to leverage > >>>>> the existing batched tlb in kernel. The minimum modification > >>>>> is that we only send async tlbi in the first stage and we send > >>>>> dsb while we have to sync in the second stage. > >>>>> > >>>>> With the above simplest micro benchmark, collapsed time to > >>>>> finish the program decreases around 5%. > >>>>> > >>>>> Typical collapsed time w/o patch: > >>>>> ~ # time taskset -c 4 ./a.out > >>>>> 0.21user 14.34system 0:14.69elapsed > >>>>> w/ patch: > >>>>> ~ # time taskset -c 4 ./a.out > >>>>> 0.22user 13.45system 0:13.80elapsed > >>>>> > >>>>> Also, Yicong Yang added the following observation. > >>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server, > >>>>> observed an improvement around 12.5% with command > >>>>> `time ./swap_bench`. > >>>>> w/o w/ > >>>>> real 0m13.460s 0m11.771s > >>>>> user 0m0.248s 0m0.279s > >>>>> sys 0m12.039s 0m11.458s > >>>>> > >>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() > >>>>> which has been eliminated by this patch: > >>>>> > >>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report > >>>>> [...] > >>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > >>>>> > >>>>> Cc: Jonathan Corbet <corbet@lwn.net> > >>>>> Cc: Nadav Amit <namit@vmware.com> > >>>>> Cc: Mel Gorman <mgorman@suse.de> > >>>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> > >>>>> Tested-by: Xin Hao <xhao@linux.alibaba.com> > >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > >>>>> --- > >>>>> .../features/vm/TLB/arch-support.txt | 2 +- > >>>>> arch/arm64/Kconfig | 1 + > >>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > >>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > >>>>> 4 files changed, 40 insertions(+), 3 deletions(-) > >>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h > >>>>> > >>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > >>>>> index 1c009312b9c1..2caf815d7c6c 100644 > >>>>> --- a/Documentation/features/vm/TLB/arch-support.txt > >>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt > >>>>> @@ -9,7 +9,7 @@ > >>>>> | alpha: | TODO | > >>>>> | arc: | TODO | > >>>>> | arm: | TODO | > >>>>> - | arm64: | TODO | > >>>>> + | arm64: | ok | > >>>>> | csky: | TODO | > >>>>> | hexagon: | TODO | > >>>>> | ia64: | TODO | > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > >>>>> index 571cc234d0b3..09d45cd6d665 100644 > >>>>> --- a/arch/arm64/Kconfig > >>>>> +++ b/arch/arm64/Kconfig > >>>>> @@ -93,6 +93,7 @@ config ARM64 > >>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > >>>>> select ARCH_SUPPORTS_NUMA_BALANCING > >>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK > >>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > >>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > >>>>> select ARCH_WANT_DEFAULT_BPF_JIT > >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > >>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > >>>>> new file mode 100644 > >>>>> index 000000000000..fedb0b87b8db > >>>>> --- /dev/null > >>>>> +++ b/arch/arm64/include/asm/tlbbatch.h > >>>>> @@ -0,0 +1,12 @@ > >>>>> +/* SPDX-License-Identifier: GPL-2.0 */ > >>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H > >>>>> +#define _ARCH_ARM64_TLBBATCH_H > >>>>> + > >>>>> +struct arch_tlbflush_unmap_batch { > >>>>> + /* > >>>>> + * For arm64, HW can do tlb shootdown, so we don't > >>>>> + * need to record cpumask for sending IPI > >>>>> + */ > >>>>> +}; > >>>>> + > >>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */ > >>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > >>>>> index 412a3b9a3c25..23cbc987321a 100644 > >>>>> --- a/arch/arm64/include/asm/tlbflush.h > >>>>> +++ b/arch/arm64/include/asm/tlbflush.h > >>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > >>>>> dsb(ish); > >>>>> } > >>>>> > >>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > >>>>> + > >>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > >>>>> unsigned long uaddr) > >>>>> { > >>>>> unsigned long addr; > >>>>> > >>>>> dsb(ishst); > >>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > >>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm)); > >>>>> __tlbi(vale1is, addr); > >>>>> __tlbi_user(vale1is, addr); > >>>>> } > >>>>> > >>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > >>>>> + unsigned long uaddr) > >>>>> +{ > >>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > >>>>> +} > >>>>> + > >>>>> static inline void flush_tlb_page(struct vm_area_struct *vma, > >>>>> unsigned long uaddr) > >>>>> { > >>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > >>>>> dsb(ish); > >>>>> } > >>>>> > >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >>>>> +{ > >>>>> + return true; > >>>>> +} > >>>> > >>>> Always defer and batch up TLB flush, unconditionally ? > >>> > >>> My understanding is we actually don't need tlbbatch for a machine with one > >>> or two cores as the tlb flush is not expensive. even for a system with four > >>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%. > >>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with > >>> 100 crores, the cost is incredibly huge. > >> > >> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still > >> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that > >> expensive as well, while queuing up individual TLB flushes ? > > > > This one is much much cheaper as it is not waiting for the > > completion of tlbi. waiting for the completion of tlbi is a big > > deal in arm64, thus, similar optimization can be seen here > > > > 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation > > when page aging"). > > > > > >> > >> The very idea behind TLB deferral is the opportunity it (might) provide > >> to accumulate address ranges and cpu masks so that individual TLB flush > >> can be replaced with a more cost effective range based TLB flush. Hence > >> I guess unless address range or cpumask based cost effective TLB flush > >> is available, deferral does not improve the unmap performance as much. > > > > > > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack > > from all cpus in the system, tlbi is not scalable. The point here is that we > > avoid waiting for each individual TLBi. Alternatively, they are batched. If > > you read the benchmark in the commit log, you can find the great decline > > in the cost to swap out a page. > > Alright, although collecting and deferring 'dsb(ish)' to the very end, does > not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I > guess it can be used to improve unmap performance on arm64. > > But is this 'dsb(ish)' deferral architecturally valid ? yes as dsb(ish) ensures the completion of tlbi. https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers We are even depending on the dsb(ish) during context switch in commit: 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation when page aging"). Before the context switch, lots of tlbi could have been sent. > > Let's examine single page unmap path via try_to_unmap_one(). > > should_defer_flush() { > ptep_get_and_clear() > set_tlb_ubc_flush_pending() > arch_tlbbatch_add_mm() > __flush_tlb_page_nosync() > } else { > ptep_clear_flush() > ptep_get_and_clear() > flush_tlb_page() > flush_tlb_page_nosync() > __flush_tlb_page_nosync() > dsb(ish) > } > > __flush_tlb_page_nosync() > { > dsb(ishst); > addr = __TLBI_VADDR(uaddr, ASID(mm)); > __tlbi(vale1is, addr); > __tlbi_user(vale1is, addr); > } > > Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi() > and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function. > > #define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \ > "tlbi " #op "\n" \ > ALTERNATIVE("nop\n nop", \ > "dsb ish\n tlbi " #op, \ > ARM64_WORKAROUND_REPEAT_TLBI, \ > CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > : : ) > > #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \ > "tlbi " #op ", %0\n" \ > ALTERNATIVE("nop\n nop", \ > "dsb ish\n tlbi " #op ", %0", \ > ARM64_WORKAROUND_REPEAT_TLBI, \ > CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > : : "r" (arg)) > > #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg) > > #define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0) > > #define __tlbi_user(op, arg) do { \ > if (arm64_kernel_unmapped_at_el0()) \ > __tlbi(op, (arg) | USER_ASID_FLAG); \ > } while (0) > > There is already a 'dsb(ish)' in between two subsequent TLB operations in > case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess > deferral should not enabled on such systems ? > > But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush() > via try_to_unmap_flush[_dirty](). There might be random number of instructions > in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'. > Just wondering, if such 'detached in time with other instructions in between' > 'tlbi' and 'dsb(ish)', is architecturally valid ? yes. I think so, arm64 even depends on the implicit dsb in context switch. > > There is a comment in 'struct tlbflush_unmap_batch'. > > /* > * The arch code makes the following promise: generic code can modify a > * PTE, then call arch_tlbbatch_add_mm() (which internally provides all > * needed barriers), then call arch_tlbbatch_flush(), and the entries > * will be flushed on all CPUs by the time that arch_tlbbatch_flush() > * returns. > */ > > > It expects arch_tlbbatch_add_mm() to provide all barriers, hence wondering if > that would include just the first 'dsb(isht)' not the subsequent 'dsb(ish)' ? yes. include/asm/tlbflush.h, we see the below comments: * TLB Invalidation * ================ * * This header file implements the low-level TLB invalidation routines * (sometimes referred to as "flushing" in the kernel) for arm64. * * Every invalidation operation uses the following template: * * DSB ISHST // Ensure prior page-table updates have completed * TLBI ... // Invalidate the TLB * DSB ISH // Ensure the TLB invalidation has completed * if (invalidated kernel mappings) * ISB // Discard any instructions fetched from the old mapping * Clearly dsb(ishst) has ensured page-table updates are visible to all CPUs. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-19 4:53 ` Barry Song @ 2022-09-19 5:08 ` Barry Song 0 siblings, 0 replies; 34+ messages in thread From: Barry Song @ 2022-09-19 5:08 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Mon, Sep 19, 2022 at 4:53 PM Barry Song <21cnbao@gmail.com> wrote: > > On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: > > > > > > > > On 9/15/22 12:12, Barry Song wrote: > > > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual > > > <anshuman.khandual@arm.com> wrote: > > >> > > >> > > >> > > >> On 9/9/22 11:05, Barry Song wrote: > > >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > > >>> <anshuman.khandual@arm.com> wrote: > > >>>> > > >>>> > > >>>> > > >>>> On 8/22/22 13:51, Yicong Yang wrote: > > >>>>> From: Barry Song <v-songbaohua@oppo.com> > > >>>>> > > >>>>> on x86, batched and deferred tlb shootdown has lead to 90% > > >>>>> performance increase on tlb shootdown. on arm64, HW can do > > >>>>> tlb shootdown without software IPI. But sync tlbi is still > > >>>>> quite expensive. > > >>>>> > > >>>>> Even running a simplest program which requires swapout can > > >>>>> prove this is true, > > >>>>> #include <sys/types.h> > > >>>>> #include <unistd.h> > > >>>>> #include <sys/mman.h> > > >>>>> #include <string.h> > > >>>>> > > >>>>> int main() > > >>>>> { > > >>>>> #define SIZE (1 * 1024 * 1024) > > >>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > > >>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > >>>>> > > >>>>> memset(p, 0x88, SIZE); > > >>>>> > > >>>>> for (int k = 0; k < 10000; k++) { > > >>>>> /* swap in */ > > >>>>> for (int i = 0; i < SIZE; i += 4096) { > > >>>>> (void)p[i]; > > >>>>> } > > >>>>> > > >>>>> /* swap out */ > > >>>>> madvise(p, SIZE, MADV_PAGEOUT); > > >>>>> } > > >>>>> } > > >>>>> > > >>>>> Perf result on snapdragon 888 with 8 cores by using zRAM > > >>>>> as the swap block device. > > >>>>> > > >>>>> ~ # perf record taskset -c 4 ./a.out > > >>>>> [ perf record: Woken up 10 times to write data ] > > >>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > > >>>>> ~ # perf report > > >>>>> # To display the perf.data header info, please use --header/--header-only options. > > >>>>> # To display the perf.data header info, please use --header/--header-only options. > > >>>>> # > > >>>>> # > > >>>>> # Total Lost Samples: 0 > > >>>>> # > > >>>>> # Samples: 60K of event 'cycles' > > >>>>> # Event count (approx.): 35706225414 > > >>>>> # > > >>>>> # Overhead Command Shared Object Symbol > > >>>>> # ........ ....... ................. ............................................................................. > > >>>>> # > > >>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > > >>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > > >>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > > >>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > > >>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > > >>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > > >>>>> 3.49% a.out [kernel.kallsyms] [k] memset64 > > >>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page > > >>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > > >>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > > >>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load > > >>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > > >>>>> > > >>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > > >>>>> swapping in/out a page mapped by only one process. If the > > >>>>> page is mapped by multiple processes, typically, like more > > >>>>> than 100 on a phone, the overhead would be much higher as > > >>>>> we have to run tlb flush 100 times for one single page. > > >>>>> Plus, tlb flush overhead will increase with the number > > >>>>> of CPU cores due to the bad scalability of tlb shootdown > > >>>>> in HW, so those ARM64 servers should expect much higher > > >>>>> overhead. > > >>>>> > > >>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush > > >>>>> is actually used by the final dsb() to wait for the completion > > >>>>> of tlb flush. This provides us a very good chance to leverage > > >>>>> the existing batched tlb in kernel. The minimum modification > > >>>>> is that we only send async tlbi in the first stage and we send > > >>>>> dsb while we have to sync in the second stage. > > >>>>> > > >>>>> With the above simplest micro benchmark, collapsed time to > > >>>>> finish the program decreases around 5%. > > >>>>> > > >>>>> Typical collapsed time w/o patch: > > >>>>> ~ # time taskset -c 4 ./a.out > > >>>>> 0.21user 14.34system 0:14.69elapsed > > >>>>> w/ patch: > > >>>>> ~ # time taskset -c 4 ./a.out > > >>>>> 0.22user 13.45system 0:13.80elapsed > > >>>>> > > >>>>> Also, Yicong Yang added the following observation. > > >>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server, > > >>>>> observed an improvement around 12.5% with command > > >>>>> `time ./swap_bench`. > > >>>>> w/o w/ > > >>>>> real 0m13.460s 0m11.771s > > >>>>> user 0m0.248s 0m0.279s > > >>>>> sys 0m12.039s 0m11.458s > > >>>>> > > >>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() > > >>>>> which has been eliminated by this patch: > > >>>>> > > >>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report > > >>>>> [...] > > >>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > > >>>>> > > >>>>> Cc: Jonathan Corbet <corbet@lwn.net> > > >>>>> Cc: Nadav Amit <namit@vmware.com> > > >>>>> Cc: Mel Gorman <mgorman@suse.de> > > >>>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> > > >>>>> Tested-by: Xin Hao <xhao@linux.alibaba.com> > > >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > >>>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > >>>>> --- > > >>>>> .../features/vm/TLB/arch-support.txt | 2 +- > > >>>>> arch/arm64/Kconfig | 1 + > > >>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > > >>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > > >>>>> 4 files changed, 40 insertions(+), 3 deletions(-) > > >>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h > > >>>>> > > >>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > > >>>>> index 1c009312b9c1..2caf815d7c6c 100644 > > >>>>> --- a/Documentation/features/vm/TLB/arch-support.txt > > >>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt > > >>>>> @@ -9,7 +9,7 @@ > > >>>>> | alpha: | TODO | > > >>>>> | arc: | TODO | > > >>>>> | arm: | TODO | > > >>>>> - | arm64: | TODO | > > >>>>> + | arm64: | ok | > > >>>>> | csky: | TODO | > > >>>>> | hexagon: | TODO | > > >>>>> | ia64: | TODO | > > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > >>>>> index 571cc234d0b3..09d45cd6d665 100644 > > >>>>> --- a/arch/arm64/Kconfig > > >>>>> +++ b/arch/arm64/Kconfig > > >>>>> @@ -93,6 +93,7 @@ config ARM64 > > >>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > > >>>>> select ARCH_SUPPORTS_NUMA_BALANCING > > >>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK > > >>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > > >>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > > >>>>> select ARCH_WANT_DEFAULT_BPF_JIT > > >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > > >>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > > >>>>> new file mode 100644 > > >>>>> index 000000000000..fedb0b87b8db > > >>>>> --- /dev/null > > >>>>> +++ b/arch/arm64/include/asm/tlbbatch.h > > >>>>> @@ -0,0 +1,12 @@ > > >>>>> +/* SPDX-License-Identifier: GPL-2.0 */ > > >>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H > > >>>>> +#define _ARCH_ARM64_TLBBATCH_H > > >>>>> + > > >>>>> +struct arch_tlbflush_unmap_batch { > > >>>>> + /* > > >>>>> + * For arm64, HW can do tlb shootdown, so we don't > > >>>>> + * need to record cpumask for sending IPI > > >>>>> + */ > > >>>>> +}; > > >>>>> + > > >>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */ > > >>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > > >>>>> index 412a3b9a3c25..23cbc987321a 100644 > > >>>>> --- a/arch/arm64/include/asm/tlbflush.h > > >>>>> +++ b/arch/arm64/include/asm/tlbflush.h > > >>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > > >>>>> dsb(ish); > > >>>>> } > > >>>>> > > >>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > > >>>>> + > > >>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > > >>>>> unsigned long uaddr) > > >>>>> { > > >>>>> unsigned long addr; > > >>>>> > > >>>>> dsb(ishst); > > >>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); > > >>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm)); > > >>>>> __tlbi(vale1is, addr); > > >>>>> __tlbi_user(vale1is, addr); > > >>>>> } > > >>>>> > > >>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > > >>>>> + unsigned long uaddr) > > >>>>> +{ > > >>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); > > >>>>> +} > > >>>>> + > > >>>>> static inline void flush_tlb_page(struct vm_area_struct *vma, > > >>>>> unsigned long uaddr) > > >>>>> { > > >>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > > >>>>> dsb(ish); > > >>>>> } > > >>>>> > > >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > > >>>>> +{ > > >>>>> + return true; > > >>>>> +} > > >>>> > > >>>> Always defer and batch up TLB flush, unconditionally ? > > >>> > > >>> My understanding is we actually don't need tlbbatch for a machine with one > > >>> or two cores as the tlb flush is not expensive. even for a system with four > > >>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%. > > >>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with > > >>> 100 crores, the cost is incredibly huge. > > >> > > >> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still > > >> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that > > >> expensive as well, while queuing up individual TLB flushes ? > > > > > > This one is much much cheaper as it is not waiting for the > > > completion of tlbi. waiting for the completion of tlbi is a big > > > deal in arm64, thus, similar optimization can be seen here > > > > > > 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation > > > when page aging"). > > > > > > > > >> > > >> The very idea behind TLB deferral is the opportunity it (might) provide > > >> to accumulate address ranges and cpu masks so that individual TLB flush > > >> can be replaced with a more cost effective range based TLB flush. Hence > > >> I guess unless address range or cpumask based cost effective TLB flush > > >> is available, deferral does not improve the unmap performance as much. > > > > > > > > > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack > > > from all cpus in the system, tlbi is not scalable. The point here is that we > > > avoid waiting for each individual TLBi. Alternatively, they are batched. If > > > you read the benchmark in the commit log, you can find the great decline > > > in the cost to swap out a page. > > > > Alright, although collecting and deferring 'dsb(ish)' to the very end, does > > not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I > > guess it can be used to improve unmap performance on arm64. > > > > But is this 'dsb(ish)' deferral architecturally valid ? > > yes as dsb(ish) ensures the completion of tlbi. > https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers > > We are even depending on the dsb(ish) during context switch in commit: > 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation > when page aging"). > > Before the context switch, lots of tlbi could have been sent. > > > > > Let's examine single page unmap path via try_to_unmap_one(). > > > > should_defer_flush() { > > ptep_get_and_clear() > > set_tlb_ubc_flush_pending() > > arch_tlbbatch_add_mm() > > __flush_tlb_page_nosync() > > } else { > > ptep_clear_flush() > > ptep_get_and_clear() > > flush_tlb_page() > > flush_tlb_page_nosync() > > __flush_tlb_page_nosync() > > dsb(ish) > > } > > > > __flush_tlb_page_nosync() > > { > > dsb(ishst); > > addr = __TLBI_VADDR(uaddr, ASID(mm)); > > __tlbi(vale1is, addr); > > __tlbi_user(vale1is, addr); > > } > > > > Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi() > > and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function. > > > > #define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \ > > "tlbi " #op "\n" \ > > ALTERNATIVE("nop\n nop", \ > > "dsb ish\n tlbi " #op, \ > > ARM64_WORKAROUND_REPEAT_TLBI, \ > > CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > > : : ) > > > > #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \ > > "tlbi " #op ", %0\n" \ > > ALTERNATIVE("nop\n nop", \ > > "dsb ish\n tlbi " #op ", %0", \ > > ARM64_WORKAROUND_REPEAT_TLBI, \ > > CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > > : : "r" (arg)) > > > > #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg) > > > > #define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0) > > > > #define __tlbi_user(op, arg) do { \ > > if (arm64_kernel_unmapped_at_el0()) \ > > __tlbi(op, (arg) | USER_ASID_FLAG); \ > > } while (0) > > > > There is already a 'dsb(ish)' in between two subsequent TLB operations in > > case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess > > deferral should not enabled on such systems ? > > > > But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush() > > via try_to_unmap_flush[_dirty](). There might be random number of instructions > > in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'. > > Just wondering, if such 'detached in time with other instructions in between' > > 'tlbi' and 'dsb(ish)', is architecturally valid ? > > yes. I think so, arm64 even depends on the implicit dsb in context switch. Please note we are not leveraging the time windows between tlbi and dsb(ish) to improve performance. we are actually shrinking the number of dsb(isb). In memory reclamation, we usually unmap 32 or more pages, then call try_to_unmap_flush(). that is why we are batching dsb(isb). so we are reducing 31 or more dsb(ish) for each memory reclamation, that is the point. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-08-22 8:21 ` [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 2022-08-24 9:46 ` Kefeng Wang 2022-09-09 5:24 ` Anshuman Khandual @ 2022-09-20 3:00 ` Anshuman Khandual 2022-09-20 3:39 ` Barry Song 2022-09-21 6:53 ` Anshuman Khandual 3 siblings, 1 reply; 34+ messages in thread From: Anshuman Khandual @ 2022-09-20 3:00 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 8/22/22 13:51, Yicong Yang wrote: > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + return true; > +} This needs to be conditional on systems, where there will be performance improvements, and should not just be enabled all the time on all systems. num_online_cpus() > X, which does not hold any cpu hotplug lock would be a good metric ? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-20 3:00 ` Anshuman Khandual @ 2022-09-20 3:39 ` Barry Song 2022-09-20 8:45 ` Anshuman Khandual 0 siblings, 1 reply; 34+ messages in thread From: Barry Song @ 2022-09-20 3:39 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Tue, Sep 20, 2022 at 3:00 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > On 8/22/22 13:51, Yicong Yang wrote: > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > > +{ > > + return true; > > +} > > This needs to be conditional on systems, where there will be performance > improvements, and should not just be enabled all the time on all systems. > num_online_cpus() > X, which does not hold any cpu hotplug lock would be > a good metric ? for a small system, i don't see how this patch will help, e.g. cpus <= 4; so we can actually disable tlb-batch on small systems. just need to check if we will have any race condition since hotplug will make the condition true and false dynamically. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-20 3:39 ` Barry Song @ 2022-09-20 8:45 ` Anshuman Khandual 2022-09-21 1:50 ` Barry Song 0 siblings, 1 reply; 34+ messages in thread From: Anshuman Khandual @ 2022-09-20 8:45 UTC (permalink / raw) To: Barry Song Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 9/20/22 09:09, Barry Song wrote: > On Tue, Sep 20, 2022 at 3:00 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> On 8/22/22 13:51, Yicong Yang wrote: >>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>> +{ >>> + return true; >>> +} >> >> This needs to be conditional on systems, where there will be performance >> improvements, and should not just be enabled all the time on all systems. >> num_online_cpus() > X, which does not hold any cpu hotplug lock would be >> a good metric ? > > for a small system, i don't see how this patch will help, e.g. cpus <= 4; > so we can actually disable tlb-batch on small systems. Do not subscribe ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH based on NR_CPUS ? That might not help much as the default value is 256 for NR_CPUS. OR arch_tlbbatch_should_defer() checks on 1. online cpus (dont enable batched TLB if <= X) 2. ARM64_WORKAROUND_REPEAT_TLBI (dont enable batched TLB) > just need to check if we will have any race condition since hotplug will > make the condition true and false dynamically. If should_defer_flush() evaluate to be false, then ptep_clear_flush() clears and flushes the entry right away. This should not race with other queued up TLBI requests, which will be flushed separately. Wondering how there can be a race here ! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-20 8:45 ` Anshuman Khandual @ 2022-09-21 1:50 ` Barry Song 2022-09-21 1:51 ` Barry Song 0 siblings, 1 reply; 34+ messages in thread From: Barry Song @ 2022-09-21 1:50 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Tue, Sep 20, 2022 at 8:45 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 9/20/22 09:09, Barry Song wrote: > > On Tue, Sep 20, 2022 at 3:00 PM Anshuman Khandual > > <anshuman.khandual@arm.com> wrote: > >> > >> > >> On 8/22/22 13:51, Yicong Yang wrote: > >>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >>> +{ > >>> + return true; > >>> +} > >> > >> This needs to be conditional on systems, where there will be performance > >> improvements, and should not just be enabled all the time on all systems. > >> num_online_cpus() > X, which does not hold any cpu hotplug lock would be > >> a good metric ? > > > > for a small system, i don't see how this patch will help, e.g. cpus <= 4; > > so we can actually disable tlb-batch on small systems. > > Do not subscribe ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH based on NR_CPUS ? > That might not help much as the default value is 256 for NR_CPUS. > > OR > > arch_tlbbatch_should_defer() checks on > > 1. online cpus (dont enable batched TLB if <= X) > 2. ARM64_WORKAROUND_REPEAT_TLBI (dont enable batched TLB) > > > just need to check if we will have any race condition since hotplug will > > make the condition true and false dynamically. > > If should_defer_flush() evaluate to be false, then ptep_clear_flush() > clears and flushes the entry right away. This should not race with other > queued up TLBI requests, which will be flushed separately. Wondering how > there can be a race here ! Right. How about we make something as below? static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) { /* for a small system very small number of CPUs, TLB shootdown is cheap */ if (num_online_cpus() <= 4 || unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) return false; #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) return false; #endif return true; } Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-21 1:50 ` Barry Song @ 2022-09-21 1:51 ` Barry Song 2022-09-21 3:33 ` Anshuman Khandual 0 siblings, 1 reply; 34+ messages in thread From: Barry Song @ 2022-09-21 1:51 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Wed, Sep 21, 2022 at 1:50 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Sep 20, 2022 at 8:45 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: > > > > > > > > On 9/20/22 09:09, Barry Song wrote: > > > On Tue, Sep 20, 2022 at 3:00 PM Anshuman Khandual > > > <anshuman.khandual@arm.com> wrote: > > >> > > >> > > >> On 8/22/22 13:51, Yicong Yang wrote: > > >>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > > >>> +{ > > >>> + return true; > > >>> +} > > >> > > >> This needs to be conditional on systems, where there will be performance > > >> improvements, and should not just be enabled all the time on all systems. > > >> num_online_cpus() > X, which does not hold any cpu hotplug lock would be > > >> a good metric ? > > > > > > for a small system, i don't see how this patch will help, e.g. cpus <= 4; > > > so we can actually disable tlb-batch on small systems. > > > > Do not subscribe ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH based on NR_CPUS ? > > That might not help much as the default value is 256 for NR_CPUS. > > > > OR > > > > arch_tlbbatch_should_defer() checks on > > > > 1. online cpus (dont enable batched TLB if <= X) > > 2. ARM64_WORKAROUND_REPEAT_TLBI (dont enable batched TLB) > > > > > just need to check if we will have any race condition since hotplug will > > > make the condition true and false dynamically. > > > > If should_defer_flush() evaluate to be false, then ptep_clear_flush() > > clears and flushes the entry right away. This should not race with other > > queued up TLBI requests, which will be flushed separately. Wondering how > > there can be a race here ! > > Right. How about we make something as below? > > static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > { > /* for a small system very small number of CPUs, TLB shootdown is cheap */ > if (num_online_cpus() <= 4 || > unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) > return false; > > #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI > if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) > return false; > #endif > > return true; > } sorry, i mean static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) { /* for a small system very small number of CPUs, TLB shootdown is cheap */ if (num_online_cpus() <= 4) return false; #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) return false; #endif return true; } > > Thanks > Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-21 1:51 ` Barry Song @ 2022-09-21 3:33 ` Anshuman Khandual 0 siblings, 0 replies; 34+ messages in thread From: Anshuman Khandual @ 2022-09-21 3:33 UTC (permalink / raw) To: Barry Song Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 9/21/22 07:21, Barry Song wrote: > On Wed, Sep 21, 2022 at 1:50 PM Barry Song <21cnbao@gmail.com> wrote: >> >> On Tue, Sep 20, 2022 at 8:45 PM Anshuman Khandual >> <anshuman.khandual@arm.com> wrote: >>> >>> >>> >>> On 9/20/22 09:09, Barry Song wrote: >>>> On Tue, Sep 20, 2022 at 3:00 PM Anshuman Khandual >>>> <anshuman.khandual@arm.com> wrote: >>>>> >>>>> >>>>> On 8/22/22 13:51, Yicong Yang wrote: >>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>> +{ >>>>>> + return true; >>>>>> +} >>>>> >>>>> This needs to be conditional on systems, where there will be performance >>>>> improvements, and should not just be enabled all the time on all systems. >>>>> num_online_cpus() > X, which does not hold any cpu hotplug lock would be >>>>> a good metric ? >>>> >>>> for a small system, i don't see how this patch will help, e.g. cpus <= 4; >>>> so we can actually disable tlb-batch on small systems. >>> >>> Do not subscribe ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH based on NR_CPUS ? >>> That might not help much as the default value is 256 for NR_CPUS. >>> >>> OR >>> >>> arch_tlbbatch_should_defer() checks on >>> >>> 1. online cpus (dont enable batched TLB if <= X) >>> 2. ARM64_WORKAROUND_REPEAT_TLBI (dont enable batched TLB) >>> >>>> just need to check if we will have any race condition since hotplug will >>>> make the condition true and false dynamically. >>> >>> If should_defer_flush() evaluate to be false, then ptep_clear_flush() >>> clears and flushes the entry right away. This should not race with other >>> queued up TLBI requests, which will be flushed separately. Wondering how >>> there can be a race here ! >> >> Right. How about we make something as below? >> >> static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >> { >> /* for a small system very small number of CPUs, TLB shootdown is cheap */ >> if (num_online_cpus() <= 4 || >> unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) >> return false; >> >> #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI >> if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) >> return false; >> #endif >> >> return true; >> } > > sorry, i mean > > static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > { > /* for a small system very small number of CPUs, TLB shootdown is cheap */ > if (num_online_cpus() <= 4) > return false; > > #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI > if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) > return false; > #endif > > return true; > } This is a good starting point. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-08-22 8:21 ` [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang ` (2 preceding siblings ...) 2022-09-20 3:00 ` Anshuman Khandual @ 2022-09-21 6:53 ` Anshuman Khandual 2022-09-21 7:15 ` Barry Song 2022-09-21 7:17 ` Nadav Amit 3 siblings, 2 replies; 34+ messages in thread From: Anshuman Khandual @ 2022-09-21 6:53 UTC (permalink / raw) To: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On 8/22/22 13:51, Yicong Yang wrote: > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > + struct mm_struct *mm, > + unsigned long uaddr) > +{ > + __flush_tlb_page_nosync(mm, uaddr); > +} > + > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > +{ > + dsb(ish); > +} Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping TLB invalidation requests on a given mm and try to generate a range based TLB invalidation such as flush_tlb_range(). struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can later be flushed in subsequent arch_tlbbatch_flush() ? OR It might not be worth the effort and complexity, in comparison to performance improvement, TLB range flush brings in ? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-21 6:53 ` Anshuman Khandual @ 2022-09-21 7:15 ` Barry Song 2022-09-21 7:17 ` Nadav Amit 1 sibling, 0 replies; 34+ messages in thread From: Barry Song @ 2022-09-21 7:15 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, wangkefeng.wang, xhao, prime.zeng, Barry Song, Nadav Amit, Mel Gorman On Wed, Sep 21, 2022 at 6:53 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > On 8/22/22 13:51, Yicong Yang wrote: > > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > > + struct mm_struct *mm, > > + unsigned long uaddr) > > +{ > > + __flush_tlb_page_nosync(mm, uaddr); > > +} > > + > > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > > +{ > > + dsb(ish); > > +} > > Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping > TLB invalidation requests on a given mm and try to generate a range based TLB > invalidation such as flush_tlb_range(). > > struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous > ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can > later be flushed in subsequent arch_tlbbatch_flush() ? > > OR > > It might not be worth the effort and complexity, in comparison to performance > improvement, TLB range flush brings in ? Probably it is not worth the complexity as perf annotate shows " Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush." so any further optimization before dsb(ish) might bring some improvement but seems minor. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-21 6:53 ` Anshuman Khandual 2022-09-21 7:15 ` Barry Song @ 2022-09-21 7:17 ` Nadav Amit 2022-09-22 3:15 ` Anshuman Khandual 1 sibling, 1 reply; 34+ messages in thread From: Nadav Amit @ 2022-09-21 7:17 UTC (permalink / raw) To: Anshuman Khandual Cc: Yicong Yang, Andrew Morton, Linux MM, linux-arm-kernel, x86, catalin.marinas, Will Deacon, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song, Mel Gorman On Sep 20, 2022, at 11:53 PM, Anshuman Khandual <anshuman.khandual@arm.com> wrote: > ⚠ External Email > > On 8/22/22 13:51, Yicong Yang wrote: >> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, >> + struct mm_struct *mm, >> + unsigned long uaddr) >> +{ >> + __flush_tlb_page_nosync(mm, uaddr); >> +} >> + >> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) >> +{ >> + dsb(ish); >> +} > > Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping > TLB invalidation requests on a given mm and try to generate a range based TLB > invalidation such as flush_tlb_range(). > > struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous > ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can > later be flushed in subsequent arch_tlbbatch_flush() ? > > OR > > It might not be worth the effort and complexity, in comparison to performance > improvement, TLB range flush brings in ? So here are my 2 cents, based on my experience with Intel-x86. It is likely different on arm64, but perhaps it can provide you some insight into what parameters you should measure and consider. In general there is a tradeoff between full TLB flushes and entry-specific ones. Flushing specific entries takes more time than flushing the entire TLB, but sade TLB refills. Dave Hansen made some calculations in the past and came up with 33 as a magic cutoff number, i.e., if you need to flush more than 33 entries, just flush the entire TLB. I am not sure that this exact number is very meaningful, since one might argue that it should’ve taken PTI into account (which might require twice as many TLB invalidations). Anyhow, back to arch_tlbbatch_add_mm(). It may be possible to track ranges, but the question is whether you would actually succeed in forming continuous ranges that are eventually (on x86) smaller than the full TLB flush cutoff (=33). Questionable (perhaps better with MGLRU?). Then, you should remember that tracking should be very efficient, since even few cache misses might have greater cost than what you save by selective-flushing. Finally, on x86 you would need to invoke the smp/IPI layer multiple times to send different cores the relevant range they need to flush. IOW: It is somewhat complicated to implement efficeintly. On x86, and probably other IPI-based TLB shootdown systems, does not have clear performance benefit (IMHO). _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation 2022-09-21 7:17 ` Nadav Amit @ 2022-09-22 3:15 ` Anshuman Khandual 0 siblings, 0 replies; 34+ messages in thread From: Anshuman Khandual @ 2022-09-22 3:15 UTC (permalink / raw) To: Nadav Amit Cc: Yicong Yang, Andrew Morton, Linux MM, linux-arm-kernel, x86, catalin.marinas, Will Deacon, linux-doc, corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Barry Song, Mel Gorman On 9/21/22 12:47, Nadav Amit wrote: > On Sep 20, 2022, at 11:53 PM, Anshuman Khandual <anshuman.khandual@arm.com> wrote: > >> ⚠ External Email >> >> On 8/22/22 13:51, Yicong Yang wrote: >>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, >>> + struct mm_struct *mm, >>> + unsigned long uaddr) >>> +{ >>> + __flush_tlb_page_nosync(mm, uaddr); >>> +} >>> + >>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) >>> +{ >>> + dsb(ish); >>> +} >> >> Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping >> TLB invalidation requests on a given mm and try to generate a range based TLB >> invalidation such as flush_tlb_range(). >> >> struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous >> ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can >> later be flushed in subsequent arch_tlbbatch_flush() ? >> >> OR >> >> It might not be worth the effort and complexity, in comparison to performance >> improvement, TLB range flush brings in ? > > So here are my 2 cents, based on my experience with Intel-x86. It is likely > different on arm64, but perhaps it can provide you some insight into what > parameters you should measure and consider. > > In general there is a tradeoff between full TLB flushes and entry-specific > ones. Flushing specific entries takes more time than flushing the entire > TLB, but sade TLB refills. Right. > > Dave Hansen made some calculations in the past and came up with 33 as a > magic cutoff number, i.e., if you need to flush more than 33 entries, just > flush the entire TLB. I am not sure that this exact number is very > meaningful, since one might argue that it should’ve taken PTI into account > (which might require twice as many TLB invalidations). Okay. > > Anyhow, back to arch_tlbbatch_add_mm(). It may be possible to track ranges, > but the question is whether you would actually succeed in forming continuous > ranges that are eventually (on x86) smaller than the full TLB flush cutoff > (=33). Questionable (perhaps better with MGLRU?). This proposal here for arm64 does not cause a full TLB flush ever. It creates individual TLB flushes all the time. Hence the choice here is not between full TLB flush and possible range flushes. Choice is actually between individual TLB flushes and range/full TLB flushes. > > Then, you should remember that tracking should be very efficient, since even > few cache misses might have greater cost than what you save by > selective-flushing. Finally, on x86 you would need to invoke the smp/IPI > layer multiple times to send different cores the relevant range they need to > flush. Agreed, these reasons make it much difficult to gain any more performance. > > IOW: It is somewhat complicated to implement efficeintly. On x86, and > probably other IPI-based TLB shootdown systems, does not have clear > performance benefit (IMHO). Agreed, thanks for such a detailed explanation, appreciate it. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH 2022-08-22 8:21 [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang ` (3 preceding siblings ...) 2022-08-22 8:21 ` [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang @ 2022-09-06 8:53 ` Yicong Yang 4 siblings, 0 replies; 34+ messages in thread From: Yicong Yang @ 2022-09-06 8:53 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, linux-doc Cc: corbet, peterz, arnd, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, anshuman.khandual Hi mm and arm64 maintainers, a gentle ping for this.. Thanks. On 2022/8/22 16:21, Yicong Yang wrote: > From: Yicong Yang <yangyicong@hisilicon.com> > > Though ARM64 has the hardware to do tlb shootdown, the hardware > broadcasting is not free. > A simplest micro benchmark shows even on snapdragon 888 with only > 8 cores, the overhead for ptep_clear_flush is huge even for paging > out one page mapped by only one process: > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > > While pages are mapped by multiple processes or HW has more CPUs, > the cost should become even higher due to the bad scalability of > tlb shootdown. > > The same benchmark can result in 16.99% CPU consumption on ARM64 > server with around 100 cores according to Yicong's test on patch > 4/4. > > This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by > 1. only send tlbi instructions in the first stage - > arch_tlbbatch_add_mm() > 2. wait for the completion of tlbi by dsb while doing tlbbatch > sync in arch_tlbbatch_flush() > My testing on snapdragon shows the overhead of ptep_clear_flush > is removed by the patchset. The micro benchmark becomes 5% faster > even for one page mapped by single process on snapdragon 888. > > -v3: > 1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead > of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng > 2. Add Tested-by from Xin Hao > Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/ > > -v2: > 1. Collected Yicong's test result on kunpeng920 ARM64 server; > 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm() > according to the comments of Peter Zijlstra and Dave Hansen > 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask > is empty according to the comments of Nadav Amit > > Thanks, Peter, Dave and Nadav for your testing or reviewing > , and comments. > > -v1: > https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/ > > Anshuman Khandual (1): > mm/tlbbatch: Introduce arch_tlbbatch_should_defer() > > Barry Song (3): > Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't > apply to ARM64" > mm: rmap: Extend tlbbatch APIs to fit new platforms > arm64: support batched/deferred tlb shootdown during page reclamation > > Documentation/features/arch-support.txt | 1 - > .../features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/tlbbatch.h | 12 ++++++++ > arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++-- > arch/x86/include/asm/tlbflush.h | 15 +++++++++- > mm/rmap.c | 19 +++++-------- > 7 files changed, 61 insertions(+), 17 deletions(-) > create mode 100644 arch/arm64/include/asm/tlbbatch.h > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2022-09-22 3:16 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-08-22 8:21 [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang 2022-08-22 8:21 ` [PATCH v3 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64" Yicong Yang 2022-09-09 4:26 ` Anshuman Khandual 2022-09-09 4:40 ` Barry Song 2022-08-22 8:21 ` [PATCH v3 2/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang 2022-08-24 9:40 ` Kefeng Wang 2022-09-09 4:16 ` Anshuman Khandual 2022-08-22 8:21 ` [PATCH v3 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms Yicong Yang 2022-08-24 9:43 ` Kefeng Wang 2022-09-09 4:51 ` Anshuman Khandual 2022-09-09 5:25 ` Barry Song 2022-08-22 8:21 ` [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 2022-08-24 9:46 ` Kefeng Wang 2022-09-09 5:24 ` Anshuman Khandual 2022-09-09 5:35 ` Barry Song 2022-09-09 6:32 ` Yicong Yang 2022-09-15 6:07 ` Anshuman Khandual 2022-09-15 6:42 ` Barry Song 2022-09-15 14:31 ` Nadav Amit 2022-09-19 2:46 ` Anshuman Khandual 2022-09-19 4:24 ` Anshuman Khandual 2022-09-19 4:53 ` Barry Song 2022-09-19 5:08 ` Barry Song 2022-09-20 3:00 ` Anshuman Khandual 2022-09-20 3:39 ` Barry Song 2022-09-20 8:45 ` Anshuman Khandual 2022-09-21 1:50 ` Barry Song 2022-09-21 1:51 ` Barry Song 2022-09-21 3:33 ` Anshuman Khandual 2022-09-21 6:53 ` Anshuman Khandual 2022-09-21 7:15 ` Barry Song 2022-09-21 7:17 ` Nadav Amit 2022-09-22 3:15 ` Anshuman Khandual 2022-09-06 8:53 ` [PATCH v3 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH Yicong Yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).