* [RFC PATCH v5 0/2] arm64: tlb: add support for TLBI RANGE instructions @ 2020-07-08 12:40 Zhenyu Ye 2020-07-08 12:40 ` [RFC PATCH v5 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature Zhenyu Ye 2020-07-08 12:40 ` [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye 0 siblings, 2 replies; 5+ messages in thread From: Zhenyu Ye @ 2020-07-08 12:40 UTC (permalink / raw) To: catalin.marinas, will, suzuki.poulose, maz, steven.price, guohanjun, olof Cc: yezhenyu2, linux-arm-kernel, linux-kernel, linux-arch, linux-mm, arm, xiexiangyou, prime.zeng, zhangshaokun, kuhn.chenqun ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a range of input addresses. This series add support for this feature. I tested this feature on a FPGA machine whose cpus support the tlbi range. As the page num increases, the performance is improved significantly. When page num = 256, the performance is improved by about 10 times. Below is the test data when the stride = PTE: [page num] [classic] [tlbi range] 1 16051 13524 2 11366 11146 3 11582 12171 4 11694 11101 5 12138 12267 6 12290 11105 7 12400 12002 8 12837 11097 9 14791 12140 10 15461 11087 16 18233 11094 32 26983 11079 64 43840 11092 128 77754 11098 256 145514 11089 512 280932 11111 See more details in: https://lore.kernel.org/linux-arm-kernel/504c7588-97e5-e014-fca0-c5511ae0d256@huawei.com/ -- ChangeList: v5: - rebase this series on Linux 5.8-rc4. - remove the __TG macro. - move the odd range_pages check into loop. v4: combine the __flush_tlb_range() and the __directly into the same function with a single loop for both. v3: rebase this series on Linux 5.7-rc1. v2: Link: https://lkml.org/lkml/2019/11/11/348 Zhenyu Ye (2): arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature arm64: tlb: Use the TLBI RANGE feature in arm64 arch/arm64/include/asm/cpucaps.h | 3 +- arch/arm64/include/asm/sysreg.h | 3 + arch/arm64/include/asm/tlbflush.h | 101 +++++++++++++++++++++++++----- arch/arm64/kernel/cpufeature.c | 10 +++ 4 files changed, 102 insertions(+), 15 deletions(-) -- 2.19.1 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC PATCH v5 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature 2020-07-08 12:40 [RFC PATCH v5 0/2] arm64: tlb: add support for TLBI RANGE instructions Zhenyu Ye @ 2020-07-08 12:40 ` Zhenyu Ye 2020-07-08 12:40 ` [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye 1 sibling, 0 replies; 5+ messages in thread From: Zhenyu Ye @ 2020-07-08 12:40 UTC (permalink / raw) To: catalin.marinas, will, suzuki.poulose, maz, steven.price, guohanjun, olof Cc: yezhenyu2, linux-arm-kernel, linux-kernel, linux-arch, linux-mm, arm, xiexiangyou, prime.zeng, zhangshaokun, kuhn.chenqun ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a range of input addresses. This patch detect this feature. Signed-off-by: Zhenyu Ye <yezhenyu2@huawei.com> --- arch/arm64/include/asm/cpucaps.h | 3 ++- arch/arm64/include/asm/sysreg.h | 3 +++ arch/arm64/kernel/cpufeature.c | 10 ++++++++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h index d7b3bb0cb180..96fe898bfb5f 100644 --- a/arch/arm64/include/asm/cpucaps.h +++ b/arch/arm64/include/asm/cpucaps.h @@ -62,7 +62,8 @@ #define ARM64_HAS_GENERIC_AUTH 52 #define ARM64_HAS_32BIT_EL1 53 #define ARM64_BTI 54 +#define ARM64_HAS_TLBI_RANGE 55 -#define ARM64_NCAPS 55 +#define ARM64_NCAPS 56 #endif /* __ASM_CPUCAPS_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index 463175f80341..b4eb2e5601f2 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -617,6 +617,9 @@ #define ID_AA64ISAR0_SHA1_SHIFT 8 #define ID_AA64ISAR0_AES_SHIFT 4 +#define ID_AA64ISAR0_TLBI_RANGE_NI 0x0 +#define ID_AA64ISAR0_TLBI_RANGE 0x2 + /* id_aa64isar1 */ #define ID_AA64ISAR1_I8MM_SHIFT 52 #define ID_AA64ISAR1_DGH_SHIFT 48 diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 9fae0efc80c1..5491bf47e62c 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -2058,6 +2058,16 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .sign = FTR_UNSIGNED, }, #endif + { + .desc = "TLB range maintenance instruction", + .capability = ARM64_HAS_TLBI_RANGE, + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .matches = has_cpuid_feature, + .sys_reg = SYS_ID_AA64ISAR0_EL1, + .field_pos = ID_AA64ISAR0_TLB_SHIFT, + .sign = FTR_UNSIGNED, + .min_field_value = ID_AA64ISAR0_TLBI_RANGE, + }, {}, }; -- 2.19.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 2020-07-08 12:40 [RFC PATCH v5 0/2] arm64: tlb: add support for TLBI RANGE instructions Zhenyu Ye 2020-07-08 12:40 ` [RFC PATCH v5 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature Zhenyu Ye @ 2020-07-08 12:40 ` Zhenyu Ye 2020-07-08 18:24 ` Catalin Marinas 1 sibling, 1 reply; 5+ messages in thread From: Zhenyu Ye @ 2020-07-08 12:40 UTC (permalink / raw) To: catalin.marinas, will, suzuki.poulose, maz, steven.price, guohanjun, olof Cc: yezhenyu2, linux-arm-kernel, linux-kernel, linux-arch, linux-mm, arm, xiexiangyou, prime.zeng, zhangshaokun, kuhn.chenqun Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range(). In this patch, we only use the TLBI RANGE feature if the stride == PAGE_SIZE, because when stride > PAGE_SIZE, usually only a small number of pages need to be flushed and classic tlbi intructions are more effective. We can also use 'end - start < threshold number' to decide which way to go, however, different hardware may have different thresholds, so I'm not sure if this is feasible. Signed-off-by: Zhenyu Ye <yezhenyu2@huawei.com> --- arch/arm64/include/asm/tlbflush.h | 104 ++++++++++++++++++++++++++---- 1 file changed, 90 insertions(+), 14 deletions(-) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index bc3949064725..30975ddb8f06 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -50,6 +50,16 @@ __tlbi(op, (arg) | USER_ASID_FLAG); \ } while (0) +#define __tlbi_last_level(op1, op2, arg, last_level) do { \ + if (last_level) { \ + __tlbi(op1, arg); \ + __tlbi_user(op1, arg); \ + } else { \ + __tlbi(op2, arg); \ + __tlbi_user(op2, arg); \ + } \ +} while (0) + /* This macro creates a properly formatted VA operand for the TLBI */ #define __TLBI_VADDR(addr, asid) \ ({ \ @@ -59,6 +69,60 @@ __ta; \ }) +/* + * Get translation granule of the system, which is decided by + * PAGE_SIZE. Used by TTL. + * - 4KB : 1 + * - 16KB : 2 + * - 64KB : 3 + */ +static inline unsigned long get_trans_granule(void) +{ + switch (PAGE_SIZE) { + case SZ_4K: + return 1; + case SZ_16K: + return 2; + case SZ_64K: + return 3; + default: + return 0; + } +} + +/* + * This macro creates a properly formatted VA operand for the TLBI RANGE. + * The value bit assignments are: + * + * +----------+------+-------+-------+-------+----------------------+ + * | ASID | TG | SCALE | NUM | TTL | BADDR | + * +-----------------+-------+-------+-------+----------------------+ + * |63 48|47 46|45 44|43 39|38 37|36 0| + * + * The address range is determined by below formula: + * [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE) + * + */ +#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl) \ + ({ \ + unsigned long __ta = (addr) >> PAGE_SHIFT; \ + __ta &= GENMASK_ULL(36, 0); \ + __ta |= (unsigned long)(ttl) << 37; \ + __ta |= (unsigned long)(num) << 39; \ + __ta |= (unsigned long)(scale) << 44; \ + __ta |= get_trans_granule() << 46; \ + __ta |= (unsigned long)(asid) << 48; \ + __ta; \ + }) + +/* These macros are used by the TLBI RANGE feature. */ +#define __TLBI_RANGE_PAGES(num, scale) (((num) + 1) << (5 * (scale) + 1)) +#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3) + +#define TLBI_RANGE_MASK GENMASK_ULL(4, 0) +#define __TLBI_RANGE_NUM(range, scale) \ + (((range) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) + /* * TLB Invalidation * ================ @@ -181,32 +245,44 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, unsigned long stride, bool last_level) { + int num = 0; + int scale = 0; unsigned long asid = ASID(vma->vm_mm); unsigned long addr; + unsigned long range_pages; start = round_down(start, stride); end = round_up(end, stride); + range_pages = (end - start) >> PAGE_SHIFT; - if ((end - start) >= (MAX_TLBI_OPS * stride)) { + if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && + (end - start) >= (MAX_TLBI_OPS * stride)) || + range_pages >= MAX_TLBI_RANGE_PAGES) { flush_tlb_mm(vma->vm_mm); return; } - /* Convert the stride into units of 4k */ - stride >>= 12; - - start = __TLBI_VADDR(start, asid); - end = __TLBI_VADDR(end, asid); - dsb(ishst); - for (addr = start; addr < end; addr += stride) { - if (last_level) { - __tlbi(vale1is, addr); - __tlbi_user(vale1is, addr); - } else { - __tlbi(vae1is, addr); - __tlbi_user(vae1is, addr); + while (range_pages > 0) { + if (cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && + stride == PAGE_SIZE && range_pages % 2 == 0) { + num = __TLBI_RANGE_NUM(range_pages, scale) - 1; + if (num >= 0) { + addr = __TLBI_VADDR_RANGE(start, asid, scale, + num, 0); + __tlbi_last_level(rvale1is, rvae1is, addr, + last_level); + start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; + range_pages -= __TLBI_RANGE_PAGES(num, scale); + } + scale++; + continue; } + + addr = __TLBI_VADDR(start, asid); + __tlbi_last_level(vale1is, vae1is, addr, last_level); + start += stride; + range_pages -= stride >> PAGE_SHIFT; } dsb(ish); } -- 2.19.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 2020-07-08 12:40 ` [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye @ 2020-07-08 18:24 ` Catalin Marinas 2020-07-09 6:51 ` Zhenyu Ye 0 siblings, 1 reply; 5+ messages in thread From: Catalin Marinas @ 2020-07-08 18:24 UTC (permalink / raw) To: Zhenyu Ye Cc: will, suzuki.poulose, maz, steven.price, guohanjun, olof, linux-arm-kernel, linux-kernel, linux-arch, linux-mm, arm, xiexiangyou, prime.zeng, zhangshaokun, kuhn.chenqun On Wed, Jul 08, 2020 at 08:40:31PM +0800, Zhenyu Ye wrote: > Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range(). > > In this patch, we only use the TLBI RANGE feature if the stride == PAGE_SIZE, > because when stride > PAGE_SIZE, usually only a small number of pages need > to be flushed and classic tlbi intructions are more effective. Why are they more effective? I guess a range op would work on this as well, say unmapping a large THP range. If we ignore this stride == PAGE_SIZE, it could make the code easier to read. > We can also use 'end - start < threshold number' to decide which way > to go, however, different hardware may have different thresholds, so > I'm not sure if this is feasible. > > Signed-off-by: Zhenyu Ye <yezhenyu2@huawei.com> > --- > arch/arm64/include/asm/tlbflush.h | 104 ++++++++++++++++++++++++++---- > 1 file changed, 90 insertions(+), 14 deletions(-) Could you please rebase these patches on top of the arm64 for-next/tlbi branch: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/tlbi > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index bc3949064725..30975ddb8f06 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -50,6 +50,16 @@ > __tlbi(op, (arg) | USER_ASID_FLAG); \ > } while (0) > > +#define __tlbi_last_level(op1, op2, arg, last_level) do { \ > + if (last_level) { \ > + __tlbi(op1, arg); \ > + __tlbi_user(op1, arg); \ > + } else { \ > + __tlbi(op2, arg); \ > + __tlbi_user(op2, arg); \ > + } \ > +} while (0) > + > /* This macro creates a properly formatted VA operand for the TLBI */ > #define __TLBI_VADDR(addr, asid) \ > ({ \ > @@ -59,6 +69,60 @@ > __ta; \ > }) > > +/* > + * Get translation granule of the system, which is decided by > + * PAGE_SIZE. Used by TTL. > + * - 4KB : 1 > + * - 16KB : 2 > + * - 64KB : 3 > + */ > +static inline unsigned long get_trans_granule(void) > +{ > + switch (PAGE_SIZE) { > + case SZ_4K: > + return 1; > + case SZ_16K: > + return 2; > + case SZ_64K: > + return 3; > + default: > + return 0; > + } > +} Maybe you can factor out this switch statement in the for-next/tlbi branch to be shared with TTL. > +/* > + * This macro creates a properly formatted VA operand for the TLBI RANGE. > + * The value bit assignments are: > + * > + * +----------+------+-------+-------+-------+----------------------+ > + * | ASID | TG | SCALE | NUM | TTL | BADDR | > + * +-----------------+-------+-------+-------+----------------------+ > + * |63 48|47 46|45 44|43 39|38 37|36 0| > + * > + * The address range is determined by below formula: > + * [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE) > + * > + */ > +#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl) \ I don't see a non-zero ttl passed to this macro but I suspect this would change if based on top of the TTL patches. > + ({ \ > + unsigned long __ta = (addr) >> PAGE_SHIFT; \ > + __ta &= GENMASK_ULL(36, 0); \ > + __ta |= (unsigned long)(ttl) << 37; \ > + __ta |= (unsigned long)(num) << 39; \ > + __ta |= (unsigned long)(scale) << 44; \ > + __ta |= get_trans_granule() << 46; \ > + __ta |= (unsigned long)(asid) << 48; \ > + __ta; \ > + }) > + > +/* These macros are used by the TLBI RANGE feature. */ > +#define __TLBI_RANGE_PAGES(num, scale) (((num) + 1) << (5 * (scale) + 1)) > +#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3) > + > +#define TLBI_RANGE_MASK GENMASK_ULL(4, 0) > +#define __TLBI_RANGE_NUM(range, scale) \ > + (((range) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) > + > /* > * TLB Invalidation > * ================ > @@ -181,32 +245,44 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, > unsigned long start, unsigned long end, > unsigned long stride, bool last_level) > { > + int num = 0; > + int scale = 0; > unsigned long asid = ASID(vma->vm_mm); > unsigned long addr; > + unsigned long range_pages; > > start = round_down(start, stride); > end = round_up(end, stride); > + range_pages = (end - start) >> PAGE_SHIFT; > > - if ((end - start) >= (MAX_TLBI_OPS * stride)) { > + if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && > + (end - start) >= (MAX_TLBI_OPS * stride)) || > + range_pages >= MAX_TLBI_RANGE_PAGES) { > flush_tlb_mm(vma->vm_mm); > return; > } Is there any value in this range_pages check here? What's the value of MAX_TLBI_RANGE_PAGES? If we have TLBI range ops, we make a decision here but without including the stride. Further down we use the stride to skip the TLBI range ops. > > - /* Convert the stride into units of 4k */ > - stride >>= 12; > - > - start = __TLBI_VADDR(start, asid); > - end = __TLBI_VADDR(end, asid); > - > dsb(ishst); > - for (addr = start; addr < end; addr += stride) { > - if (last_level) { > - __tlbi(vale1is, addr); > - __tlbi_user(vale1is, addr); > - } else { > - __tlbi(vae1is, addr); > - __tlbi_user(vae1is, addr); > + while (range_pages > 0) { BTW, I think we can even drop the "range_" from range_pages, it's just the number of pages. > + if (cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && > + stride == PAGE_SIZE && range_pages % 2 == 0) { > + num = __TLBI_RANGE_NUM(range_pages, scale) - 1; > + if (num >= 0) { > + addr = __TLBI_VADDR_RANGE(start, asid, scale, > + num, 0); > + __tlbi_last_level(rvale1is, rvae1is, addr, > + last_level); > + start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; > + range_pages -= __TLBI_RANGE_PAGES(num, scale); > + } > + scale++; > + continue; > } > + > + addr = __TLBI_VADDR(start, asid); > + __tlbi_last_level(vale1is, vae1is, addr, last_level); > + start += stride; > + range_pages -= stride >> PAGE_SHIFT; > } > dsb(ish); > } I think the algorithm is correct, though I need to work it out on a piece of paper. The code could benefit from some comments (above the loop) on how the range is built and the right scale found. -- Catalin ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 2020-07-08 18:24 ` Catalin Marinas @ 2020-07-09 6:51 ` Zhenyu Ye 0 siblings, 0 replies; 5+ messages in thread From: Zhenyu Ye @ 2020-07-09 6:51 UTC (permalink / raw) To: Catalin Marinas Cc: will, suzuki.poulose, maz, steven.price, guohanjun, olof, linux-arm-kernel, linux-kernel, linux-arch, linux-mm, arm, xiexiangyou, prime.zeng, zhangshaokun, kuhn.chenqun On 2020/7/9 2:24, Catalin Marinas wrote: > On Wed, Jul 08, 2020 at 08:40:31PM +0800, Zhenyu Ye wrote: >> Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range(). >> >> In this patch, we only use the TLBI RANGE feature if the stride == PAGE_SIZE, >> because when stride > PAGE_SIZE, usually only a small number of pages need >> to be flushed and classic tlbi intructions are more effective. > > Why are they more effective? I guess a range op would work on this as > well, say unmapping a large THP range. If we ignore this stride == > PAGE_SIZE, it could make the code easier to read. > OK, I will remove the stride == PAGE_SIZE here. >> We can also use 'end - start < threshold number' to decide which way >> to go, however, different hardware may have different thresholds, so >> I'm not sure if this is feasible. >> >> Signed-off-by: Zhenyu Ye <yezhenyu2@huawei.com> >> --- >> arch/arm64/include/asm/tlbflush.h | 104 ++++++++++++++++++++++++++---- >> 1 file changed, 90 insertions(+), 14 deletions(-) > > Could you please rebase these patches on top of the arm64 for-next/tlbi > branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/tlbi > OK, I will send a formal version patch of this series soon. >> >> - if ((end - start) >= (MAX_TLBI_OPS * stride)) { >> + if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && >> + (end - start) >= (MAX_TLBI_OPS * stride)) || >> + range_pages >= MAX_TLBI_RANGE_PAGES) { >> flush_tlb_mm(vma->vm_mm); >> return; >> } > > Is there any value in this range_pages check here? What's the value of > MAX_TLBI_RANGE_PAGES? If we have TLBI range ops, we make a decision here > but without including the stride. Further down we use the stride to skip > the TLBI range ops. > MAX_TLBI_RANGE_PAGES is defined as __TLBI_RANGE_PAGES(31, 3), which is decided by ARMv8.4 spec. The address range is determined by below formula: [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE) Which has nothing to do with the stride. After removing the stride == PAGE_SIZE below, there will be more clear. >> } > > I think the algorithm is correct, though I need to work it out on a > piece of paper. > > The code could benefit from some comments (above the loop) on how the > range is built and the right scale found. > OK. Thanks, Zhenyu ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-07-09 6:51 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-08 12:40 [RFC PATCH v5 0/2] arm64: tlb: add support for TLBI RANGE instructions Zhenyu Ye 2020-07-08 12:40 ` [RFC PATCH v5 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature Zhenyu Ye 2020-07-08 12:40 ` [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye 2020-07-08 18:24 ` Catalin Marinas 2020-07-09 6:51 ` Zhenyu Ye
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).