* [PATCH v1 0/2] arm64: tlb: add support for TLBI RANGE instructions @ 2020-07-09 9:10 Zhenyu Ye 2020-07-09 9:10 ` [PATCH v1 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature Zhenyu Ye 2020-07-09 9:10 ` [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye 0 siblings, 2 replies; 6+ messages in thread From: Zhenyu Ye @ 2020-07-09 9:10 UTC (permalink / raw) To: catalin.marinas, will, suzuki.poulose, maz, steven.price, guohanjun, olof Cc: linux-arch, yezhenyu2, linux-kernel, xiexiangyou, zhangshaokun, linux-mm, arm, prime.zeng, kuhn.chenqun, linux-arm-kernel NOTICE: this series are based on the arm64 for-next/tlbi branch: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/tlbi -- ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a range of input addresses. This series add support for this feature. I tested this feature on a FPGA machine whose cpus support the tlbi range. As the page num increases, the performance is improved significantly. When page num = 256, the performance is improved by about 10 times. Below is the test data when the stride = PTE: [page num] [classic] [tlbi range] 1 16051 13524 2 11366 11146 3 11582 12171 4 11694 11101 5 12138 12267 6 12290 11105 7 12400 12002 8 12837 11097 9 14791 12140 10 15461 11087 16 18233 11094 32 26983 11079 64 43840 11092 128 77754 11098 256 145514 11089 512 280932 11111 See more details in: https://lore.kernel.org/linux-arm-kernel/504c7588-97e5-e014-fca0-c5511ae0d256@huawei.com/ -- RFC patches: - Link: https://lore.kernel.org/linux-arm-kernel/20200708124031.1414-1-yezhenyu2@huawei.com/ Zhenyu Ye (2): arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature arm64: tlb: Use the TLBI RANGE feature in arm64 arch/arm64/include/asm/cpucaps.h | 3 +- arch/arm64/include/asm/sysreg.h | 3 + arch/arm64/include/asm/tlbflush.h | 156 ++++++++++++++++++++++++------ arch/arm64/kernel/cpufeature.c | 10 ++ 4 files changed, 141 insertions(+), 31 deletions(-) -- 2.19.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v1 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature 2020-07-09 9:10 [PATCH v1 0/2] arm64: tlb: add support for TLBI RANGE instructions Zhenyu Ye @ 2020-07-09 9:10 ` Zhenyu Ye 2020-07-09 9:10 ` [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye 1 sibling, 0 replies; 6+ messages in thread From: Zhenyu Ye @ 2020-07-09 9:10 UTC (permalink / raw) To: catalin.marinas, will, suzuki.poulose, maz, steven.price, guohanjun, olof Cc: linux-arch, yezhenyu2, linux-kernel, xiexiangyou, zhangshaokun, linux-mm, arm, prime.zeng, kuhn.chenqun, linux-arm-kernel ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a range of input addresses. This patch detect this feature. Signed-off-by: Zhenyu Ye <yezhenyu2@huawei.com> --- arch/arm64/include/asm/cpucaps.h | 3 ++- arch/arm64/include/asm/sysreg.h | 3 +++ arch/arm64/kernel/cpufeature.c | 10 ++++++++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h index d44ba903d11d..8fe4aa1d372b 100644 --- a/arch/arm64/include/asm/cpucaps.h +++ b/arch/arm64/include/asm/cpucaps.h @@ -63,7 +63,8 @@ #define ARM64_HAS_32BIT_EL1 53 #define ARM64_BTI 54 #define ARM64_HAS_ARMv8_4_TTL 55 +#define ARM64_HAS_TLBI_RANGE 56 -#define ARM64_NCAPS 56 +#define ARM64_NCAPS 57 #endif /* __ASM_CPUCAPS_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index 8c209aa17273..a5f24a26d86a 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -617,6 +617,9 @@ #define ID_AA64ISAR0_SHA1_SHIFT 8 #define ID_AA64ISAR0_AES_SHIFT 4 +#define ID_AA64ISAR0_TLBI_RANGE_NI 0x0 +#define ID_AA64ISAR0_TLBI_RANGE 0x2 + /* id_aa64isar1 */ #define ID_AA64ISAR1_I8MM_SHIFT 52 #define ID_AA64ISAR1_DGH_SHIFT 48 diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index e877f56ff1ab..ba0f0ce06fee 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -2067,6 +2067,16 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .sign = FTR_UNSIGNED, }, #endif + { + .desc = "TLB range maintenance instruction", + .capability = ARM64_HAS_TLBI_RANGE, + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .matches = has_cpuid_feature, + .sys_reg = SYS_ID_AA64ISAR0_EL1, + .field_pos = ID_AA64ISAR0_TLB_SHIFT, + .sign = FTR_UNSIGNED, + .min_field_value = ID_AA64ISAR0_TLBI_RANGE, + }, {}, }; -- 2.19.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 2020-07-09 9:10 [PATCH v1 0/2] arm64: tlb: add support for TLBI RANGE instructions Zhenyu Ye 2020-07-09 9:10 ` [PATCH v1 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature Zhenyu Ye @ 2020-07-09 9:10 ` Zhenyu Ye 2020-07-09 9:14 ` Zhenyu Ye 2020-07-09 17:36 ` Catalin Marinas 1 sibling, 2 replies; 6+ messages in thread From: Zhenyu Ye @ 2020-07-09 9:10 UTC (permalink / raw) To: catalin.marinas, will, suzuki.poulose, maz, steven.price, guohanjun, olof Cc: linux-arch, yezhenyu2, linux-kernel, xiexiangyou, zhangshaokun, linux-mm, arm, prime.zeng, kuhn.chenqun, linux-arm-kernel Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range(). Signed-off-by: Zhenyu Ye <yezhenyu2@huawei.com> --- arch/arm64/include/asm/tlbflush.h | 156 ++++++++++++++++++++++++------ 1 file changed, 126 insertions(+), 30 deletions(-) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 39aed2efd21b..30e52eae973b 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -60,6 +60,31 @@ __ta; \ }) +/* + * Get translation granule of the system, which is decided by + * PAGE_SIZE. Used by TTL. + * - 4KB : 1 + * - 16KB : 2 + * - 64KB : 3 + */ +#define TLBI_TTL_TG_4K 1 +#define TLBI_TTL_TG_16K 2 +#define TLBI_TTL_TG_64K 3 + +static inline unsigned long get_trans_granule(void) +{ + switch (PAGE_SIZE) { + case SZ_4K: + return TLBI_TTL_TG_4K; + case SZ_16K: + return TLBI_TTL_TG_16K; + case SZ_64K: + return TLBI_TTL_TG_64K; + default: + return 0; + } +} + /* * Level-based TLBI operations. * @@ -73,29 +98,15 @@ * in asm/stage2_pgtable.h. */ #define TLBI_TTL_MASK GENMASK_ULL(47, 44) -#define TLBI_TTL_TG_4K 1 -#define TLBI_TTL_TG_16K 2 -#define TLBI_TTL_TG_64K 3 #define __tlbi_level(op, addr, level) do { \ u64 arg = addr; \ \ if (cpus_have_const_cap(ARM64_HAS_ARMv8_4_TTL) && \ + !cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && \ level) { \ u64 ttl = level & 3; \ - \ - switch (PAGE_SIZE) { \ - case SZ_4K: \ - ttl |= TLBI_TTL_TG_4K << 2; \ - break; \ - case SZ_16K: \ - ttl |= TLBI_TTL_TG_16K << 2; \ - break; \ - case SZ_64K: \ - ttl |= TLBI_TTL_TG_64K << 2; \ - break; \ - } \ - \ + ttl |= get_trans_granule() << 2; \ arg &= ~TLBI_TTL_MASK; \ arg |= FIELD_PREP(TLBI_TTL_MASK, ttl); \ } \ @@ -108,6 +119,49 @@ __tlbi_level(op, (arg | USER_ASID_FLAG), level); \ } while (0) +#define __tlbi_last_level(op1, op2, arg, last_level, tlb_level) do { \ + if (last_level) { \ + __tlbi_level(op1, arg, tlb_level); \ + __tlbi_user_level(op1, arg, tlb_level); \ + } else { \ + __tlbi_level(op2, arg, tlb_level); \ + __tlbi_user_level(op2, arg, tlb_level); \ + } \ +} while (0) + +/* + * This macro creates a properly formatted VA operand for the TLBI RANGE. + * The value bit assignments are: + * + * +----------+------+-------+-------+-------+----------------------+ + * | ASID | TG | SCALE | NUM | TTL | BADDR | + * +-----------------+-------+-------+-------+----------------------+ + * |63 48|47 46|45 44|43 39|38 37|36 0| + * + * The address range is determined by below formula: + * [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE) + * + */ +#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl) \ + ({ \ + unsigned long __ta = (addr) >> PAGE_SHIFT; \ + __ta &= GENMASK_ULL(36, 0); \ + __ta |= (unsigned long)(ttl) << 37; \ + __ta |= (unsigned long)(num) << 39; \ + __ta |= (unsigned long)(scale) << 44; \ + __ta |= get_trans_granule() << 46; \ + __ta |= (unsigned long)(asid) << 48; \ + __ta; \ + }) + +/* These macros are used by the TLBI RANGE feature. */ +#define __TLBI_RANGE_PAGES(num, scale) (((num) + 1) << (5 * (scale) + 1)) +#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3) + +#define TLBI_RANGE_MASK GENMASK_ULL(4, 0) +#define __TLBI_RANGE_NUM(range, scale) \ + (((range) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) + /* * TLB Invalidation * ================ @@ -232,32 +286,74 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, unsigned long stride, bool last_level, int tlb_level) { + int num = 0; + int scale = 0; unsigned long asid = ASID(vma->vm_mm); unsigned long addr; + unsigned long pages; start = round_down(start, stride); end = round_up(end, stride); + pages = (end - start) >> PAGE_SHIFT; - if ((end - start) >= (MAX_TLBI_OPS * stride)) { + if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && + (end - start) >= (MAX_TLBI_OPS * stride)) || + pages >= MAX_TLBI_RANGE_PAGES) { flush_tlb_mm(vma->vm_mm); return; } - /* Convert the stride into units of 4k */ - stride >>= 12; - - start = __TLBI_VADDR(start, asid); - end = __TLBI_VADDR(end, asid); - dsb(ishst); - for (addr = start; addr < end; addr += stride) { - if (last_level) { - __tlbi_level(vale1is, addr, tlb_level); - __tlbi_user_level(vale1is, addr, tlb_level); - } else { - __tlbi_level(vae1is, addr, tlb_level); - __tlbi_user_level(vae1is, addr, tlb_level); + + /* + * When cpu does not support TLBI RANGE feature, we flush the tlb + * entries one by one at the granularity of 'stride'. + * When cpu supports the TLBI RANGE feature, then: + * 1. If pages is odd, flush the first page through non-RANGE + * instruction; + * 2. For remaining pages: The minimum range granularity is decided + * by 'scale', so we can not flush all pages by one instruction + * in some cases. + * + * For example, when the pages = 0xe81a, let's start 'scale' from + * maximum, and find right 'num' for each 'scale': + * + * When scale = 3, we can flush no pages because the minumum + * range is 2^(5*3 + 1) = 0x10000. + * When scale = 2, the minimum range is 2^(5*2 + 1) = 0x800, we can + * flush 0xe800 pages this time, the num = 0xe800/0x800 - 1 = 0x1c. + * Remain pages is 0x1a; + * When scale = 1, the minimum range is 2^(5*1 + 1) = 0x40, no page + * can be flushed. + * When scale = 0, we flush the remaining 0x1a pages, the num = + * 0x1a/0x2 - 1 = 0xd. + * + * However, in most scenarios, the pages = 1 when flush_tlb_range() is + * called. Start from scale = 3 or other proper value (such as scale = + * ilog2(pages)), will incur extra overhead. + * So increase 'scale' from 0 to maximum, the flush order is exactly + * opposite to the example. + */ + while (pages > 0) { + if (cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && + pages % 2 == 0) { + num = __TLBI_RANGE_NUM(pages, scale) - 1; + if (num >= 0) { + addr = __TLBI_VADDR_RANGE(start, asid, scale, + num, tlb_level); + __tlbi_last_level(rvale1is, rvae1is, addr, + last_level, tlb_level); + start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; + pages -= __TLBI_RANGE_PAGES(num, scale); + } + scale++; + continue; } + + addr = __TLBI_VADDR(start, asid); + __tlbi_last_level(vale1is, vae1is, addr, last_level, tlb_level); + start += stride; + pages -= stride >> PAGE_SHIFT; } dsb(ish); } -- 2.19.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 2020-07-09 9:10 ` [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye @ 2020-07-09 9:14 ` Zhenyu Ye 2020-07-09 17:36 ` Catalin Marinas 1 sibling, 0 replies; 6+ messages in thread From: Zhenyu Ye @ 2020-07-09 9:14 UTC (permalink / raw) To: catalin.marinas, will, suzuki.poulose, maz, steven.price, guohanjun, olof Cc: linux-arch, linux-kernel, xiexiangyou, zhangshaokun, linux-mm, arm, prime.zeng, kuhn.chenqun, linux-arm-kernel On 2020/7/9 17:10, Zhenyu Ye wrote: > + /* > + * When cpu does not support TLBI RANGE feature, we flush the tlb > + * entries one by one at the granularity of 'stride'. > + * When cpu supports the TLBI RANGE feature, then: > + * 1. If pages is odd, flush the first page through non-RANGE > + * instruction; > + * 2. For remaining pages: The minimum range granularity is decided > + * by 'scale', so we can not flush all pages by one instruction > + * in some cases. > + * > + * For example, when the pages = 0xe81a, let's start 'scale' from > + * maximum, and find right 'num' for each 'scale': > + * > + * When scale = 3, we can flush no pages because the minumum > + * range is 2^(5*3 + 1) = 0x10000. > + * When scale = 2, the minimum range is 2^(5*2 + 1) = 0x800, we can > + * flush 0xe800 pages this time, the num = 0xe800/0x800 - 1 = 0x1c. > + * Remain pages is 0x1a; > + * When scale = 1, the minimum range is 2^(5*1 + 1) = 0x40, no page > + * can be flushed. > + * When scale = 0, we flush the remaining 0x1a pages, the num = > + * 0x1a/0x2 - 1 = 0xd. > + * > + * However, in most scenarios, the pages = 1 when flush_tlb_range() is > + * called. Start from scale = 3 or other proper value (such as scale = > + * ilog2(pages)), will incur extra overhead. > + * So increase 'scale' from 0 to maximum, the flush order is exactly > + * opposite to the example. > + */ The comments may be too long, probably should be moved to commit messages. Thanks, Zhenyu _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 2020-07-09 9:10 ` [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye 2020-07-09 9:14 ` Zhenyu Ye @ 2020-07-09 17:36 ` Catalin Marinas 2020-07-10 6:07 ` Zhenyu Ye 1 sibling, 1 reply; 6+ messages in thread From: Catalin Marinas @ 2020-07-09 17:36 UTC (permalink / raw) To: Zhenyu Ye Cc: linux-arch, suzuki.poulose, maz, linux-kernel, xiexiangyou, steven.price, zhangshaokun, linux-mm, arm, prime.zeng, guohanjun, olof, kuhn.chenqun, will, linux-arm-kernel On Thu, Jul 09, 2020 at 05:10:54PM +0800, Zhenyu Ye wrote: > Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range(). > > Signed-off-by: Zhenyu Ye <yezhenyu2@huawei.com> > --- > arch/arm64/include/asm/tlbflush.h | 156 ++++++++++++++++++++++++------ > 1 file changed, 126 insertions(+), 30 deletions(-) > > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index 39aed2efd21b..30e52eae973b 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -60,6 +60,31 @@ > __ta; \ > }) > > +/* > + * Get translation granule of the system, which is decided by > + * PAGE_SIZE. Used by TTL. > + * - 4KB : 1 > + * - 16KB : 2 > + * - 64KB : 3 > + */ > +#define TLBI_TTL_TG_4K 1 > +#define TLBI_TTL_TG_16K 2 > +#define TLBI_TTL_TG_64K 3 > + > +static inline unsigned long get_trans_granule(void) > +{ > + switch (PAGE_SIZE) { > + case SZ_4K: > + return TLBI_TTL_TG_4K; > + case SZ_16K: > + return TLBI_TTL_TG_16K; > + case SZ_64K: > + return TLBI_TTL_TG_64K; > + default: > + return 0; > + } > +} > + > /* > * Level-based TLBI operations. > * > @@ -73,29 +98,15 @@ > * in asm/stage2_pgtable.h. > */ > #define TLBI_TTL_MASK GENMASK_ULL(47, 44) > -#define TLBI_TTL_TG_4K 1 > -#define TLBI_TTL_TG_16K 2 > -#define TLBI_TTL_TG_64K 3 > > #define __tlbi_level(op, addr, level) do { \ > u64 arg = addr; \ > \ > if (cpus_have_const_cap(ARM64_HAS_ARMv8_4_TTL) && \ > + !cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && \ > level) { \ > u64 ttl = level & 3; \ > - \ > - switch (PAGE_SIZE) { \ > - case SZ_4K: \ > - ttl |= TLBI_TTL_TG_4K << 2; \ > - break; \ > - case SZ_16K: \ > - ttl |= TLBI_TTL_TG_16K << 2; \ > - break; \ > - case SZ_64K: \ > - ttl |= TLBI_TTL_TG_64K << 2; \ > - break; \ > - } \ > - \ > + ttl |= get_trans_granule() << 2; \ > arg &= ~TLBI_TTL_MASK; \ > arg |= FIELD_PREP(TLBI_TTL_MASK, ttl); \ > } \ I think checking for !ARM64_HAS_TLBI_RANGE here is incorrect. I can see why you attempted this since the range and classic ops have a different position for the level but now you are not passing the TTL at all for the classic TLBI. It's also inconsistent to have the range ops get the level in the addr argument while the classic ops added in the __tlbi_level macro. I'd rather have two sets of macros, __tlbi_level and __tlbi_range_level, called depending on whether you use classic or range ops. > @@ -108,6 +119,49 @@ > __tlbi_level(op, (arg | USER_ASID_FLAG), level); \ > } while (0) > > +#define __tlbi_last_level(op1, op2, arg, last_level, tlb_level) do { \ > + if (last_level) { \ > + __tlbi_level(op1, arg, tlb_level); \ > + __tlbi_user_level(op1, arg, tlb_level); \ > + } else { \ > + __tlbi_level(op2, arg, tlb_level); \ > + __tlbi_user_level(op2, arg, tlb_level); \ > + } \ > +} while (0) And you could drop this altogether. I know it's slightly more lines of code but keeping it expanded in __flush_tlb_range() would be clearer. > + > +/* > + * This macro creates a properly formatted VA operand for the TLBI RANGE. > + * The value bit assignments are: > + * > + * +----------+------+-------+-------+-------+----------------------+ > + * | ASID | TG | SCALE | NUM | TTL | BADDR | > + * +-----------------+-------+-------+-------+----------------------+ > + * |63 48|47 46|45 44|43 39|38 37|36 0| > + * > + * The address range is determined by below formula: > + * [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE) > + * > + */ > +#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl) \ > + ({ \ > + unsigned long __ta = (addr) >> PAGE_SHIFT; \ > + __ta &= GENMASK_ULL(36, 0); \ > + __ta |= (unsigned long)(ttl) << 37; \ > + __ta |= (unsigned long)(num) << 39; \ > + __ta |= (unsigned long)(scale) << 44; \ > + __ta |= get_trans_granule() << 46; \ > + __ta |= (unsigned long)(asid) << 48; \ > + __ta; \ > + }) As per above, I'd remove the ttl here and just add it in the __tlbi_level_range(). For consistency, you could do the same with num and scale, just leave the asid and addr, similar to __TLBI_VADDR (the only difference is the shift by PAGE_SHIFT rather than 12). > + > +/* These macros are used by the TLBI RANGE feature. */ > +#define __TLBI_RANGE_PAGES(num, scale) (((num) + 1) << (5 * (scale) + 1)) > +#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3) > + > +#define TLBI_RANGE_MASK GENMASK_ULL(4, 0) > +#define __TLBI_RANGE_NUM(range, scale) \ > + (((range) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) > + > /* > * TLB Invalidation > * ================ > @@ -232,32 +286,74 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, > unsigned long stride, bool last_level, > int tlb_level) > { > + int num = 0; > + int scale = 0; > unsigned long asid = ASID(vma->vm_mm); > unsigned long addr; > + unsigned long pages; > > start = round_down(start, stride); > end = round_up(end, stride); > + pages = (end - start) >> PAGE_SHIFT; > > - if ((end - start) >= (MAX_TLBI_OPS * stride)) { > + if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && > + (end - start) >= (MAX_TLBI_OPS * stride)) || > + pages >= MAX_TLBI_RANGE_PAGES) { > flush_tlb_mm(vma->vm_mm); > return; > } > > - /* Convert the stride into units of 4k */ > - stride >>= 12; > - > - start = __TLBI_VADDR(start, asid); > - end = __TLBI_VADDR(end, asid); > - > dsb(ishst); > - for (addr = start; addr < end; addr += stride) { > - if (last_level) { > - __tlbi_level(vale1is, addr, tlb_level); > - __tlbi_user_level(vale1is, addr, tlb_level); > - } else { > - __tlbi_level(vae1is, addr, tlb_level); > - __tlbi_user_level(vae1is, addr, tlb_level); > + > + /* > + * When cpu does not support TLBI RANGE feature, we flush the tlb > + * entries one by one at the granularity of 'stride'. > + * When cpu supports the TLBI RANGE feature, then: > + * 1. If pages is odd, flush the first page through non-RANGE > + * instruction; > + * 2. For remaining pages: The minimum range granularity is decided > + * by 'scale', so we can not flush all pages by one instruction > + * in some cases. This part can stay in the code. In addition, you could mention something along the lines that it starts from scale 0 (covering num * 2 pages) and increments it until no pages left. > + * > + * For example, when the pages = 0xe81a, let's start 'scale' from > + * maximum, and find right 'num' for each 'scale': > + * > + * When scale = 3, we can flush no pages because the minumum > + * range is 2^(5*3 + 1) = 0x10000. > + * When scale = 2, the minimum range is 2^(5*2 + 1) = 0x800, we can > + * flush 0xe800 pages this time, the num = 0xe800/0x800 - 1 = 0x1c. > + * Remain pages is 0x1a; > + * When scale = 1, the minimum range is 2^(5*1 + 1) = 0x40, no page > + * can be flushed. > + * When scale = 0, we flush the remaining 0x1a pages, the num = > + * 0x1a/0x2 - 1 = 0xd. > + * > + * However, in most scenarios, the pages = 1 when flush_tlb_range() is > + * called. Start from scale = 3 or other proper value (such as scale = > + * ilog2(pages)), will incur extra overhead. > + * So increase 'scale' from 0 to maximum, the flush order is exactly > + * opposite to the example. > + */ I'd drop the example from the code, just move it to the commit log. > + while (pages > 0) { > + if (cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && > + pages % 2 == 0) { > + num = __TLBI_RANGE_NUM(pages, scale) - 1; > + if (num >= 0) { > + addr = __TLBI_VADDR_RANGE(start, asid, scale, > + num, tlb_level); > + __tlbi_last_level(rvale1is, rvae1is, addr, > + last_level, tlb_level); > + start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; > + pages -= __TLBI_RANGE_PAGES(num, scale); > + } > + scale++; > + continue; > } > + > + addr = __TLBI_VADDR(start, asid); > + __tlbi_last_level(vale1is, vae1is, addr, last_level, tlb_level); > + start += stride; > + pages -= stride >> PAGE_SHIFT; > } > dsb(ish); As I mentioned above, just keep the "if (last_level)" expanded here in both cases. Maybe you could place a "pages % 2 == 1" check first to avoid the indentation. Something like: while (pages > 0) { if (!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) || pages % 2 == 1) { ... __tlbi_level(); ... continue; } num = __TLBI_RANGE_NUM(pages, scale) - 1; if (num >= 0) { ... } scale++; continue; } Thanks. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 2020-07-09 17:36 ` Catalin Marinas @ 2020-07-10 6:07 ` Zhenyu Ye 0 siblings, 0 replies; 6+ messages in thread From: Zhenyu Ye @ 2020-07-10 6:07 UTC (permalink / raw) To: Catalin Marinas Cc: linux-arch, suzuki.poulose, maz, linux-kernel, xiexiangyou, steven.price, zhangshaokun, linux-mm, arm, prime.zeng, guohanjun, olof, kuhn.chenqun, will, linux-arm-kernel Hi Catalin, On 2020/7/10 1:36, Catalin Marinas wrote: > On Thu, Jul 09, 2020 at 05:10:54PM +0800, Zhenyu Ye wrote: >> #define __tlbi_level(op, addr, level) do { \ >> u64 arg = addr; \ >> \ >> if (cpus_have_const_cap(ARM64_HAS_ARMv8_4_TTL) && \ >> + !cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) && \ >> level) { \ >> u64 ttl = level & 3; \ >> - \ >> - switch (PAGE_SIZE) { \ >> - case SZ_4K: \ >> - ttl |= TLBI_TTL_TG_4K << 2; \ >> - break; \ >> - case SZ_16K: \ >> - ttl |= TLBI_TTL_TG_16K << 2; \ >> - break; \ >> - case SZ_64K: \ >> - ttl |= TLBI_TTL_TG_64K << 2; \ >> - break; \ >> - } \ >> - \ >> + ttl |= get_trans_granule() << 2; \ >> arg &= ~TLBI_TTL_MASK; \ >> arg |= FIELD_PREP(TLBI_TTL_MASK, ttl); \ >> } \ > > I think checking for !ARM64_HAS_TLBI_RANGE here is incorrect. I can see > why you attempted this since the range and classic ops have a different > position for the level but now you are not passing the TTL at all for > the classic TLBI. It's also inconsistent to have the range ops get the > level in the addr argument while the classic ops added in the > __tlbi_level macro. > You are right, this is really a serious problem. But this can be avoided after removing the check for ARM64_HAS_TLBI_RANGE and dropping the __tlbi_last_level. Just call __tlbi() and __tlbi_user() when doing range ops. > I'd rather have two sets of macros, __tlbi_level and __tlbi_range_level, > called depending on whether you use classic or range ops. > Then we have to add __tlbi_user_range_level, too. And if we move the num and scale out of __TLBI_VADDR_RANGE, the __TLBI_VADDR_RANGE macro will make little sense (addr and asid also can be moved out). __TLBI_VADDR macro is defined to create a properly formatted VA operand for the TLBI, then how about add the level to __TLBI_VADDR, just like: #define __TLBI_VADDR(addr, asid, level) \ ({ \ unsigned long __ta = (addr) >> 12; \ __ta &= GENMASK_ULL(43, 0); \ __ta |= (unsigned long)(asid) << 48; \ if (cpus_have_const_cap(ARM64_HAS_ARMv8_4_TTL)) { \ u64 ttl = get_trans_granule() << 2 + level & 3; \ __ta |= ttl << 44; \ } \ __ta; \ }) Then we should make sure __TLBI_VADDR is used for all TLBI operands. But the related code has changed a lot in this merge window, so I perfer to do this in the future, after all below be merged: git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/el2-obj-v4.1 git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/pre-nv-5.9 git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/tlbi Currently, keep the range ops get the level in the addr argument, the classic ops added the level in the __tlbi_level macro. >> @@ -108,6 +119,49 @@ >> __tlbi_level(op, (arg | USER_ASID_FLAG), level); \ >> } while (0) >> >> +#define __tlbi_last_level(op1, op2, arg, last_level, tlb_level) do { \ >> + if (last_level) { \ >> + __tlbi_level(op1, arg, tlb_level); \ >> + __tlbi_user_level(op1, arg, tlb_level); \ >> + } else { \ >> + __tlbi_level(op2, arg, tlb_level); \ >> + __tlbi_user_level(op2, arg, tlb_level); \ >> + } \ >> +} while (0) > > And you could drop this altogether. I know it's slightly more lines of > code but keeping it expanded in __flush_tlb_range() would be clearer. Thanks, Zhenyu _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-07-10 6:09 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-09 9:10 [PATCH v1 0/2] arm64: tlb: add support for TLBI RANGE instructions Zhenyu Ye 2020-07-09 9:10 ` [PATCH v1 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature Zhenyu Ye 2020-07-09 9:10 ` [PATCH v1 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64 Zhenyu Ye 2020-07-09 9:14 ` Zhenyu Ye 2020-07-09 17:36 ` Catalin Marinas 2020-07-10 6:07 ` Zhenyu Ye
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).