* [PATCH] arm64: asid: Optimize cache_flush for SMT @ 2019-06-23 16:04 guoren 2019-06-24 11:40 ` Mark Rutland 0 siblings, 1 reply; 4+ messages in thread From: guoren @ 2019-06-23 16:04 UTC (permalink / raw) To: julien.grall, arnd, linux-kernel; +Cc: linux-csky, Guo Ren, Catalin Marinas From: Guo Ren <ren_guo@c-sky.com> The hardware threads of one core could share the same TLB for SMT+SMP system. Assume hardware threads number sequence like this: | 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f | core1 core2 core3 core4 Current algorithm seems is correct for SMT+SMP, but it'll give some duplicate local_tlb_flush. Because one hardware threads local_tlb_flush will also flush other hardware threads' TLB entry in one core TLB. So we can use bitmap to reduce local_tlb_flush for SMT. C-SKY cores don't support SMT and the patch is no benefit for C-SKY. Signed-off-by: Guo Ren <ren_guo@c-sky.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Julien Grall <julien.grall@arm.com> --- arch/csky/include/asm/asid.h | 4 ++++ arch/csky/mm/asid.c | 11 ++++++++++- arch/csky/mm/context.c | 2 +- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/csky/include/asm/asid.h b/arch/csky/include/asm/asid.h index ac08b0f..f654492 100644 --- a/arch/csky/include/asm/asid.h +++ b/arch/csky/include/asm/asid.h @@ -23,6 +23,9 @@ struct asid_info unsigned int ctxt_shift; /* Callback to locally flush the context. */ void (*flush_cpu_ctxt_cb)(void); + /* To reduce duplicate tlb_flush for SMT */ + unsigned int harts_per_core; + unsigned int harts_per_core_mask; }; #define NUM_ASIDS(info) (1UL << ((info)->bits)) @@ -73,6 +76,7 @@ static inline void asid_check_context(struct asid_info *info, int asid_allocator_init(struct asid_info *info, u32 bits, unsigned int asid_per_ctxt, + unsigned int harts_per_core, void (*flush_cpu_ctxt_cb)(void)); #endif diff --git a/arch/csky/mm/asid.c b/arch/csky/mm/asid.c index b2e9147..50a983e 100644 --- a/arch/csky/mm/asid.c +++ b/arch/csky/mm/asid.c @@ -148,8 +148,13 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid, atomic64_set(pasid, asid); } - if (cpumask_test_and_clear_cpu(cpu, &info->flush_pending)) + if (cpumask_test_cpu(cpu, &info->flush_pending)) { + unsigned int i; + unsigned int harts_base = cpu & info->harts_per_core_mask; info->flush_cpu_ctxt_cb(); + for (i = 0; i < info->harts_per_core; i++) + cpumask_clear_cpu(harts_base + i, &info->flush_pending); + } atomic64_set(&active_asid(info, cpu), asid); cpumask_set_cpu(cpu, mm_cpumask(mm)); @@ -162,15 +167,19 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid, * @info: Pointer to the asid allocator structure * @bits: Number of ASIDs available * @asid_per_ctxt: Number of ASIDs to allocate per-context. ASIDs are + * @harts_per_core: Number hardware threads per core, must be 1, 2, 4, 8, 16 ... * allocated contiguously for a given context. This value should be a power of * 2. */ int asid_allocator_init(struct asid_info *info, u32 bits, unsigned int asid_per_ctxt, + unsigned int harts_per_core, void (*flush_cpu_ctxt_cb)(void)) { info->bits = bits; info->ctxt_shift = ilog2(asid_per_ctxt); + info->harts_per_core = harts_per_core; + info->harts_per_core_mask = ~((1 << ilog2(harts_per_core)) - 1); info->flush_cpu_ctxt_cb = flush_cpu_ctxt_cb; /* * Expect allocation after rollover to fail if we don't have at least diff --git a/arch/csky/mm/context.c b/arch/csky/mm/context.c index 0d95bdd..b58523b 100644 --- a/arch/csky/mm/context.c +++ b/arch/csky/mm/context.c @@ -30,7 +30,7 @@ static int asids_init(void) { BUG_ON(((1 << CONFIG_CPU_ASID_BITS) - 1) <= num_possible_cpus()); - if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, + if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, 1, asid_flush_cpu_ctxt)) panic("Unable to initialize ASID allocator for %lu ASIDs\n", NUM_ASIDS(&asid_info)); -- 2.7.4 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] arm64: asid: Optimize cache_flush for SMT 2019-06-23 16:04 [PATCH] arm64: asid: Optimize cache_flush for SMT guoren @ 2019-06-24 11:40 ` Mark Rutland 2019-06-24 12:25 ` Guo Ren 2019-06-25 7:25 ` Palmer Dabbelt 0 siblings, 2 replies; 4+ messages in thread From: Mark Rutland @ 2019-06-24 11:40 UTC (permalink / raw) To: guoren Cc: julien.grall, arnd, linux-kernel, linux-csky, Guo Ren, Catalin Marinas I'm very confused by this patch. The title says arm64, yet the code is under arch/csky/, and the code in question refers to HARTs, which IIUC is RISC-V terminology. On Mon, Jun 24, 2019 at 12:04:29AM +0800, guoren@kernel.org wrote: > From: Guo Ren <ren_guo@c-sky.com> > > The hardware threads of one core could share the same TLB for SMT+SMP > system. Assume hardware threads number sequence like this: > > | 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f | > core1 core2 core3 core4 Given this is the Linux logical CPU ID rather than a physical CPU ID, this assumption is not valid. For example, CPUs may be renumbered across kexec. Even if this were a physical CPU ID, this doesn't hold on arm64 (e.g. due to big.LITTLE). > Current algorithm seems is correct for SMT+SMP, but it'll give some > duplicate local_tlb_flush. Because one hardware threads local_tlb_flush > will also flush other hardware threads' TLB entry in one core TLB. Does any architecture specification mandate that behaviour? That isn't true for arm64, I have no idea whether RISC-V mandates that, and as below it seems this is irrelevant on C-SKY. > So we can use bitmap to reduce local_tlb_flush for SMT. > > C-SKY cores don't support SMT and the patch is no benefit for C-SKY. As above, this patch is very confusing -- if this doesn't benefit C-SKY, why modify the C-SKY code? Thanks, Mark. > > Signed-off-by: Guo Ren <ren_guo@c-sky.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Julien Grall <julien.grall@arm.com> > --- > arch/csky/include/asm/asid.h | 4 ++++ > arch/csky/mm/asid.c | 11 ++++++++++- > arch/csky/mm/context.c | 2 +- > 3 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/arch/csky/include/asm/asid.h b/arch/csky/include/asm/asid.h > index ac08b0f..f654492 100644 > --- a/arch/csky/include/asm/asid.h > +++ b/arch/csky/include/asm/asid.h > @@ -23,6 +23,9 @@ struct asid_info > unsigned int ctxt_shift; > /* Callback to locally flush the context. */ > void (*flush_cpu_ctxt_cb)(void); > + /* To reduce duplicate tlb_flush for SMT */ > + unsigned int harts_per_core; > + unsigned int harts_per_core_mask; > }; > > #define NUM_ASIDS(info) (1UL << ((info)->bits)) > @@ -73,6 +76,7 @@ static inline void asid_check_context(struct asid_info *info, > > int asid_allocator_init(struct asid_info *info, > u32 bits, unsigned int asid_per_ctxt, > + unsigned int harts_per_core, > void (*flush_cpu_ctxt_cb)(void)); > > #endif > diff --git a/arch/csky/mm/asid.c b/arch/csky/mm/asid.c > index b2e9147..50a983e 100644 > --- a/arch/csky/mm/asid.c > +++ b/arch/csky/mm/asid.c > @@ -148,8 +148,13 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid, > atomic64_set(pasid, asid); > } > > - if (cpumask_test_and_clear_cpu(cpu, &info->flush_pending)) > + if (cpumask_test_cpu(cpu, &info->flush_pending)) { > + unsigned int i; > + unsigned int harts_base = cpu & info->harts_per_core_mask; > info->flush_cpu_ctxt_cb(); > + for (i = 0; i < info->harts_per_core; i++) > + cpumask_clear_cpu(harts_base + i, &info->flush_pending); > + } > > atomic64_set(&active_asid(info, cpu), asid); > cpumask_set_cpu(cpu, mm_cpumask(mm)); > @@ -162,15 +167,19 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid, > * @info: Pointer to the asid allocator structure > * @bits: Number of ASIDs available > * @asid_per_ctxt: Number of ASIDs to allocate per-context. ASIDs are > + * @harts_per_core: Number hardware threads per core, must be 1, 2, 4, 8, 16 ... > * allocated contiguously for a given context. This value should be a power of > * 2. > */ > int asid_allocator_init(struct asid_info *info, > u32 bits, unsigned int asid_per_ctxt, > + unsigned int harts_per_core, > void (*flush_cpu_ctxt_cb)(void)) > { > info->bits = bits; > info->ctxt_shift = ilog2(asid_per_ctxt); > + info->harts_per_core = harts_per_core; > + info->harts_per_core_mask = ~((1 << ilog2(harts_per_core)) - 1); > info->flush_cpu_ctxt_cb = flush_cpu_ctxt_cb; > /* > * Expect allocation after rollover to fail if we don't have at least > diff --git a/arch/csky/mm/context.c b/arch/csky/mm/context.c > index 0d95bdd..b58523b 100644 > --- a/arch/csky/mm/context.c > +++ b/arch/csky/mm/context.c > @@ -30,7 +30,7 @@ static int asids_init(void) > { > BUG_ON(((1 << CONFIG_CPU_ASID_BITS) - 1) <= num_possible_cpus()); > > - if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, > + if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, 1, > asid_flush_cpu_ctxt)) > panic("Unable to initialize ASID allocator for %lu ASIDs\n", > NUM_ASIDS(&asid_info)); > -- > 2.7.4 > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] arm64: asid: Optimize cache_flush for SMT 2019-06-24 11:40 ` Mark Rutland @ 2019-06-24 12:25 ` Guo Ren 2019-06-25 7:25 ` Palmer Dabbelt 1 sibling, 0 replies; 4+ messages in thread From: Guo Ren @ 2019-06-24 12:25 UTC (permalink / raw) To: Mark Rutland Cc: Julien Grall, Arnd Bergmann, linux-kernel, linux-csky, Guo Ren, Catalin Marinas On Mon, Jun 24, 2019 at 7:40 PM Mark Rutland <mark.rutland@arm.com> wrote: > > I'm very confused by this patch. The title says arm64, yet the code is > under arch/csky/, and the code in question refers to HARTs, which IIUC > is RISC-V terminology. This patch is used to answer Catalin's question: > While the algorithm may seem fairly generic, the semantics have a few > corner cases specific to each architecture. See [1] for a description of > the semantics we need on arm64 (CnP is a feature where the hardware > threads of the same core can share the TLB; the original algorithm > violated the requirements when this feature was enabled). Here is my reply for Catalin: C-SKY SMP is only one hart per core, but here is a patch [1] with my thought on SMT duplicate tlb flush: [1] https://lore.kernel.org/linux-csky/1561305869-18872-1-git-send-email-guoren@kernel.org/T/#u Our talk is on this thread: https://lore.kernel.org/linux-arm-kernel/20190624102209.ngwtosgr5fvp3ler@willie-the-truck/T/#m92396a2f238c9eece660cdc0f275e787531d4ec1 > > On Mon, Jun 24, 2019 at 12:04:29AM +0800, guoren@kernel.org wrote: > > From: Guo Ren <ren_guo@c-sky.com> > > > > The hardware threads of one core could share the same TLB for SMT+SMP > > system. Assume hardware threads number sequence like this: > > > > | 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f | > > core1 core2 core3 core4 > > Given this is the Linux logical CPU ID rather than a physical CPU ID, > this assumption is not valid. For example, CPUs may be renumbered across > kexec. > > Even if this were a physical CPU ID, this doesn't hold on arm64 (e.g. > due to big.LITTLE). That's ok for csky, C-SKY smp logical CPU ID is the same with physical one. > > > Current algorithm seems is correct for SMT+SMP, but it'll give some > > duplicate local_tlb_flush. Because one hardware threads local_tlb_flush > > will also flush other hardware threads' TLB entry in one core TLB. > > Does any architecture specification mandate that behaviour? > > That isn't true for arm64, I have no idea whether RISC-V mandates that, > and as below it seems this is irrelevant on C-SKY. Harts in one core share the same tlb and I think one hart flushing tlb will also affect on other harts in the same core. So we just need one tlb flush for one core. > > > So we can use bitmap to reduce local_tlb_flush for SMT. > > > > C-SKY cores don't support SMT and the patch is no benefit for C-SKY. > > As above, this patch is very confusing -- if this doesn't benefit C-SKY, > why modify the C-SKY code? Ditto, it's for Catalin's question and this patch compiled for csky. Best Regards Guo Ren ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] arm64: asid: Optimize cache_flush for SMT 2019-06-24 11:40 ` Mark Rutland 2019-06-24 12:25 ` Guo Ren @ 2019-06-25 7:25 ` Palmer Dabbelt 1 sibling, 0 replies; 4+ messages in thread From: Palmer Dabbelt @ 2019-06-25 7:25 UTC (permalink / raw) To: mark.rutland Cc: guoren, julien.grall, Arnd Bergmann, linux-kernel, linux-csky, ren_guo, catalin.marinas On Mon, 24 Jun 2019 04:40:10 PDT (-0700), mark.rutland@arm.com wrote: > I'm very confused by this patch. The title says arm64, yet the code is > under arch/csky/, and the code in question refers to HARTs, which IIUC > is RISC-V terminology. > > On Mon, Jun 24, 2019 at 12:04:29AM +0800, guoren@kernel.org wrote: >> From: Guo Ren <ren_guo@c-sky.com> >> >> The hardware threads of one core could share the same TLB for SMT+SMP >> system. Assume hardware threads number sequence like this: >> >> | 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f | >> core1 core2 core3 core4 > > Given this is the Linux logical CPU ID rather than a physical CPU ID, > this assumption is not valid. For example, CPUs may be renumbered across > kexec. > > Even if this were a physical CPU ID, this doesn't hold on arm64 (e.g. > due to big.LITTLE). > >> Current algorithm seems is correct for SMT+SMP, but it'll give some >> duplicate local_tlb_flush. Because one hardware threads local_tlb_flush >> will also flush other hardware threads' TLB entry in one core TLB. > > Does any architecture specification mandate that behaviour? > > That isn't true for arm64, I have no idea whether RISC-V mandates that, > and as below it seems this is irrelevant on C-SKY. There is no event defined by RISC-V that ever requires implementations flush the TLB of more than one hart at a time. There is also nothing in the normative text of the RISC-V manuals that allows for any differentiation between multiple threads on a single core and multiple cores (though I am about to suggest adding two, against my will :)). >> So we can use bitmap to reduce local_tlb_flush for SMT. >> >> C-SKY cores don't support SMT and the patch is no benefit for C-SKY. > > As above, this patch is very confusing -- if this doesn't benefit C-SKY, > why modify the C-SKY code? > > Thanks, > Mark. > >> >> Signed-off-by: Guo Ren <ren_guo@c-sky.com> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Julien Grall <julien.grall@arm.com> >> --- >> arch/csky/include/asm/asid.h | 4 ++++ >> arch/csky/mm/asid.c | 11 ++++++++++- >> arch/csky/mm/context.c | 2 +- >> 3 files changed, 15 insertions(+), 2 deletions(-) >> >> diff --git a/arch/csky/include/asm/asid.h b/arch/csky/include/asm/asid.h >> index ac08b0f..f654492 100644 >> --- a/arch/csky/include/asm/asid.h >> +++ b/arch/csky/include/asm/asid.h >> @@ -23,6 +23,9 @@ struct asid_info >> unsigned int ctxt_shift; >> /* Callback to locally flush the context. */ >> void (*flush_cpu_ctxt_cb)(void); >> + /* To reduce duplicate tlb_flush for SMT */ >> + unsigned int harts_per_core; >> + unsigned int harts_per_core_mask; >> }; >> >> #define NUM_ASIDS(info) (1UL << ((info)->bits)) >> @@ -73,6 +76,7 @@ static inline void asid_check_context(struct asid_info *info, >> >> int asid_allocator_init(struct asid_info *info, >> u32 bits, unsigned int asid_per_ctxt, >> + unsigned int harts_per_core, >> void (*flush_cpu_ctxt_cb)(void)); >> >> #endif >> diff --git a/arch/csky/mm/asid.c b/arch/csky/mm/asid.c >> index b2e9147..50a983e 100644 >> --- a/arch/csky/mm/asid.c >> +++ b/arch/csky/mm/asid.c >> @@ -148,8 +148,13 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid, >> atomic64_set(pasid, asid); >> } >> >> - if (cpumask_test_and_clear_cpu(cpu, &info->flush_pending)) >> + if (cpumask_test_cpu(cpu, &info->flush_pending)) { >> + unsigned int i; >> + unsigned int harts_base = cpu & info->harts_per_core_mask; >> info->flush_cpu_ctxt_cb(); >> + for (i = 0; i < info->harts_per_core; i++) >> + cpumask_clear_cpu(harts_base + i, &info->flush_pending); >> + } >> >> atomic64_set(&active_asid(info, cpu), asid); >> cpumask_set_cpu(cpu, mm_cpumask(mm)); >> @@ -162,15 +167,19 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid, >> * @info: Pointer to the asid allocator structure >> * @bits: Number of ASIDs available >> * @asid_per_ctxt: Number of ASIDs to allocate per-context. ASIDs are >> + * @harts_per_core: Number hardware threads per core, must be 1, 2, 4, 8, 16 ... >> * allocated contiguously for a given context. This value should be a power of >> * 2. >> */ >> int asid_allocator_init(struct asid_info *info, >> u32 bits, unsigned int asid_per_ctxt, >> + unsigned int harts_per_core, >> void (*flush_cpu_ctxt_cb)(void)) >> { >> info->bits = bits; >> info->ctxt_shift = ilog2(asid_per_ctxt); >> + info->harts_per_core = harts_per_core; >> + info->harts_per_core_mask = ~((1 << ilog2(harts_per_core)) - 1); >> info->flush_cpu_ctxt_cb = flush_cpu_ctxt_cb; >> /* >> * Expect allocation after rollover to fail if we don't have at least >> diff --git a/arch/csky/mm/context.c b/arch/csky/mm/context.c >> index 0d95bdd..b58523b 100644 >> --- a/arch/csky/mm/context.c >> +++ b/arch/csky/mm/context.c >> @@ -30,7 +30,7 @@ static int asids_init(void) >> { >> BUG_ON(((1 << CONFIG_CPU_ASID_BITS) - 1) <= num_possible_cpus()); >> >> - if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, >> + if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, 1, >> asid_flush_cpu_ctxt)) >> panic("Unable to initialize ASID allocator for %lu ASIDs\n", >> NUM_ASIDS(&asid_info)); >> -- >> 2.7.4 >> ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-06-25 7:25 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-06-23 16:04 [PATCH] arm64: asid: Optimize cache_flush for SMT guoren 2019-06-24 11:40 ` Mark Rutland 2019-06-24 12:25 ` Guo Ren 2019-06-25 7:25 ` Palmer Dabbelt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).