linux-csky.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] arm64: asid: Optimize cache_flush for SMT
@ 2019-06-23 16:04 guoren
  2019-06-24 11:40 ` Mark Rutland
  0 siblings, 1 reply; 4+ messages in thread
From: guoren @ 2019-06-23 16:04 UTC (permalink / raw)
  To: julien.grall, arnd, linux-kernel; +Cc: linux-csky, Guo Ren, Catalin Marinas

From: Guo Ren <ren_guo@c-sky.com>

The hardware threads of one core could share the same TLB for SMT+SMP
system. Assume hardware threads number sequence like this:

| 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f |
   core1     core2     core3     core4

Current algorithm seems is correct for SMT+SMP, but it'll give some
duplicate local_tlb_flush. Because one hardware threads local_tlb_flush
will also flush other hardware threads' TLB entry in one core TLB.

So we can use bitmap to reduce local_tlb_flush for SMT.

C-SKY cores don't support SMT and the patch is no benefit for C-SKY.

Signed-off-by: Guo Ren <ren_guo@c-sky.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Julien Grall <julien.grall@arm.com>
---
 arch/csky/include/asm/asid.h |  4 ++++
 arch/csky/mm/asid.c          | 11 ++++++++++-
 arch/csky/mm/context.c       |  2 +-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/csky/include/asm/asid.h b/arch/csky/include/asm/asid.h
index ac08b0f..f654492 100644
--- a/arch/csky/include/asm/asid.h
+++ b/arch/csky/include/asm/asid.h
@@ -23,6 +23,9 @@ struct asid_info
 	unsigned int		ctxt_shift;
 	/* Callback to locally flush the context. */
 	void			(*flush_cpu_ctxt_cb)(void);
+	/* To reduce duplicate tlb_flush for SMT */
+	unsigned int		harts_per_core;
+	unsigned int		harts_per_core_mask;
 };
 
 #define NUM_ASIDS(info)			(1UL << ((info)->bits))
@@ -73,6 +76,7 @@ static inline void asid_check_context(struct asid_info *info,
 
 int asid_allocator_init(struct asid_info *info,
 			u32 bits, unsigned int asid_per_ctxt,
+			unsigned int harts_per_core,
 			void (*flush_cpu_ctxt_cb)(void));
 
 #endif
diff --git a/arch/csky/mm/asid.c b/arch/csky/mm/asid.c
index b2e9147..50a983e 100644
--- a/arch/csky/mm/asid.c
+++ b/arch/csky/mm/asid.c
@@ -148,8 +148,13 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid,
 		atomic64_set(pasid, asid);
 	}
 
-	if (cpumask_test_and_clear_cpu(cpu, &info->flush_pending))
+	if (cpumask_test_cpu(cpu, &info->flush_pending)) {
+		unsigned int i;
+		unsigned int harts_base = cpu & info->harts_per_core_mask;
 		info->flush_cpu_ctxt_cb();
+		for (i = 0; i < info->harts_per_core; i++)
+			cpumask_clear_cpu(harts_base + i, &info->flush_pending);
+	}
 
 	atomic64_set(&active_asid(info, cpu), asid);
 	cpumask_set_cpu(cpu, mm_cpumask(mm));
@@ -162,15 +167,19 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid,
  * @info: Pointer to the asid allocator structure
  * @bits: Number of ASIDs available
  * @asid_per_ctxt: Number of ASIDs to allocate per-context. ASIDs are
+ * @harts_per_core: Number hardware threads per core, must be 1, 2, 4, 8, 16 ...
  * allocated contiguously for a given context. This value should be a power of
  * 2.
  */
 int asid_allocator_init(struct asid_info *info,
 			u32 bits, unsigned int asid_per_ctxt,
+			unsigned int harts_per_core,
 			void (*flush_cpu_ctxt_cb)(void))
 {
 	info->bits = bits;
 	info->ctxt_shift = ilog2(asid_per_ctxt);
+	info->harts_per_core = harts_per_core;
+	info->harts_per_core_mask = ~((1 << ilog2(harts_per_core)) - 1);
 	info->flush_cpu_ctxt_cb = flush_cpu_ctxt_cb;
 	/*
 	 * Expect allocation after rollover to fail if we don't have at least
diff --git a/arch/csky/mm/context.c b/arch/csky/mm/context.c
index 0d95bdd..b58523b 100644
--- a/arch/csky/mm/context.c
+++ b/arch/csky/mm/context.c
@@ -30,7 +30,7 @@ static int asids_init(void)
 {
 	BUG_ON(((1 << CONFIG_CPU_ASID_BITS) - 1) <= num_possible_cpus());
 
-	if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1,
+	if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, 1,
 				asid_flush_cpu_ctxt))
 		panic("Unable to initialize ASID allocator for %lu ASIDs\n",
 		      NUM_ASIDS(&asid_info));
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] arm64: asid: Optimize cache_flush for SMT
  2019-06-23 16:04 [PATCH] arm64: asid: Optimize cache_flush for SMT guoren
@ 2019-06-24 11:40 ` Mark Rutland
  2019-06-24 12:25   ` Guo Ren
  2019-06-25  7:25   ` Palmer Dabbelt
  0 siblings, 2 replies; 4+ messages in thread
From: Mark Rutland @ 2019-06-24 11:40 UTC (permalink / raw)
  To: guoren
  Cc: julien.grall, arnd, linux-kernel, linux-csky, Guo Ren, Catalin Marinas

I'm very confused by this patch. The title says arm64, yet the code is
under arch/csky/, and the code in question refers to HARTs, which IIUC
is RISC-V terminology.

On Mon, Jun 24, 2019 at 12:04:29AM +0800, guoren@kernel.org wrote:
> From: Guo Ren <ren_guo@c-sky.com>
> 
> The hardware threads of one core could share the same TLB for SMT+SMP
> system. Assume hardware threads number sequence like this:
> 
> | 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f |
>    core1     core2     core3     core4

Given this is the Linux logical CPU ID rather than a physical CPU ID,
this assumption is not valid. For example, CPUs may be renumbered across
kexec.

Even if this were a physical CPU ID, this doesn't hold on arm64 (e.g.
due to big.LITTLE).

> Current algorithm seems is correct for SMT+SMP, but it'll give some
> duplicate local_tlb_flush. Because one hardware threads local_tlb_flush
> will also flush other hardware threads' TLB entry in one core TLB.

Does any architecture specification mandate that behaviour?

That isn't true for arm64, I have no idea whether RISC-V mandates that,
and as below it seems this is irrelevant on C-SKY.

> So we can use bitmap to reduce local_tlb_flush for SMT.
> 
> C-SKY cores don't support SMT and the patch is no benefit for C-SKY.

As above, this patch is very confusing -- if this doesn't benefit C-SKY,
why modify the C-SKY code?

Thanks,
Mark.

> 
> Signed-off-by: Guo Ren <ren_guo@c-sky.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Julien Grall <julien.grall@arm.com>
> ---
>  arch/csky/include/asm/asid.h |  4 ++++
>  arch/csky/mm/asid.c          | 11 ++++++++++-
>  arch/csky/mm/context.c       |  2 +-
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/csky/include/asm/asid.h b/arch/csky/include/asm/asid.h
> index ac08b0f..f654492 100644
> --- a/arch/csky/include/asm/asid.h
> +++ b/arch/csky/include/asm/asid.h
> @@ -23,6 +23,9 @@ struct asid_info
>  	unsigned int		ctxt_shift;
>  	/* Callback to locally flush the context. */
>  	void			(*flush_cpu_ctxt_cb)(void);
> +	/* To reduce duplicate tlb_flush for SMT */
> +	unsigned int		harts_per_core;
> +	unsigned int		harts_per_core_mask;
>  };
>  
>  #define NUM_ASIDS(info)			(1UL << ((info)->bits))
> @@ -73,6 +76,7 @@ static inline void asid_check_context(struct asid_info *info,
>  
>  int asid_allocator_init(struct asid_info *info,
>  			u32 bits, unsigned int asid_per_ctxt,
> +			unsigned int harts_per_core,
>  			void (*flush_cpu_ctxt_cb)(void));
>  
>  #endif
> diff --git a/arch/csky/mm/asid.c b/arch/csky/mm/asid.c
> index b2e9147..50a983e 100644
> --- a/arch/csky/mm/asid.c
> +++ b/arch/csky/mm/asid.c
> @@ -148,8 +148,13 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid,
>  		atomic64_set(pasid, asid);
>  	}
>  
> -	if (cpumask_test_and_clear_cpu(cpu, &info->flush_pending))
> +	if (cpumask_test_cpu(cpu, &info->flush_pending)) {
> +		unsigned int i;
> +		unsigned int harts_base = cpu & info->harts_per_core_mask;
>  		info->flush_cpu_ctxt_cb();
> +		for (i = 0; i < info->harts_per_core; i++)
> +			cpumask_clear_cpu(harts_base + i, &info->flush_pending);
> +	}
>  
>  	atomic64_set(&active_asid(info, cpu), asid);
>  	cpumask_set_cpu(cpu, mm_cpumask(mm));
> @@ -162,15 +167,19 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid,
>   * @info: Pointer to the asid allocator structure
>   * @bits: Number of ASIDs available
>   * @asid_per_ctxt: Number of ASIDs to allocate per-context. ASIDs are
> + * @harts_per_core: Number hardware threads per core, must be 1, 2, 4, 8, 16 ...
>   * allocated contiguously for a given context. This value should be a power of
>   * 2.
>   */
>  int asid_allocator_init(struct asid_info *info,
>  			u32 bits, unsigned int asid_per_ctxt,
> +			unsigned int harts_per_core,
>  			void (*flush_cpu_ctxt_cb)(void))
>  {
>  	info->bits = bits;
>  	info->ctxt_shift = ilog2(asid_per_ctxt);
> +	info->harts_per_core = harts_per_core;
> +	info->harts_per_core_mask = ~((1 << ilog2(harts_per_core)) - 1);
>  	info->flush_cpu_ctxt_cb = flush_cpu_ctxt_cb;
>  	/*
>  	 * Expect allocation after rollover to fail if we don't have at least
> diff --git a/arch/csky/mm/context.c b/arch/csky/mm/context.c
> index 0d95bdd..b58523b 100644
> --- a/arch/csky/mm/context.c
> +++ b/arch/csky/mm/context.c
> @@ -30,7 +30,7 @@ static int asids_init(void)
>  {
>  	BUG_ON(((1 << CONFIG_CPU_ASID_BITS) - 1) <= num_possible_cpus());
>  
> -	if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1,
> +	if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, 1,
>  				asid_flush_cpu_ctxt))
>  		panic("Unable to initialize ASID allocator for %lu ASIDs\n",
>  		      NUM_ASIDS(&asid_info));
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] arm64: asid: Optimize cache_flush for SMT
  2019-06-24 11:40 ` Mark Rutland
@ 2019-06-24 12:25   ` Guo Ren
  2019-06-25  7:25   ` Palmer Dabbelt
  1 sibling, 0 replies; 4+ messages in thread
From: Guo Ren @ 2019-06-24 12:25 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Julien Grall, Arnd Bergmann, linux-kernel, linux-csky, Guo Ren,
	Catalin Marinas

On Mon, Jun 24, 2019 at 7:40 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> I'm very confused by this patch. The title says arm64, yet the code is
> under arch/csky/, and the code in question refers to HARTs, which IIUC
> is RISC-V terminology.
This patch is used to answer Catalin's question:
> While the algorithm may seem fairly generic, the semantics have a few
> corner cases specific to each architecture. See [1] for a description of
> the semantics we need on arm64 (CnP is a feature where the hardware
> threads of the same core can share the TLB; the original algorithm
> violated the requirements when this feature was enabled).
Here is my reply for Catalin:
C-SKY SMP is only one hart per core, but here is a patch [1] with my
thought on SMT duplicate tlb flush:
[1] https://lore.kernel.org/linux-csky/1561305869-18872-1-git-send-email-guoren@kernel.org/T/#u

Our talk is on this thread:
https://lore.kernel.org/linux-arm-kernel/20190624102209.ngwtosgr5fvp3ler@willie-the-truck/T/#m92396a2f238c9eece660cdc0f275e787531d4ec1

>
> On Mon, Jun 24, 2019 at 12:04:29AM +0800, guoren@kernel.org wrote:
> > From: Guo Ren <ren_guo@c-sky.com>
> >
> > The hardware threads of one core could share the same TLB for SMT+SMP
> > system. Assume hardware threads number sequence like this:
> >
> > | 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f |
> >    core1     core2     core3     core4
>
> Given this is the Linux logical CPU ID rather than a physical CPU ID,
> this assumption is not valid. For example, CPUs may be renumbered across
> kexec.
>
> Even if this were a physical CPU ID, this doesn't hold on arm64 (e.g.
> due to big.LITTLE).
That's ok for csky, C-SKY smp logical CPU ID is the same with physical one.

>
> > Current algorithm seems is correct for SMT+SMP, but it'll give some
> > duplicate local_tlb_flush. Because one hardware threads local_tlb_flush
> > will also flush other hardware threads' TLB entry in one core TLB.
>
> Does any architecture specification mandate that behaviour?
>
> That isn't true for arm64, I have no idea whether RISC-V mandates that,
> and as below it seems this is irrelevant on C-SKY.
Harts in one core share the same tlb and I think one hart flushing tlb will also
affect on other harts in the same core. So we just need one tlb flush for one
core.

>
> > So we can use bitmap to reduce local_tlb_flush for SMT.
> >
> > C-SKY cores don't support SMT and the patch is no benefit for C-SKY.
>
> As above, this patch is very confusing -- if this doesn't benefit C-SKY,
> why modify the C-SKY code?
Ditto, it's for Catalin's question and this patch compiled for csky.

Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] arm64: asid: Optimize cache_flush for SMT
  2019-06-24 11:40 ` Mark Rutland
  2019-06-24 12:25   ` Guo Ren
@ 2019-06-25  7:25   ` Palmer Dabbelt
  1 sibling, 0 replies; 4+ messages in thread
From: Palmer Dabbelt @ 2019-06-25  7:25 UTC (permalink / raw)
  To: mark.rutland
  Cc: guoren, julien.grall, Arnd Bergmann, linux-kernel, linux-csky,
	ren_guo, catalin.marinas

On Mon, 24 Jun 2019 04:40:10 PDT (-0700), mark.rutland@arm.com wrote:
> I'm very confused by this patch. The title says arm64, yet the code is
> under arch/csky/, and the code in question refers to HARTs, which IIUC
> is RISC-V terminology.
>
> On Mon, Jun 24, 2019 at 12:04:29AM +0800, guoren@kernel.org wrote:
>> From: Guo Ren <ren_guo@c-sky.com>
>>
>> The hardware threads of one core could share the same TLB for SMT+SMP
>> system. Assume hardware threads number sequence like this:
>>
>> | 0 1 2 3 | 4 5 6 7 | 8 9 a b | c d e f |
>>    core1     core2     core3     core4
>
> Given this is the Linux logical CPU ID rather than a physical CPU ID,
> this assumption is not valid. For example, CPUs may be renumbered across
> kexec.
>
> Even if this were a physical CPU ID, this doesn't hold on arm64 (e.g.
> due to big.LITTLE).
>
>> Current algorithm seems is correct for SMT+SMP, but it'll give some
>> duplicate local_tlb_flush. Because one hardware threads local_tlb_flush
>> will also flush other hardware threads' TLB entry in one core TLB.
>
> Does any architecture specification mandate that behaviour?
>
> That isn't true for arm64, I have no idea whether RISC-V mandates that,
> and as below it seems this is irrelevant on C-SKY.

There is no event defined by RISC-V that ever requires implementations flush
the TLB of more than one hart at a time.  There is also nothing in the
normative text of the RISC-V manuals that allows for any differentiation
between multiple threads on a single core and multiple cores (though I am about
to suggest adding two, against my will :)).

>> So we can use bitmap to reduce local_tlb_flush for SMT.
>>
>> C-SKY cores don't support SMT and the patch is no benefit for C-SKY.
>
> As above, this patch is very confusing -- if this doesn't benefit C-SKY,
> why modify the C-SKY code?
>
> Thanks,
> Mark.
>
>>
>> Signed-off-by: Guo Ren <ren_guo@c-sky.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Julien Grall <julien.grall@arm.com>
>> ---
>>  arch/csky/include/asm/asid.h |  4 ++++
>>  arch/csky/mm/asid.c          | 11 ++++++++++-
>>  arch/csky/mm/context.c       |  2 +-
>>  3 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/csky/include/asm/asid.h b/arch/csky/include/asm/asid.h
>> index ac08b0f..f654492 100644
>> --- a/arch/csky/include/asm/asid.h
>> +++ b/arch/csky/include/asm/asid.h
>> @@ -23,6 +23,9 @@ struct asid_info
>>  	unsigned int		ctxt_shift;
>>  	/* Callback to locally flush the context. */
>>  	void			(*flush_cpu_ctxt_cb)(void);
>> +	/* To reduce duplicate tlb_flush for SMT */
>> +	unsigned int		harts_per_core;
>> +	unsigned int		harts_per_core_mask;
>>  };
>>
>>  #define NUM_ASIDS(info)			(1UL << ((info)->bits))
>> @@ -73,6 +76,7 @@ static inline void asid_check_context(struct asid_info *info,
>>
>>  int asid_allocator_init(struct asid_info *info,
>>  			u32 bits, unsigned int asid_per_ctxt,
>> +			unsigned int harts_per_core,
>>  			void (*flush_cpu_ctxt_cb)(void));
>>
>>  #endif
>> diff --git a/arch/csky/mm/asid.c b/arch/csky/mm/asid.c
>> index b2e9147..50a983e 100644
>> --- a/arch/csky/mm/asid.c
>> +++ b/arch/csky/mm/asid.c
>> @@ -148,8 +148,13 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid,
>>  		atomic64_set(pasid, asid);
>>  	}
>>
>> -	if (cpumask_test_and_clear_cpu(cpu, &info->flush_pending))
>> +	if (cpumask_test_cpu(cpu, &info->flush_pending)) {
>> +		unsigned int i;
>> +		unsigned int harts_base = cpu & info->harts_per_core_mask;
>>  		info->flush_cpu_ctxt_cb();
>> +		for (i = 0; i < info->harts_per_core; i++)
>> +			cpumask_clear_cpu(harts_base + i, &info->flush_pending);
>> +	}
>>
>>  	atomic64_set(&active_asid(info, cpu), asid);
>>  	cpumask_set_cpu(cpu, mm_cpumask(mm));
>> @@ -162,15 +167,19 @@ void asid_new_context(struct asid_info *info, atomic64_t *pasid,
>>   * @info: Pointer to the asid allocator structure
>>   * @bits: Number of ASIDs available
>>   * @asid_per_ctxt: Number of ASIDs to allocate per-context. ASIDs are
>> + * @harts_per_core: Number hardware threads per core, must be 1, 2, 4, 8, 16 ...
>>   * allocated contiguously for a given context. This value should be a power of
>>   * 2.
>>   */
>>  int asid_allocator_init(struct asid_info *info,
>>  			u32 bits, unsigned int asid_per_ctxt,
>> +			unsigned int harts_per_core,
>>  			void (*flush_cpu_ctxt_cb)(void))
>>  {
>>  	info->bits = bits;
>>  	info->ctxt_shift = ilog2(asid_per_ctxt);
>> +	info->harts_per_core = harts_per_core;
>> +	info->harts_per_core_mask = ~((1 << ilog2(harts_per_core)) - 1);
>>  	info->flush_cpu_ctxt_cb = flush_cpu_ctxt_cb;
>>  	/*
>>  	 * Expect allocation after rollover to fail if we don't have at least
>> diff --git a/arch/csky/mm/context.c b/arch/csky/mm/context.c
>> index 0d95bdd..b58523b 100644
>> --- a/arch/csky/mm/context.c
>> +++ b/arch/csky/mm/context.c
>> @@ -30,7 +30,7 @@ static int asids_init(void)
>>  {
>>  	BUG_ON(((1 << CONFIG_CPU_ASID_BITS) - 1) <= num_possible_cpus());
>>
>> -	if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1,
>> +	if (asid_allocator_init(&asid_info, CONFIG_CPU_ASID_BITS, 1, 1,
>>  				asid_flush_cpu_ctxt))
>>  		panic("Unable to initialize ASID allocator for %lu ASIDs\n",
>>  		      NUM_ASIDS(&asid_info));
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-06-25  7:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-23 16:04 [PATCH] arm64: asid: Optimize cache_flush for SMT guoren
2019-06-24 11:40 ` Mark Rutland
2019-06-24 12:25   ` Guo Ren
2019-06-25  7:25   ` Palmer Dabbelt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).