linux-riscv.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
@ 2022-07-13 13:33 Sudeep Holla
  2022-07-13 14:03 ` Greg Kroah-Hartman
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Sudeep Holla @ 2022-07-13 13:33 UTC (permalink / raw)
  To: linux-kernel, conor.dooley
  Cc: Sudeep Holla, Greg Kroah-Hartman, Vincent Guittot,
	Dietmar Eggemann, Ionela Voinescu, Pierre Gondois,
	linux-arm-kernel, linux-riscv

init_cpu_topology() is called only once at the boot and all the cache
attributes are detected early for all the possible CPUs. However when
the CPUs are hotplugged out, the cacheinfo gets removed. While the
attributes are added back when the CPUs are hotplugged back in as part
of CPU hotplug state machine, it ends up called quite late after the
update_siblings_masks() are called in the secondary_start_kernel()
resulting in wrong llc_sibling_masks.

Move the call to detect_cache_attributes() inside update_siblings_masks()
to ensure the cacheinfo is updated before the LLC sibling masks are
updated. This will fix the incorrect LLC sibling masks generated when
the CPUs are hotplugged out and hotplugged back in again.

Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
---
 drivers/base/arch_topology.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

Hi Conor,

Ionela reported an issue with the CPU hotplug and as a fix I need to
move the call to detect_cache_attributes() which I had thought to keep
it there from first but for no reason had moved it to init_cpu_topology().

Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
cpu in the secondary CPUs init path while init_cpu_topology executed
detect_cache_attributes() for all possible CPUs much earlier. I think
this might help as the percpu memory might be initialised in this case.

Anyways give this a try, also test the CPU hotplug and check if nothing
is broken on RISC-V. We noticed this bug only on one platform while

Regards,
Sudeep

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 441e14ac33a4..0424b59b695e 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -732,7 +732,11 @@ const struct cpumask *cpu_clustergroup_mask(int cpu)
 void update_siblings_masks(unsigned int cpuid)
 {
 	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
-	int cpu;
+	int cpu, ret;
+
+	ret = detect_cache_attributes(cpuid);
+	if (ret)
+		pr_info("Early cacheinfo failed, ret = %d\n", ret);
 	/* update core and thread sibling masks */
 	for_each_online_cpu(cpu) {
@@ -821,7 +825,7 @@ __weak int __init parse_acpi_topology(void)
 #if defined(CONFIG_ARM64) || defined(CONFIG_RISCV)
 void __init init_cpu_topology(void)
 {
-	int ret, cpu;
+	int ret;
 	reset_cpu_topology();
 	ret = parse_acpi_topology();
@@ -836,13 +840,5 @@ void __init init_cpu_topology(void)
 		reset_cpu_topology();
 		return;
 	}
-
-	for_each_possible_cpu(cpu) {
-		ret = detect_cache_attributes(cpu);
-		if (ret) {
-			pr_info("Early cacheinfo failed, ret = %d\n", ret);
-			break;
-		}
-	}
 }
 #endif
--2.37.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-13 13:33 [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path Sudeep Holla
@ 2022-07-13 14:03 ` Greg Kroah-Hartman
  2022-07-13 14:18   ` Sudeep Holla
  2022-07-13 16:04 ` Conor.Dooley
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Greg Kroah-Hartman @ 2022-07-13 14:03 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: linux-kernel, conor.dooley, Vincent Guittot, Dietmar Eggemann,
	Ionela Voinescu, Pierre Gondois, linux-arm-kernel, linux-riscv

On Wed, Jul 13, 2022 at 02:33:44PM +0100, Sudeep Holla wrote:
> init_cpu_topology() is called only once at the boot and all the cache
> attributes are detected early for all the possible CPUs. However when
> the CPUs are hotplugged out, the cacheinfo gets removed. While the
> attributes are added back when the CPUs are hotplugged back in as part
> of CPU hotplug state machine, it ends up called quite late after the
> update_siblings_masks() are called in the secondary_start_kernel()
> resulting in wrong llc_sibling_masks.
> 
> Move the call to detect_cache_attributes() inside update_siblings_masks()
> to ensure the cacheinfo is updated before the LLC sibling masks are
> updated. This will fix the incorrect LLC sibling masks generated when
> the CPUs are hotplugged out and hotplugged back in again.
> 
> Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>  drivers/base/arch_topology.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> Hi Conor,
> 
> Ionela reported an issue with the CPU hotplug and as a fix I need to
> move the call to detect_cache_attributes() which I had thought to keep
> it there from first but for no reason had moved it to init_cpu_topology().
> 
> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> cpu in the secondary CPUs init path while init_cpu_topology executed
> detect_cache_attributes() for all possible CPUs much earlier. I think
> this might help as the percpu memory might be initialised in this case.
> 
> Anyways give this a try, also test the CPU hotplug and check if nothing
> is broken on RISC-V. We noticed this bug only on one platform while
> 
> Regards,
> Sudeep
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 441e14ac33a4..0424b59b695e 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -732,7 +732,11 @@ const struct cpumask *cpu_clustergroup_mask(int cpu)
>  void update_siblings_masks(unsigned int cpuid)
>  {
>  	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> -	int cpu;
> +	int cpu, ret;
> +
> +	ret = detect_cache_attributes(cpuid);
> +	if (ret)
> +		pr_info("Early cacheinfo failed, ret = %d\n", ret);

No erroring out?

thanks,

greg k-h

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-13 14:03 ` Greg Kroah-Hartman
@ 2022-07-13 14:18   ` Sudeep Holla
  0 siblings, 0 replies; 17+ messages in thread
From: Sudeep Holla @ 2022-07-13 14:18 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-kernel, conor.dooley, Vincent Guittot, Dietmar Eggemann,
	Ionela Voinescu, Pierre Gondois, linux-arm-kernel, linux-riscv

On Wed, Jul 13, 2022 at 04:03:56PM +0200, Greg Kroah-Hartman wrote:
> On Wed, Jul 13, 2022 at 02:33:44PM +0100, Sudeep Holla wrote:
> > init_cpu_topology() is called only once at the boot and all the cache
> > attributes are detected early for all the possible CPUs. However when
> > the CPUs are hotplugged out, the cacheinfo gets removed. While the
> > attributes are added back when the CPUs are hotplugged back in as part
> > of CPU hotplug state machine, it ends up called quite late after the
> > update_siblings_masks() are called in the secondary_start_kernel()
> > resulting in wrong llc_sibling_masks.
> > 
> > Move the call to detect_cache_attributes() inside update_siblings_masks()
> > to ensure the cacheinfo is updated before the LLC sibling masks are
> > updated. This will fix the incorrect LLC sibling masks generated when
> > the CPUs are hotplugged out and hotplugged back in again.
> > 
> > Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
> > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > ---
> >  drivers/base/arch_topology.c | 16 ++++++----------
> >  1 file changed, 6 insertions(+), 10 deletions(-)
> > 
> > Hi Conor,
> > 
> > Ionela reported an issue with the CPU hotplug and as a fix I need to
> > move the call to detect_cache_attributes() which I had thought to keep
> > it there from first but for no reason had moved it to init_cpu_topology().
> > 
> > Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> > cpu in the secondary CPUs init path while init_cpu_topology executed
> > detect_cache_attributes() for all possible CPUs much earlier. I think
> > this might help as the percpu memory might be initialised in this case.
> > 
> > Anyways give this a try, also test the CPU hotplug and check if nothing
> > is broken on RISC-V. We noticed this bug only on one platform while
> > 
> > Regards,
> > Sudeep
> > 
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 441e14ac33a4..0424b59b695e 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -732,7 +732,11 @@ const struct cpumask *cpu_clustergroup_mask(int cpu)
> >  void update_siblings_masks(unsigned int cpuid)
> >  {
> >  	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> > -	int cpu;
> > +	int cpu, ret;
> > +
> > +	ret = detect_cache_attributes(cpuid);
> > +	if (ret)
> > +		pr_info("Early cacheinfo failed, ret = %d\n", ret);
> 
> No erroring out?
> 

No, this is optional as not all platforms have cacheinfo in the DT and also
the scheduler must work even without the cache information. It may not produce
optimal performance but it must work.

Also we have seen on one RISC-V platform with probably low percpu allocation,
the early detection fails, but it works just fine later device_initcall().
That was the main reason for adding error log, but the idea is to continue
building the information for the scheduler domains even if the LLC information
can't be obtained. In case of failure, we assume all CPUs have only private
caches and no shared LLC.

Hope that makes sense. Let me know if you prefer to drop the error log or
anything else. I just added as we found cases of -ENOMEM on RISC-V and we
want to highlight that.

-- 
Regards,
Sudeep

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-13 13:33 [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path Sudeep Holla
  2022-07-13 14:03 ` Greg Kroah-Hartman
@ 2022-07-13 16:04 ` Conor.Dooley
  2022-07-14 14:17 ` Conor.Dooley
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Conor.Dooley @ 2022-07-13 16:04 UTC (permalink / raw)
  To: sudeep.holla, linux-kernel
  Cc: gregkh, vincent.guittot, dietmar.eggemann, ionela.voinescu,
	pierre.gondois, linux-arm-kernel, linux-riscv

On 13/07/2022 14:33, Sudeep Holla wrote:
> Hi Conor,
> 
> Ionela reported an issue with the CPU hotplug and as a fix I need to
> move the call to detect_cache_attributes() which I had thought to keep
> it there from first but for no reason had moved it to init_cpu_topology().
> 
> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> cpu in the secondary CPUs init path while init_cpu_topology executed
> detect_cache_attributes() for all possible CPUs much earlier. I think
> this might help as the percpu memory might be initialised in this case.
> 
> Anyways give this a try, also test the CPU hotplug and check if nothing
> is broken on RISC-V. We noticed this bug only on one platform while

Coll, will give it a go tomorrow probably. I have a swiotlb issue that's
broken my boot to find first :/

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-13 13:33 [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path Sudeep Holla
  2022-07-13 14:03 ` Greg Kroah-Hartman
  2022-07-13 16:04 ` Conor.Dooley
@ 2022-07-14 14:17 ` Conor.Dooley
  2022-07-14 15:01   ` Sudeep Holla
  2022-07-14 17:52 ` Ionela Voinescu
  2022-07-18 17:41 ` Guenter Roeck
  4 siblings, 1 reply; 17+ messages in thread
From: Conor.Dooley @ 2022-07-14 14:17 UTC (permalink / raw)
  To: sudeep.holla, linux-kernel, Conor.Dooley
  Cc: gregkh, vincent.guittot, dietmar.eggemann, ionela.voinescu,
	pierre.gondois, linux-arm-kernel, linux-riscv

On 13/07/2022 14:33, Sudeep Holla wrote:

Hey Sudeep,
I could not get this patch to actually apply, tried a couple
different versions of -next :/

It is in -next already though, which I suspect might be part of why
it does not apply.. Surely you can fast forward your arch_topology
for-next branch to gregs merge commit rather than generating this
from the premerge branch & re-merging into your branch that Stephen
picks up?

Either way, I tested it directly in -next since that's back to
booting for me today and ...

> init_cpu_topology() is called only once at the boot and all the cache
> attributes are detected early for all the possible CPUs. However when
> the CPUs are hotplugged out, the cacheinfo gets removed. While the
> attributes are added back when the CPUs are hotplugged back in as part
> of CPU hotplug state machine, it ends up called quite late after the
> update_siblings_masks() are called in the secondary_start_kernel()
> resulting in wrong llc_sibling_masks.
> 
> Move the call to detect_cache_attributes() inside update_siblings_masks()
> to ensure the cacheinfo is updated before the LLC sibling masks are
> updated. This will fix the incorrect LLC sibling masks generated when
> the CPUs are hotplugged out and hotplugged back in again.
> 
> Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>   drivers/base/arch_topology.c | 16 ++++++----------
>   1 file changed, 6 insertions(+), 10 deletions(-)
> 
> Hi Conor,
> 
> Ionela reported an issue with the CPU hotplug and as a fix I need to
> move the call to detect_cache_attributes() which I had thought to keep
> it there from first but for no reason had moved it to init_cpu_topology().
> 
> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> cpu in the secondary CPUs init path while init_cpu_topology executed
> detect_cache_attributes() for all possible CPUs much earlier. I think
> this might help as the percpu memory might be initialised in this case.

Actually, we are now worse off than before:
     0.009813] smp: Bringing up secondary CPUs ...
[    0.011530] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
[    0.011550] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
[    0.011566] preempt_count: 1, expected: 0
[    0.011580] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc6-next-20220714-dirty #1
[    0.011599] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
[    0.011608] Call Trace:
[    0.011620] [<ffffffff80005070>] dump_backtrace+0x1c/0x24
[    0.011661] [<ffffffff8066b0c4>] show_stack+0x2c/0x38
[    0.011699] [<ffffffff806704a2>] dump_stack_lvl+0x40/0x58
[    0.011725] [<ffffffff806704ce>] dump_stack+0x14/0x1c
[    0.011745] [<ffffffff8002f42a>] __might_resched+0x100/0x10a
[    0.011772] [<ffffffff8002f472>] __might_sleep+0x3e/0x66
[    0.011793] [<ffffffff8014d774>] __kmalloc+0xd6/0x224
[    0.011825] [<ffffffff803d631c>] detect_cache_attributes+0x37a/0x448
[    0.011855] [<ffffffff803e8fbe>] update_siblings_masks+0x24/0x246
[    0.011885] [<ffffffff80005f32>] smp_callin+0x38/0x5c
[    0.015990] smp: Brought up 1 node, 4 CPUs

> 
> Anyways give this a try, also test the CPU hotplug and check if nothing
> is broken on RISC-V. We noticed this bug only on one platform while

So, our system monitor that runs openSBI does not actually support
any hotplug features yet, so:

# echo 0 > /sys/devices/system/cpu/cpu3/online
[   47.233627] CPU3: off
[   47.236018] CPU3 may not have stopped: 3
# echo 1 > /sys/devices/system/cpu/cpu3/online
[   54.911868] CPU3: failed to start

And this one confused the hell out of it...

# echo 0 > /sys/devices/system/cpu/cpu1/online
[ 2903.057706] CPU1: off
HSS_OpenSBI_Reboot() called
[ 2903.062447] CPU1 may not have stopped: 3
#
# [8.218591] HSS_Boot_PMPSetupHandler(): Hart1 setup complete

This is the hart that brought up openSBI so when the request
to offline it comes through it causes a system reboot haha

Either way, I think both imply that the hotplug code on the
Linux side is sane.

FWIW Sudeep, if you  want to add me as a reviewer for generic
arch topology stuff since I do care about testing it etc, please
feel free (although for the sake of my filters, make the email
conor@kernel.org if you do).

Thanks,
Conor.

> 
> Regards,
> Sudeep
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 441e14ac33a4..0424b59b695e 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -732,7 +732,11 @@ const struct cpumask *cpu_clustergroup_mask(int cpu)
>   void update_siblings_masks(unsigned int cpuid)
>   {
>   	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> -	int cpu;
> +	int cpu, ret;
> +
> +	ret = detect_cache_attributes(cpuid);
> +	if (ret)
> +		pr_info("Early cacheinfo failed, ret = %d\n", ret);
>   	/* update core and thread sibling masks */
>   	for_each_online_cpu(cpu) {
> @@ -821,7 +825,7 @@ __weak int __init parse_acpi_topology(void)
>   #if defined(CONFIG_ARM64) || defined(CONFIG_RISCV)
>   void __init init_cpu_topology(void)
>   {
> -	int ret, cpu;
> +	int ret;
>   	reset_cpu_topology();
>   	ret = parse_acpi_topology();
> @@ -836,13 +840,5 @@ void __init init_cpu_topology(void)
>   		reset_cpu_topology();
>   		return;
>   	}
> -
> -	for_each_possible_cpu(cpu) {
> -		ret = detect_cache_attributes(cpu);
> -		if (ret) {
> -			pr_info("Early cacheinfo failed, ret = %d\n", ret);
> -			break;
> -		}
> -	}
>   }
>   #endif
> --2.37.1
> 
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-14 14:17 ` Conor.Dooley
@ 2022-07-14 15:01   ` Sudeep Holla
  2022-07-14 15:27     ` Conor.Dooley
  0 siblings, 1 reply; 17+ messages in thread
From: Sudeep Holla @ 2022-07-14 15:01 UTC (permalink / raw)
  To: Conor.Dooley
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On Thu, Jul 14, 2022 at 02:17:33PM +0000, Conor.Dooley@microchip.com wrote:
> On 13/07/2022 14:33, Sudeep Holla wrote:
> 
> Hey Sudeep,
> I could not get this patch to actually apply, tried a couple
> different versions of -next :/
>

That's strange.

> It is in -next already though, which I suspect might be part of why
> it does not apply.. 

Ah that could be the case.

> Surely you can fast forward your arch_topology
> for-next branch to gregs merge commit rather than generating this
> from the premerge branch & re-merging into your branch that Stephen
> picks up?
>

Greg has merged my branch and all those commits are untouched, so it shouldn't
cause any issue as the hash remains same in both the trees, I just added just
this one patch on the top. Did you see any issues with the merge, or are you
just speculating based on your understanding. As I said if Greg had picked
up patches directly, then I would have definitely based on his -next branch
to avoid duplicate hash, but that is not the case here.

> Either way, I tested it directly in -next since that's back to
> booting for me today and ...
>

Thanks.

> > init_cpu_topology() is called only once at the boot and all the cache
> > attributes are detected early for all the possible CPUs. However when
> > the CPUs are hotplugged out, the cacheinfo gets removed. While the
> > attributes are added back when the CPUs are hotplugged back in as part
> > of CPU hotplug state machine, it ends up called quite late after the
> > update_siblings_masks() are called in the secondary_start_kernel()
> > resulting in wrong llc_sibling_masks.
> > 
> > Move the call to detect_cache_attributes() inside update_siblings_masks()
> > to ensure the cacheinfo is updated before the LLC sibling masks are
> > updated. This will fix the incorrect LLC sibling masks generated when
> > the CPUs are hotplugged out and hotplugged back in again.
> > 
> > Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
> > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > ---
> >   drivers/base/arch_topology.c | 16 ++++++----------
> >   1 file changed, 6 insertions(+), 10 deletions(-)
> > 
> > Hi Conor,
> > 
> > Ionela reported an issue with the CPU hotplug and as a fix I need to
> > move the call to detect_cache_attributes() which I had thought to keep
> > it there from first but for no reason had moved it to init_cpu_topology().
> > 
> > Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> > cpu in the secondary CPUs init path while init_cpu_topology executed
> > detect_cache_attributes() for all possible CPUs much earlier. I think
> > this might help as the percpu memory might be initialised in this case.
> 
> Actually, we are now worse off than before:
>      0.009813] smp: Bringing up secondary CPUs ...
> [    0.011530] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
> [    0.011550] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> [    0.011566] preempt_count: 1, expected: 0
> [    0.011580] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc6-next-20220714-dirty #1
> [    0.011599] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
> [    0.011608] Call Trace:
> [    0.011620] [<ffffffff80005070>] dump_backtrace+0x1c/0x24
> [    0.011661] [<ffffffff8066b0c4>] show_stack+0x2c/0x38
> [    0.011699] [<ffffffff806704a2>] dump_stack_lvl+0x40/0x58
> [    0.011725] [<ffffffff806704ce>] dump_stack+0x14/0x1c
> [    0.011745] [<ffffffff8002f42a>] __might_resched+0x100/0x10a
> [    0.011772] [<ffffffff8002f472>] __might_sleep+0x3e/0x66
> [    0.011793] [<ffffffff8014d774>] __kmalloc+0xd6/0x224
> [    0.011825] [<ffffffff803d631c>] detect_cache_attributes+0x37a/0x448
> [    0.011855] [<ffffffff803e8fbe>] update_siblings_masks+0x24/0x246
> [    0.011885] [<ffffffff80005f32>] smp_callin+0x38/0x5c
> [    0.015990] smp: Brought up 1 node, 4 CPUs
>

Interesting, need to check if it is not in atomic context on arm64.
Wonder if some configs are disabled and making this bug hidden. Let me
check.

One possible solution is to add GFP_ATOMIC to the allocation but I want
to make sure if it is legal to be in atomic context when calling
update_siblings_masks.

> > 
> > Anyways give this a try, also test the CPU hotplug and check if nothing
> > is broken on RISC-V. We noticed this bug only on one platform while
> 
> So, our system monitor that runs openSBI does not actually support
> any hotplug features yet, so:

OK, we can ignore hotplug on RISC-V for now then. We have tested on multiple
arm64 platforms(DT as well as ACPI).

-- 
Regards,
Sudeep

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-14 15:01   ` Sudeep Holla
@ 2022-07-14 15:27     ` Conor.Dooley
  2022-07-14 16:00       ` Sudeep Holla
  0 siblings, 1 reply; 17+ messages in thread
From: Conor.Dooley @ 2022-07-14 15:27 UTC (permalink / raw)
  To: sudeep.holla, Conor.Dooley
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On 14/07/2022 16:01, Sudeep Holla wrote:
> On Thu, Jul 14, 2022 at 02:17:33PM +0000, Conor.Dooley@microchip.com wrote:
>> On 13/07/2022 14:33, Sudeep Holla wrote:
>>
>> Hey Sudeep,
>> I could not get this patch to actually apply, tried a couple
>> different versions of -next :/
>>
> 
> That's strange.
> 
>> It is in -next already though, which I suspect might be part of why
>> it does not apply.. 
> 
> Ah that could be the case.
> 
>> Surely you can fast forward your arch_topology
>> for-next branch to gregs merge commit rather than generating this
>> from the premerge branch & re-merging into your branch that Stephen
>> picks up?
>>
> 
> Greg has merged my branch and all those commits are untouched, so it shouldn't
> cause any issue as the hash remains same in both the trees, I just added just
> this one patch on the top. Did you see any issues with the merge, or are you
> just speculating based on your understanding. 

Speculating based on it being a "could not construct ancestor" error.


>>
>> Actually, we are now worse off than before:
>>      0.009813] smp: Bringing up secondary CPUs ...
>> [    0.011530] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
>> [    0.011550] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
>> [    0.011566] preempt_count: 1, expected: 0
>> [    0.011580] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc6-next-20220714-dirty #1
>> [    0.011599] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
>> [    0.011608] Call Trace:
>> [    0.011620] [<ffffffff80005070>] dump_backtrace+0x1c/0x24
>> [    0.011661] [<ffffffff8066b0c4>] show_stack+0x2c/0x38
>> [    0.011699] [<ffffffff806704a2>] dump_stack_lvl+0x40/0x58
>> [    0.011725] [<ffffffff806704ce>] dump_stack+0x14/0x1c
>> [    0.011745] [<ffffffff8002f42a>] __might_resched+0x100/0x10a
>> [    0.011772] [<ffffffff8002f472>] __might_sleep+0x3e/0x66
>> [    0.011793] [<ffffffff8014d774>] __kmalloc+0xd6/0x224
>> [    0.011825] [<ffffffff803d631c>] detect_cache_attributes+0x37a/0x448
>> [    0.011855] [<ffffffff803e8fbe>] update_siblings_masks+0x24/0x246
>> [    0.011885] [<ffffffff80005f32>] smp_callin+0x38/0x5c
>> [    0.015990] smp: Brought up 1 node, 4 CPUs
>>
> 
> Interesting, need to check if it is not in atomic context on arm64.
> Wonder if some configs are disabled and making this bug hidden. Let me
> check.
> 
> One possible solution is to add GFP_ATOMIC to the allocation but I want
> to make sure if it is legal to be in atomic context when calling
> update_siblings_masks.
> 
>>>
>>> Anyways give this a try, also test the CPU hotplug and check if nothing
>>> is broken on RISC-V. We noticed this bug only on one platform while
>>
>> So, our system monitor that runs openSBI does not actually support
>> any hotplug features yet, so:
> 
> OK, we can ignore hotplug on RISC-V for now then. We have tested on multiple
> arm64 platforms(DT as well as ACPI).
> 

Well, other vendors implementations of firmware-come-bootloaders-
running-openSBI may support it, but (currently) ours does not.
But, if no-one else is speaking up about this, my arch-topo changes
or your original patchset...
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-14 15:27     ` Conor.Dooley
@ 2022-07-14 16:00       ` Sudeep Holla
  2022-07-14 16:10         ` Conor.Dooley
  0 siblings, 1 reply; 17+ messages in thread
From: Sudeep Holla @ 2022-07-14 16:00 UTC (permalink / raw)
  To: Conor.Dooley
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On Thu, Jul 14, 2022 at 03:27:09PM +0000, Conor.Dooley@microchip.com wrote:
> On 14/07/2022 16:01, Sudeep Holla wrote:
> > 
> > Interesting, need to check if it is not in atomic context on arm64.
> > Wonder if some configs are disabled and making this bug hidden. Let me
> > check.
> >

OK, it turns I didn't have necessary config options enabled. Enabling
them, I did see the BUG splat and changing allocation to GFP_ATOMIC
fixed the same. Can you try that please so that you can test if other
things are fine.

> > One possible solution is to add GFP_ATOMIC to the allocation but I want
> > to make sure if it is legal to be in atomic context when calling
> > update_siblings_masks.
> >

So I take is as legal and needs to be fixed to push my patch.

> >>>
> >>> Anyways give this a try, also test the CPU hotplug and check if nothing
> >>> is broken on RISC-V. We noticed this bug only on one platform while
> >>
> >> So, our system monitor that runs openSBI does not actually support
> >> any hotplug features yet, so:
> > 
> > OK, we can ignore hotplug on RISC-V for now then. We have tested on multiple
> > arm64 platforms(DT as well as ACPI).
> > 
> 
> Well, other vendors implementations of firmware-come-bootloaders-
> running-openSBI may support it, but (currently) ours does not.
> But, if no-one else is speaking up about this, my arch-topo changes
> or your original patchset...

OK

-- 
Regards,
Sudeep

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-14 16:00       ` Sudeep Holla
@ 2022-07-14 16:10         ` Conor.Dooley
  2022-07-15  9:11           ` Sudeep Holla
  0 siblings, 1 reply; 17+ messages in thread
From: Conor.Dooley @ 2022-07-14 16:10 UTC (permalink / raw)
  To: sudeep.holla
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On 14/07/2022 17:00, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Thu, Jul 14, 2022 at 03:27:09PM +0000, Conor.Dooley@microchip.com wrote:
>> On 14/07/2022 16:01, Sudeep Holla wrote:
>>>
>>> Interesting, need to check if it is not in atomic context on arm64.
>>> Wonder if some configs are disabled and making this bug hidden. Let me
>>> check.
>>>
> 
> OK, it turns I didn't have necessary config options enabled. Enabling
> them, I did see the BUG splat and changing allocation to GFP_ATOMIC
> fixed the same. Can you try that please so that you can test if other
> things are fine.
> 
>>> One possible solution is to add GFP_ATOMIC to the allocation but I want
>>> to make sure if it is legal to be in atomic context when calling
>>> update_siblings_masks.
>>>
> 
> So I take is as legal and needs to be fixed to push my patch.
> 

With the GFP_ATOMIC, behaviour is the same as before for me.

Therefore, with the following diff & for RISC-V/DT:

Reviewed-by: Conor Dooley <conor.dooley@microchip.com>

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 65d566ff24c4..4b5cd08c5a65 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -356,7 +356,7 @@ int detect_cache_attributes(unsigned int cpu)
                return -ENOENT;
 
        per_cpu_cacheinfo(cpu) = kcalloc(cache_leaves(cpu),
-                                        sizeof(struct cacheinfo), GFP_KERNEL);
+                                        sizeof(struct cacheinfo), GFP_ATOMIC);
        if (per_cpu_cacheinfo(cpu) == NULL) {
                cache_leaves(cpu) = 0;
                return -ENOMEM;
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-13 13:33 [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path Sudeep Holla
                   ` (2 preceding siblings ...)
  2022-07-14 14:17 ` Conor.Dooley
@ 2022-07-14 17:52 ` Ionela Voinescu
  2022-07-18 17:41 ` Guenter Roeck
  4 siblings, 0 replies; 17+ messages in thread
From: Ionela Voinescu @ 2022-07-14 17:52 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: linux-kernel, conor.dooley, Greg Kroah-Hartman, Vincent Guittot,
	Dietmar Eggemann, Pierre Gondois, linux-arm-kernel, linux-riscv

Hi Sudeep,

Thank you for the fix!

On Wednesday 13 Jul 2022 at 14:33:44 (+0100), Sudeep Holla wrote:
> init_cpu_topology() is called only once at the boot and all the cache
> attributes are detected early for all the possible CPUs. However when
> the CPUs are hotplugged out, the cacheinfo gets removed. While the
> attributes are added back when the CPUs are hotplugged back in as part
> of CPU hotplug state machine, it ends up called quite late after the
> update_siblings_masks() are called in the secondary_start_kernel()
> resulting in wrong llc_sibling_masks.
> 
> Move the call to detect_cache_attributes() inside update_siblings_masks()
> to ensure the cacheinfo is updated before the LLC sibling masks are
> updated. This will fix the incorrect LLC sibling masks generated when
> the CPUs are hotplugged out and hotplugged back in again.
> 
> Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>  drivers/base/arch_topology.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> Hi Conor,
> 
> Ionela reported an issue with the CPU hotplug and as a fix I need to
> move the call to detect_cache_attributes() which I had thought to keep
> it there from first but for no reason had moved it to init_cpu_topology().
> 
> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> cpu in the secondary CPUs init path while init_cpu_topology executed
> detect_cache_attributes() for all possible CPUs much earlier. I think
> this might help as the percpu memory might be initialised in this case.
> 
> Anyways give this a try, also test the CPU hotplug and check if nothing
> is broken on RISC-V. We noticed this bug only on one platform while
> 
> Regards,
> Sudeep
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 441e14ac33a4..0424b59b695e 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -732,7 +732,11 @@ const struct cpumask *cpu_clustergroup_mask(int cpu)
>  void update_siblings_masks(unsigned int cpuid)
>  {
>  	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> -	int cpu;
> +	int cpu, ret;
> +
> +	ret = detect_cache_attributes(cpuid);
> +	if (ret)
> +		pr_info("Early cacheinfo failed, ret = %d\n", ret);
>  	/* update core and thread sibling masks */
>  	for_each_online_cpu(cpu) {
> @@ -821,7 +825,7 @@ __weak int __init parse_acpi_topology(void)
>  #if defined(CONFIG_ARM64) || defined(CONFIG_RISCV)
>  void __init init_cpu_topology(void)
>  {
> -	int ret, cpu;
> +	int ret;
>  	reset_cpu_topology();
>  	ret = parse_acpi_topology();
> @@ -836,13 +840,5 @@ void __init init_cpu_topology(void)
>  		reset_cpu_topology();
>  		return;
>  	}
> -
> -	for_each_possible_cpu(cpu) {
> -		ret = detect_cache_attributes(cpu);
> -		if (ret) {
> -			pr_info("Early cacheinfo failed, ret = %d\n", ret);
> -			break;
> -		}
> -	}
>  }
>  #endif
> --2.37.1
> 

Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Tested-by:  Ionela Voinescu <ionela.voinescu@arm.com>

Kind regards,
Ionela.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-14 16:10         ` Conor.Dooley
@ 2022-07-15  9:11           ` Sudeep Holla
  2022-07-15  9:16             ` Conor.Dooley
  0 siblings, 1 reply; 17+ messages in thread
From: Sudeep Holla @ 2022-07-15  9:11 UTC (permalink / raw)
  To: Conor.Dooley
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On Thu, Jul 14, 2022 at 04:10:36PM +0000, Conor.Dooley@microchip.com wrote:
> On 14/07/2022 17:00, Sudeep Holla wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > 
> > On Thu, Jul 14, 2022 at 03:27:09PM +0000, Conor.Dooley@microchip.com wrote:
> >> On 14/07/2022 16:01, Sudeep Holla wrote:
> >>>
> >>> Interesting, need to check if it is not in atomic context on arm64.
> >>> Wonder if some configs are disabled and making this bug hidden. Let me
> >>> check.
> >>>
> > 
> > OK, it turns I didn't have necessary config options enabled. Enabling
> > them, I did see the BUG splat and changing allocation to GFP_ATOMIC
> > fixed the same. Can you try that please so that you can test if other
> > things are fine.
> > 
> >>> One possible solution is to add GFP_ATOMIC to the allocation but I want
> >>> to make sure if it is legal to be in atomic context when calling
> >>> update_siblings_masks.
> >>>
> > 
> > So I take is as legal and needs to be fixed to push my patch.
> > 
> 
> With the GFP_ATOMIC, behaviour is the same as before for me.
>

So you still get -ENOMEM failure on your platform. It is fine, just that
I am bit curious to know why as it succeeds at device_initcall later.
I was hoping this might fix your memory allocation failure.

> Therefore, with the following diff & for RISC-V/DT:
> 
> Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
>

Thanks !

-- 
Regards,
Sudeep

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-15  9:11           ` Sudeep Holla
@ 2022-07-15  9:16             ` Conor.Dooley
  2022-07-15 14:04               ` Conor.Dooley
  0 siblings, 1 reply; 17+ messages in thread
From: Conor.Dooley @ 2022-07-15  9:16 UTC (permalink / raw)
  To: sudeep.holla
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On 15/07/2022 10:11, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Thu, Jul 14, 2022 at 04:10:36PM +0000, Conor.Dooley@microchip.com wrote:
>> With the GFP_ATOMIC, behaviour is the same as before for me.
>>
> 
> So you still get -ENOMEM failure on your platform. It is fine, just that
> I am bit curious to know why as it succeeds at device_initcall later.
> I was hoping this might fix your memory allocation failure.

Correct. I have been staring at RISC-V init code the last days b/c of
another problem, I'll do a little more staring today and see if I can
figure out what's going on.

Thanks,
Conor.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-15  9:16             ` Conor.Dooley
@ 2022-07-15 14:04               ` Conor.Dooley
  2022-07-15 15:41                 ` Sudeep Holla
  0 siblings, 1 reply; 17+ messages in thread
From: Conor.Dooley @ 2022-07-15 14:04 UTC (permalink / raw)
  To: sudeep.holla
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On 15/07/2022 10:16, Conor Dooley wrote:
> On 15/07/2022 10:11, Sudeep Holla wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On Thu, Jul 14, 2022 at 04:10:36PM +0000, Conor.Dooley@microchip.com wrote:
>>> With the GFP_ATOMIC, behaviour is the same as before for me.
>>>
>>
>> So you still get -ENOMEM failure on your platform. It is fine, just that
>> I am bit curious to know why as it succeeds at device_initcall later.
>> I was hoping this might fix your memory allocation failure.
> 
> Correct. 

I take that back. On today's next with patch 2 applied, I don't see a
problem with no memory, so this does appear to have sorted the memory
allocation failure. Sorry for misleading you & thanks!

With my patches for store_cpu_topology on RISC-V I do see it though,
when called on the boot cpu. I must have mixed up which I had tested.
I have a fix for that though & will update my patches later.

Thanks & apologies!
Conor.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-15 14:04               ` Conor.Dooley
@ 2022-07-15 15:41                 ` Sudeep Holla
  0 siblings, 0 replies; 17+ messages in thread
From: Sudeep Holla @ 2022-07-15 15:41 UTC (permalink / raw)
  To: Conor.Dooley
  Cc: linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	Sudeep Holla, ionela.voinescu, pierre.gondois, linux-arm-kernel,
	linux-riscv

On Fri, Jul 15, 2022 at 02:04:41PM +0000, Conor.Dooley@microchip.com wrote:
> On 15/07/2022 10:16, Conor Dooley wrote:
> > On 15/07/2022 10:11, Sudeep Holla wrote:
> >> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>
> >> On Thu, Jul 14, 2022 at 04:10:36PM +0000, Conor.Dooley@microchip.com wrote:
> >>> With the GFP_ATOMIC, behaviour is the same as before for me.
> >>>
> >>
> >> So you still get -ENOMEM failure on your platform. It is fine, just that
> >> I am bit curious to know why as it succeeds at device_initcall later.
> >> I was hoping this might fix your memory allocation failure.
> > 
> > Correct. 
> 
> I take that back. On today's next with patch 2 applied, I don't see a
> problem with no memory, so this does appear to have sorted the memory
> allocation failure. Sorry for misleading you & thanks!
>

No worries. Glad to hear this fixed the memory allocation issue you had
on your platform, I was hopeful based on some analysis I did. I must have
done this from first, I think I had seen the bug in my initial testing and
moved the call to detect_cache_attributes() into init_cpu_topology() instead
which fixed it but broke hotplug which I didn't notice on few platforms
I tested until Ionela tested it on a particular platform.

> With my patches for store_cpu_topology on RISC-V I do see it though,
> when called on the boot cpu. I must have mixed up which I had tested.
> I have a fix for that though & will update my patches later.
> 

Sure.

-- 
Regards,
Sudeep

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-13 13:33 [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path Sudeep Holla
                   ` (3 preceding siblings ...)
  2022-07-14 17:52 ` Ionela Voinescu
@ 2022-07-18 17:41 ` Guenter Roeck
  2022-07-18 17:57   ` Conor.Dooley
  4 siblings, 1 reply; 17+ messages in thread
From: Guenter Roeck @ 2022-07-18 17:41 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: linux-kernel, conor.dooley, Greg Kroah-Hartman, Vincent Guittot,
	Dietmar Eggemann, Ionela Voinescu, Pierre Gondois,
	linux-arm-kernel, linux-riscv

On Wed, Jul 13, 2022 at 02:33:44PM +0100, Sudeep Holla wrote:
> init_cpu_topology() is called only once at the boot and all the cache
> attributes are detected early for all the possible CPUs. However when
> the CPUs are hotplugged out, the cacheinfo gets removed. While the
> attributes are added back when the CPUs are hotplugged back in as part
> of CPU hotplug state machine, it ends up called quite late after the
> update_siblings_masks() are called in the secondary_start_kernel()
> resulting in wrong llc_sibling_masks.
> 
> Move the call to detect_cache_attributes() inside update_siblings_masks()
> to ensure the cacheinfo is updated before the LLC sibling masks are
> updated. This will fix the incorrect LLC sibling masks generated when
> the CPUs are hotplugged out and hotplugged back in again.
> 
> Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>  drivers/base/arch_topology.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> Hi Conor,
> 
> Ionela reported an issue with the CPU hotplug and as a fix I need to
> move the call to detect_cache_attributes() which I had thought to keep
> it there from first but for no reason had moved it to init_cpu_topology().
> 
> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> cpu in the secondary CPUs init path while init_cpu_topology executed
> detect_cache_attributes() for all possible CPUs much earlier. I think
> this might help as the percpu memory might be initialised in this case.
> 
> Anyways give this a try, also test the CPU hotplug and check if nothing
> is broken on RISC-V. We noticed this bug only on one platform while
> 

arm64, with next-20220718:

...
[    0.823405] Detected PIPT I-cache on CPU1
[    0.824456] BUG: sleeping function called from invalid context at kernel/locking/semaphore.c:164
[    0.824550] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1
[    0.824600] preempt_count: 1, expected: 0
[    0.824633] RCU nest depth: 0, expected: 0
[    0.824899] no locks held by swapper/1/0.
[    0.825035] irq event stamp: 0
[    0.825072] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[    0.826017] hardirqs last disabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
[    0.826123] softirqs last  enabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
[    0.826191] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    0.826764] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc7-next-20220718 #1
[    0.827397] Call trace:
[    0.827456]  dump_backtrace.part.0+0xd4/0xe0
[    0.827574]  show_stack+0x18/0x50
[    0.827625]  dump_stack_lvl+0x9c/0xd8
[    0.827678]  dump_stack+0x18/0x34
[    0.827722]  __might_resched+0x178/0x220
[    0.827778]  __might_sleep+0x48/0x80
[    0.827833]  down_timeout+0x2c/0xa0
[    0.827896]  acpi_os_wait_semaphore+0x68/0x9c
[    0.827952]  acpi_ut_acquire_mutex+0x4c/0xb8
[    0.828008]  acpi_get_table+0x38/0xbc
[    0.828059]  acpi_find_last_cache_level+0x44/0x130
[    0.828112]  init_cache_level+0xb8/0xcc
[    0.828165]  detect_cache_attributes+0x240/0x580
[    0.828217]  update_siblings_masks+0x28/0x270
[    0.828270]  store_cpu_topology+0x64/0x74
[    0.828326]  secondary_start_kernel+0xd0/0x150
[    0.828386]  __secondary_switched+0xb0/0xb4

I know the problem has already been reported, but I think the backtrace
above is slightly different.

Guenter

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-18 17:41 ` Guenter Roeck
@ 2022-07-18 17:57   ` Conor.Dooley
  2022-07-19 10:29     ` Sudeep Holla
  0 siblings, 1 reply; 17+ messages in thread
From: Conor.Dooley @ 2022-07-18 17:57 UTC (permalink / raw)
  To: linux, sudeep.holla
  Cc: linux-kernel, Conor.Dooley, gregkh, vincent.guittot,
	dietmar.eggemann, ionela.voinescu, pierre.gondois,
	linux-arm-kernel, linux-riscv

On 18/07/2022 18:41, Guenter Roeck wrote:
> On Wed, Jul 13, 2022 at 02:33:44PM +0100, Sudeep Holla wrote:
>> init_cpu_topology() is called only once at the boot and all the cache
>> attributes are detected early for all the possible CPUs. However when
>> the CPUs are hotplugged out, the cacheinfo gets removed. While the
>> attributes are added back when the CPUs are hotplugged back in as part
>> of CPU hotplug state machine, it ends up called quite late after the
>> update_siblings_masks() are called in the secondary_start_kernel()
>> resulting in wrong llc_sibling_masks.
>>
>> Move the call to detect_cache_attributes() inside update_siblings_masks()
>> to ensure the cacheinfo is updated before the LLC sibling masks are
>> updated. This will fix the incorrect LLC sibling masks generated when
>> the CPUs are hotplugged out and hotplugged back in again.
>>
>> Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
>> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
>> ---
>>  drivers/base/arch_topology.c | 16 ++++++----------
>>  1 file changed, 6 insertions(+), 10 deletions(-)
>>
>> Hi Conor,
>>
>> Ionela reported an issue with the CPU hotplug and as a fix I need to
>> move the call to detect_cache_attributes() which I had thought to keep
>> it there from first but for no reason had moved it to init_cpu_topology().
>>
>> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
>> cpu in the secondary CPUs init path while init_cpu_topology executed
>> detect_cache_attributes() for all possible CPUs much earlier. I think
>> this might help as the percpu memory might be initialised in this case.
>>
>> Anyways give this a try, also test the CPU hotplug and check if nothing
>> is broken on RISC-V. We noticed this bug only on one platform while
>>
> 
> arm64, with next-20220718:
> 
> ...
> [    0.823405] Detected PIPT I-cache on CPU1
> [    0.824456] BUG: sleeping function called from invalid context at kernel/locking/semaphore.c:164
> [    0.824550] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1
> [    0.824600] preempt_count: 1, expected: 0
> [    0.824633] RCU nest depth: 0, expected: 0
> [    0.824899] no locks held by swapper/1/0.
> [    0.825035] irq event stamp: 0
> [    0.825072] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [    0.826017] hardirqs last disabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
> [    0.826123] softirqs last  enabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
> [    0.826191] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [    0.826764] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc7-next-20220718 #1
> [    0.827397] Call trace:
> [    0.827456]  dump_backtrace.part.0+0xd4/0xe0
> [    0.827574]  show_stack+0x18/0x50
> [    0.827625]  dump_stack_lvl+0x9c/0xd8
> [    0.827678]  dump_stack+0x18/0x34
> [    0.827722]  __might_resched+0x178/0x220
> [    0.827778]  __might_sleep+0x48/0x80
> [    0.827833]  down_timeout+0x2c/0xa0
> [    0.827896]  acpi_os_wait_semaphore+0x68/0x9c
> [    0.827952]  acpi_ut_acquire_mutex+0x4c/0xb8
> [    0.828008]  acpi_get_table+0x38/0xbc
> [    0.828059]  acpi_find_last_cache_level+0x44/0x130
> [    0.828112]  init_cache_level+0xb8/0xcc
> [    0.828165]  detect_cache_attributes+0x240/0x580
> [    0.828217]  update_siblings_masks+0x28/0x270
> [    0.828270]  store_cpu_topology+0x64/0x74
> [    0.828326]  secondary_start_kernel+0xd0/0x150
> [    0.828386]  __secondary_switched+0xb0/0xb4
> 
> I know the problem has already been reported, but I think the backtrace
> above is slightly different.

Aye, I got a different BT on RISC-V + DT - but that should be fixed in
next-20220718. This is a different problem unfortunately.

Thanks,
Conor.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
  2022-07-18 17:57   ` Conor.Dooley
@ 2022-07-19 10:29     ` Sudeep Holla
  0 siblings, 0 replies; 17+ messages in thread
From: Sudeep Holla @ 2022-07-19 10:29 UTC (permalink / raw)
  To: Conor.Dooley
  Cc: linux, linux-kernel, gregkh, vincent.guittot, dietmar.eggemann,
	ionela.voinescu, pierre.gondois, linux-arm-kernel, linux-riscv

On Mon, Jul 18, 2022 at 05:57:33PM +0000, Conor.Dooley@microchip.com wrote:
> On 18/07/2022 18:41, Guenter Roeck wrote:
> > On Wed, Jul 13, 2022 at 02:33:44PM +0100, Sudeep Holla wrote:
> >> init_cpu_topology() is called only once at the boot and all the cache
> >> attributes are detected early for all the possible CPUs. However when
> >> the CPUs are hotplugged out, the cacheinfo gets removed. While the
> >> attributes are added back when the CPUs are hotplugged back in as part
> >> of CPU hotplug state machine, it ends up called quite late after the
> >> update_siblings_masks() are called in the secondary_start_kernel()
> >> resulting in wrong llc_sibling_masks.
> >>
> >> Move the call to detect_cache_attributes() inside update_siblings_masks()
> >> to ensure the cacheinfo is updated before the LLC sibling masks are
> >> updated. This will fix the incorrect LLC sibling masks generated when
> >> the CPUs are hotplugged out and hotplugged back in again.
> >>
> >> Reported-by: Ionela Voinescu <ionela.voinescu@arm.com>
> >> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> >> ---
> >>  drivers/base/arch_topology.c | 16 ++++++----------
> >>  1 file changed, 6 insertions(+), 10 deletions(-)
> >>
> >> Hi Conor,
> >>
> >> Ionela reported an issue with the CPU hotplug and as a fix I need to
> >> move the call to detect_cache_attributes() which I had thought to keep
> >> it there from first but for no reason had moved it to init_cpu_topology().
> >>
> >> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> >> cpu in the secondary CPUs init path while init_cpu_topology executed
> >> detect_cache_attributes() for all possible CPUs much earlier. I think
> >> this might help as the percpu memory might be initialised in this case.
> >>
> >> Anyways give this a try, also test the CPU hotplug and check if nothing
> >> is broken on RISC-V. We noticed this bug only on one platform while
> >>
> > 
> > arm64, with next-20220718:
> > 
> > ...
> > [    0.823405] Detected PIPT I-cache on CPU1
> > [    0.824456] BUG: sleeping function called from invalid context at kernel/locking/semaphore.c:164
> > [    0.824550] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1
> > [    0.824600] preempt_count: 1, expected: 0
> > [    0.824633] RCU nest depth: 0, expected: 0
> > [    0.824899] no locks held by swapper/1/0.
> > [    0.825035] irq event stamp: 0
> > [    0.825072] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > [    0.826017] hardirqs last disabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
> > [    0.826123] softirqs last  enabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
> > [    0.826191] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > [    0.826764] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc7-next-20220718 #1
> > [    0.827397] Call trace:
> > [    0.827456]  dump_backtrace.part.0+0xd4/0xe0
> > [    0.827574]  show_stack+0x18/0x50
> > [    0.827625]  dump_stack_lvl+0x9c/0xd8
> > [    0.827678]  dump_stack+0x18/0x34
> > [    0.827722]  __might_resched+0x178/0x220
> > [    0.827778]  __might_sleep+0x48/0x80
> > [    0.827833]  down_timeout+0x2c/0xa0
> > [    0.827896]  acpi_os_wait_semaphore+0x68/0x9c
> > [    0.827952]  acpi_ut_acquire_mutex+0x4c/0xb8
> > [    0.828008]  acpi_get_table+0x38/0xbc
> > [    0.828059]  acpi_find_last_cache_level+0x44/0x130
> > [    0.828112]  init_cache_level+0xb8/0xcc
> > [    0.828165]  detect_cache_attributes+0x240/0x580
> > [    0.828217]  update_siblings_masks+0x28/0x270
> > [    0.828270]  store_cpu_topology+0x64/0x74
> > [    0.828326]  secondary_start_kernel+0xd0/0x150
> > [    0.828386]  __secondary_switched+0xb0/0xb4
> > 
> > I know the problem has already been reported, but I think the backtrace
> > above is slightly different.
>

Thanks for the report, I forgot to run with lockdep on ACPI system. This is
trickier. I will take a look at it.

> Aye, I got a different BT on RISC-V + DT - but that should be fixed in
> next-20220718. This is a different problem unfortunately.

Yes, ACPI is bit different flow.

--
Regards,
Sudeep

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-07-19 10:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-13 13:33 [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path Sudeep Holla
2022-07-13 14:03 ` Greg Kroah-Hartman
2022-07-13 14:18   ` Sudeep Holla
2022-07-13 16:04 ` Conor.Dooley
2022-07-14 14:17 ` Conor.Dooley
2022-07-14 15:01   ` Sudeep Holla
2022-07-14 15:27     ` Conor.Dooley
2022-07-14 16:00       ` Sudeep Holla
2022-07-14 16:10         ` Conor.Dooley
2022-07-15  9:11           ` Sudeep Holla
2022-07-15  9:16             ` Conor.Dooley
2022-07-15 14:04               ` Conor.Dooley
2022-07-15 15:41                 ` Sudeep Holla
2022-07-14 17:52 ` Ionela Voinescu
2022-07-18 17:41 ` Guenter Roeck
2022-07-18 17:57   ` Conor.Dooley
2022-07-19 10:29     ` Sudeep Holla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).