linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] arm64: smp: fix smp_send_stop() behaviour
@ 2019-06-13 12:21 Cristian Marussi
  2019-06-17 18:09 ` Will Deacon
       [not found] ` <CANW9uyt1_Jt=Lk_Y7OQOEnSx7rZg5J5gQYcZcxU8TeZRYYHLCQ@mail.gmail.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Cristian Marussi @ 2019-06-13 12:21 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, will.deacon; +Cc: mark.rutland, dave.martin

On a 2-CPUs system, when one CPU is already online if the other
panics while starting-up, smp_send_stop() will fail to send any
STOP message to the other already online core, resulting in a
system still responsive and alive at the end of the panic procedure.
This patch makes smp_send_stop() account also for the online status
of the calling CPU while evaluating how many CPUs are effectively
online: this way, an adequate number of STOPs is sent, so enforcing
a proper freeze of the system at the end of panic even under the
above conditions.

Reported-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
---

This peculiar panic-procedure behaviour was exposed hitting a BUG()
while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
Such trigger-BUG() was fixed by a distinct commit already included
in Linux 5.2-rc4 [0]

[0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
---
 arch/arm64/kernel/smp.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index bb4b3f07761a..c7d604427883 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
 void smp_send_stop(void)
 {
 	unsigned long timeout;
+	unsigned int this_cpu_online = cpu_online(smp_processor_id());
 
-	if (num_online_cpus() > 1) {
+	/*
+	 * If this CPU isn't fully online, it will not be counted in
+	 * num_online_cpus(): on a 2-CPU system this situation will
+	 * result in no message being sent to the other already online CPU.
+	 */
+	if (num_online_cpus() > this_cpu_online) {
 		cpumask_t mask;
 
 		cpumask_copy(&mask, cpu_online_mask);
@@ -985,10 +991,10 @@ void smp_send_stop(void)
 
 	/* Wait up to one second for other CPUs to stop */
 	timeout = USEC_PER_SEC;
-	while (num_online_cpus() > 1 && timeout--)
+	while (num_online_cpus() > this_cpu_online && timeout--)
 		udelay(1);
 
-	if (num_online_cpus() > 1)
+	if (num_online_cpus() > this_cpu_online)
 		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
 			   cpumask_pr_args(cpu_online_mask));
 
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
  2019-06-13 12:21 [PATCH] arm64: smp: fix smp_send_stop() behaviour Cristian Marussi
@ 2019-06-17 18:09 ` Will Deacon
  2019-06-18  9:36   ` James Morse
                     ` (2 more replies)
       [not found] ` <CANW9uyt1_Jt=Lk_Y7OQOEnSx7rZg5J5gQYcZcxU8TeZRYYHLCQ@mail.gmail.com>
  1 sibling, 3 replies; 9+ messages in thread
From: Will Deacon @ 2019-06-17 18:09 UTC (permalink / raw)
  To: Cristian Marussi
  Cc: mark.rutland, catalin.marinas, james.morse, dave.martin,
	linux-arm-kernel

[+James M]

On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
> On a 2-CPUs system, when one CPU is already online if the other
> panics while starting-up, smp_send_stop() will fail to send any
> STOP message to the other already online core, resulting in a
> system still responsive and alive at the end of the panic procedure.
> This patch makes smp_send_stop() account also for the online status
> of the calling CPU while evaluating how many CPUs are effectively
> online: this way, an adequate number of STOPs is sent, so enforcing
> a proper freeze of the system at the end of panic even under the
> above conditions.
> 
> Reported-by: Dave Martin <Dave.Martin@arm.com>
> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
> ---
> 
> This peculiar panic-procedure behaviour was exposed hitting a BUG()
> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
> Such trigger-BUG() was fixed by a distinct commit already included
> in Linux 5.2-rc4 [0]
> 
> [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
> ---
>  arch/arm64/kernel/smp.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index bb4b3f07761a..c7d604427883 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>  void smp_send_stop(void)
>  {
>  	unsigned long timeout;
> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>  
> -	if (num_online_cpus() > 1) {
> +	/*
> +	 * If this CPU isn't fully online, it will not be counted in
> +	 * num_online_cpus(): on a 2-CPU system this situation will
> +	 * result in no message being sent to the other already online CPU.
> +	 */
> +	if (num_online_cpus() > this_cpu_online) {
>  		cpumask_t mask;
>  
>  		cpumask_copy(&mask, cpu_online_mask);
> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>  
>  	/* Wait up to one second for other CPUs to stop */
>  	timeout = USEC_PER_SEC;
> -	while (num_online_cpus() > 1 && timeout--)
> +	while (num_online_cpus() > this_cpu_online && timeout--)
>  		udelay(1);
>  
> -	if (num_online_cpus() > 1)
> +	if (num_online_cpus() > this_cpu_online)
>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>  			   cpumask_pr_args(cpu_online_mask));

Whilst this looks ok to me, I'm worried about whether or not we have this
sort of logic elsewhere. For example, do we need to fix
crash_smp_send_stop() (and possibly machine_kexec()) too?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
  2019-06-17 18:09 ` Will Deacon
@ 2019-06-18  9:36   ` James Morse
  2019-06-18  9:58     ` Cristian Marussi
  2019-06-18  9:41   ` Cristian Marussi
  2019-06-18 12:54   ` Russell King - ARM Linux admin
  2 siblings, 1 reply; 9+ messages in thread
From: James Morse @ 2019-06-18  9:36 UTC (permalink / raw)
  To: Will Deacon, Cristian Marussi
  Cc: mark.rutland, catalin.marinas, dave.martin, linux-arm-kernel

Hi Christian, Will,

On 17/06/2019 19:09, Will Deacon wrote:
> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>> On a 2-CPUs system, when one CPU is already online if the other
>> panics while starting-up, smp_send_stop() will fail to send any
>> STOP message to the other already online core, resulting in a
>> system still responsive and alive at the end of the panic procedure.
>> This patch makes smp_send_stop() account also for the online status
>> of the calling CPU while evaluating how many CPUs are effectively
>> online: this way, an adequate number of STOPs is sent, so enforcing
>> a proper freeze of the system at the end of panic even under the
>> above conditions.


>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index bb4b3f07761a..c7d604427883 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>  void smp_send_stop(void)
>>  {
>>  	unsigned long timeout;
>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>  
>> -	if (num_online_cpus() > 1) {
>> +	/*
>> +	 * If this CPU isn't fully online, it will not be counted in
>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>> +	 * result in no message being sent to the other already online CPU.
>> +	 */
>> +	if (num_online_cpus() > this_cpu_online) {
>>  		cpumask_t mask;
>>  
>>  		cpumask_copy(&mask, cpu_online_mask);
>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>  
>>  	/* Wait up to one second for other CPUs to stop */
>>  	timeout = USEC_PER_SEC;
>> -	while (num_online_cpus() > 1 && timeout--)
>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>  		udelay(1);
>>  
>> -	if (num_online_cpus() > 1)
>> +	if (num_online_cpus() > this_cpu_online)
>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>  			   cpumask_pr_args(cpu_online_mask));
> 
> Whilst this looks ok to me, I'm worried about whether or not we have this
> sort of logic elsewhere. For example, do we need to fix
> crash_smp_send_stop() (and possibly machine_kexec()) too?

We should do the same in crash_smp_send_stop(), its possible a newly-online core triggers
kdump by calling panic() in the same way.

machine_kexec() is called on the last surviving cpu after migrate_to_reboot_cpu() has run.
At first glance it looks like this could never happen there, but kexec re-enables
cpu-hotplug (commit 011e4b02f1da), and we can reschedule before we start moving memory
around, so I'm not convinced its immune...



Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
  2019-06-17 18:09 ` Will Deacon
  2019-06-18  9:36   ` James Morse
@ 2019-06-18  9:41   ` Cristian Marussi
  2019-06-18 12:46     ` Will Deacon
  2019-06-18 12:54   ` Russell King - ARM Linux admin
  2 siblings, 1 reply; 9+ messages in thread
From: Cristian Marussi @ 2019-06-18  9:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: mark.rutland, catalin.marinas, james.morse, dave.martin,
	linux-arm-kernel

Hi Will

Thanks for the review.

On 17/06/2019 19:09, Will Deacon wrote:
> [+James M]
> 
> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>> On a 2-CPUs system, when one CPU is already online if the other
>> panics while starting-up, smp_send_stop() will fail to send any
>> STOP message to the other already online core, resulting in a
>> system still responsive and alive at the end of the panic procedure.
>> This patch makes smp_send_stop() account also for the online status
>> of the calling CPU while evaluating how many CPUs are effectively
>> online: this way, an adequate number of STOPs is sent, so enforcing
>> a proper freeze of the system at the end of panic even under the
>> above conditions.
>>
>> Reported-by: Dave Martin <Dave.Martin@arm.com>
>> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
>> ---
>>
>> This peculiar panic-procedure behaviour was exposed hitting a BUG()
>> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
>> Such trigger-BUG() was fixed by a distinct commit already included
>> in Linux 5.2-rc4 [0]
>>
>> [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
>> ---
>>  arch/arm64/kernel/smp.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index bb4b3f07761a..c7d604427883 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>  void smp_send_stop(void)
>>  {
>>  	unsigned long timeout;
>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>  
>> -	if (num_online_cpus() > 1) {
>> +	/*
>> +	 * If this CPU isn't fully online, it will not be counted in
>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>> +	 * result in no message being sent to the other already online CPU.
>> +	 */
>> +	if (num_online_cpus() > this_cpu_online) {
>>  		cpumask_t mask;
>>  
>>  		cpumask_copy(&mask, cpu_online_mask);
>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>  
>>  	/* Wait up to one second for other CPUs to stop */
>>  	timeout = USEC_PER_SEC;
>> -	while (num_online_cpus() > 1 && timeout--)
>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>  		udelay(1);
>>  
>> -	if (num_online_cpus() > 1)
>> +	if (num_online_cpus() > this_cpu_online)
>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>  			   cpumask_pr_args(cpu_online_mask));
> 
> Whilst this looks ok to me, I'm worried about whether or not we have this
> sort of logic elsewhere. For example, do we need to fix
> crash_smp_send_stop() (and possibly machine_kexec()) too?

I think we certainly have such logic in crash_smp_send_stop() too at least,
maybe it is just less easily exposed given the different use case of the function.

This wanted to be just a fix only against the observed troubled panic, but I
could extend it to cover similar issues spotted by code analysis, if deemed worth.

Thanks

Cristian


> 
> Will
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
       [not found] ` <CANW9uyt1_Jt=Lk_Y7OQOEnSx7rZg5J5gQYcZcxU8TeZRYYHLCQ@mail.gmail.com>
@ 2019-06-18  9:56   ` Cristian Marussi
  0 siblings, 0 replies; 9+ messages in thread
From: Cristian Marussi @ 2019-06-18  9:56 UTC (permalink / raw)
  To: Itaru Kitayama
  Cc: mark.rutland, catalin.marinas, will.deacon, dave.martin,
	linux-arm-kernel

Hi Itaru

thanks for the review.

On 17/06/2019 20:58, Itaru Kitayama wrote:
> Could you avoid using the magic number like in udelay()?
> 

If you mean udelay(1) it is just that I avoided modifying anything which was not
strictly related to the fix addressed by this patch.

Thanks

Cristian

> On Thu, Jun 13, 2019 at 21:22 Cristian Marussi <cristian.marussi@arm.com>
> wrote:
> 
>> On a 2-CPUs system, when one CPU is already online if the other
>> panics while starting-up, smp_send_stop() will fail to send any
>> STOP message to the other already online core, resulting in a
>> system still responsive and alive at the end of the panic procedure.
>> This patch makes smp_send_stop() account also for the online status
>> of the calling CPU while evaluating how many CPUs are effectively
>> online: this way, an adequate number of STOPs is sent, so enforcing
>> a proper freeze of the system at the end of panic even under the
>> above conditions.
>>
>> Reported-by: Dave Martin <Dave.Martin@arm.com>
>> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
>> ---
>>
>> This peculiar panic-procedure behaviour was exposed hitting a BUG()
>> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
>> Such trigger-BUG() was fixed by a distinct commit already included
>> in Linux 5.2-rc4 [0]
>>
>> [0]
>> https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
>> ---
>>  arch/arm64/kernel/smp.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index bb4b3f07761a..c7d604427883 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>  void smp_send_stop(void)
>>  {
>>         unsigned long timeout;
>> +       unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>
>> -       if (num_online_cpus() > 1) {
>> +       /*
>> +        * If this CPU isn't fully online, it will not be counted in
>> +        * num_online_cpus(): on a 2-CPU system this situation will
>> +        * result in no message being sent to the other already online CPU.
>> +        */
>> +       if (num_online_cpus() > this_cpu_online) {
>>                 cpumask_t mask;
>>
>>                 cpumask_copy(&mask, cpu_online_mask);
>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>
>>         /* Wait up to one second for other CPUs to stop */
>>         timeout = USEC_PER_SEC;
>> -       while (num_online_cpus() > 1 && timeout--)
>> +       while (num_online_cpus() > this_cpu_online && timeout--)
>>                 udelay(1);
>>
>> -       if (num_online_cpus() > 1)
>> +       if (num_online_cpus() > this_cpu_online)
>>                 pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>                            cpumask_pr_args(cpu_online_mask));
>>
>> --
>> 2.17.1
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
  2019-06-18  9:36   ` James Morse
@ 2019-06-18  9:58     ` Cristian Marussi
  0 siblings, 0 replies; 9+ messages in thread
From: Cristian Marussi @ 2019-06-18  9:58 UTC (permalink / raw)
  To: James Morse, Will Deacon
  Cc: mark.rutland, catalin.marinas, dave.martin, linux-arm-kernel

Hi James

On 18/06/2019 10:36, James Morse wrote:
> Hi Christian, Will,
> 
> On 17/06/2019 19:09, Will Deacon wrote:
>> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>>> On a 2-CPUs system, when one CPU is already online if the other
>>> panics while starting-up, smp_send_stop() will fail to send any
>>> STOP message to the other already online core, resulting in a
>>> system still responsive and alive at the end of the panic procedure.
>>> This patch makes smp_send_stop() account also for the online status
>>> of the calling CPU while evaluating how many CPUs are effectively
>>> online: this way, an adequate number of STOPs is sent, so enforcing
>>> a proper freeze of the system at the end of panic even under the
>>> above conditions.
> 
> 
>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>> index bb4b3f07761a..c7d604427883 100644
>>> --- a/arch/arm64/kernel/smp.c
>>> +++ b/arch/arm64/kernel/smp.c
>>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>>  void smp_send_stop(void)
>>>  {
>>>  	unsigned long timeout;
>>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>>  
>>> -	if (num_online_cpus() > 1) {
>>> +	/*
>>> +	 * If this CPU isn't fully online, it will not be counted in
>>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>>> +	 * result in no message being sent to the other already online CPU.
>>> +	 */
>>> +	if (num_online_cpus() > this_cpu_online) {
>>>  		cpumask_t mask;
>>>  
>>>  		cpumask_copy(&mask, cpu_online_mask);
>>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>>  
>>>  	/* Wait up to one second for other CPUs to stop */
>>>  	timeout = USEC_PER_SEC;
>>> -	while (num_online_cpus() > 1 && timeout--)
>>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>>  		udelay(1);
>>>  
>>> -	if (num_online_cpus() > 1)
>>> +	if (num_online_cpus() > this_cpu_online)
>>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>>  			   cpumask_pr_args(cpu_online_mask));
>>
>> Whilst this looks ok to me, I'm worried about whether or not we have this
>> sort of logic elsewhere. For example, do we need to fix
>> crash_smp_send_stop() (and possibly machine_kexec()) too?
> 
> We should do the same in crash_smp_send_stop(), its possible a newly-online core triggers
> kdump by calling panic() in the same way.
> 
> machine_kexec() is called on the last surviving cpu after migrate_to_reboot_cpu() has run.
> At first glance it looks like this could never happen there, but kexec re-enables
> cpu-hotplug (commit 011e4b02f1da), and we can reschedule before we start moving memory
> around, so I'm not convinced its immune...
> 
> 

I'll look into machine_kexec() behavior in these regards.

Thanks for the review !

Cristian
> 
> Thanks,
> 
> James
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
  2019-06-18  9:41   ` Cristian Marussi
@ 2019-06-18 12:46     ` Will Deacon
  0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2019-06-18 12:46 UTC (permalink / raw)
  To: Cristian Marussi
  Cc: mark.rutland, catalin.marinas, Will Deacon, james.morse,
	dave.martin, linux-arm-kernel

On Tue, Jun 18, 2019 at 10:41:32AM +0100, Cristian Marussi wrote:
> On 17/06/2019 19:09, Will Deacon wrote:
> >> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> >> index bb4b3f07761a..c7d604427883 100644
> >> --- a/arch/arm64/kernel/smp.c
> >> +++ b/arch/arm64/kernel/smp.c
> >> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
> >>  void smp_send_stop(void)
> >>  {
> >>  	unsigned long timeout;
> >> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
> >>  
> >> -	if (num_online_cpus() > 1) {
> >> +	/*
> >> +	 * If this CPU isn't fully online, it will not be counted in
> >> +	 * num_online_cpus(): on a 2-CPU system this situation will
> >> +	 * result in no message being sent to the other already online CPU.
> >> +	 */
> >> +	if (num_online_cpus() > this_cpu_online) {
> >>  		cpumask_t mask;
> >>  
> >>  		cpumask_copy(&mask, cpu_online_mask);
> >> @@ -985,10 +991,10 @@ void smp_send_stop(void)
> >>  
> >>  	/* Wait up to one second for other CPUs to stop */
> >>  	timeout = USEC_PER_SEC;
> >> -	while (num_online_cpus() > 1 && timeout--)
> >> +	while (num_online_cpus() > this_cpu_online && timeout--)
> >>  		udelay(1);
> >>  
> >> -	if (num_online_cpus() > 1)
> >> +	if (num_online_cpus() > this_cpu_online)
> >>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
> >>  			   cpumask_pr_args(cpu_online_mask));
> > 
> > Whilst this looks ok to me, I'm worried about whether or not we have this
> > sort of logic elsewhere. For example, do we need to fix
> > crash_smp_send_stop() (and possibly machine_kexec()) too?
> 
> I think we certainly have such logic in crash_smp_send_stop() too at least,
> maybe it is just less easily exposed given the different use case of the function.
> 
> This wanted to be just a fix only against the observed troubled panic, but I
> could extend it to cover similar issues spotted by code analysis, if deemed worth.

Yes, please. Makes sense to fix these all at once.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
  2019-06-17 18:09 ` Will Deacon
  2019-06-18  9:36   ` James Morse
  2019-06-18  9:41   ` Cristian Marussi
@ 2019-06-18 12:54   ` Russell King - ARM Linux admin
  2019-06-18 17:40     ` Cristian Marussi
  2 siblings, 1 reply; 9+ messages in thread
From: Russell King - ARM Linux admin @ 2019-06-18 12:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: mark.rutland, catalin.marinas, james.morse, linux-arm-kernel,
	dave.martin, Cristian Marussi

On Mon, Jun 17, 2019 at 07:09:13PM +0100, Will Deacon wrote:
> [+James M]
> 
> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
> > On a 2-CPUs system, when one CPU is already online if the other
> > panics while starting-up, smp_send_stop() will fail to send any
> > STOP message to the other already online core, resulting in a
> > system still responsive and alive at the end of the panic procedure.
> > This patch makes smp_send_stop() account also for the online status
> > of the calling CPU while evaluating how many CPUs are effectively
> > online: this way, an adequate number of STOPs is sent, so enforcing
> > a proper freeze of the system at the end of panic even under the
> > above conditions.
> > 
> > Reported-by: Dave Martin <Dave.Martin@arm.com>
> > Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
> > ---
> > 
> > This peculiar panic-procedure behaviour was exposed hitting a BUG()
> > while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
> > Such trigger-BUG() was fixed by a distinct commit already included
> > in Linux 5.2-rc4 [0]
> > 
> > [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
> > ---
> >  arch/arm64/kernel/smp.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index bb4b3f07761a..c7d604427883 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
> >  void smp_send_stop(void)
> >  {
> >  	unsigned long timeout;
> > +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
> >  
> > -	if (num_online_cpus() > 1) {
> > +	/*
> > +	 * If this CPU isn't fully online, it will not be counted in
> > +	 * num_online_cpus(): on a 2-CPU system this situation will
> > +	 * result in no message being sent to the other already online CPU.
> > +	 */
> > +	if (num_online_cpus() > this_cpu_online) {
> >  		cpumask_t mask;
> >  
> >  		cpumask_copy(&mask, cpu_online_mask);
> > @@ -985,10 +991,10 @@ void smp_send_stop(void)
> >  
> >  	/* Wait up to one second for other CPUs to stop */
> >  	timeout = USEC_PER_SEC;
> > -	while (num_online_cpus() > 1 && timeout--)
> > +	while (num_online_cpus() > this_cpu_online && timeout--)
> >  		udelay(1);
> >  
> > -	if (num_online_cpus() > 1)
> > +	if (num_online_cpus() > this_cpu_online)
> >  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
> >  			   cpumask_pr_args(cpu_online_mask));
> 
> Whilst this looks ok to me, I'm worried about whether or not we have this
> sort of logic elsewhere. For example, do we need to fix
> crash_smp_send_stop() (and possibly machine_kexec()) too?

What about other architectures?  This, or very similar code, is present
on other architectures too.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] arm64: smp: fix smp_send_stop() behaviour
  2019-06-18 12:54   ` Russell King - ARM Linux admin
@ 2019-06-18 17:40     ` Cristian Marussi
  0 siblings, 0 replies; 9+ messages in thread
From: Cristian Marussi @ 2019-06-18 17:40 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Will Deacon
  Cc: mark.rutland, catalin.marinas, james.morse, dave.martin,
	linux-arm-kernel

Hi

On 18/06/2019 13:54, Russell King - ARM Linux admin wrote:
> On Mon, Jun 17, 2019 at 07:09:13PM +0100, Will Deacon wrote:
>> [+James M]
>>
>> On Thu, Jun 13, 2019 at 01:21:46PM +0100, Cristian Marussi wrote:
>>> On a 2-CPUs system, when one CPU is already online if the other
>>> panics while starting-up, smp_send_stop() will fail to send any
>>> STOP message to the other already online core, resulting in a
>>> system still responsive and alive at the end of the panic procedure.
>>> This patch makes smp_send_stop() account also for the online status
>>> of the calling CPU while evaluating how many CPUs are effectively
>>> online: this way, an adequate number of STOPs is sent, so enforcing
>>> a proper freeze of the system at the end of panic even under the
>>> above conditions.
>>>
>>> Reported-by: Dave Martin <Dave.Martin@arm.com>
>>> Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
>>> ---
>>>
>>> This peculiar panic-procedure behaviour was exposed hitting a BUG()
>>> while running a KSFT cpu-hotplug test on a 2-core ARMv8 model.
>>> Such trigger-BUG() was fixed by a distinct commit already included
>>> in Linux 5.2-rc4 [0]
>>>
>>> [0] https://lore.kernel.org/linux-arm-kernel/1559576102-12156-1-git-send-email-Dave.Martin@arm.com/
>>> ---
>>>  arch/arm64/kernel/smp.c | 12 +++++++++---
>>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>> index bb4b3f07761a..c7d604427883 100644
>>> --- a/arch/arm64/kernel/smp.c
>>> +++ b/arch/arm64/kernel/smp.c
>>> @@ -971,8 +971,14 @@ void tick_broadcast(const struct cpumask *mask)
>>>  void smp_send_stop(void)
>>>  {
>>>  	unsigned long timeout;
>>> +	unsigned int this_cpu_online = cpu_online(smp_processor_id());
>>>  
>>> -	if (num_online_cpus() > 1) {
>>> +	/*
>>> +	 * If this CPU isn't fully online, it will not be counted in
>>> +	 * num_online_cpus(): on a 2-CPU system this situation will
>>> +	 * result in no message being sent to the other already online CPU.
>>> +	 */
>>> +	if (num_online_cpus() > this_cpu_online) {
>>>  		cpumask_t mask;
>>>  
>>>  		cpumask_copy(&mask, cpu_online_mask);
>>> @@ -985,10 +991,10 @@ void smp_send_stop(void)
>>>  
>>>  	/* Wait up to one second for other CPUs to stop */
>>>  	timeout = USEC_PER_SEC;
>>> -	while (num_online_cpus() > 1 && timeout--)
>>> +	while (num_online_cpus() > this_cpu_online && timeout--)
>>>  		udelay(1);
>>>  
>>> -	if (num_online_cpus() > 1)
>>> +	if (num_online_cpus() > this_cpu_online)
>>>  		pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
>>>  			   cpumask_pr_args(cpu_online_mask));
>>
>> Whilst this looks ok to me, I'm worried about whether or not we have this
>> sort of logic elsewhere. For example, do we need to fix
>> crash_smp_send_stop() (and possibly machine_kexec()) too?
> 
> What about other architectures?  This, or very similar code, is present
> on other architectures too.
> 
Thanks for the review,

indeed glancing quickly at other arch's, there is a lot of common things and
subtle differences in handling these situations...potentially some common logic
to fix once for all and abstract out of arch specific code...(or at least try to
do it)....so now a question arises to me: is it still worth to properly fix on
arm64 at first extending this patch (easier and quicker maybe) or should I opt
directly for the abstraction party ?

Thanks

Cristian


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-06-18 17:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-13 12:21 [PATCH] arm64: smp: fix smp_send_stop() behaviour Cristian Marussi
2019-06-17 18:09 ` Will Deacon
2019-06-18  9:36   ` James Morse
2019-06-18  9:58     ` Cristian Marussi
2019-06-18  9:41   ` Cristian Marussi
2019-06-18 12:46     ` Will Deacon
2019-06-18 12:54   ` Russell King - ARM Linux admin
2019-06-18 17:40     ` Cristian Marussi
     [not found] ` <CANW9uyt1_Jt=Lk_Y7OQOEnSx7rZg5J5gQYcZcxU8TeZRYYHLCQ@mail.gmail.com>
2019-06-18  9:56   ` Cristian Marussi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).