All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
@ 2018-09-04  6:33 Neeraj Upadhyay
  2018-09-04 12:12 ` Mukesh Ojha
  2018-09-05 11:33 ` Thomas Gleixner
  0 siblings, 2 replies; 14+ messages in thread
From: Neeraj Upadhyay @ 2018-09-04  6:33 UTC (permalink / raw)
  To: tglx, josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman, malat
  Cc: linux-kernel, sramana, linux-arm-msm, Neeraj Upadhyay

If takedown_cpu() fails during _cpu_down(), st->state is reset,
by calling cpuhp_reset_state(). This results in an additional
increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
state being skipped during rollback. Fix this by not calling
cpuhp_reset_state() and doing the state reset directly in
_cpu_down().

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
---
 kernel/cpu.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..9f49edb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 	 */
 	ret = cpuhp_down_callbacks(cpu, st, target);
 	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
-		cpuhp_reset_state(st, prev_state);
+		/*
+		 * As st->last is not set, cpuhp_reset_state() increments
+		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
+		 * skipped during rollback. So, don't use it here.
+		 */
+		st->rollback = true;
+		st->target = prev_state;
+		st->bringup = !st->bringup;
 		__cpuhp_kick_ap(st);
 	}
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-04  6:33 [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu() Neeraj Upadhyay
@ 2018-09-04 12:12 ` Mukesh Ojha
  2018-09-05 11:33 ` Thomas Gleixner
  1 sibling, 0 replies; 14+ messages in thread
From: Mukesh Ojha @ 2018-09-04 12:12 UTC (permalink / raw)
  To: Neeraj Upadhyay, tglx, josh, peterz, mingo, jiangshanlai,
	dzickus, brendan.jackman, malat
  Cc: linux-kernel, sramana, linux-arm-msm



On 9/4/2018 12:03 PM, Neeraj Upadhyay wrote:
> If takedown_cpu() fails during _cpu_down(), st->state is reset,
> by calling cpuhp_reset_state(). This results in an additional
> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
> state being skipped during rollback. Fix this by not calling
> cpuhp_reset_state() and doing the state reset directly in
> _cpu_down().
>
> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> ---
>   kernel/cpu.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..9f49edb 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
>   	 */
>   	ret = cpuhp_down_callbacks(cpu, st, target);
>   	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> -		cpuhp_reset_state(st, prev_state);
> +		/*
> +		 * As st->last is not set, cpuhp_reset_state() increments
> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> +		 * skipped during rollback. So, don't use it here.
> +		 */
> +		st->rollback = true;
> +		st->target = prev_state;
> +		st->bringup = !st->bringup;

Acked-by: Mukesh Ojha <mojha@codeaurora.org>

looks valid for this case.

>   		__cpuhp_kick_ap(st);
>   	}
>   

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-04  6:33 [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu() Neeraj Upadhyay
  2018-09-04 12:12 ` Mukesh Ojha
@ 2018-09-05 11:33 ` Thomas Gleixner
  2018-09-05 12:09   ` Mukesh Ojha
  2018-09-05 12:23   ` Thomas Gleixner
  1 sibling, 2 replies; 14+ messages in thread
From: Thomas Gleixner @ 2018-09-05 11:33 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm

On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> If takedown_cpu() fails during _cpu_down(), st->state is reset,
> by calling cpuhp_reset_state(). This results in an additional
> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
> state being skipped during rollback. Fix this by not calling
> cpuhp_reset_state() and doing the state reset directly in
> _cpu_down().
> 
> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
> ---
>  kernel/cpu.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..9f49edb 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
>  	 */
>  	ret = cpuhp_down_callbacks(cpu, st, target);
>  	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> -		cpuhp_reset_state(st, prev_state);
> +		/*
> +		 * As st->last is not set, cpuhp_reset_state() increments
> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> +		 * skipped during rollback. So, don't use it here.
> +		 */
> +		st->rollback = true;
> +		st->target = prev_state;
> +		st->bringup = !st->bringup;

No, this is just papering over the actual problem.

The state inconsistency happens in take_cpu_down() when it returns with a
failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
and st->state is then incremented in undo_cpu_down().

That's the real issue and we need to analyze the whole cpu_down rollback
logic first.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 11:33 ` Thomas Gleixner
@ 2018-09-05 12:09   ` Mukesh Ojha
  2018-09-05 13:14     ` Thomas Gleixner
  2018-09-05 12:23   ` Thomas Gleixner
  1 sibling, 1 reply; 14+ messages in thread
From: Mukesh Ojha @ 2018-09-05 12:09 UTC (permalink / raw)
  To: Thomas Gleixner, Neeraj Upadhyay
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm



On 9/5/2018 5:03 PM, Thomas Gleixner wrote:
> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
>> If takedown_cpu() fails during _cpu_down(), st->state is reset,
>> by calling cpuhp_reset_state(). This results in an additional
>> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
>> state being skipped during rollback. Fix this by not calling
>> cpuhp_reset_state() and doing the state reset directly in
>> _cpu_down().
>>
>> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
>> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
>> ---
>>   kernel/cpu.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index aa7fe85..9f49edb 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
>>   	 */
>>   	ret = cpuhp_down_callbacks(cpu, st, target);
>>   	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>> -		cpuhp_reset_state(st, prev_state);
>> +		/*
>> +		 * As st->last is not set, cpuhp_reset_state() increments
>> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
>> +		 * skipped during rollback. So, don't use it here.
>> +		 */
>> +		st->rollback = true;
>> +		st->target = prev_state;
>> +		st->bringup = !st->bringup;
> No, this is just papering over the actual problem.
>
> The state inconsistency happens in take_cpu_down() when it returns with a
> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> and st->state is then incremented in undo_cpu_down().
>
> That's the real issue and we need to analyze the whole cpu_down rollback
> logic first.

Could this be done like below ?

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..47bce90 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
         int err, cpu = smp_processor_id();
         int ret;

-       /* Ensure this CPU doesn't handle any more interrupts. */
-       err = __cpu_disable();
-       if (err < 0)
-               return err;
-
         /*
          * We get here while we are in CPUHP_TEARDOWN_CPU state and we 
must not
          * do this step again.
          */
         WARN_ON(st->state != CPUHP_TEARDOWN_CPU);
         st->state--;
+
+       /* Ensure this CPU doesn't handle any more interrupts. */
+       err = __cpu_disable();
+       if (err < 0)
+               return err;
+
         /* Invoke the former CPU_DYING callbacks */

Thanks,
Mukesh
>
>
>
>
>
>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 11:33 ` Thomas Gleixner
  2018-09-05 12:09   ` Mukesh Ojha
@ 2018-09-05 12:23   ` Thomas Gleixner
  2018-09-05 12:51     ` Neeraj Upadhyay
                       ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Thomas Gleixner @ 2018-09-05 12:23 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm

On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> >  	ret = cpuhp_down_callbacks(cpu, st, target);
> >  	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > -		cpuhp_reset_state(st, prev_state);
> > +		/*
> > +		 * As st->last is not set, cpuhp_reset_state() increments
> > +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > +		 * skipped during rollback. So, don't use it here.
> > +		 */
> > +		st->rollback = true;
> > +		st->target = prev_state;
> > +		st->bringup = !st->bringup;
> 
> No, this is just papering over the actual problem.
> 
> The state inconsistency happens in take_cpu_down() when it returns with a
> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> and st->state is then incremented in undo_cpu_down().
> 
> That's the real issue and we need to analyze the whole cpu_down rollback
> logic first.

And looking closer this is a general issue. Just that the TEARDOWN state
makes it simple to observe. It's universaly broken, when the first teardown
callback fails because, st->state is only decremented _AFTER_ the callback
returns success, but undo_cpu_down() increments unconditionally.

Patch below.

Thanks,

	tglx
----
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
 		ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
 		if (ret) {
 			st->target = prev_state;
-			undo_cpu_down(cpu, st);
+			if (st->state < prev_state)
+				undo_cpu_down(cpu, st);
 			break;
 		}
 	}
@@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
 	 * to do the further cleanups.
 	 */
 	ret = cpuhp_down_callbacks(cpu, st, target);
-	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+	if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
 		cpuhp_reset_state(st, prev_state);
 		__cpuhp_kick_ap(st);
 	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 12:23   ` Thomas Gleixner
@ 2018-09-05 12:51     ` Neeraj Upadhyay
  2018-09-05 13:17       ` Thomas Gleixner
  2018-09-05 14:53     ` Sudeep Holla
  2018-09-06 13:52     ` [tip:smp/urgent] cpu/hotplug: Prevent state corruption on error rollback tip-bot for Thomas Gleixner
  2 siblings, 1 reply; 14+ messages in thread
From: Neeraj Upadhyay @ 2018-09-05 12:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm



On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Thomas Gleixner wrote:
>> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
>>>   	ret = cpuhp_down_callbacks(cpu, st, target);
>>>   	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>>> -		cpuhp_reset_state(st, prev_state);
>>> +		/*
>>> +		 * As st->last is not set, cpuhp_reset_state() increments
>>> +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
>>> +		 * skipped during rollback. So, don't use it here.
>>> +		 */
>>> +		st->rollback = true;
>>> +		st->target = prev_state;
>>> +		st->bringup = !st->bringup;
>> No, this is just papering over the actual problem.
>>
>> The state inconsistency happens in take_cpu_down() when it returns with a
>> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
>> and st->state is then incremented in undo_cpu_down().
>>
>> That's the real issue and we need to analyze the whole cpu_down rollback
>> logic first.
> And looking closer this is a general issue. Just that the TEARDOWN state
> makes it simple to observe. It's universaly broken, when the first teardown
> callback fails because, st->state is only decremented _AFTER_ the callback
> returns success, but undo_cpu_down() increments unconditionally.
>
> Patch below.
>
> Thanks,
>
> 	tglx
As per my understanding, there are 2 problems here; one is fixed with 
your patch, and other is cpuhp_reset_state() is used during rollback 
from non-AP to AP state, which seem to result in 2 increments of 
st->state (one increment done by cpuhp_reset_state() and another by 
cpu_thread_fun()) .
> ----
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
>   		ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
>   		if (ret) {
>   			st->target = prev_state;
> -			undo_cpu_down(cpu, st);
> +			if (st->state < prev_state)
> +				undo_cpu_down(cpu, st);
>   			break;
>   		}
>   	}
> @@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
>   	 * to do the further cleanups.
>   	 */
>   	ret = cpuhp_down_callbacks(cpu, st, target);
> -	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> +	if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>   		cpuhp_reset_state(st, prev_state);
>   		__cpuhp_kick_ap(st);
>   	}

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 12:09   ` Mukesh Ojha
@ 2018-09-05 13:14     ` Thomas Gleixner
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2018-09-05 13:14 UTC (permalink / raw)
  To: Mukesh Ojha
  Cc: Neeraj Upadhyay, josh, peterz, mingo, jiangshanlai, dzickus,
	brendan.jackman, malat, linux-kernel, sramana, linux-arm-msm

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

On Wed, 5 Sep 2018, Mukesh Ojha wrote:
> On 9/5/2018 5:03 PM, Thomas Gleixner wrote:
> > > +		st->rollback = true;
> > > +		st->target = prev_state;
> > > +		st->bringup = !st->bringup;
> > No, this is just papering over the actual problem.
> > 
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> > 
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
> 
> Could this be done like below ?

IOW, more papering over the real root cause.

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..47bce90 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
>         int err, cpu = smp_processor_id();
>         int ret;

  ^^^^^^^

Please fix your mailer or your editor. That patch is whitespace damaged.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 12:51     ` Neeraj Upadhyay
@ 2018-09-05 13:17       ` Thomas Gleixner
  2018-09-06  2:56         ` Neeraj Upadhyay
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2018-09-05 13:17 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm

On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> >
> > And looking closer this is a general issue. Just that the TEARDOWN state
> > makes it simple to observe. It's universaly broken, when the first teardown
> > callback fails because, st->state is only decremented _AFTER_ the callback
> > returns success, but undo_cpu_down() increments unconditionally.
> >
>
> As per my understanding, there are 2 problems here; one is fixed with your
> patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
> AP state, which seem to result in 2 increments of st->state (one increment
> done by cpuhp_reset_state() and another by cpu_thread_fun()) .

And how did your hack fix that up magically? I'll have a look later today.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 12:23   ` Thomas Gleixner
  2018-09-05 12:51     ` Neeraj Upadhyay
@ 2018-09-05 14:53     ` Sudeep Holla
  2018-09-05 15:19       ` Geert Uytterhoeven
  2018-09-06 13:52     ` [tip:smp/urgent] cpu/hotplug: Prevent state corruption on error rollback tip-bot for Thomas Gleixner
  2 siblings, 1 reply; 14+ messages in thread
From: Sudeep Holla @ 2018-09-05 14:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Neeraj Upadhyay, josh, peterz, mingo, jiangshanlai, dzickus,
	brendan.jackman, malat, linux-kernel, sramana, linux-arm-msm,
	Geert Uytterhoeven, Lorenzo Pieralisi, Sudeep Holla

On Wed, Sep 05, 2018 at 02:23:46PM +0200, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > >  	ret = cpuhp_down_callbacks(cpu, st, target);
> > >  	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > > -		cpuhp_reset_state(st, prev_state);
> > > +		/*
> > > +		 * As st->last is not set, cpuhp_reset_state() increments
> > > +		 * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > > +		 * skipped during rollback. So, don't use it here.
> > > +		 */
> > > +		st->rollback = true;
> > > +		st->target = prev_state;
> > > +		st->bringup = !st->bringup;
> > 
> > No, this is just papering over the actual problem.
> > 
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> > 
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
> 
> And looking closer this is a general issue. Just that the TEARDOWN state
> makes it simple to observe. It's universaly broken, when the first teardown
> callback fails because, st->state is only decremented _AFTER_ the callback
> returns success, but undo_cpu_down() increments unconditionally.
> 
> Patch below.

This patch fixes the issue reported @[1]. Lorenzo did some debugging and
I wanted to have a look at it at some point but this discussion drew my
attention and sounded very similar[2]. So I did a quick test with this
patch and it fixes the issue.

--
Regards,
Sudeep

[1] https://lore.kernel.org/lkml/CAMuHMdVg868LgL5xTg5Dp5rReKxoo+8fRy+ETJiMxGWZCp+hWw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20180823131505.GA31558@red-moon/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 14:53     ` Sudeep Holla
@ 2018-09-05 15:19       ` Geert Uytterhoeven
  0 siblings, 0 replies; 14+ messages in thread
From: Geert Uytterhoeven @ 2018-09-05 15:19 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Thomas Gleixner, Neeraj Upadhyay, Josh Triplett, Peter Zijlstra,
	Ingo Molnar, Lai Jiangshan, dzickus, Brendan Jackman,
	Mathieu Malaterre, Linux Kernel Mailing List, sramana,
	linux-arm-msm, Lorenzo Pieralisi, Linux-Renesas

Hi Sudeep,

Thanks for the CC!

On Wed, Sep 5, 2018 at 4:54 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> On Wed, Sep 05, 2018 at 02:23:46PM +0200, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> > > On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > > >   ret = cpuhp_down_callbacks(cpu, st, target);
> > > >   if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > > > -         cpuhp_reset_state(st, prev_state);
> > > > +         /*
> > > > +          * As st->last is not set, cpuhp_reset_state() increments
> > > > +          * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > > > +          * skipped during rollback. So, don't use it here.
> > > > +          */
> > > > +         st->rollback = true;
> > > > +         st->target = prev_state;
> > > > +         st->bringup = !st->bringup;
> > >
> > > No, this is just papering over the actual problem.
> > >
> > > The state inconsistency happens in take_cpu_down() when it returns with a
> > > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > > and st->state is then incremented in undo_cpu_down().
> > >
> > > That's the real issue and we need to analyze the whole cpu_down rollback
> > > logic first.
> >
> > And looking closer this is a general issue. Just that the TEARDOWN state
> > makes it simple to observe. It's universaly broken, when the first teardown
> > callback fails because, st->state is only decremented _AFTER_ the callback
> > returns success, but undo_cpu_down() increments unconditionally.
> >
> > Patch below.
>
> This patch fixes the issue reported @[1]. Lorenzo did some debugging and
> I wanted to have a look at it at some point but this discussion drew my
> attention and sounded very similar[2]. So I did a quick test with this
> patch and it fixes the issue.

> [1] https://lore.kernel.org/lkml/CAMuHMdVg868LgL5xTg5Dp5rReKxoo+8fRy+ETJiMxGWZCp+hWw@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/20180823131505.GA31558@red-moon/

Thomas' patch fixes the issue for me:
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>


Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-05 13:17       ` Thomas Gleixner
@ 2018-09-06  2:56         ` Neeraj Upadhyay
  2018-09-06  8:18           ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Neeraj Upadhyay @ 2018-09-06  2:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm



On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
>> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
>>> And looking closer this is a general issue. Just that the TEARDOWN state
>>> makes it simple to observe. It's universaly broken, when the first teardown
>>> callback fails because, st->state is only decremented _AFTER_ the callback
>>> returns success, but undo_cpu_down() increments unconditionally.
>>>
>> As per my understanding, there are 2 problems here; one is fixed with your
>> patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
>> AP state, which seem to result in 2 increments of st->state (one increment
>> done by cpuhp_reset_state() and another by cpu_thread_fun()) .
> And how did your hack fix that up magically? I'll have a look later today.
>
> Thanks,
>
> 	tglx

The hack fixes it by not calling cpuhp_reset_state() and doing rollback 
state reset inline in  _cpu_down().




-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-06  2:56         ` Neeraj Upadhyay
@ 2018-09-06  8:18           ` Thomas Gleixner
  2018-09-06  9:00             ` Neeraj Upadhyay
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2018-09-06  8:18 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm

[-- Attachment #1: Type: text/plain, Size: 1311 bytes --]

On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> > > On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> > > > And looking closer this is a general issue. Just that the TEARDOWN state
> > > > makes it simple to observe. It's universaly broken, when the first
> > > > teardown
> > > > callback fails because, st->state is only decremented _AFTER_ the
> > > > callback
> > > > returns success, but undo_cpu_down() increments unconditionally.
> > > > 
> > > As per my understanding, there are 2 problems here; one is fixed with your
> > > patch, and other is cpuhp_reset_state() is used during rollback from
> > > non-AP to
> > > AP state, which seem to result in 2 increments of st->state (one increment
> > > done by cpuhp_reset_state() and another by cpu_thread_fun()) .
> > And how did your hack fix that up magically? I'll have a look later today.
> > 
> > Thanks,
> > 
> > 	tglx
> 
> The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
> reset inline in  _cpu_down().

And how is that any different from the proper patch I sent? It ends up in
the same state. So I have a hard time to understand your blurb above where
you claim that my patch just solves one of two problems?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
  2018-09-06  8:18           ` Thomas Gleixner
@ 2018-09-06  9:00             ` Neeraj Upadhyay
  0 siblings, 0 replies; 14+ messages in thread
From: Neeraj Upadhyay @ 2018-09-06  9:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: josh, peterz, mingo, jiangshanlai, dzickus, brendan.jackman,
	malat, linux-kernel, sramana, linux-arm-msm



On 09/06/2018 01:48 PM, Thomas Gleixner wrote:
> On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:
>> On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
>>> On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
>>>> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
>>>>> And looking closer this is a general issue. Just that the TEARDOWN state
>>>>> makes it simple to observe. It's universaly broken, when the first
>>>>> teardown
>>>>> callback fails because, st->state is only decremented _AFTER_ the
>>>>> callback
>>>>> returns success, but undo_cpu_down() increments unconditionally.
>>>>>
>>>> As per my understanding, there are 2 problems here; one is fixed with your
>>>> patch, and other is cpuhp_reset_state() is used during rollback from
>>>> non-AP to
>>>> AP state, which seem to result in 2 increments of st->state (one increment
>>>> done by cpuhp_reset_state() and another by cpu_thread_fun()) .
>>> And how did your hack fix that up magically? I'll have a look later today.
>>>
>>> Thanks,
>>>
>>> 	tglx
>>
>> The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
>> reset inline in  _cpu_down().
> 
> And how is that any different from the proper patch I sent? It ends up in
> the same state. So I have a hard time to understand your blurb above where
> you claim that my patch just solves one of two problems?
> 
> Thanks,
> 
> 	tglx
>

Yes, your patch solves the problem related to smpboot unparking being 
skipped during rollback and with the hack we end up in same state. The 
second thing, which I am referring to, is that there is one additional 
state increment. I missed the part that, it could be required, so that 
we reach CPUHP_AP_ONLINE_IDLE before calling __cpuhp_kick_ap(). So, it's 
not a problem.

* 	cpuhp_reset_state() does one state increment and we reach 
CPUHP_AP_ONLINE_IDLE.

	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
		cpuhp_reset_state(st, prev_state);
		__cpuhp_kick_ap(st);
	}

         CPUHP_TEARDOWN_CPU,
         CPUHP_AP_ONLINE_IDLE,
         CPUHP_AP_SMPBOOT_THREADS,


* cpuhp_thread_fun() does one more increment before invoking state 
callback (so, we skip CPUHP_AP_ONLINE_IDLE) and we reach 
CPUHP_AP_SMPBOOT_THREADS:

	static void cpuhp_thread_fun(unsigned int cpu)
	<snip>
		if (bringup) {
			st->state++;
			state = st->state;
	<snip>


-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:smp/urgent] cpu/hotplug: Prevent state corruption on error rollback
  2018-09-05 12:23   ` Thomas Gleixner
  2018-09-05 12:51     ` Neeraj Upadhyay
  2018-09-05 14:53     ` Sudeep Holla
@ 2018-09-06 13:52     ` tip-bot for Thomas Gleixner
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-09-06 13:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, neeraju, hpa, linux-kernel, geert+renesas, sudeep.holla

Commit-ID:  69fa6eb7d6a64801ea261025cce9723d9442d773
Gitweb:     https://git.kernel.org/tip/69fa6eb7d6a64801ea261025cce9723d9442d773
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 6 Sep 2018 15:21:38 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 6 Sep 2018 15:21:38 +0200

cpu/hotplug: Prevent state corruption on error rollback

When a teardown callback fails, the CPU hotplug code brings the CPU back to
the previous state. The previous state becomes the new target state. The
rollback happens in undo_cpu_down() which increments the state
unconditionally even if the state is already the same as the target.

As a consequence the next CPU hotplug operation will start at the wrong
state. This is easily to observe when __cpu_disable() fails.

Prevent the unconditional undo by checking the state vs. target before
incrementing state and fix up the consequently wrong conditional in the
unplug code which handles the failure of the final CPU take down on the
control CPU side.

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: josh@joshtriplett.org
Cc: peterz@infradead.org
Cc: jiangshanlai@gmail.com
Cc: dzickus@redhat.com
Cc: brendan.jackman@arm.com
Cc: malat@debian.org
Cc: sramana@codeaurora.org
Cc: linux-arm-msm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1809051419580.1416@nanos.tec.linutronix.de

----
---
 kernel/cpu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index eb4041f78073..0097acec1c71 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
 		ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
 		if (ret) {
 			st->target = prev_state;
-			undo_cpu_down(cpu, st);
+			if (st->state < prev_state)
+				undo_cpu_down(cpu, st);
 			break;
 		}
 	}
@@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 	 * to do the further cleanups.
 	 */
 	ret = cpuhp_down_callbacks(cpu, st, target);
-	if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+	if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
 		cpuhp_reset_state(st, prev_state);
 		__cpuhp_kick_ap(st);
 	}

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-09-06 13:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-04  6:33 [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu() Neeraj Upadhyay
2018-09-04 12:12 ` Mukesh Ojha
2018-09-05 11:33 ` Thomas Gleixner
2018-09-05 12:09   ` Mukesh Ojha
2018-09-05 13:14     ` Thomas Gleixner
2018-09-05 12:23   ` Thomas Gleixner
2018-09-05 12:51     ` Neeraj Upadhyay
2018-09-05 13:17       ` Thomas Gleixner
2018-09-06  2:56         ` Neeraj Upadhyay
2018-09-06  8:18           ` Thomas Gleixner
2018-09-06  9:00             ` Neeraj Upadhyay
2018-09-05 14:53     ` Sudeep Holla
2018-09-05 15:19       ` Geert Uytterhoeven
2018-09-06 13:52     ` [tip:smp/urgent] cpu/hotplug: Prevent state corruption on error rollback tip-bot for Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.