linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
@ 2022-05-23 16:05 Vincent Donnefort
  2022-05-25 16:52 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Vincent Donnefort @ 2022-05-23 16:05 UTC (permalink / raw)
  To: peterz, tglx
  Cc: linux-kernel, vschneid, kernel-team, Vincent Donnefort, Derek Dolney

The DYING/STARTING callbacks are not expected to fail. However, as reported
by Derek, drivers such as tboot are still free to return errors within
those sections. In that case, there's nothing the hotplug machinery can do,
so let's just proceed and log the failures.

Fixes: 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
Reported-by: Derek Dolney <z23@posteo.net>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

---

v1 -> v2: 
   - Commit message rewording.
   - More details in the warnings.
   - Some variable renaming

diff --git a/kernel/cpu.c b/kernel/cpu.c
index bbad5e375d3b..c3617683459e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
 	return true;
 }
 
-static int cpuhp_invoke_callback_range(bool bringup,
-				       unsigned int cpu,
-				       struct cpuhp_cpu_state *st,
-				       enum cpuhp_state target)
+static int _cpuhp_invoke_callback_range(bool bringup,
+					unsigned int cpu,
+					struct cpuhp_cpu_state *st,
+					enum cpuhp_state target,
+					bool nofail)
 {
 	enum cpuhp_state state;
-	int err = 0;
+	int ret = 0;
 
 	while (cpuhp_next_state(bringup, &state, st, target)) {
+		int err;
+
 		err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
-		if (err)
+		if (!err)
+			continue;
+
+		if (nofail) {
+			pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
+				cpu, bringup ? "UP" : "DOWN",
+				cpuhp_get_step(st->state)->name,
+				st->state, err);
+			ret = -1;
+		} else {
+			ret = err;
 			break;
+		}
 	}
 
-	return err;
+	return ret;
+}
+
+static inline int cpuhp_invoke_callback_range(bool bringup,
+					      unsigned int cpu,
+					      struct cpuhp_cpu_state *st,
+					      enum cpuhp_state target)
+{
+	return _cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
+}
+
+static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
+						      unsigned int cpu,
+						      struct cpuhp_cpu_state *st,
+						      enum cpuhp_state target)
+{
+	WARN_ON_ONCE(_cpuhp_invoke_callback_range(bringup, cpu, st, target, true));
 }
 
 static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
@@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
 	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
 	int err, cpu = smp_processor_id();
-	int ret;
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
@@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
 	 */
 	WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
 
-	/* Invoke the former CPU_DYING callbacks */
-	ret = cpuhp_invoke_callback_range(false, cpu, st, target);
-
 	/*
+	 * Invoke the former CPU_DYING callbacks
 	 * DYING must not fail!
 	 */
-	WARN_ON_ONCE(ret);
+	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
 
 	/* Give up timekeeping duties */
 	tick_handover_do_timer();
@@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
 {
 	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
 	enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
-	int ret;
 
 	rcu_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */
 	cpumask_set_cpu(cpu, &cpus_booted_once_mask);
-	ret = cpuhp_invoke_callback_range(true, cpu, st, target);
 
 	/*
 	 * STARTING must not fail!
 	 */
-	WARN_ON_ONCE(ret);
+	cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
 }
 
 /*
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-05-23 16:05 [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections Vincent Donnefort
@ 2022-05-25 16:52 ` Peter Zijlstra
  2022-05-26  8:24   ` Vincent Donnefort
  2022-05-26 11:48 ` Derek Dolney
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2022-05-25 16:52 UTC (permalink / raw)
  To: Vincent Donnefort; +Cc: tglx, linux-kernel, vschneid, kernel-team, Derek Dolney

On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
> The DYING/STARTING callbacks are not expected to fail. However, as reported
> by Derek, drivers such as tboot are still free to return errors within
> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.
> 

I'm confused. Why isn't this a driver bug?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-05-25 16:52 ` Peter Zijlstra
@ 2022-05-26  8:24   ` Vincent Donnefort
  2022-05-26 10:15     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Donnefort @ 2022-05-26  8:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, vschneid, kernel-team, Derek Dolney

On Wed, May 25, 2022 at 06:52:48PM +0200, Peter Zijlstra wrote:
> On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
> > The DYING/STARTING callbacks are not expected to fail. However, as reported
> > by Derek, drivers such as tboot are still free to return errors within
> > those sections. In that case, there's nothing the hotplug machinery can do,
> > so let's just proceed and log the failures.
> > 
> 
> I'm confused. Why isn't this a driver bug?

It is a entirely a driver bug which has been reported already. but 453e41085183
(cpu/hotplug: Add cpuhp_invoke_callback_range()) changed the behaviour so I
thought it would be worth to revert to the original one which is to not break
the entire up/down for a single driver error.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-05-26  8:24   ` Vincent Donnefort
@ 2022-05-26 10:15     ` Peter Zijlstra
  2022-05-29 18:01       ` Derek Dolney
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2022-05-26 10:15 UTC (permalink / raw)
  To: Vincent Donnefort; +Cc: tglx, linux-kernel, vschneid, kernel-team, Derek Dolney

On Thu, May 26, 2022 at 09:24:28AM +0100, Vincent Donnefort wrote:
> On Wed, May 25, 2022 at 06:52:48PM +0200, Peter Zijlstra wrote:
> > On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
> > > The DYING/STARTING callbacks are not expected to fail. However, as reported
> > > by Derek, drivers such as tboot are still free to return errors within
> > > those sections. In that case, there's nothing the hotplug machinery can do,
> > > so let's just proceed and log the failures.
> > > 
> > 
> > I'm confused. Why isn't this a driver bug?
> 
> It is a entirely a driver bug which has been reported already. but 453e41085183
> (cpu/hotplug: Add cpuhp_invoke_callback_range()) changed the behaviour so I
> thought it would be worth to revert to the original one which is to not break
> the entire up/down for a single driver error.

Ah I see. Fair enough I suppose.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-05-23 16:05 [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections Vincent Donnefort
  2022-05-25 16:52 ` Peter Zijlstra
@ 2022-05-26 11:48 ` Derek Dolney
  2022-05-26 12:27 ` Derek Dolney
  2022-06-13 12:36 ` Thomas Gleixner
  3 siblings, 0 replies; 10+ messages in thread
From: Derek Dolney @ 2022-05-26 11:48 UTC (permalink / raw)
  To: Vincent Donnefort, peterz, tglx; +Cc: linux-kernel, vschneid, kernel-team

I tested this patch on the 5.12 commit that broke suspend and also on
the latest git 5.18 branch and this is good, suspend and resume are
working again.

Derek

On 5/23/22 12:05 PM, Vincent Donnefort wrote:
> The DYING/STARTING callbacks are not expected to fail. However, as reported
> by Derek, drivers such as tboot are still free to return errors within
> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.
> 
> Fixes: 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
> Reported-by: Derek Dolney <z23@posteo.net>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> 
> ---
> 
> v1 -> v2: 
>    - Commit message rewording.
>    - More details in the warnings.
>    - Some variable renaming
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bbad5e375d3b..c3617683459e 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
>  	return true;
>  }
>  
> -static int cpuhp_invoke_callback_range(bool bringup,
> -				       unsigned int cpu,
> -				       struct cpuhp_cpu_state *st,
> -				       enum cpuhp_state target)
> +static int _cpuhp_invoke_callback_range(bool bringup,
> +					unsigned int cpu,
> +					struct cpuhp_cpu_state *st,
> +					enum cpuhp_state target,
> +					bool nofail)
>  {
>  	enum cpuhp_state state;
> -	int err = 0;
> +	int ret = 0;
>  
>  	while (cpuhp_next_state(bringup, &state, st, target)) {
> +		int err;
> +
>  		err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> -		if (err)
> +		if (!err)
> +			continue;
> +
> +		if (nofail) {
> +			pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> +				cpu, bringup ? "UP" : "DOWN",
> +				cpuhp_get_step(st->state)->name,
> +				st->state, err);
> +			ret = -1;
> +		} else {
> +			ret = err;
>  			break;
> +		}
>  	}
>  
> -	return err;
> +	return ret;
> +}
> +
> +static inline int cpuhp_invoke_callback_range(bool bringup,
> +					      unsigned int cpu,
> +					      struct cpuhp_cpu_state *st,
> +					      enum cpuhp_state target)
> +{
> +	return _cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
> +}
> +
> +static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
> +						      unsigned int cpu,
> +						      struct cpuhp_cpu_state *st,
> +						      enum cpuhp_state target)
> +{
> +	WARN_ON_ONCE(_cpuhp_invoke_callback_range(bringup, cpu, st, target, true));
>  }
>  
>  static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
> @@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
>  	int err, cpu = smp_processor_id();
> -	int ret;
>  
>  	/* Ensure this CPU doesn't handle any more interrupts. */
>  	err = __cpu_disable();
> @@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
>  	 */
>  	WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
>  
> -	/* Invoke the former CPU_DYING callbacks */
> -	ret = cpuhp_invoke_callback_range(false, cpu, st, target);
> -
>  	/*
> +	 * Invoke the former CPU_DYING callbacks
>  	 * DYING must not fail!
>  	 */
> -	WARN_ON_ONCE(ret);
> +	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
>  
>  	/* Give up timekeeping duties */
>  	tick_handover_do_timer();
> @@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
>  {
>  	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>  	enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
> -	int ret;
>  
>  	rcu_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */
>  	cpumask_set_cpu(cpu, &cpus_booted_once_mask);
> -	ret = cpuhp_invoke_callback_range(true, cpu, st, target);
>  
>  	/*
>  	 * STARTING must not fail!
>  	 */
> -	WARN_ON_ONCE(ret);
> +	cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-05-23 16:05 [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections Vincent Donnefort
  2022-05-25 16:52 ` Peter Zijlstra
  2022-05-26 11:48 ` Derek Dolney
@ 2022-05-26 12:27 ` Derek Dolney
  2022-06-13 12:36 ` Thomas Gleixner
  3 siblings, 0 replies; 10+ messages in thread
From: Derek Dolney @ 2022-05-26 12:27 UTC (permalink / raw)
  To: Vincent Donnefort, peterz, tglx; +Cc: linux-kernel, vschneid, kernel-team

Tested by: Derek Dolney <z23@posteo.net>

Suspend and resume with tboot are working again with this patch. Tested
on 5.12 broken commit and latest git 5.18.

On 5/23/22 12:05 PM, Vincent Donnefort wrote:
> The DYING/STARTING callbacks are not expected to fail. However, as reported
> by Derek, drivers such as tboot are still free to return errors within
> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.
> 
> Fixes: 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
> Reported-by: Derek Dolney <z23@posteo.net>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> 
> ---
> 
> v1 -> v2: 
>    - Commit message rewording.
>    - More details in the warnings.
>    - Some variable renaming
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bbad5e375d3b..c3617683459e 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
>  	return true;
>  }
>  
> -static int cpuhp_invoke_callback_range(bool bringup,
> -				       unsigned int cpu,
> -				       struct cpuhp_cpu_state *st,
> -				       enum cpuhp_state target)
> +static int _cpuhp_invoke_callback_range(bool bringup,
> +					unsigned int cpu,
> +					struct cpuhp_cpu_state *st,
> +					enum cpuhp_state target,
> +					bool nofail)
>  {
>  	enum cpuhp_state state;
> -	int err = 0;
> +	int ret = 0;
>  
>  	while (cpuhp_next_state(bringup, &state, st, target)) {
> +		int err;
> +
>  		err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> -		if (err)
> +		if (!err)
> +			continue;
> +
> +		if (nofail) {
> +			pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> +				cpu, bringup ? "UP" : "DOWN",
> +				cpuhp_get_step(st->state)->name,
> +				st->state, err);
> +			ret = -1;
> +		} else {
> +			ret = err;
>  			break;
> +		}
>  	}
>  
> -	return err;
> +	return ret;
> +}
> +
> +static inline int cpuhp_invoke_callback_range(bool bringup,
> +					      unsigned int cpu,
> +					      struct cpuhp_cpu_state *st,
> +					      enum cpuhp_state target)
> +{
> +	return _cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
> +}
> +
> +static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
> +						      unsigned int cpu,
> +						      struct cpuhp_cpu_state *st,
> +						      enum cpuhp_state target)
> +{
> +	WARN_ON_ONCE(_cpuhp_invoke_callback_range(bringup, cpu, st, target, true));
>  }
>  
>  static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
> @@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
>  	int err, cpu = smp_processor_id();
> -	int ret;
>  
>  	/* Ensure this CPU doesn't handle any more interrupts. */
>  	err = __cpu_disable();
> @@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
>  	 */
>  	WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
>  
> -	/* Invoke the former CPU_DYING callbacks */
> -	ret = cpuhp_invoke_callback_range(false, cpu, st, target);
> -
>  	/*
> +	 * Invoke the former CPU_DYING callbacks
>  	 * DYING must not fail!
>  	 */
> -	WARN_ON_ONCE(ret);
> +	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
>  
>  	/* Give up timekeeping duties */
>  	tick_handover_do_timer();
> @@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
>  {
>  	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>  	enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
> -	int ret;
>  
>  	rcu_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */
>  	cpumask_set_cpu(cpu, &cpus_booted_once_mask);
> -	ret = cpuhp_invoke_callback_range(true, cpu, st, target);
>  
>  	/*
>  	 * STARTING must not fail!
>  	 */
> -	WARN_ON_ONCE(ret);
> +	cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-05-26 10:15     ` Peter Zijlstra
@ 2022-05-29 18:01       ` Derek Dolney
  0 siblings, 0 replies; 10+ messages in thread
From: Derek Dolney @ 2022-05-29 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Donnefort
  Cc: tglx, linux-kernel, vschneid, kernel-team

FYI there is also now a patch to fix the driver bug in testing by the
tboot devs at the moment, you could monitor the progress here:

https://sourceforge.net/p/tboot/mailman/message/37659164/

I tested this patch and it works for me.

Derek

On 5/26/22 6:15 AM, Peter Zijlstra wrote:
> On Thu, May 26, 2022 at 09:24:28AM +0100, Vincent Donnefort wrote:
>> On Wed, May 25, 2022 at 06:52:48PM +0200, Peter Zijlstra wrote:
>>> On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
>>>> The DYING/STARTING callbacks are not expected to fail. However, as reported
>>>> by Derek, drivers such as tboot are still free to return errors within
>>>> those sections. In that case, there's nothing the hotplug machinery can do,
>>>> so let's just proceed and log the failures.
>>>>
>>>
>>> I'm confused. Why isn't this a driver bug?
>>
>> It is a entirely a driver bug which has been reported already. but 453e41085183
>> (cpu/hotplug: Add cpuhp_invoke_callback_range()) changed the behaviour so I
>> thought it would be worth to revert to the original one which is to not break
>> the entire up/down for a single driver error.
> 
> Ah I see. Fair enough I suppose.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-05-23 16:05 [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections Vincent Donnefort
                   ` (2 preceding siblings ...)
  2022-05-26 12:27 ` Derek Dolney
@ 2022-06-13 12:36 ` Thomas Gleixner
  2022-06-13 13:37   ` Vincent Donnefort
  3 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2022-06-13 12:36 UTC (permalink / raw)
  To: Vincent Donnefort, peterz
  Cc: linux-kernel, vschneid, kernel-team, Vincent Donnefort, Derek Dolney

Vincent,

On Mon, May 23 2022 at 17:05, Vincent Donnefort wrote:
> +static int _cpuhp_invoke_callback_range(bool bringup,
> +					unsigned int cpu,
> +					struct cpuhp_cpu_state *st,
> +					enum cpuhp_state target,
> +					bool nofail)
>  {
>  	enum cpuhp_state state;
> -	int err = 0;
> +	int ret = 0;
>  
>  	while (cpuhp_next_state(bringup, &state, st, target)) {
> +		int err;
> +
>  		err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> -		if (err)
> +		if (!err)
> +			continue;
> +
> +		if (nofail) {
> +			pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> +				cpu, bringup ? "UP" : "DOWN",
> +				cpuhp_get_step(st->state)->name,
> +				st->state, err);
> +			ret = -1;

I have a hard time to map this to the changelog:

> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.

That's still returning an error code at the end. Confused.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-06-13 12:36 ` Thomas Gleixner
@ 2022-06-13 13:37   ` Vincent Donnefort
  2022-07-04 10:01     ` Thorsten Leemhuis
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Donnefort @ 2022-06-13 13:37 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: peterz, linux-kernel, vschneid, kernel-team, Derek Dolney

On Mon, Jun 13, 2022 at 02:36:18PM +0200, Thomas Gleixner wrote:
> Vincent,
> 
> On Mon, May 23 2022 at 17:05, Vincent Donnefort wrote:
> > +static int _cpuhp_invoke_callback_range(bool bringup,
> > +					unsigned int cpu,
> > +					struct cpuhp_cpu_state *st,
> > +					enum cpuhp_state target,
> > +					bool nofail)
> >  {
> >  	enum cpuhp_state state;
> > -	int err = 0;
> > +	int ret = 0;
> >  
> >  	while (cpuhp_next_state(bringup, &state, st, target)) {
> > +		int err;
> > +
> >  		err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> > -		if (err)
> > +		if (!err)
> > +			continue;
> > +
> > +		if (nofail) {
> > +			pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> > +				cpu, bringup ? "UP" : "DOWN",
> > +				cpuhp_get_step(st->state)->name,
> > +				st->state, err);
> > +			ret = -1;
> 
> I have a hard time to map this to the changelog:
> 
> > those sections. In that case, there's nothing the hotplug machinery can do,
> > so let's just proceed and log the failures.
> 
> That's still returning an error code at the end. Confused.

It is, but after returning from this function, only a warning will be raised
(cpuhp_invoke_callback_range_nofail()) instead of stopping the HP machinery
(cpuhp_invoke_callback_range()). How about this changelog?

  The DYING/STARTING callbacks are not expected to fail. However, as reported by
  Derek, drivers such as tboot are still free to return errors within those
  sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable
  state.
  
  No rollback being possible there, let's only log the failures and proceed
  with the following steps. This restores the hotplug behaviour prior to
  453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())

> 
> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections
  2022-06-13 13:37   ` Vincent Donnefort
@ 2022-07-04 10:01     ` Thorsten Leemhuis
  0 siblings, 0 replies; 10+ messages in thread
From: Thorsten Leemhuis @ 2022-07-04 10:01 UTC (permalink / raw)
  To: Vincent Donnefort, Thomas Gleixner
  Cc: peterz, linux-kernel, vschneid, kernel-team, Derek Dolney

On 13.06.22 15:37, Vincent Donnefort wrote:
> On Mon, Jun 13, 2022 at 02:36:18PM +0200, Thomas Gleixner wrote:
>> Vincent,
>>
>> On Mon, May 23 2022 at 17:05, Vincent Donnefort wrote:
>>> +static int _cpuhp_invoke_callback_range(bool bringup,
>>> +					unsigned int cpu,
>>> +					struct cpuhp_cpu_state *st,
>>> +					enum cpuhp_state target,
>>> +					bool nofail)
>>>  {
>>>  	enum cpuhp_state state;
>>> -	int err = 0;
>>> +	int ret = 0;
>>>  
>>>  	while (cpuhp_next_state(bringup, &state, st, target)) {
>>> +		int err;
>>> +
>>>  		err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
>>> -		if (err)
>>> +		if (!err)
>>> +			continue;
>>> +
>>> +		if (nofail) {
>>> +			pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
>>> +				cpu, bringup ? "UP" : "DOWN",
>>> +				cpuhp_get_step(st->state)->name,
>>> +				st->state, err);
>>> +			ret = -1;
>>
>> I have a hard time to map this to the changelog:
>>
>>> those sections. In that case, there's nothing the hotplug machinery can do,
>>> so let's just proceed and log the failures.
>>
>> That's still returning an error code at the end. Confused.
> 
> It is, but after returning from this function, only a warning will be raised
> (cpuhp_invoke_callback_range_nofail()) instead of stopping the HP machinery
> (cpuhp_invoke_callback_range()). How about this changelog?
> 
>   The DYING/STARTING callbacks are not expected to fail. However, as reported by
>   Derek, drivers such as tboot are still free to return errors within those
>   sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable
>   state.
>   
>   No rollback being possible there, let's only log the failures and proceed
>   with the following steps. This restores the hotplug behaviour prior to
>   453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())

Vincent, what's up here? Did that patch make it further? It looks to me
like things stalled here, but maybe I'm missing something. I'm asking
because that fix was supposed to fix a regression I'm tracking.

BTW, if you respin this patch, could you please add proper 'Link:' tags
pointing to all reports about this issue? e.g. like this:

 Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867

These tags are important, as they allow others to look into the
backstory now and years from now. That is why they should be placed in
cases like this, as Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst explain in more detail.
Additionally, my regression tracking bot ‘regzbot’ relies on these tags
to automatically connect reports with patches that are posted or
committed to fix the reported issue.

Ciao, Thorsten


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-07-04 10:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-23 16:05 [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections Vincent Donnefort
2022-05-25 16:52 ` Peter Zijlstra
2022-05-26  8:24   ` Vincent Donnefort
2022-05-26 10:15     ` Peter Zijlstra
2022-05-29 18:01       ` Derek Dolney
2022-05-26 11:48 ` Derek Dolney
2022-05-26 12:27 ` Derek Dolney
2022-06-13 12:36 ` Thomas Gleixner
2022-06-13 13:37   ` Vincent Donnefort
2022-07-04 10:01     ` Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).