[PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed
@ 2021-03-10  7:08 Dongli Zhang
  2021-04-05 21:22 ` Dongli Zhang
  0 siblings, 1 reply; 4+ messages in thread
From: Dongli Zhang @ 2021-03-10  7:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, qais.yousef, peterz, mpe, aneesh.kumar, ethp, npiggin,
	joe.jin, dongli.zhang

During bootup or cpu hotplug, the cpuhp_up_callbacks() calls many CPUHP
callbacks (e.g., perf, mm, workqueue, RCU, kvmclock and more) for each
cpu to online. It may roll back to its previous state if any of
callbacks is failed. As a result, the user will not be able to know
which callback is failed and usually the only symptom is cpu online
failure.

The error log is printed for once to have confirm which callback is
failed.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
I used 'RFC' because WARN_ON_ONCE() is always used for the result from
cpuhp_invoke_callback(). I would prefer to get feedback from
maintainers/reviewers. Here I prefer to print the cpuhp name and state
value to help confirm the specific callback that is failed.

 kernel/cpu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1b6302ecbabe..c7a719079272 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -621,6 +621,10 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
 		st->state++;
 		ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
 		if (ret) {
+			pr_err_once("CPUHP callback failure (%d) for cpu %u at %s (%d)\n",
+				    ret, cpu, cpuhp_get_step(st->state)->name,
+				    st->state);
+
 			if (can_rollback_cpu(st)) {
 				st->target = prev_state;
 				undo_cpu_up(cpu, st);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed
  2021-03-10  7:08 [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed Dongli Zhang
@ 2021-04-05 21:22 ` Dongli Zhang
  2021-04-06 16:10   ` Qais Yousef
  0 siblings, 1 reply; 4+ messages in thread
From: Dongli Zhang @ 2021-04-05 21:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, qais.yousef, peterz, mpe, aneesh.kumar, ethp, npiggin, joe.jin

May I have if there is any feedback on this? To pr_err_once() here helps narrow
down what is the root cause of cpu online failure.


The issue fixed by d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64 can't be
online/hotpluged") is able to demonstrate how this pr_err_once() helps.

Due to VM kernel issue, at most 64 VCPUs can online if host clocksource is
switched to hpet.

# echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource


By default, we have no idea why only 64 out of 100 VCPUs are online in VM. We
need to customize kernel to debug why some CPUs are not able to online.


We will have below and know the root cause is with kvmclock, if we add the
pr_err_once().

[    0.693112] CPUHP callback failure (-12) for cpu 64 at kvmclock:setup_percpu (66)

Thank you very much!

Dongli Zhang


On 3/9/21 11:08 PM, Dongli Zhang wrote:
> During bootup or cpu hotplug, the cpuhp_up_callbacks() calls many CPUHP
> callbacks (e.g., perf, mm, workqueue, RCU, kvmclock and more) for each
> cpu to online. It may roll back to its previous state if any of
> callbacks is failed. As a result, the user will not be able to know
> which callback is failed and usually the only symptom is cpu online
> failure.
> 
> The error log is printed for once to have confirm which callback is
> failed.
> 
> Cc: Joe Jin <joe.jin@oracle.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> I used 'RFC' because WARN_ON_ONCE() is always used for the result from
> cpuhp_invoke_callback(). I would prefer to get feedback from
> maintainers/reviewers. Here I prefer to print the cpuhp name and state
> value to help confirm the specific callback that is failed.
> 
>  kernel/cpu.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 1b6302ecbabe..c7a719079272 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -621,6 +621,10 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
>  		st->state++;
>  		ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
>  		if (ret) {
> +			pr_err_once("CPUHP callback failure (%d) for cpu %u at %s (%d)\n",
> +				    ret, cpu, cpuhp_get_step(st->state)->name,
> +				    st->state);
> +
>  			if (can_rollback_cpu(st)) {
>  				st->target = prev_state;
>  				undo_cpu_up(cpu, st);
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed
  2021-04-05 21:22 ` Dongli Zhang
@ 2021-04-06 16:10   ` Qais Yousef
  2021-04-07 22:13     ` Dongli Zhang
  0 siblings, 1 reply; 4+ messages in thread
From: Qais Yousef @ 2021-04-06 16:10 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: linux-kernel, tglx, peterz, mpe, aneesh.kumar, ethp, npiggin, joe.jin

On 04/05/21 14:22, Dongli Zhang wrote:
> May I have if there is any feedback on this? To pr_err_once() here helps narrow
> down what is the root cause of cpu online failure.
> 
> 
> The issue fixed by d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64 can't be
> online/hotpluged") is able to demonstrate how this pr_err_once() helps.
> 
> Due to VM kernel issue, at most 64 VCPUs can online if host clocksource is
> switched to hpet.
> 
> # echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> 
> By default, we have no idea why only 64 out of 100 VCPUs are online in VM. We
> need to customize kernel to debug why some CPUs are not able to online.
> 
> 
> We will have below and know the root cause is with kvmclock, if we add the
> pr_err_once().

Isn't pr_debug() more appropriate here? Don't you want one on the
cpuhp_down_callbacks() too?

I *think* it's common now to have CONFIG_DYNAMIC_DEBUG enabled so one can
enable that line when they need to

	https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

I can see the appeal of pr_err_once() but I think this can fail for legitimate
reasons and is not necessarily strictly always an error?

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed
  2021-04-06 16:10   ` Qais Yousef
@ 2021-04-07 22:13     ` Dongli Zhang
  0 siblings, 0 replies; 4+ messages in thread
From: Dongli Zhang @ 2021-04-07 22:13 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, tglx, peterz, mpe, aneesh.kumar, ethp, npiggin, joe.jin

On 4/6/21 9:10 AM, Qais Yousef wrote:
> On 04/05/21 14:22, Dongli Zhang wrote:
>> May I have if there is any feedback on this? To pr_err_once() here helps narrow
>> down what is the root cause of cpu online failure.
>>
>>
>> The issue fixed by d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64 can't be
>> online/hotpluged") is able to demonstrate how this pr_err_once() helps.
>>
>> Due to VM kernel issue, at most 64 VCPUs can online if host clocksource is
>> switched to hpet.
>>
>> # echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
>>
>>
>> By default, we have no idea why only 64 out of 100 VCPUs are online in VM. We
>> need to customize kernel to debug why some CPUs are not able to online.
>>
>>
>> We will have below and know the root cause is with kvmclock, if we add the
>> pr_err_once().
> 
> Isn't pr_debug() more appropriate here? Don't you want one on the
> cpuhp_down_callbacks() too?

I will add one to cpuhp_down_callbacks() as well. Thank you very much for the
suggestion.

> 
> I *think* it's common now to have CONFIG_DYNAMIC_DEBUG enabled so one can
> enable that line when they need to
> 
> 	https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html
> 
> I can see the appeal of pr_err_once() but I think this can fail for legitimate
> reasons and is not necessarily strictly always an error?

The reason I used pr_err_once() is to remind the sysadmin something bad (maybe
not necessarily strictly as error) used to happen. It is quite helpful for
developer/sysadmin that is not familiar with cpu/smp/bootup/online.

Otherwise, someone will need to manually configure kernel command line with
"loglevel=8 dyndbg="file kernel/cpu.c +p".

The good thing with pr_err_once is it is not based on the assumption that
someone already know the root cause is related to kernel/cpu.c.

The good thing with pr_debug is we will not need to print for at most once.

I will send v2 with pr_debug() and let all maintainers to decide if it is
fine/required to print something here. Some places use WARN_ON_ONCE() for the
return value of cpuhp_invoke_callback(). I prefer to use pr_debug/pr_err to
print extra debug information for cpu online/hotplug case.

Thank you very much!

Dongli Zhang

> 
> Thanks
> 
> --
> Qais Yousef
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-04-07 22:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-10  7:08 [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed Dongli Zhang
2021-04-05 21:22 ` Dongli Zhang
2021-04-06 16:10   ` Qais Yousef
2021-04-07 22:13     ` Dongli Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.