All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Thermal: Fix lockup of cpu_down()
@ 2013-07-16 18:02 Steven Rostedt
  2013-07-16 18:19 ` Srinivas Pandruvada
  2013-07-16 18:34 ` Srinivas Pandruvada
  0 siblings, 2 replies; 5+ messages in thread
From: Steven Rostedt @ 2013-07-16 18:02 UTC (permalink / raw)
  To: LKML; +Cc: Srinivas Pandruvada, Zhang Rui, Andrew Morton, Tejun Heo

Commit f1a18a105 "Thermal: CPU Package temperature thermal" had code
that did a get_online_cpus(), run a loop and then do a
put_online_cpus(). The problem is that the loop had an error exit that
would skip the put_online_cpus() part.

In the error exit part of the function, it also did a get_online_cpus(),
run a loop and then put_online_cpus(). The only way to get to the error
exit part is with get_online_cpus() already performed. If this error
condition is hit, the system will be prevented from taking CPUs offline.
The process taking the CPU offline will lock up hard.

Removing the get_online_cpus() removes the lockup as the hotplug CPU
refcount is back to zero.

This was bisected with ktest.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/drivers/thermal/x86_pkg_temp_thermal.c b/drivers/thermal/x86_pkg_temp_thermal.c
index 5de56f6..d47624c 100644
--- a/drivers/thermal/x86_pkg_temp_thermal.c
+++ b/drivers/thermal/x86_pkg_temp_thermal.c
@@ -592,7 +592,6 @@ static int __init pkg_temp_thermal_init(void)
 	return 0;
 
 err_ret:
-	get_online_cpus();
 	for_each_online_cpu(i)
 		put_core_offline(i);
 	put_online_cpus();



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Thermal: Fix lockup of cpu_down()
  2013-07-16 18:02 [PATCH] Thermal: Fix lockup of cpu_down() Steven Rostedt
@ 2013-07-16 18:19 ` Srinivas Pandruvada
  2013-07-16 18:33   ` Steven Rostedt
  2013-07-16 18:34 ` Srinivas Pandruvada
  1 sibling, 1 reply; 5+ messages in thread
From: Srinivas Pandruvada @ 2013-07-16 18:19 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Zhang Rui, Andrew Morton, Tejun Heo

Thanks. How did you trigger this error condition? Is it a code review or 
you have some way to reproduce?

Regards,
Srinivas

On 07/16/2013 11:02 AM, Steven Rostedt wrote:
> Commit f1a18a105 "Thermal: CPU Package temperature thermal" had code
> that did a get_online_cpus(), run a loop and then do a
> put_online_cpus(). The problem is that the loop had an error exit that
> would skip the put_online_cpus() part.
>
> In the error exit part of the function, it also did a get_online_cpus(),
> run a loop and then put_online_cpus(). The only way to get to the error
> exit part is with get_online_cpus() already performed. If this error
> condition is hit, the system will be prevented from taking CPUs offline.
> The process taking the CPU offline will lock up hard.
>
> Removing the get_online_cpus() removes the lockup as the hotplug CPU
> refcount is back to zero.
>
> This was bisected with ktest.
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>
> diff --git a/drivers/thermal/x86_pkg_temp_thermal.c b/drivers/thermal/x86_pkg_temp_thermal.c
> index 5de56f6..d47624c 100644
> --- a/drivers/thermal/x86_pkg_temp_thermal.c
> +++ b/drivers/thermal/x86_pkg_temp_thermal.c
> @@ -592,7 +592,6 @@ static int __init pkg_temp_thermal_init(void)
>   	return 0;
>   
>   err_ret:
> -	get_online_cpus();
>   	for_each_online_cpu(i)
>   		put_core_offline(i);
>   	put_online_cpus();
>
>
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Thermal: Fix lockup of cpu_down()
  2013-07-16 18:19 ` Srinivas Pandruvada
@ 2013-07-16 18:33   ` Steven Rostedt
  2013-07-16 18:44     ` Srinivas Pandruvada
  0 siblings, 1 reply; 5+ messages in thread
From: Steven Rostedt @ 2013-07-16 18:33 UTC (permalink / raw)
  To: Srinivas Pandruvada; +Cc: LKML, Zhang Rui, Andrew Morton, Tejun Heo

On Tue, 2013-07-16 at 11:19 -0700, Srinivas Pandruvada wrote:
> Thanks. How did you trigger this error condition? Is it a code review or 
> you have some way to reproduce?

No, my tests do a cpu hotplug stress and the system would hang. I had to
bisect it to find the bug and it came to this code. What was weird is
that the module wasn't loaded. Then I ran the ftrace function tracer
stared by the kernel command line with the following:

 ftrace=function ftrace_filter=get_online_cpus,put_online_cpus

and after I booted up, I ran:

cat /debug/tracing/trace | perl -e '
my @stack;
while (<>) {
	if (/get_online/) {
		push @stack, $_;
	} elsif (/put_online/) {
		pop @stack;
	}
}
foreach my $line (@stack) {
	print $line;
}'

And it showed that get_online_cpus() was called twice without a matching
put_online_cpu(). The strange thing was the calls had no parent
function. Which is when I realized that the module was loaded but then
failed to init, and was unloaded. Which explains why it didn't show up
in my lsmod.

Then it was just the matter of looking at all the calls to
get_online_cpu() in the commit, and it was rather obvious to what the
bug was.

With the patch applied, the lockup went away.

-- Steve




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Thermal: Fix lockup of cpu_down()
  2013-07-16 18:02 [PATCH] Thermal: Fix lockup of cpu_down() Steven Rostedt
  2013-07-16 18:19 ` Srinivas Pandruvada
@ 2013-07-16 18:34 ` Srinivas Pandruvada
  1 sibling, 0 replies; 5+ messages in thread
From: Srinivas Pandruvada @ 2013-07-16 18:34 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Zhang Rui, Andrew Morton, Tejun Heo

On 07/16/2013 11:02 AM, Steven Rostedt wrote:
> Commit f1a18a105 "Thermal: CPU Package temperature thermal" had code
> that did a get_online_cpus(), run a loop and then do a
> put_online_cpus(). The problem is that the loop had an error exit that
> would skip the put_online_cpus() part.
>
> In the error exit part of the function, it also did a get_online_cpus(),
> run a loop and then put_online_cpus(). The only way to get to the error
> exit part is with get_online_cpus() already performed. If this error
> condition is hit, the system will be prevented from taking CPUs offline.
> The process taking the CPU offline will lock up hard.
>
> Removing the get_online_cpus() removes the lockup as the hotplug CPU
> refcount is back to zero.
>
> This was bisected with ktest.
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>
> diff --git a/drivers/thermal/x86_pkg_temp_thermal.c b/drivers/thermal/x86_pkg_temp_thermal.c
> index 5de56f6..d47624c 100644
> --- a/drivers/thermal/x86_pkg_temp_thermal.c
> +++ b/drivers/thermal/x86_pkg_temp_thermal.c
> @@ -592,7 +592,6 @@ static int __init pkg_temp_thermal_init(void)
>   	return 0;
>   
>   err_ret:
> -	get_online_cpus();
>   	for_each_online_cpu(i)
>   		put_core_offline(i);
>   	put_online_cpus();
>
Agreed.
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Thermal: Fix lockup of cpu_down()
  2013-07-16 18:33   ` Steven Rostedt
@ 2013-07-16 18:44     ` Srinivas Pandruvada
  0 siblings, 0 replies; 5+ messages in thread
From: Srinivas Pandruvada @ 2013-07-16 18:44 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Zhang Rui, Andrew Morton, Tejun Heo

On 07/16/2013 11:33 AM, Steven Rostedt wrote:
> On Tue, 2013-07-16 at 11:19 -0700, Srinivas Pandruvada wrote:
>> Thanks. How did you trigger this error condition? Is it a code review or
>> you have some way to reproduce?
> No, my tests do a cpu hotplug stress and the system would hang. I had to
> bisect it to find the bug and it came to this code. What was weird is
> that the module wasn't loaded. Then I ran the ftrace function tracer
> stared by the kernel command line with the following:
>
>   ftrace=function ftrace_filter=get_online_cpus,put_online_cpus
>
> and after I booted up, I ran:
>
> cat /debug/tracing/trace | perl -e '
> my @stack;
> while (<>) {
> 	if (/get_online/) {
> 		push @stack, $_;
> 	} elsif (/put_online/) {
> 		pop @stack;
> 	}
> }
> foreach my $line (@stack) {
> 	print $line;
> }'
>
> And it showed that get_online_cpus() was called twice without a matching
> put_online_cpu(). The strange thing was the calls had no parent
> function. Which is when I realized that the module was loaded but then
> failed to init, and was unloaded. Which explains why it didn't show up
> in my lsmod.
>
> Then it was just the matter of looking at all the calls to
> get_online_cpu() in the commit, and it was rather obvious to what the
> bug was.
>
> With the patch applied, the lockup went away.
>
> -- Steve
Thanks for your help in debugging and isolating.
>
>
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-07-16 18:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-16 18:02 [PATCH] Thermal: Fix lockup of cpu_down() Steven Rostedt
2013-07-16 18:19 ` Srinivas Pandruvada
2013-07-16 18:33   ` Steven Rostedt
2013-07-16 18:44     ` Srinivas Pandruvada
2013-07-16 18:34 ` Srinivas Pandruvada

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.