All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
@ 2020-01-06  6:41 Chuansheng Liu
  2020-01-06  7:07 ` Borislav Petkov
  0 siblings, 1 reply; 9+ messages in thread
From: Chuansheng Liu @ 2020-01-06  6:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: tony.luck, bp, tglx, mingo, hpa, chuansheng.liu

In ICL platform, it is easy to hit bootup failure with panic
in thermal interrupt handler during early bootup stage.

Such issue makes my platform almost can not boot up with
latest kernel code.

The call stack is like:
kernel BUG at kernel/timer/timer.c:1152!

Call Trace:
__queue_delayed_work
queue_delayed_work_on
therm_throt_process
intel_thermal_interrupt
...

When one CPU is up, the irq is enabled prior to CPU UP
notification which will then initialize therm_worker.
Such race will cause the posssibility that interrupt
handler therm_throt_process() accesses uninitialized
therm_work, then system hit panic at very early bootup
stage.

In my ICL platform, it can be reproduced in several times
of reboot stress. With this fix, the system keeps alive
for more than 200 times of reboot stress.

Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
---
 arch/x86/kernel/cpu/mce/therm_throt.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
index b38010b541d6..7320eb3ac029 100644
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -86,6 +86,7 @@ struct _thermal_state {
 	unsigned long		total_time_ms;
 	bool			rate_control_active;
 	bool			new_event;
+	bool			therm_work_active;
 	u8			level;
 	u8			sample_index;
 	u8			sample_count;
@@ -359,7 +360,9 @@ static void therm_throt_process(bool new_event, int event, int level)
 
 		state->baseline_temp = temp;
 		state->last_interrupt_time = now;
-		schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
+		if (state->therm_work_active)
+			schedule_delayed_work_on(this_cpu, &state->therm_work,
+					THERM_THROT_POLL_INTERVAL);
 	} else if (old_event && state->last_interrupt_time) {
 		unsigned long throttle_time;
 
@@ -473,7 +476,8 @@ static int thermal_throttle_online(unsigned int cpu)
 
 	INIT_DELAYED_WORK(&state->package_throttle.therm_work, throttle_active_work);
 	INIT_DELAYED_WORK(&state->core_throttle.therm_work, throttle_active_work);
-
+	state->package_throttle.therm_work_active = true;
+	state->core_throttle.therm_work_active = true;
 	return thermal_throttle_add_dev(dev, cpu);
 }
 
@@ -482,6 +486,8 @@ static int thermal_throttle_offline(unsigned int cpu)
 	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
+	state->package_throttle.therm_work_active = false;
+	state->core_throttle.therm_work_active = false;
 	cancel_delayed_work(&state->package_throttle.therm_work);
 	cancel_delayed_work(&state->core_throttle.therm_work);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06  6:41 [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work Chuansheng Liu
@ 2020-01-06  7:07 ` Borislav Petkov
  2020-01-06  7:11   ` Ingo Molnar
  2020-01-06  9:22   ` Liu, Chuansheng
  0 siblings, 2 replies; 9+ messages in thread
From: Borislav Petkov @ 2020-01-06  7:07 UTC (permalink / raw)
  To: Chuansheng Liu; +Cc: linux-kernel, tony.luck, tglx, mingo, hpa

On Mon, Jan 06, 2020 at 06:41:55AM +0000, Chuansheng Liu wrote:
> In ICL platform, it is easy to hit bootup failure with panic
> in thermal interrupt handler during early bootup stage.
> 
> Such issue makes my platform almost can not boot up with
> latest kernel code.
> 
> The call stack is like:
> kernel BUG at kernel/timer/timer.c:1152!
> 
> Call Trace:
> __queue_delayed_work
> queue_delayed_work_on
> therm_throt_process
> intel_thermal_interrupt
> ...
> 
> When one CPU is up, the irq is enabled prior to CPU UP
> notification which will then initialize therm_worker.

You mean the unmasking of the thermal vector at the end of
intel_init_thermal()?

If so, why don't you move that to the end of the notifier and unmask it
only after all the necessary work like setting up the workqueues etc, is
done, and save yourself adding yet another silly bool?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06  7:07 ` Borislav Petkov
@ 2020-01-06  7:11   ` Ingo Molnar
  2020-01-06  9:24     ` Liu, Chuansheng
  2020-01-11  4:20     ` Liu, Chuansheng
  2020-01-06  9:22   ` Liu, Chuansheng
  1 sibling, 2 replies; 9+ messages in thread
From: Ingo Molnar @ 2020-01-06  7:11 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Chuansheng Liu, linux-kernel, tony.luck, tglx, mingo, hpa

* Borislav Petkov <bp@alien8.de> wrote:

> On Mon, Jan 06, 2020 at 06:41:55AM +0000, Chuansheng Liu wrote:
> > In ICL platform, it is easy to hit bootup failure with panic
> > in thermal interrupt handler during early bootup stage.
> > 
> > Such issue makes my platform almost can not boot up with
> > latest kernel code.
> > 
> > The call stack is like:
> > kernel BUG at kernel/timer/timer.c:1152!
> > 
> > Call Trace:
> > __queue_delayed_work
> > queue_delayed_work_on
> > therm_throt_process
> > intel_thermal_interrupt
> > ...
> > 
> > When one CPU is up, the irq is enabled prior to CPU UP
> > notification which will then initialize therm_worker.
> 
> You mean the unmasking of the thermal vector at the end of
> intel_init_thermal()?
> 
> If so, why don't you move that to the end of the notifier and unmask it
> only after all the necessary work like setting up the workqueues etc, is
> done, and save yourself adding yet another silly bool?

A debugging WARN_ON_ONCE() when the workqueue is not initialized yet 
would also be useful I suspect. This would turn any remaining race-crash 
boot failure in this area into a warning.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06  7:07 ` Borislav Petkov
  2020-01-06  7:11   ` Ingo Molnar
@ 2020-01-06  9:22   ` Liu, Chuansheng
  2020-01-06 10:01     ` Boris Petkov
  1 sibling, 1 reply; 9+ messages in thread
From: Liu, Chuansheng @ 2020-01-06  9:22 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, Luck, Tony, tglx, mingo, hpa



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Monday, January 6, 2020 3:08 PM
> To: Liu, Chuansheng <chuansheng.liu@intel.com>
> Cc: linux-kernel@vger.kernel.org; Luck, Tony <tony.luck@intel.com>;
> tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com
> Subject: Re: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized
> therm_work
> 
> On Mon, Jan 06, 2020 at 06:41:55AM +0000, Chuansheng Liu wrote:
> > In ICL platform, it is easy to hit bootup failure with panic
> > in thermal interrupt handler during early bootup stage.
> >
> > Such issue makes my platform almost can not boot up with
> > latest kernel code.
> >
> > The call stack is like:
> > kernel BUG at kernel/timer/timer.c:1152!
> >
> > Call Trace:
> > __queue_delayed_work
> > queue_delayed_work_on
> > therm_throt_process
> > intel_thermal_interrupt
> > ...
> >
> > When one CPU is up, the irq is enabled prior to CPU UP
> > notification which will then initialize therm_worker.
> 
> You mean the unmasking of the thermal vector at the end of
> intel_init_thermal()?
Exactly, and there is one local CPU irq enable later too.

> 
> If so, why don't you move that to the end of the notifier and unmask it
> only after all the necessary work like setting up the workqueues etc, is
> done, and save yourself adding yet another silly bool?
> 
Thanks for your suggestion, I am just worried about the interrupt delay.
I traced there is about 2s gap between unmask interrupt and workqueue
Initialization. If you think it is OK to ignore this delay, I will make another
simple patch as you suggested😊

Best Regards
Chuansheng



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06  7:11   ` Ingo Molnar
@ 2020-01-06  9:24     ` Liu, Chuansheng
  2020-01-11  4:20     ` Liu, Chuansheng
  1 sibling, 0 replies; 9+ messages in thread
From: Liu, Chuansheng @ 2020-01-06  9:24 UTC (permalink / raw)
  To: Ingo Molnar, Borislav Petkov; +Cc: linux-kernel, Luck, Tony, tglx, mingo, hpa



> -----Original Message-----
> From: Ingo Molnar <mingo.kernel.org@gmail.com> On Behalf Of Ingo Molnar
> Sent: Monday, January 6, 2020 3:11 PM
> To: Borislav Petkov <bp@alien8.de>
> Cc: Liu, Chuansheng <chuansheng.liu@intel.com>; linux-kernel@vger.kernel.org;
> Luck, Tony <tony.luck@intel.com>; tglx@linutronix.de; mingo@redhat.com;
> hpa@zytor.com
> Subject: Re: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized
> therm_work
> 
> * Borislav Petkov <bp@alien8.de> wrote:
> 
> > On Mon, Jan 06, 2020 at 06:41:55AM +0000, Chuansheng Liu wrote:
> > > In ICL platform, it is easy to hit bootup failure with panic
> > > in thermal interrupt handler during early bootup stage.
> > >
> > > Such issue makes my platform almost can not boot up with
> > > latest kernel code.
> > >
> > > The call stack is like:
> > > kernel BUG at kernel/timer/timer.c:1152!
> > >
> > > Call Trace:
> > > __queue_delayed_work
> > > queue_delayed_work_on
> > > therm_throt_process
> > > intel_thermal_interrupt
> > > ...
> > >
> > > When one CPU is up, the irq is enabled prior to CPU UP
> > > notification which will then initialize therm_worker.
> >
> > You mean the unmasking of the thermal vector at the end of
> > intel_init_thermal()?
> >
> > If so, why don't you move that to the end of the notifier and unmask it
> > only after all the necessary work like setting up the workqueues etc, is
> > done, and save yourself adding yet another silly bool?
> 
> A debugging WARN_ON_ONCE() when the workqueue is not initialized yet
> would also be useful I suspect. This would turn any remaining race-crash
> boot failure in this area into a warning.
> 
Thanks Ingo.
That's a good suggestion, I can try to make another patch about it later.

Best Regards
Chuansheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06  9:22   ` Liu, Chuansheng
@ 2020-01-06 10:01     ` Boris Petkov
  2020-01-06 10:10       ` Liu, Chuansheng
  0 siblings, 1 reply; 9+ messages in thread
From: Boris Petkov @ 2020-01-06 10:01 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Luck, Tony, tglx, mingo, hpa

On January 6, 2020 10:22:06 AM GMT+01:00, "Liu, Chuansheng" <chuansheng.liu@intel.com> wrote:
>I traced there is about 2s gap between unmask interrupt and workqueue
>Initialization. 

And that is a problem because?

You setup workqueue etc and *then* unmask the irq. 

-- 
Sent from a small device: formatting sux and brevity is inevitable.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06 10:01     ` Boris Petkov
@ 2020-01-06 10:10       ` Liu, Chuansheng
  2020-01-06 10:37         ` Boris Petkov
  0 siblings, 1 reply; 9+ messages in thread
From: Liu, Chuansheng @ 2020-01-06 10:10 UTC (permalink / raw)
  To: Boris Petkov; +Cc: linux-kernel, Luck, Tony, tglx, mingo, hpa



> -----Original Message-----
> From: Boris Petkov <bp@alien8.de>
> Sent: Monday, January 6, 2020 6:01 PM
> To: Liu, Chuansheng <chuansheng.liu@intel.com>
> Cc: linux-kernel@vger.kernel.org; Luck, Tony <tony.luck@intel.com>;
> tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com
> Subject: RE: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized
> therm_work
> 
> On January 6, 2020 10:22:06 AM GMT+01:00, "Liu, Chuansheng"
> <chuansheng.liu@intel.com> wrote:
> >I traced there is about 2s gap between unmask interrupt and workqueue
> >Initialization.
> 
> And that is a problem because?
> 
> You setup workqueue etc and *then* unmask the irq.
> 
Some previous experience shows:
If there is critical thermal alert, we still can take action in kernel side in this
2s, even though the workqueue is not ready, but interrupt handler can work
well.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06 10:10       ` Liu, Chuansheng
@ 2020-01-06 10:37         ` Boris Petkov
  0 siblings, 0 replies; 9+ messages in thread
From: Boris Petkov @ 2020-01-06 10:37 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Luck, Tony, tglx, mingo, hpa

On January 6, 2020 11:10:11 AM GMT+01:00, "Liu, Chuansheng" <chuansheng.liu@intel.com> wrote:
>If there is critical thermal alert, we still can take action in kernel
>side in this
>2s

What critical thermal alert during boot and what action can you take that early that can't wait 2 seconds? 

-- 
Sent from a small device: formatting sux and brevity is inevitable.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work
  2020-01-06  7:11   ` Ingo Molnar
  2020-01-06  9:24     ` Liu, Chuansheng
@ 2020-01-11  4:20     ` Liu, Chuansheng
  1 sibling, 0 replies; 9+ messages in thread
From: Liu, Chuansheng @ 2020-01-11  4:20 UTC (permalink / raw)
  To: Ingo Molnar, Borislav Petkov; +Cc: linux-kernel, Luck, Tony, tglx, mingo, hpa

Hi Ingo,

> -----Original Message-----
> From: Ingo Molnar <mingo.kernel.org@gmail.com> On Behalf Of Ingo Molnar
> Sent: Monday, January 6, 2020 3:11 PM
> To: Borislav Petkov <bp@alien8.de>
> Cc: Liu, Chuansheng <chuansheng.liu@intel.com>; linux-kernel@vger.kernel.org;
> Luck, Tony <tony.luck@intel.com>; tglx@linutronix.de; mingo@redhat.com;
> hpa@zytor.com
> Subject: Re: [PATCH] x86/mce/therm_throt: Fix the access of uninitialized
> therm_work
> 
> * Borislav Petkov <bp@alien8.de> wrote:
> 
> > On Mon, Jan 06, 2020 at 06:41:55AM +0000, Chuansheng Liu wrote:
> > > In ICL platform, it is easy to hit bootup failure with panic
> > > in thermal interrupt handler during early bootup stage.
> > >
> > > Such issue makes my platform almost can not boot up with
> > > latest kernel code.
> > >
> > > The call stack is like:
> > > kernel BUG at kernel/timer/timer.c:1152!
> > >
> > > Call Trace:
> > > __queue_delayed_work
> > > queue_delayed_work_on
> > > therm_throt_process
> > > intel_thermal_interrupt
> > > ...
> > >
> > > When one CPU is up, the irq is enabled prior to CPU UP
> > > notification which will then initialize therm_worker.
> >
> > You mean the unmasking of the thermal vector at the end of
> > intel_init_thermal()?
> >
> > If so, why don't you move that to the end of the notifier and unmask it
> > only after all the necessary work like setting up the workqueues etc, is
> > done, and save yourself adding yet another silly bool?
> 
> A debugging WARN_ON_ONCE() when the workqueue is not initialized yet
> would also be useful I suspect. This would turn any remaining race-crash
> boot failure in this area into a warning.
> 
Just checked the code, the WARN_ON_ONCE() is already there:
1622         WARN_ON_ONCE(timer->function != delayed_work_timer_fn);

With reproducing it, the corresponding log also shows before panic:
WARNING: CPU: 0 .... at kernel/workqueue.c:1622 __queue_delayed_work+0x73/0x90

Thanks for your reminder.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-01-11  4:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-06  6:41 [PATCH] x86/mce/therm_throt: Fix the access of uninitialized therm_work Chuansheng Liu
2020-01-06  7:07 ` Borislav Petkov
2020-01-06  7:11   ` Ingo Molnar
2020-01-06  9:24     ` Liu, Chuansheng
2020-01-11  4:20     ` Liu, Chuansheng
2020-01-06  9:22   ` Liu, Chuansheng
2020-01-06 10:01     ` Boris Petkov
2020-01-06 10:10       ` Liu, Chuansheng
2020-01-06 10:37         ` Boris Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.