All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] watchdog: Fix possible soft lockup warning at bootup
@ 2020-01-02 15:41 Waiman Long
  2020-01-02 20:08 ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Waiman Long @ 2020-01-02 15:41 UTC (permalink / raw)
  To: Andrew Morton, Ingo Molnar, Thomas Gleixner
  Cc: linux-kernel, Mike Rapoport, Kees Cook, Waiman Long

It was found that watchdog soft lockup warning was displayed on some
arm64 server systems at bootup time:

 [   25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s!  [swapper/14:0]
 [   25.496381] Modules linked in:
 [   25.496386] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G        W    L --------- -  - 4.18.0-rhel8.1+ #9
 [   25.496388] pstate: 60000009 (nZCv daif -PAN -UAO)
 [   25.496393] pc : arch_cpu_idle+0x34/0x140
 [   25.496395] lr : arch_cpu_idle+0x30/0x140
 [   25.496397] sp : ffff000021f4ff10
 [   25.496398] x29: ffff000021f4ff10 x28: 0000000000000000
 [   25.496401] x27: 0000000000000000 x26: ffff809f483c0000
 [   25.496404] x25: 0000000000000000 x24: ffff00001145c03c
 [   25.496407] x23: ffff00001110c9f8 x22: ffff000011453708
 [   25.496410] x21: ffff00001145bffc x20: 0000000000004000
 [   25.496413] x19: ffff0000110f0018 x18: 0000000000000010
 [   25.496416] x17: 0000000000000cc8 x16: 0000000000000000
 [   25.496419] x15: ffffffffffffffff x14: ffff000011453708
 [   25.496422] x13: ffff000091cc5caf x12: ffff000011cc5cb7
 [   25.496424] x11: 6572203030642072 x10: 0000000000000d10
 [   25.496427] x9 : ffff000021f4fe80 x8 : ffff809f483c0d70
 [   25.496430] x7 : 00000000b123f581 x6 : 00000000ffff8ae1
 [   25.496433] x5 : 00000000ffffffff x4 : 0000809f6ac90000
 [   25.496436] x3 : 4000000000000000 x2 : ffff809f7bd9e9c0
 [   25.496439] x1 : ffff0000110f0018 x0 : ffff000021f4ff10
 [   25.496441] Call trace:
 [   25.496444]  arch_cpu_idle+0x34/0x140
 [   25.496447]  do_idle+0x210/0x288
 [   25.496449]  cpu_startup_entry+0x2c/0x30
 [   25.496452]  secondary_start_kernel+0x124/0x138

Further analysis of the situation revealed that the smp_init() call
itself took more than 20s for that 2-socket 56-core and 224-thread
server.

 [    0.115632] CPU1: Booted secondary processor 0x0000000100 [0x431f0af1]
   :
 [   27.177282] CPU223: Booted secondary processor 0x0000011b03 [0x431f0af1]

By adding some instrumentation code, it was found that for cpu 14,
watchdog_enable() was called early with a timestamp of 1. The first
watchdog timer callback for that cpu, however, happened really late at
the above 25s timestamp mark causing the watchdog logic to treat the
delay as a soft lockup.

On another arm64 system that doesn't show the soft lockup warning, the
watchdog timer callback happened earlier at the 5s timestamp mark with
the watchdog thread invoked shortly after that.

The reason why there was such a delay in the first watchdog timer
callback for that particular system wasn't fully known yet. Given
the fact that smp_init() can run for a long time on some systems,
it is probably more appropriate to enable the watchdog function after
smp_init() instead of before it.

Another way is to leave watchdog_touch_ts at 0 in watchdog_enable()
while the system is at the booting stage. Either one of those should
be able to eliminate the soft lockup warning on bootup.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 init/main.c       | 2 +-
 kernel/watchdog.c | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 1ecfd43ed464..82e231d69265 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1204,9 +1204,9 @@ static noinline void __init kernel_init_freeable(void)
 	init_mm_internals();
 
 	do_pre_smp_initcalls();
-	lockup_detector_init();
 
 	smp_init();
+	lockup_detector_init();
 	sched_init_smp();
 
 	page_alloc_init_late();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index f41334ef0971..8acb97372831 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -496,7 +496,9 @@ static void watchdog_enable(unsigned int cpu)
 		      HRTIMER_MODE_REL_PINNED_HARD);
 
 	/* Initialize timestamp */
-	__touch_watchdog();
+	if (system_state != SYSTEM_BOOTING)
+		__touch_watchdog();
+
 	/* Enable the perf event */
 	if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
 		watchdog_nmi_enable(cpu);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] watchdog: Fix possible soft lockup warning at bootup
  2020-01-02 15:41 [PATCH] watchdog: Fix possible soft lockup warning at bootup Waiman Long
@ 2020-01-02 20:08 ` Andrew Morton
  2020-01-02 20:12   ` Waiman Long
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2020-01-02 20:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Mike Rapoport, Kees Cook

On Thu,  2 Jan 2020 10:41:49 -0500 Waiman Long <longman@redhat.com> wrote:

> It was found that watchdog soft lockup warning was displayed on some
> arm64 server systems at bootup time:
> 
> ...
>
> Further analysis of the situation revealed that the smp_init() call
> itself took more than 20s for that 2-socket 56-core and 224-thread
> server.
> 
>  [    0.115632] CPU1: Booted secondary processor 0x0000000100 [0x431f0af1]
>    :
>  [   27.177282] CPU223: Booted secondary processor 0x0000011b03 [0x431f0af1]
> 
> By adding some instrumentation code, it was found that for cpu 14,
> watchdog_enable() was called early with a timestamp of 1. The first
> watchdog timer callback for that cpu, however, happened really late at
> the above 25s timestamp mark causing the watchdog logic to treat the
> delay as a soft lockup.
> 
> On another arm64 system that doesn't show the soft lockup warning, the
> watchdog timer callback happened earlier at the 5s timestamp mark with
> the watchdog thread invoked shortly after that.
> 
> The reason why there was such a delay in the first watchdog timer
> callback for that particular system wasn't fully known yet.

Mysteries are unwelcome.  Are you continuing to investigate this?

> Given
> the fact that smp_init() can run for a long time on some systems,
> it is probably more appropriate to enable the watchdog function after
> smp_init() instead of before it.
> 
> Another way is to leave watchdog_touch_ts at 0 in watchdog_enable()
> while the system is at the booting stage. Either one of those should
> be able to eliminate the soft lockup warning on bootup.
> 
> ...
>
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -496,7 +496,9 @@ static void watchdog_enable(unsigned int cpu)
>  		      HRTIMER_MODE_REL_PINNED_HARD);
>  
>  	/* Initialize timestamp */
> -	__touch_watchdog();
> +	if (system_state != SYSTEM_BOOTING)
> +		__touch_watchdog();

A comment which explains the system_state test would be appropriate
here.

>  	/* Enable the perf event */
>  	if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
>  		watchdog_nmi_enable(cpu);


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] watchdog: Fix possible soft lockup warning at bootup
  2020-01-02 20:08 ` Andrew Morton
@ 2020-01-02 20:12   ` Waiman Long
  2020-01-03 15:17     ` Waiman Long
  0 siblings, 1 reply; 4+ messages in thread
From: Waiman Long @ 2020-01-02 20:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Mike Rapoport, Kees Cook

On 1/2/20 3:08 PM, Andrew Morton wrote:
> On Thu,  2 Jan 2020 10:41:49 -0500 Waiman Long <longman@redhat.com> wrote:
>
>> It was found that watchdog soft lockup warning was displayed on some
>> arm64 server systems at bootup time:
>>
>> ...
>>
>> Further analysis of the situation revealed that the smp_init() call
>> itself took more than 20s for that 2-socket 56-core and 224-thread
>> server.
>>
>>  [    0.115632] CPU1: Booted secondary processor 0x0000000100 [0x431f0af1]
>>    :
>>  [   27.177282] CPU223: Booted secondary processor 0x0000011b03 [0x431f0af1]
>>
>> By adding some instrumentation code, it was found that for cpu 14,
>> watchdog_enable() was called early with a timestamp of 1. The first
>> watchdog timer callback for that cpu, however, happened really late at
>> the above 25s timestamp mark causing the watchdog logic to treat the
>> delay as a soft lockup.
>>
>> On another arm64 system that doesn't show the soft lockup warning, the
>> watchdog timer callback happened earlier at the 5s timestamp mark with
>> the watchdog thread invoked shortly after that.
>>
>> The reason why there was such a delay in the first watchdog timer
>> callback for that particular system wasn't fully known yet.
> Mysteries are unwelcome.  Are you continuing to investigate this?
Yes, I will do some more investigation as to why it took so long.
>> Given
>> the fact that smp_init() can run for a long time on some systems,
>> it is probably more appropriate to enable the watchdog function after
>> smp_init() instead of before it.
>>
>> Another way is to leave watchdog_touch_ts at 0 in watchdog_enable()
>> while the system is at the booting stage. Either one of those should
>> be able to eliminate the soft lockup warning on bootup.
>>
>> ...
>>
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -496,7 +496,9 @@ static void watchdog_enable(unsigned int cpu)
>>  		      HRTIMER_MODE_REL_PINNED_HARD);
>>  
>>  	/* Initialize timestamp */
>> -	__touch_watchdog();
>> +	if (system_state != SYSTEM_BOOTING)
>> +		__touch_watchdog();
> A comment which explains the system_state test would be appropriate
> here.

Will do so.

>>  	/* Enable the perf event */
>>  	if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
>>  		watchdog_nmi_enable(cpu);

Cheers,
Longman


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] watchdog: Fix possible soft lockup warning at bootup
  2020-01-02 20:12   ` Waiman Long
@ 2020-01-03 15:17     ` Waiman Long
  0 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2020-01-03 15:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Mike Rapoport, Kees Cook

On 1/2/20 3:12 PM, Waiman Long wrote:
> On 1/2/20 3:08 PM, Andrew Morton wrote:
>> On Thu,  2 Jan 2020 10:41:49 -0500 Waiman Long <longman@redhat.com> wrote:
>>
>>> It was found that watchdog soft lockup warning was displayed on some
>>> arm64 server systems at bootup time:
>>>
>>> ...
>>>
>>> Further analysis of the situation revealed that the smp_init() call
>>> itself took more than 20s for that 2-socket 56-core and 224-thread
>>> server.
>>>
>>>  [    0.115632] CPU1: Booted secondary processor 0x0000000100 [0x431f0af1]
>>>    :
>>>  [   27.177282] CPU223: Booted secondary processor 0x0000011b03 [0x431f0af1]
>>>
>>> By adding some instrumentation code, it was found that for cpu 14,
>>> watchdog_enable() was called early with a timestamp of 1. The first
>>> watchdog timer callback for that cpu, however, happened really late at
>>> the above 25s timestamp mark causing the watchdog logic to treat the
>>> delay as a soft lockup.
>>>
>>> On another arm64 system that doesn't show the soft lockup warning, the
>>> watchdog timer callback happened earlier at the 5s timestamp mark with
>>> the watchdog thread invoked shortly after that.
>>>
>>> The reason why there was such a delay in the first watchdog timer
>>> callback for that particular system wasn't fully known yet.
>> Mysteries are unwelcome.  Are you continuing to investigate this?
> Yes, I will do some more investigation as to why it took so long.

As stated in my v2 patch, the soft lockup problem is caused by a clock
slowdown in during the init_smp() call. This is likely to be hardware
specific and I don't have enough information about the hardware to do an
in-depth investigation of this slow down. So I opt to do just a delay of
the init call as a workaround.


>>> Given
>>> the fact that smp_init() can run for a long time on some systems,
>>> it is probably more appropriate to enable the watchdog function after
>>> smp_init() instead of before it.
>>>
>>> Another way is to leave watchdog_touch_ts at 0 in watchdog_enable()
>>> while the system is at the booting stage. Either one of those should
>>> be able to eliminate the soft lockup warning on bootup.
>>>
>>> ...
>>>
>>> --- a/kernel/watchdog.c
>>> +++ b/kernel/watchdog.c
>>> @@ -496,7 +496,9 @@ static void watchdog_enable(unsigned int cpu)
>>>  		      HRTIMER_MODE_REL_PINNED_HARD);
>>>  
>>>  	/* Initialize timestamp */
>>> -	__touch_watchdog();
>>> +	if (system_state != SYSTEM_BOOTING)
>>> +		__touch_watchdog();
>> A comment which explains the system_state test would be appropriate
>> here.
> Will do so.

This change is not actually necessary. So it is taken out from the v2 patch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-01-03 15:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-02 15:41 [PATCH] watchdog: Fix possible soft lockup warning at bootup Waiman Long
2020-01-02 20:08 ` Andrew Morton
2020-01-02 20:12   ` Waiman Long
2020-01-03 15:17     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.