linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Laurent Dufour <ldufour@linux.ibm.com>
To: Michael Ellerman <michaele@au1.ibm.com>,
	Nathan Lynch <nathanl@linux.ibm.com>
Cc: linux-kernel@vger.kernel.org, npiggin@gmail.com,
	paulus@samba.org, linuxppc-dev@lists.ozlabs.org,
	haren@linux.vnet.ibm.com
Subject: Re: [PATCH 0/2] Disabling NMI watchdog during LPM's memory transfer
Date: Thu, 9 Jun 2022 11:09:54 +0200	[thread overview]
Message-ID: <acb1c167-1697-7b61-239e-02acc3de389f@linux.ibm.com> (raw)
In-Reply-To: <87zgimfff6.fsf@mpe.ellerman.id.au>

On 09/06/2022, 09:45:49, Michael Ellerman wrote:
> Nathan Lynch <nathanl@linux.ibm.com> writes:
>> Laurent Dufour <ldufour@linux.ibm.com> writes:
> ...
>>
>>> There are  ongoing investigations to clarify where and how this latency is
>>> happening. I'm not excluding any other issue in the Linux kernel, but right
>>> now, this looks to be the best option to prevent system crash during
>>> LPM.
>>
>> It will prevent the likely crash mode for enterprise distros with
>> default watchdog tunables that our internal test environments happen to
>> use. But if someone were to run the same scenario with softlockup_panic
>> enabled, or with the RCU stall timeout lower than the watchdog
>> threshold, the failure mode would be different.
>>
>> Basically I'm saying:
>> * Some users may actually want the OS to panic when it's in this state,
>>   because their applications can't work correctly.
>> * But if we're going to inhibit one watchdog, we should inhibit them
>>   all.
> 
> I'm sympathetic to both of your arguments.
> 
> But I think there is a key difference between the NMI watchdog and other
> watchdogs, which is that the NMI watchdog will use the unsafe NMI to
> interrupt other CPUs, and that can cause the system to crash when other
> watchdogs would just print a backtrace.
> 
> We had the same problem with the rcu_sched stall detector until we
> changed it to use the "safe" NMI, see:
>   5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
> 
> 
> So even if the NMI watchdog is disabled there are still the other
> watchdogs enabled, which should print backtraces by default, and if
> desired can also be configured to cause a panic.
> 
> Instead of disabling the NMI watchdog, can we instead increase the
> timeout (by how much?) during LPM, so that it is less likely to fire in
> normal usage, but is still there as a backup if the system is completely
> clogged.

That's probably doable, tweaking wd_smp_panic_timeout_tb and
wd_panic_timeout_tb when the LPM is in progress.

I'll add a new sysctl value, so administrator will have the capability to
change that and also fully disable the NMI watchdog during LPM if he want.

I've no idea what should be the default factor, I guess this will be a bit
empiric.

I'll rework my patch in that way.

cheers,
Laurent.



      reply	other threads:[~2022-06-09  9:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-01 15:53 [PATCH 0/2] Disabling NMI watchdog during LPM's memory transfer Laurent Dufour
2022-06-01 15:53 ` [PATCH 1/2] powerpc/mobility: Wait for memory transfer to complete Laurent Dufour
2022-06-01 15:53 ` [PATCH 2/2] powerpc/mobility: disabling hard lockup watchdog during LPM Laurent Dufour
2022-06-06  1:41   ` kernel test robot
2022-06-02 17:58 ` [PATCH 0/2] Disabling NMI watchdog during LPM's memory transfer Nathan Lynch
2022-06-03  8:59   ` Laurent Dufour
2022-06-06 20:00     ` Nathan Lynch
2022-06-09  7:45       ` Michael Ellerman
2022-06-09  9:09         ` Laurent Dufour [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acb1c167-1697-7b61-239e-02acc3de389f@linux.ibm.com \
    --to=ldufour@linux.ibm.com \
    --cc=haren@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=michaele@au1.ibm.com \
    --cc=nathanl@linux.ibm.com \
    --cc=npiggin@gmail.com \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).