Re: [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Doug Anderson <dianders@chromium.org>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Sumit Garg <sumit.garg@linaro.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Matthias Kaehlcke <mka@chromium.org>,
	Stephane Eranian <eranian@google.com>,
	Stephen Boyd <swboyd@chromium.org>,
	ricardo.neri@intel.com, Tzung-Bi Shih <tzungbi@chromium.org>,
	Lecopzer Chen <lecopzer.chen@mediatek.com>,
	kgdb-bugreport@lists.sourceforge.net,
	Masayoshi Mizuma <msys.mizuma@gmail.com>,
	Guenter Roeck <groeck@chromium.org>,
	Pingfan Liu <kernelfans@gmail.com>,
	Andi Kleen <ak@linux.intel.com>, Ian Rogers <irogers@google.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-perf-users@vger.kernel.org, ito-yuichi@fujitsu.com,
	Randy Dunlap <rdunlap@infradead.org>,
	Chen-Yu Tsai <wens@csie.org>,
	christophe.leroy@csgroup.eu, davem@davemloft.net,
	sparclinux@vger.kernel.org, mpe@ellerman.id.au,
	Will Deacon <will@kernel.org>,
	ravi.v.shankar@intel.com, linuxppc-dev@lists.ozlabs.org,
	Marc Zyngier <maz@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Daniel Thompson <daniel.thompson@linaro.org>,
	Colin Cross <ccross@android.com>
Subject: Re: [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
Date: Fri, 19 May 2023 10:23:06 -0700	[thread overview]
Message-ID: <CAD=FV=X53VnS37YXkGUT7W=1ekS1YgznCbOiBQBSHuZLqpHa_A@mail.gmail.com> (raw)
In-Reply-To: <CAD=FV=WEp23wDm2=cFO66uSjqw1UfYSczGSrQh32DGiqHnUDkg@mail.gmail.com>

Hi,

On Mon, May 8, 2023 at 8:52 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hmmm, but I don't think you really need "all-to-all" checking to get
> the stacktraces you want, do you? Each CPU can be "watching" exactly
> one other CPU, but then when we actually lock up we could check all of
> them and dump stacks on all the ones that are locked up. I think this
> would be a fairly easy improvement for the buddy system. I'll leave it
> out for now just to keep things simple for the initial landing, but it
> wouldn't be hard to add. Then I think the two SMP systems  (buddy vs.
> all-to-all) would be equivalent in terms of functionality?

FWIW, I take back my "this would be fairly easy" comment. :-P ...or,
at least I'll acknowledge that the easy way has some tradeoffs. It
wouldn't be trivially easy to just snoop on the data of the other
buddies because the watching processors aren't necessarily
synchronized with each other.

That being said, if someone really wanted to report on other locked
CPUs before doing a panic() and was willing to delay the panic, it
probably wouldn't be too hard to put in a mode where the CPU that
detects the first lockup could do some extra work to look for lockups.
Maybe it could send a normal IPI to other CPUs and see if they respond
or maybe it could take over monitoring all CPUs and wait one extra
period.

In any case, I'm not planning on implementing this now, but at least
wanted to document thoughts. ;-)

> With my simplistic solution
> of just allowing the buddy detector to be enabled in parallel with a
> perf-based detector then we wouldn't have this level of coordination,
> but I'll assume that's OK for the initial landing.

I dug into this more as well and I also wanted to note that, at least
for now, I'm not going to include support to turn on both the buddy
and perf lockup detectors in the common core. In order to do this and
not have them stomp on each other then I think we need extra
coordination or two copies of the interrupt count / saved interrupt
count and, at least at this point in time, it doesn't seem worth it
for a halfway solution. From everything I've heard there is a push on
many x86 machines to get off the perf lockup detector anyway to free
up the resources. Someone could look at adding this complexity later.

-Doug

WARNING: multiple messages have this Message-ID (diff)

From: Doug Anderson <dianders@chromium.org>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Sumit Garg <sumit.garg@linaro.org>,
	Mark Rutland <mark.rutland@arm.com>,
	 Matthias Kaehlcke <mka@chromium.org>,
	Stephane Eranian <eranian@google.com>,
	 Stephen Boyd <swboyd@chromium.org>,
	ricardo.neri@intel.com,  Tzung-Bi Shih <tzungbi@chromium.org>,
	Lecopzer Chen <lecopzer.chen@mediatek.com>,
	 kgdb-bugreport@lists.sourceforge.net,
	 Masayoshi Mizuma <msys.mizuma@gmail.com>,
	Guenter Roeck <groeck@chromium.org>,
	 Pingfan Liu <kernelfans@gmail.com>,
	Andi Kleen <ak@linux.intel.com>,  Ian Rogers <irogers@google.com>,
	linux-arm-kernel@lists.infradead.org,
	 linux-perf-users@vger.kernel.org, ito-yuichi@fujitsu.com,
	 Randy Dunlap <rdunlap@infradead.org>,
	Chen-Yu Tsai <wens@csie.org>,
	christophe.leroy@csgroup.eu,  davem@davemloft.net,
	sparclinux@vger.kernel.org, mpe@ellerman.id.au,
	 Will Deacon <will@kernel.org>,
	ravi.v.shankar@intel.com, linuxppc-dev@lists.ozlabs.org,
	 Marc Zyngier <maz@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	 Daniel Thompson <daniel.thompson@linaro.org>,
	Colin Cross <ccross@android.com>
Subject: Re: [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
Date: Fri, 19 May 2023 10:23:06 -0700	[thread overview]
Message-ID: <CAD=FV=X53VnS37YXkGUT7W=1ekS1YgznCbOiBQBSHuZLqpHa_A@mail.gmail.com> (raw)
In-Reply-To: <CAD=FV=WEp23wDm2=cFO66uSjqw1UfYSczGSrQh32DGiqHnUDkg@mail.gmail.com>

Hi,

On Mon, May 8, 2023 at 8:52 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hmmm, but I don't think you really need "all-to-all" checking to get
> the stacktraces you want, do you? Each CPU can be "watching" exactly
> one other CPU, but then when we actually lock up we could check all of
> them and dump stacks on all the ones that are locked up. I think this
> would be a fairly easy improvement for the buddy system. I'll leave it
> out for now just to keep things simple for the initial landing, but it
> wouldn't be hard to add. Then I think the two SMP systems  (buddy vs.
> all-to-all) would be equivalent in terms of functionality?

FWIW, I take back my "this would be fairly easy" comment. :-P ...or,
at least I'll acknowledge that the easy way has some tradeoffs. It
wouldn't be trivially easy to just snoop on the data of the other
buddies because the watching processors aren't necessarily
synchronized with each other.

That being said, if someone really wanted to report on other locked
CPUs before doing a panic() and was willing to delay the panic, it
probably wouldn't be too hard to put in a mode where the CPU that
detects the first lockup could do some extra work to look for lockups.
Maybe it could send a normal IPI to other CPUs and see if they respond
or maybe it could take over monitoring all CPUs and wait one extra
period.

In any case, I'm not planning on implementing this now, but at least
wanted to document thoughts. ;-)

> With my simplistic solution
> of just allowing the buddy detector to be enabled in parallel with a
> perf-based detector then we wouldn't have this level of coordination,
> but I'll assume that's OK for the initial landing.

I dug into this more as well and I also wanted to note that, at least
for now, I'm not going to include support to turn on both the buddy
and perf lockup detectors in the common core. In order to do this and
not have them stomp on each other then I think we need extra
coordination or two copies of the interrupt count / saved interrupt
count and, at least at this point in time, it doesn't seem worth it
for a halfway solution. From everything I've heard there is a push on
many x86 machines to get off the perf lockup detector anyway to free
up the resources. Someone could look at adding this complexity later.

-Doug

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)

From: Doug Anderson <dianders@chromium.org>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Ian Rogers <irogers@google.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Lecopzer Chen <lecopzer.chen@mediatek.com>,
	ravi.v.shankar@intel.com, kgdb-bugreport@lists.sourceforge.net,
	ricardo.neri@intel.com, Stephane Eranian <eranian@google.com>,
	sparclinux@vger.kernel.org, Guenter Roeck <groeck@chromium.org>,
	Will Deacon <will@kernel.org>,
	Daniel Thompson <daniel.thompson@linaro.org>,
	Andi Kleen <ak@linux.intel.com>, Chen-Yu Tsai <wens@csie.org>,
	Matthias Kaehlcke <mka@chromium.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Masayoshi Mizuma <msys.mizuma@gmail.com>,
	Petr Mladek <pmladek@suse.com>,
	Tzung-Bi Shih <tzungbi@chromium.org>,
	Colin Cross <ccross@android.com>,
	Stephen Boyd <swboyd@chromium.org>,
	Pingfan Liu <kernelfans@gmail.com>,
	linux-arm-kernel@lists.infradead.org,
	Sumit Garg <sumit.garg@linaro.org>,
	ito-yuichi@fujitsu.com, linux-perf-users@vger.kernel.org,
	Marc Zyngier <maz@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux ppc-dev@lists.ozlabs.org, davem@davemloft.net
Subject: Re: [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
Date: Fri, 19 May 2023 10:23:06 -0700	[thread overview]
Message-ID: <CAD=FV=X53VnS37YXkGUT7W=1ekS1YgznCbOiBQBSHuZLqpHa_A@mail.gmail.com> (raw)
In-Reply-To: <CAD=FV=WEp23wDm2=cFO66uSjqw1UfYSczGSrQh32DGiqHnUDkg@mail.gmail.com>

Hi,

On Mon, May 8, 2023 at 8:52 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hmmm, but I don't think you really need "all-to-all" checking to get
> the stacktraces you want, do you? Each CPU can be "watching" exactly
> one other CPU, but then when we actually lock up we could check all of
> them and dump stacks on all the ones that are locked up. I think this
> would be a fairly easy improvement for the buddy system. I'll leave it
> out for now just to keep things simple for the initial landing, but it
> wouldn't be hard to add. Then I think the two SMP systems  (buddy vs.
> all-to-all) would be equivalent in terms of functionality?

FWIW, I take back my "this would be fairly easy" comment. :-P ...or,
at least I'll acknowledge that the easy way has some tradeoffs. It
wouldn't be trivially easy to just snoop on the data of the other
buddies because the watching processors aren't necessarily
synchronized with each other.

That being said, if someone really wanted to report on other locked
CPUs before doing a panic() and was willing to delay the panic, it
probably wouldn't be too hard to put in a mode where the CPU that
detects the first lockup could do some extra work to look for lockups.
Maybe it could send a normal IPI to other CPUs and see if they respond
or maybe it could take over monitoring all CPUs and wait one extra
period.

In any case, I'm not planning on implementing this now, but at least
wanted to document thoughts. ;-)

> With my simplistic solution
> of just allowing the buddy detector to be enabled in parallel with a
> perf-based detector then we wouldn't have this level of coordination,
> but I'll assume that's OK for the initial landing.

I dug into this more as well and I also wanted to note that, at least
for now, I'm not going to include support to turn on both the buddy
and perf lockup detectors in the common core. In order to do this and
not have them stomp on each other then I think we need extra
coordination or two copies of the interrupt count / saved interrupt
count and, at least at this point in time, it doesn't seem worth it
for a halfway solution. From everything I've heard there is a push on
many x86 machines to get off the perf lockup detector anyway to free
up the resources. Someone could look at adding this complexity later.

-Doug

next prev parent reply	other threads:[~2023-05-19 17:31 UTC|newest]

Thread overview: 130+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-04 22:13 [PATCH v4 00/17] watchdog/hardlockup: Add the buddy hardlockup detector Douglas Anderson
2023-05-04 22:13 ` Douglas Anderson
2023-05-04 22:13 ` Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 01/17] watchdog/perf: Define dummy watchdog_update_hrtimer_threshold() on correct config Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  2:43   ` Nicholas Piggin
2023-05-05  2:43     ` Nicholas Piggin
2023-05-05  2:43     ` Nicholas Piggin
2023-05-11  8:39     ` Petr Mladek
2023-05-11  8:39       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 02/17] watchdog: remove WATCHDOG_DEFAULT Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 03/17] watchdog/hardlockup: change watchdog_nmi_enable() to void Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  2:45   ` Nicholas Piggin
2023-05-05  2:45     ` Nicholas Piggin
2023-05-05  2:45     ` Nicholas Piggin
2023-05-04 22:13 ` [PATCH v4 04/17] watchdog/perf: Ensure CPU-bound context when creating hardlockup detector event Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 05/17] watchdog/hardlockup: Rename touch_nmi_watchdog() to touch_hardlockup_watchdog() Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  2:51   ` Nicholas Piggin
2023-05-05  2:51     ` Nicholas Piggin
2023-05-05  2:51     ` Nicholas Piggin
2023-05-05 16:37     ` Doug Anderson
2023-05-05 16:37       ` Doug Anderson
2023-05-05 16:37       ` Doug Anderson
2023-05-08  1:34       ` Nicholas Piggin
2023-05-08  1:34         ` Nicholas Piggin
2023-05-08  1:34         ` Nicholas Piggin
2023-05-08 15:56         ` Doug Anderson
2023-05-08 15:56           ` Doug Anderson
2023-05-11  9:24       ` Petr Mladek
2023-05-11  9:24         ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 06/17] watchdog/perf: Rename watchdog_hld.c to watchdog_perf.c Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  2:53   ` Nicholas Piggin
2023-05-05  2:53     ` Nicholas Piggin
2023-05-05  2:53     ` Nicholas Piggin
2023-05-11 10:09   ` Petr Mladek
2023-05-11 10:09     ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  2:58   ` Nicholas Piggin
2023-05-05  2:58     ` Nicholas Piggin
2023-05-05  2:58     ` Nicholas Piggin
2023-05-05 16:37     ` Doug Anderson
2023-05-05 16:37       ` Doug Anderson
2023-05-05 16:37       ` Doug Anderson
2023-05-11 12:03       ` Petr Mladek
2023-05-11 12:03         ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 08/17] watchdog/hardlockup: Style changes to watchdog_hardlockup_check() / ..._is_lockedup() Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  3:01   ` Nicholas Piggin
2023-05-05  3:01     ` Nicholas Piggin
2023-05-05  3:01     ` Nicholas Piggin
2023-05-05 16:38     ` Doug Anderson
2023-05-05 16:38       ` Doug Anderson
2023-05-05 16:38       ` Doug Anderson
2023-05-11 12:45       ` Petr Mladek
2023-05-11 12:45         ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check() Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-11 14:14   ` Petr Mladek
2023-05-11 14:14     ` Petr Mladek
2023-05-19 17:21     ` Doug Anderson
2023-05-19 17:21       ` Doug Anderson
2023-05-19 17:21       ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 10/17] watchdog/hardlockup: Move perf hardlockup watchdog petting to watchdog.c Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-11 15:46   ` Petr Mladek
2023-05-11 15:46     ` Petr Mladek
2023-05-19 17:22     ` Doug Anderson
2023-05-19 17:22       ` Doug Anderson
2023-05-19 17:22       ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 11/17] watchdog/hardlockup: Rename some "NMI watchdog" constants/function Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  3:06   ` Nicholas Piggin
2023-05-05  3:06     ` Nicholas Piggin
2023-05-05  3:06     ` Nicholas Piggin
2023-05-05 16:38     ` Doug Anderson
2023-05-05 16:38       ` Doug Anderson
2023-05-05 16:38       ` Doug Anderson
2023-05-12 11:21     ` Petr Mladek
2023-05-12 11:21       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 12/17] watchdog/hardlockup: Have the perf hardlockup use __weak functions more cleanly Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-12 11:55   ` Petr Mladek
2023-05-12 11:55     ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-05  2:35   ` Nicholas Piggin
2023-05-05  2:35     ` Nicholas Piggin
2023-05-05  2:35     ` Nicholas Piggin
2023-05-05 16:35     ` Doug Anderson
2023-05-05 16:35       ` Doug Anderson
2023-05-05 16:35       ` Doug Anderson
2023-05-08  1:04       ` Nicholas Piggin
2023-05-08  1:04         ` Nicholas Piggin
2023-05-08  1:04         ` Nicholas Piggin
2023-05-08 15:52         ` Doug Anderson
2023-05-08 15:52           ` Doug Anderson
2023-05-19 17:23           ` Doug Anderson [this message]
2023-05-19 17:23             ` Doug Anderson
2023-05-19 17:23             ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 14/17] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 15/17] watchdog/perf: Adapt the watchdog_perf interface for async model Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 16/17] arm64: add hw_nmi_get_sample_period for preparation of lockup detector Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 17/17] arm64: Enable perf events based hard " Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson
2023-05-04 22:13   ` Douglas Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAD=FV=X53VnS37YXkGUT7W=1ekS1YgznCbOiBQBSHuZLqpHa_A@mail.gmail.com' \
    --to=dianders@chromium.org \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=ccross@android.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=daniel.thompson@linaro.org \
    --cc=davem@davemloft.net \
    --cc=eranian@google.com \
    --cc=groeck@chromium.org \
    --cc=irogers@google.com \
    --cc=ito-yuichi@fujitsu.com \
    --cc=kernelfans@gmail.com \
    --cc=kgdb-bugreport@lists.sourceforge.net \
    --cc=lecopzer.chen@mediatek.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=mka@chromium.org \
    --cc=mpe@ellerman.id.au \
    --cc=msys.mizuma@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=pmladek@suse.com \
    --cc=ravi.v.shankar@intel.com \
    --cc=rdunlap@infradead.org \
    --cc=ricardo.neri@intel.com \
    --cc=sparclinux@vger.kernel.org \
    --cc=sumit.garg@linaro.org \
    --cc=swboyd@chromium.org \
    --cc=tzungbi@chromium.org \
    --cc=wens@csie.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.