All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] watchdog/softlockup: Report overall time and some cleanup
@ 2020-12-10 16:00 Petr Mladek
  2020-12-10 16:00 ` [PATCH v2 1/7] watchdog: Rename __touch_watchdog() to a better descriptive name Petr Mladek
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Petr Mladek @ 2020-12-10 16:00 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

I dug deep into the softlockup watchdog history when time permitted
this year. And reworked the patchset that fixed timestamps and
cleaned up the code[1].

I split it into very small steps and did even more code clean up.
The result looks quite strightforward and I am pretty confident
with the changes.

[*] v1: https://lore.kernel.org/r/20191024114928.15377-1-pmladek@suse.com

Petr Mladek (7):
  watchdog: Rename __touch_watchdog() to a better descriptive name
  watchdog: Explicitly update timestamp when reporting softlockup
  watchdog/softlockup: Report the overall time of softlockups
  watchdog/softlockup: Remove logic that tried to prevent repeated
    reports
  watchdog: Fix barriers when printing backtraces from all CPUs
  watchdog: Cleanup handling of false positives
  Test softlockup

 fs/proc/consoles.c |  5 +++
 fs/proc/version.c  |  7 ++++
 kernel/watchdog.c  | 97 ++++++++++++++++++++++++++--------------------
 3 files changed, 66 insertions(+), 43 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread
* [PATCH v2 0/7] watchdog/softlockup: Report overall time and some cleanup
@ 2021-03-11 12:21 Petr Mladek
  2021-03-11 12:21 ` [PATCH v2 6/7] watchdog: Cleanup handling of false positives Petr Mladek
  0 siblings, 1 reply; 15+ messages in thread
From: Petr Mladek @ 2021-03-11 12:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

Resending. The original post is [1].

I dug deep into the softlockup watchdog history when time permitted
this year. And reworked the patchset that fixed timestamps and
cleaned up the code[2].

I split it into very small steps and did even more code clean up.
The result looks quite strightforward and I am pretty confident
with the changes.

[1] v2: https://lore.kernel.org/r/20201210160038.31441-1-pmladek@suse.com
[2] v1: https://lore.kernel.org/r/20191024114928.15377-1-pmladek@suse.com

Petr Mladek (7):
  watchdog: Rename __touch_watchdog() to a better descriptive name
  watchdog: Explicitly update timestamp when reporting softlockup
  watchdog/softlockup: Report the overall time of softlockups
  watchdog/softlockup: Remove logic that tried to prevent repeated
    reports
  watchdog: Fix barriers when printing backtraces from all CPUs
  watchdog: Cleanup handling of false positives
  Test softlockup

 fs/proc/consoles.c |  5 +++
 fs/proc/version.c  |  7 ++++
 kernel/watchdog.c  | 97 ++++++++++++++++++++++++++--------------------
 3 files changed, 66 insertions(+), 43 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH v2 6/7] watchdog: Cleanup handling of false positives
@ 2021-05-16 10:46 Sergey Senozhatsky
  2021-05-17 15:01 ` Petr Mladek
  0 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2021-05-16 10:46 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel

Hi,

// This was never in my inbox, so sorry if I mess up the "Reply-to"
// Original message:  https://lore.kernel.org/lkml/20210311122130.6788-7-pmladek@suse.com/


>@@ -375,7 +375,14 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> 	/* .. and repeat */
> 	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
>
> -	/* Reset the interval when touched externally by a known slow code. */
> +	/*
> +	 * If a virtual machine is stopped by the host it can look to
> +	 * the watchdog like a soft lockup. Check to see if the host
> +	 * stopped the vm before we process the timestamps.
> +	 */
> +	kvm_check_and_clear_guest_paused();
> +
[..]
>@@ -401,14 +405,6 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> 	 */
> 	duration = is_softlockup(touch_ts, period_ts);
> 	if (unlikely(duration)) {
> -		/*
> -		 * If a virtual machine is stopped by the host it can look to
> -		 * the watchdog like a soft lockup, check to see if the host
> -		 * stopped the vm before we issue the warning
> -		 */
> -		if (kvm_check_and_clear_guest_paused())
> -			return HRTIMER_RESTART;

This looks racy to me. I believe kvm_check_and_clear_guest_paused()
was in the right place.

VCPU can be scheduled out/preepmpted any time at any point; and then
guest VM (or even the entire system) can be suspended. When we resume
the VM we continue from where we were preempted (from VCPU POW).

So what the old code did

watchdog_timer_fn()
{
	...
	<<!!>>

	// Suppose we are suspended here. When we are getting resumed
	// jiffies jump forward, which may look like a soft lockup.
	duration = is_softlockup(touch_ts, period_ts);
	if (unlikely(duration)) {
		// And this is where kvm_check_and_clear_guest_paused()
		// jumps in. We know already that jiffies have jumped,
		// we don't know if jiffies jumped because the VM was
		// suspended. And this is what we figure out here and
		// bail out
		if (kvm_check_and_clear_guest_paused())
			return HRTIMER_RESTART;
	}
}

The new code does the following

watchdog_timer_fn()
{
	...
	kvm_check_and_clear_guest_paused(); // PVCLOCK_GUEST_STOPPED is not set

	<<!!>>

	// Suppose the VM got suspended at this point. PVCLOCK_GUEST_STOPPED
	// is set, but we don't check it. jiffies will jump and this will look
	// like a lockup, but we don't check if jiffies jumped because the VM
	// was suspended
	duration = is_softlockup(touch_ts, period_ts);
	if (unlikely(duration)) {
		// report the lockup and perhaps panic the system,
		// depending on the configuration
	}
}

What am I missing?

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-05-17 15:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 16:00 [PATCH v2 0/7] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
2020-12-10 16:00 ` [PATCH v2 1/7] watchdog: Rename __touch_watchdog() to a better descriptive name Petr Mladek
2020-12-10 16:00 ` [PATCH v2 2/7] watchdog: Explicitly update timestamp when reporting softlockup Petr Mladek
2020-12-10 16:00 ` [PATCH v2 3/7] watchdog/softlockup: Report the overall time of softlockups Petr Mladek
2020-12-10 16:00 ` [PATCH v2 4/7] watchdog/softlockup: Remove logic that tried to prevent repeated reports Petr Mladek
2020-12-10 16:00 ` [PATCH v2 5/7] watchdog: Fix barriers when printing backtraces from all CPUs Petr Mladek
2020-12-10 16:00 ` [PATCH v2 6/7] watchdog: Cleanup handling of false positives Petr Mladek
2020-12-10 16:00 ` [PATCH v2 7/7] Test softlockup Petr Mladek
2021-01-03 15:03   ` 3cc3ef45b2: RIP:version_proc_show kernel test robot
2021-01-03 15:03     ` kernel test robot
2021-01-05  9:37     ` Petr Mladek
2021-01-05  9:37       ` Petr Mladek
2021-03-11 12:21 [PATCH v2 0/7] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
2021-03-11 12:21 ` [PATCH v2 6/7] watchdog: Cleanup handling of false positives Petr Mladek
2021-05-16 10:46 Sergey Senozhatsky
2021-05-17 15:01 ` Petr Mladek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.