[PATCH v3 0/4] clocksource: Avoid incorrect hpet fallback

* [PATCH v3 0/4] clocksource: Avoid incorrect hpet fallback
@ 2021-11-18 19:14 Waiman Long
  2021-11-18 19:14 ` [PATCH v3 1/4] clocksource: Avoid accidental unstable marking of clocksources Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Waiman Long @ 2021-11-18 19:14 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd, Feng Tang, Paul E. McKenney
  Cc: linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Frederic Weisbecker, Waiman Long

It was found that when an x86 system was being stressed by running
various different benchmark suites, the clocksource watchdog might
occasionally mark TSC as unstable and fall back to hpet which will
have a signficant impact on system performance.

 v3:
  - Remove the patch 1 thunks that changes uncertainty_margin from 2 *
    WATCHDOG_MAX_SKEW to WATCHDOG_MAX_SKEW.
  - Use pr_info() for the clock-skew test skipped" message instead of
    pr_warn().
  - Move the global clock_skew_skip into a new clock_skew_skipcnt
    field in the clocksource structure.
  - Replace the local_irq_save()/local_irq_restore() pair in
    __clocksource_select_watchdog() by local_irq_disable()/local_irq_enable().

The current watchdog clocksource skew threshold of 50us is found to be
insufficient. So it is changed back to 100us before commit 2e27e793e280
("clocksource: Reduce clocksource-skew threshold") in patch 1. This
patch also skip the current clock skew check if the consecutive watchdog
read-back delay contributes a major portion of the total delay. On a
1-socket 64-thread test system, it was actually found that in one the
test sample, the hpet-tsc-hpet delay was 95263ns, while the corresponding
hpet-hpet delay was 94425ns. So the majority of the delay is caused by
the hpet read.

Patch 2 reduces the default clocksource_watchdog() retries to 2 as
suggested by Paul.

Patch 3 implements dynamic readjustment of the new internal
watchdog_max_skew variable in case the current value causes excessive
skipping of clock skew checks. The following reproducer provided by
Feng Tang was used to cause the test skipping:

  sudo stress-ng --timeout 30 --times --verify --metrics-brief --ioport <n>

where <n> is the number of cpus in the system.

A sample watchdog_max_skew readjustment output was:

[  197.771144] clocksource: timekeeping watchdog on CPU8: hpet wd-wd read-back delay of 92539ns
[  197.789589] clocksource: wd-tsc-wd read-back delay of 90933ns, clock-skew test skipped!
[  197.807145] clocksource: timekeeping watchdog on CPU8: watchdog_max_skew increased to 185078ns

To avoid excessive increase of watchdog_max_skew, a limit of
10*WATCHDOG_MAX_SKEW is used over which the watchdog itself will be
mark unstable and a new watchdog will be selected if possible.

To exercise the code, WATCHDOG_MAX_SKEW was reduced to 10us. After
skipping 10 checks, the watchdog then fell back to acpi_pm. However
the corresponding consecutive watchdog delay was still about the same
leading to ping-ponging between hpet and acpi_pm becoming the watchdog.

Patch 4 adds a Kconfig option to allow kernel builder to control the
actual WATCHDOG_MAX_SKEW threshold to be used.

Waiman Long (4):
  clocksource: Avoid accidental unstable marking of clocksources
  clocksource: Reduce the default clocksource_watchdog() retries to 2
  clocksource: Dynamically increase watchdog_max_skew
  clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW

 .../admin-guide/kernel-parameters.txt         |   4 +-
 include/linux/clocksource.h                   |   1 +
 kernel/time/Kconfig                           |   9 ++
 kernel/time/clocksource.c                     | 112 +++++++++++++++---
 4 files changed, 110 insertions(+), 16 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 11+ messages in thread