'schedutil' (possibly) aberrant behavior surrounding suspend/resume process (timing/delay/run-away CPU issues)

* 'schedutil' (possibly) aberrant behavior surrounding suspend/resume process (timing/delay/run-away CPU issues)
@ 2022-05-21 22:13 Danny van Heumen
  2022-05-25  5:28 ` Viresh Kumar
  0 siblings, 1 reply; 7+ messages in thread
From: Danny van Heumen @ 2022-05-21 22:13 UTC (permalink / raw)
  To: “Rafael J. Wysocki”, Viresh Kumar, linux-pm

Hi all,

This is my first report directly to linux kernel mailing lists. I'll do my best, but I'll invariably make some mistakes. This bug report is based on "educated" guessing, some insight and intuition, so I will do my best to explain all the curiosities and why I blame `schedutil`. To clarify, I seems behavior of schedutil triggers issues elsewhere.

I write the message not to say "this is the problem, fix it!" but rather, I hope to check with you if you recognize potential things to look for, *if the reports make any sense*.

## short summary

On multiple computers, with kernel builds from distros and also vanilla custom (minimal) config builds by myself, I have had a number of issues that all center around suspend/resume with 'schedutil' scaling governor active. Some issues seem to be explained by other causes/bug-reports, but I report this here because `schedutil` is a common denominator.

Characteristics:
- Kabylake, upon suspend, problems with going into suspend, with "runaway processor" behavior starting to run at full power causing excessive build-up of heat. (intel_pstate=passive) (more details follow ...)
- Haswell laptop, upon suspend, sometimes does not suspend, instead screen off but then returned to active.
- Pinebook Pro: upon resume from suspend (s2idle, I'm fairly certain): problems getting "display panel up in time", "eMMC/SDIO failing to initialize", screen-flickering from excessive refreshing many times the blinking rate. (more details follow ...)

These issues start to happen upon suspend. The issues do not happen when `schedutil` is not involved.

My interpretation: somewhere just before, or just after entering suspend-state/initiating restore, `schedutil` messes up something w.r.t. timings: timings/delays/repeats happen at many times the intended speed.

## `schedutil` before suspend: all good

I have been using `schedutil` for many months. Upon normal operations there never seem to be any issues, or unexpected events. What I describe all happens centered around the suspension-process, and sometimes as consequence afterwards.

## Pinebook Pro issues

- original Debian kernel,
- custom-built kernel with very minimal config: versions 5.15.{38,39,41} 5.17.9
- `schedutil` cpufreq scaling governor
- Debian bullseye (original, no third-party kernels), up-to-date install
- no tlp, manual suspend procedure (either button- or lid-triggered)

The Pinebook Pro does not exhibit issues (AFAICT) upon entering suspend. However, resume will fail often, but not always. I have seen different symptoms exhibited:

1.) 3-line error message regarding analogix_dp_resume about rockchip-module not succeeding (through analogix) to re-initialize the display panel in time. (It shows on that exact display panel, so clearly it worked.)
Those 3 lines of error message, are being refreshed at many times the necessary rate. The excessive refreshing interferes with the ability to switch terminals (TTY1, TTY2, etc.), because I see a "single frame blink" of the terminal and then gets overridden with the 3-line error message again. Same holds for switch to GUI DISPLAY terminal. Display does not have sufficient time to refresh and ends up with variations of extra long DPMS-off state before returning to the error-message-screen.
Sometimes I am able to "interrupt" this excessive run-away behavior by keeping e.g. CTRL+ALT+1 pressed such that it will try to switch TTY at "key-repeat"-speed.

2.) error message regarding eMMC/SDIO issues. Similar to (1.) issue with failing to initialize eMMC and consequently I lose access to my persistent storage.

3.) issue also occurs when entering sleep from prolonged idleness with lid already closed (DPMS off + locked) as starting-point.

[x.] Not sure if relevant at all: "crashes with wifi firmware". A buggy early firmware version for a broadcom wifi chip would occasionally crash. However, in particular around the suspend process there would be issues.
I don't want to attribute this to schedutil, but maybe schedutil's behavior significantly increases the chance of this happening.

NOTE: I tried switching back to 'ondemand' scheduler after first issues had occurred with 'schedutil' active. However, reverting to 'ondemand' did not resolve the outstanding issues at that point.

## Kabylake-laptop

- Distro-kernel (i.e. not vanilla)
- `intel_pstate=passive` kernel parameter
- `schedutil` as cpufreq scaling governor (udev-rule)

1.) Trigger suspend on the laptop. System goes into suspend state as expected. Then (in about 2 seconds) fans start spinning and apparently excessive heat is being produced. (This was reported somewhere already as being an issue with PCH temp being too high, w.r.t. `S0ix` `intel_pch_thermal`.) However, I suspect that high-temperature may be caused through `schedutil`, because the laptop did not run any intensive tasks of any kind. (idling)
Upon resuming, the laptop has lost significant amounts of battery, corresponding with the excessive CPU usage and cooling fans.

## Haswell laptop

1.) Not much to say: enter suspend, find out later that it did not truly enter suspend, but was kept back and is rather active.

## Without `schedutil`

In all cases, when taking schedutil out of the loop, these issues disappear. In case of the Kabylake laptop, I do set intel_pstate=active. All have `ondemand` governor. Suspend/resume works, repeated many times over. Even if some errors are still shown, they no longer pose a problem. The excessive screen blinking behavior is not exhibited, for example.

## Other things I have tried

As mentioned before, I have tried looking for other bugs. However, the issues seem to be too persistent. I have tried removing all external devices, tried using different USB ports, tried different suspend/resume settings in `/etc/systemd/sleep.conf`. Tried using 'tlp' package for additional power-management tricks. Tried switching governors, ... and probably more.

I realize this bug report is far from perfect. I'm afraid I cannot make the report much more exact/clear. I am happy to answer any questions you may have. I will keep an eye out for further peculiarities.

Regards,
Danny

PS: I hope I address this to the right maintainers. I checked here: https://docs.kernel.org/process/maintainers.html#cpu-frequency-scaling-framework

^ permalink raw reply	[flat|nested] 7+ messages in thread