[PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

* [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO
@ 2018-05-07 21:01 Johannes Weiner
  2018-05-07 21:01 ` [PATCH 1/7] mm: workingset: don't drop refault information prematurely Johannes Weiner
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Johannes Weiner @ 2018-05-07 21:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-block, cgroups
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Tejun Heo,
	Balbir Singh, Mike Galbraith, Oliver Yang, Shakeel Butt, xxx xxx,
	Taras Kondratiuk, Daniel Walker, Vinayak Menon,
	Ruslan Ruslichenko, kernel-team

Hi,

I previously submitted a version of this patch set called "memdelay",
which translated delays from reclaim, swap-in, thrashing page cache
into a pressure percentage of lost walltime. I've since extended this
code to aggregate all delay states tracked by delayacct in order to
have generalized pressure/overcommit levels for CPU, memory, and IO.

There was feedback from Peter on the previous version that I have
incorporated as much as possible and as it still applies to this code:

	- got rid of the extra lock in the sched callbacks; all task
          state changes we care about serialize through rq->lock

	- got rid of ktime_get() inside the sched callbacks and
          switched time measuring to rq_clock()

	- got rid of all divisions inside the sched callbacks,
          tracking everything natively in ns now

I also moved this stuff into existing sched/stat.h callbacks, so it
doesn't get in the way in sched/core.c, and of course moved the whole
thing behind CONFIG_PSI since not everyone is going to want it.

Real-world applications

Since the last posting, we've begun using the data collected by this
code quite extensively at Facebook, and with several success stories.

First we used it on systems that frequently locked up in low memory
situations. The reason this happens is that the OOM killer is
triggered by reclaim not being able to make forward progress, but with
fast flash devices there is *always* some clean and uptodate cache to
reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
the time faulting executables. There is no situation where this ever
makes sense in practice. We wrote a <100 line POC python script to
monitor memory pressure and kill stuff manually, way before such
pathological thrashing.

We've since extended the python script into a more generic oomd that
we use all over the place, not just to avoid livelocks but also to
guarantee latency and throughput SLAs, since they're usually violated
way before the kernel OOM killer would ever kick in.

We also use the memory pressure info for loadshedding. Our batch job
infrastructure used to refuse new requests on heuristics based on RSS
and other existing VM metrics in an attempt to avoid OOM kills and
maximize utilization. Since it was still plagued by frequent OOM
kills, we switched it to shed load on psi memory pressure, which has
turned out to be a much better bellwether, and we managed to reduce
OOM kills drastically. Reducing the rate of OOM outages from the
worker pool raised its aggregate productivity, and we were able to
switch that service to smaller machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as
well as to prevent multiple workloads on a machine from stepping on
each others' toes. We were not able to do this properly without the
pressure metrics; we would see latency or bandwidth drops, but it
would often be hard to impossible to rootcause it post-mortem. We now
log and graph the pressure metrics for all containers in our fleet and
can trivially link service drops to resource pressure after the fact.

How do you use this?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
calculate pressure at the cgroup level instead of system-wide.

The cpu file contains one line:

	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which some tasks are
delayed on the runqueue while another task has the CPU. They're recent
averages over 10s, 1m, 5m windows, so you can tell short term trends
from long term ones, similarly to the load average.

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu: the time in which at least one
task is stalled on the resource.

The full line, however, indicates time in which *nobody* is using the
CPU productively due to pressure: all non-idle tasks could be waiting
on thrashing cache simultaneously. It can also happen when a single
reclaimer occupies the CPU, since nothing else can make forward
progress during that time. CPU cycles are being wasted. Significant
time spent in there is a good trigger for killing, moving jobs to
other machines, or dropping incoming requests, since neither the jobs
nor the machine overall is making too much headway.

The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse
with future hardware).

The io file is similar to memory. However, unlike CPU and memory, the
block layer doesn't have a concept of hardware contention. We cannot
know if the IO a task is waiting on is being performed by the device
or whether the device is busy with or slowed down other requests. As a
result, we can tell how many CPU cycles go to waste due to IO delays,
but we can not identify the competition factor in those delays.

These patches are against v4.17-rc4.

 Documentation/accounting/psi.txt                |  73 ++++
 Documentation/cgroup-v2.txt                     |  18 +
 arch/powerpc/platforms/cell/cpufreq_spudemand.c |   2 +-
 arch/powerpc/platforms/cell/spufs/sched.c       |   9 +-
 arch/s390/appldata/appldata_os.c                |   4 -
 drivers/cpuidle/governors/menu.c                |   4 -
 fs/proc/loadavg.c                               |   3 -
 include/linux/cgroup-defs.h                     |   4 +
 include/linux/cgroup.h                          |  15 +
 include/linux/delayacct.h                       |  23 +
 include/linux/mmzone.h                          |   1 +
 include/linux/page-flags.h                      |   5 +-
 include/linux/psi.h                             |  52 +++
 include/linux/psi_types.h                       |  84 ++++
 include/linux/sched.h                           |  10 +
 include/linux/sched/loadavg.h                   |  90 +++-
 include/linux/sched/stat.h                      |  10 +-
 include/linux/swap.h                            |   2 +-
 include/trace/events/mmflags.h                  |   1 +
 include/uapi/linux/taskstats.h                  |   6 +-
 init/Kconfig                                    |  20 +
 kernel/cgroup/cgroup.c                          |  45 +-
 kernel/debug/kdb/kdb_main.c                     |   7 +-
 kernel/delayacct.c                              |  15 +
 kernel/fork.c                                   |   4 +
 kernel/sched/Makefile                           |   1 +
 kernel/sched/core.c                             |   3 +
 kernel/sched/loadavg.c                          |  84 ----
 kernel/sched/psi.c                              | 499 ++++++++++++++++++++++
 kernel/sched/sched.h                            | 166 +++----
 kernel/sched/stats.h                            |  91 +++-
 mm/compaction.c                                 |   5 +
 mm/filemap.c                                    |  27 +-
 mm/huge_memory.c                                |   1 +
 mm/memcontrol.c                                 |   2 +
 mm/migrate.c                                    |   2 +
 mm/page_alloc.c                                 |  10 +
 mm/swap_state.c                                 |   1 +
 mm/vmscan.c                                     |  14 +
 mm/vmstat.c                                     |   1 +
 mm/workingset.c                                 | 113 +++--
 tools/accounting/getdelays.c                    |   8 +-
 42 files changed, 1279 insertions(+), 256 deletions(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread