From: Johannes Weiner <firstname.lastname@example.org> To: email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Cc: Ingo Molnar <email@example.com>, Peter Zijlstra <firstname.lastname@example.org>, Andrew Morton <email@example.com>, Tejun Heo <firstname.lastname@example.org>, Balbir Singh <email@example.com>, Mike Galbraith <firstname.lastname@example.org>, Oliver Yang <email@example.com>, Shakeel Butt <firstname.lastname@example.org>, xxx xxx <email@example.com>, Taras Kondratiuk <firstname.lastname@example.org>, Daniel Walker <email@example.com>, Vinayak Menon <firstname.lastname@example.org>, Ruslan Ruslichenko <email@example.com>, firstname.lastname@example.org Subject: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO Date: Mon, 7 May 2018 17:01:28 -0400 [thread overview] Message-ID: <email@example.com> (raw) Hi, I previously submitted a version of this patch set called "memdelay", which translated delays from reclaim, swap-in, thrashing page cache into a pressure percentage of lost walltime. I've since extended this code to aggregate all delay states tracked by delayacct in order to have generalized pressure/overcommit levels for CPU, memory, and IO. There was feedback from Peter on the previous version that I have incorporated as much as possible and as it still applies to this code: - got rid of the extra lock in the sched callbacks; all task state changes we care about serialize through rq->lock - got rid of ktime_get() inside the sched callbacks and switched time measuring to rq_clock() - got rid of all divisions inside the sched callbacks, tracking everything natively in ns now I also moved this stuff into existing sched/stat.h callbacks, so it doesn't get in the way in sched/core.c, and of course moved the whole thing behind CONFIG_PSI since not everyone is going to want it. Real-world applications Since the last posting, we've begun using the data collected by this code quite extensively at Facebook, and with several success stories. First we used it on systems that frequently locked up in low memory situations. The reason this happens is that the OOM killer is triggered by reclaim not being able to make forward progress, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of the time faulting executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff manually, way before such pathological thrashing. We've since extended the python script into a more generic oomd that we use all over the place, not just to avoid livelocks but also to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. We also use the memory pressure info for loadshedding. Our batch job infrastructure used to refuse new requests on heuristics based on RSS and other existing VM metrics in an attempt to avoid OOM kills and maximize utilization. Since it was still plagued by frequent OOM kills, we switched it to shed load on psi memory pressure, which has turned out to be a much better bellwether, and we managed to reduce OOM kills drastically. Reducing the rate of OOM outages from the worker pool raised its aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to do this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph the pressure metrics for all containers in our fleet and can trivially link service drops to resource pressure after the fact. How do you use this? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply calculate pressure at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which some tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. What to make of this number? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (and most workloads do vary). From the perspective of the individual job it's not great, however, and they might do better with more resources. Depending on what your priority is, an elevated "some" number may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu: the time in which at least one task is stalled on the resource. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks could be waiting on thrashing cache simultaneously. It can also happen when a single reclaimer occupies the CPU, since nothing else can make forward progress during that time. CPU cycles are being wasted. Significant time spent in there is a good trigger for killing, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall is making too much headway. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). The io file is similar to memory. However, unlike CPU and memory, the block layer doesn't have a concept of hardware contention. We cannot know if the IO a task is waiting on is being performed by the device or whether the device is busy with or slowed down other requests. As a result, we can tell how many CPU cycles go to waste due to IO delays, but we can not identify the competition factor in those delays. These patches are against v4.17-rc4. Documentation/accounting/psi.txt | 73 ++++ Documentation/cgroup-v2.txt | 18 + arch/powerpc/platforms/cell/cpufreq_spudemand.c | 2 +- arch/powerpc/platforms/cell/spufs/sched.c | 9 +- arch/s390/appldata/appldata_os.c | 4 - drivers/cpuidle/governors/menu.c | 4 - fs/proc/loadavg.c | 3 - include/linux/cgroup-defs.h | 4 + include/linux/cgroup.h | 15 + include/linux/delayacct.h | 23 + include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 +- include/linux/psi.h | 52 +++ include/linux/psi_types.h | 84 ++++ include/linux/sched.h | 10 + include/linux/sched/loadavg.h | 90 +++- include/linux/sched/stat.h | 10 +- include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + include/uapi/linux/taskstats.h | 6 +- init/Kconfig | 20 + kernel/cgroup/cgroup.c | 45 +- kernel/debug/kdb/kdb_main.c | 7 +- kernel/delayacct.c | 15 + kernel/fork.c | 4 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 3 + kernel/sched/loadavg.c | 84 ---- kernel/sched/psi.c | 499 ++++++++++++++++++++++ kernel/sched/sched.h | 166 +++---- kernel/sched/stats.h | 91 +++- mm/compaction.c | 5 + mm/filemap.c | 27 +- mm/huge_memory.c | 1 + mm/memcontrol.c | 2 + mm/migrate.c | 2 + mm/page_alloc.c | 10 + mm/swap_state.c | 1 + mm/vmscan.c | 14 + mm/vmstat.c | 1 + mm/workingset.c | 113 +++-- tools/accounting/getdelays.c | 8 +- 42 files changed, 1279 insertions(+), 256 deletions(-)
next reply other threads:[~2018-05-07 20:59 UTC|newest] Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-05-07 21:01 Johannes Weiner [this message] 2018-05-07 21:01 ` [PATCH 1/7] mm: workingset: don't drop refault information prematurely Johannes Weiner 2018-05-07 21:01 ` [PATCH 2/7] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner 2018-05-07 21:01 ` [PATCH 3/7] delayacct: track delays from thrashing cache pages Johannes Weiner 2018-05-07 21:01 ` [PATCH 4/7] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner 2018-05-07 21:01 ` [PATCH 5/7] sched: loadavg: make calc_load_n() public Johannes Weiner 2018-05-09 9:49 ` Peter Zijlstra 2018-05-10 13:46 ` Johannes Weiner 2018-05-07 21:01 ` [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO Johannes Weiner 2018-05-08 0:42 ` Randy Dunlap 2018-05-08 14:06 ` Johannes Weiner 2018-05-08 1:35 ` kbuild test robot 2018-05-08 3:04 ` kbuild test robot 2018-05-08 14:05 ` Johannes Weiner 2018-05-09 9:59 ` Peter Zijlstra 2018-05-10 13:49 ` Johannes Weiner 2018-05-09 10:04 ` Peter Zijlstra 2018-05-10 14:10 ` Johannes Weiner 2018-05-09 10:05 ` Peter Zijlstra 2018-05-10 14:13 ` Johannes Weiner 2018-05-09 10:14 ` Peter Zijlstra 2018-05-10 14:18 ` Johannes Weiner 2018-05-09 10:21 ` Peter Zijlstra 2018-05-10 14:24 ` Johannes Weiner 2018-05-09 10:26 ` Peter Zijlstra 2018-05-09 10:46 ` Peter Zijlstra 2018-05-09 11:38 ` Peter Zijlstra 2018-05-10 13:41 ` Johannes Weiner 2018-05-14 8:33 ` Peter Zijlstra 2018-05-09 10:55 ` Peter Zijlstra 2018-05-09 11:03 ` Vinayak Menon 2018-05-23 13:17 ` Johannes Weiner 2018-05-23 13:19 ` Vinayak Menon 2018-06-07 0:46 ` Suren Baghdasaryan 2018-05-07 21:01 ` [PATCH 7/7] psi: cgroup support Johannes Weiner 2018-05-09 11:07 ` Peter Zijlstra 2018-05-10 14:49 ` Johannes Weiner 2018-05-14 15:39 ` [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO Christopher Lameter 2018-05-14 17:35 ` Bart Van Assche 2018-05-14 18:55 ` Johannes Weiner 2018-05-14 20:15 ` Christopher Lameter 2018-05-26 0:29 ` Suren Baghdasaryan 2018-05-29 18:16 ` Johannes Weiner 2018-05-30 23:32 ` Suren Baghdasaryan
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.