From: Johannes Weiner <hannes@cmpxchg.org> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 0/3] memdelay: memory health metric for systems and workloads Date: Thu, 27 Jul 2017 11:30:07 -0400 [thread overview] Message-ID: <20170727153010.23347-1-hannes@cmpxchg.org> (raw) This patch series implements a fine-grained metric for memory health. It builds on top of the refault detection code to quantify the time lost on VM events that occur exclusively due a lack of memory and maps it into a percentage of lost walltime for the system and cgroups. Rationale When presented with a Linux system or container executing a workload, it's hard to judge the health of its memory situation. The statistics exported by the memory management subsystem can reveal smoking guns: page reclaim activity, major faults and refaults can be indicative of an unhealthy memory situation. But they don't actually quantify the cost a memory shortage imposes on the system or workload. How bad is it when 2000 pages are refaulting each second? If the data is stored contiguously on a fast flash drive, it might be okay. If the data is spread out all over a rotating disk, it could be a problem - unless the CPUs are still fully utilized, in which case adding memory wouldn't make things move faster, but instead wait for CPU time. A previous attempt to provide a health signal from the VM was the vmpressure interface, 70ddf637eebe ("memcg: add memory.pressure_level events"). This derives its pressure levels from recently observed reclaim efficiency. As pages are scanned but not reclaimed, the ratio is translated into levels of low, medium, and critical pressure. However, the vmpressure scale is too coarse for today's systems. The accuracy relies on storage being relatively slow compared to how fast the CPU can go through the LRUs, so that when LRU scan cycles outstrip IO completion rates the reclaim code runs into pages that are still reading from disk. But as solid state devices close this speed gap, and memory sizes are in the hundreds of gigabytes, this effect has almost completely disappeared. By the time the reclaim scanner runs into in-flight pages, the tasks in the system already spend a significant part of their runtime waiting for refaulting pages. The vmpressure range is compressed into the split second before OOM and misses large, practically relevant parts of the pressure spectrum. Knowing the exact time penalty that the kernel's paging activity is imposing on a workload is a powerful tool. It allows users to finetune a workload to available memory, but also detect and quantify minute regressions and improvements in the reclaim and caching algorithms. Structure The first patch cleans up the different loadavg callsites and macros as the memdelay averages are going to be tracked using these. The second patch adds a distinction between page cache transitions (inactive list refaults) and page cache thrashing (active list refaults), since only the latter are unproductive refaults. The third patch finally adds the memdelay accounting and interface: its scheduler side identifies productive and unproductive task states, and the VM side aggregates them into system and cgroup domain states and calculates moving averages of the time spent in each state. arch/powerpc/platforms/cell/spufs/sched.c | 3 - arch/s390/appldata/appldata_os.c | 4 - drivers/cpuidle/governors/menu.c | 4 - fs/proc/array.c | 8 + fs/proc/base.c | 2 + fs/proc/internal.h | 2 + fs/proc/loadavg.c | 3 - include/linux/cgroup.h | 14 ++ include/linux/memcontrol.h | 14 ++ include/linux/memdelay.h | 174 +++++++++++++++++ include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 +- include/linux/sched.h | 10 +- include/linux/sched/loadavg.h | 3 + include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + kernel/cgroup/cgroup.c | 4 +- kernel/debug/kdb/kdb_main.c | 7 +- kernel/fork.c | 4 + kernel/sched/Makefile | 2 +- kernel/sched/core.c | 20 ++ kernel/sched/memdelay.c | 112 +++++++++++ mm/Makefile | 2 +- mm/compaction.c | 4 + mm/filemap.c | 18 +- mm/huge_memory.c | 1 + mm/memcontrol.c | 25 +++ mm/memdelay.c | 289 ++++++++++++++++++++++++++++ mm/migrate.c | 2 + mm/page_alloc.c | 11 +- mm/swap_state.c | 1 + mm/vmscan.c | 10 + mm/vmstat.c | 1 + mm/workingset.c | 98 ++++++---- 34 files changed, 792 insertions(+), 69 deletions(-)
WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 0/3] memdelay: memory health metric for systems and workloads Date: Thu, 27 Jul 2017 11:30:07 -0400 [thread overview] Message-ID: <20170727153010.23347-1-hannes@cmpxchg.org> (raw) This patch series implements a fine-grained metric for memory health. It builds on top of the refault detection code to quantify the time lost on VM events that occur exclusively due a lack of memory and maps it into a percentage of lost walltime for the system and cgroups. Rationale When presented with a Linux system or container executing a workload, it's hard to judge the health of its memory situation. The statistics exported by the memory management subsystem can reveal smoking guns: page reclaim activity, major faults and refaults can be indicative of an unhealthy memory situation. But they don't actually quantify the cost a memory shortage imposes on the system or workload. How bad is it when 2000 pages are refaulting each second? If the data is stored contiguously on a fast flash drive, it might be okay. If the data is spread out all over a rotating disk, it could be a problem - unless the CPUs are still fully utilized, in which case adding memory wouldn't make things move faster, but instead wait for CPU time. A previous attempt to provide a health signal from the VM was the vmpressure interface, 70ddf637eebe ("memcg: add memory.pressure_level events"). This derives its pressure levels from recently observed reclaim efficiency. As pages are scanned but not reclaimed, the ratio is translated into levels of low, medium, and critical pressure. However, the vmpressure scale is too coarse for today's systems. The accuracy relies on storage being relatively slow compared to how fast the CPU can go through the LRUs, so that when LRU scan cycles outstrip IO completion rates the reclaim code runs into pages that are still reading from disk. But as solid state devices close this speed gap, and memory sizes are in the hundreds of gigabytes, this effect has almost completely disappeared. By the time the reclaim scanner runs into in-flight pages, the tasks in the system already spend a significant part of their runtime waiting for refaulting pages. The vmpressure range is compressed into the split second before OOM and misses large, practically relevant parts of the pressure spectrum. Knowing the exact time penalty that the kernel's paging activity is imposing on a workload is a powerful tool. It allows users to finetune a workload to available memory, but also detect and quantify minute regressions and improvements in the reclaim and caching algorithms. Structure The first patch cleans up the different loadavg callsites and macros as the memdelay averages are going to be tracked using these. The second patch adds a distinction between page cache transitions (inactive list refaults) and page cache thrashing (active list refaults), since only the latter are unproductive refaults. The third patch finally adds the memdelay accounting and interface: its scheduler side identifies productive and unproductive task states, and the VM side aggregates them into system and cgroup domain states and calculates moving averages of the time spent in each state. arch/powerpc/platforms/cell/spufs/sched.c | 3 - arch/s390/appldata/appldata_os.c | 4 - drivers/cpuidle/governors/menu.c | 4 - fs/proc/array.c | 8 + fs/proc/base.c | 2 + fs/proc/internal.h | 2 + fs/proc/loadavg.c | 3 - include/linux/cgroup.h | 14 ++ include/linux/memcontrol.h | 14 ++ include/linux/memdelay.h | 174 +++++++++++++++++ include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 +- include/linux/sched.h | 10 +- include/linux/sched/loadavg.h | 3 + include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + kernel/cgroup/cgroup.c | 4 +- kernel/debug/kdb/kdb_main.c | 7 +- kernel/fork.c | 4 + kernel/sched/Makefile | 2 +- kernel/sched/core.c | 20 ++ kernel/sched/memdelay.c | 112 +++++++++++ mm/Makefile | 2 +- mm/compaction.c | 4 + mm/filemap.c | 18 +- mm/huge_memory.c | 1 + mm/memcontrol.c | 25 +++ mm/memdelay.c | 289 ++++++++++++++++++++++++++++ mm/migrate.c | 2 + mm/page_alloc.c | 11 +- mm/swap_state.c | 1 + mm/vmscan.c | 10 + mm/vmstat.c | 1 + mm/workingset.c | 98 ++++++---- 34 files changed, 792 insertions(+), 69 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2017-07-27 15:30 UTC|newest] Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-07-27 15:30 Johannes Weiner [this message] 2017-07-27 15:30 ` [PATCH 0/3] memdelay: memory health metric for systems and workloads Johannes Weiner 2017-07-27 15:30 ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros Johannes Weiner 2017-07-27 15:30 ` Johannes Weiner 2017-07-27 15:30 ` [PATCH 2/3] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner 2017-07-27 15:30 ` Johannes Weiner 2017-07-27 15:30 ` [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads Johannes Weiner 2017-07-27 15:30 ` Johannes Weiner 2017-07-27 15:56 ` Johannes Weiner 2017-07-27 15:56 ` Johannes Weiner 2017-07-29 9:10 ` Peter Zijlstra 2017-07-29 9:10 ` Peter Zijlstra 2017-07-30 15:28 ` Johannes Weiner 2017-07-30 15:28 ` Johannes Weiner 2017-07-31 8:31 ` Peter Zijlstra 2017-07-31 8:31 ` Peter Zijlstra 2017-07-31 18:41 ` Johannes Weiner 2017-07-31 18:41 ` Johannes Weiner 2017-07-31 19:49 ` Mike Galbraith 2017-07-31 19:49 ` Mike Galbraith 2017-07-31 20:38 ` Johannes Weiner 2017-07-31 20:38 ` Johannes Weiner 2017-08-01 2:23 ` Mike Galbraith 2017-08-01 2:23 ` Mike Galbraith 2017-08-01 7:57 ` Peter Zijlstra 2017-08-01 7:57 ` Peter Zijlstra 2017-08-01 12:26 ` Johannes Weiner 2017-08-01 12:26 ` Johannes Weiner 2017-08-13 14:52 ` Peter Zijlstra 2017-08-13 14:52 ` Peter Zijlstra 2017-07-29 13:31 ` kbuild test robot 2017-07-27 20:43 ` [PATCH 0/3] memdelay: memory health metric " Andrew Morton 2017-07-27 20:43 ` Andrew Morton 2017-07-28 19:43 ` Johannes Weiner 2017-07-28 19:43 ` Johannes Weiner 2017-08-02 8:11 ` Michal Hocko 2017-08-02 8:11 ` Michal Hocko 2017-07-29 2:48 ` Mike Galbraith 2017-07-29 2:48 ` Mike Galbraith 2017-07-29 3:21 ` Mike Galbraith 2017-07-29 3:21 ` Mike Galbraith 2017-07-29 6:38 ` Mike Galbraith 2017-07-29 6:38 ` Mike Galbraith
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20170727153010.23347-1-hannes@cmpxchg.org \ --to=hannes@cmpxchg.org \ --cc=akpm@linux-foundation.org \ --cc=kernel-team@fb.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mgorman@suse.de \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=riel@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.