All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: chris@chrisdown.name, hannes@cmpxchg.org, hughd@google.com,
	kuba@kernel.org, mhocko@kernel.org, mm-commits@vger.kernel.org,
	shakeelb@google.com, tj@kernel.org
Subject: + mm-automatically-penalize-tasks-with-high-swap-use.patch added to -mm tree
Date: Wed, 27 May 2020 14:33:10 -0700	[thread overview]
Message-ID: <20200527213310.adtflYbJn%akpm@linux-foundation.org> (raw)
In-Reply-To: <20200522222217.ee14ad7eda7aab1e6697da6c@linux-foundation.org>


The patch titled
     Subject: mm/memcg: automatically penalize tasks with high swap use
has been added to the -mm tree.  Its filename is
     mm-automatically-penalize-tasks-with-high-swap-use.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-automatically-penalize-tasks-with-high-swap-use.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-automatically-penalize-tasks-with-high-swap-use.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Jakub Kicinski <kuba@kernel.org>
Subject: mm/memcg: automatically penalize tasks with high swap use

Add a memory.swap.high knob, which can be used to protect the system from
SWAP exhaustion.  The mechanism used for penalizing is similar to
memory.high penalty (sleep on return to user space).

That is not to say that the knob itself is equivalent to memory.high.  The
objective is more to protect the system from potentially buggy tasks
consuming a lot of swap and impacting other tasks, or even bringing the
whole system to stand still with complete SWAP exhaustion.  Hopefully
without the need to find per-task hard limits.

Slowing misbehaving tasks down gradually allows user space oom killers or
other protection mechanisms to react.  oomd and earlyoom already do
killing based on swap exhaustion, and memory.swap.high protection will
help implement such userspace oom policies more reliably.

We can use one counter for number of pages allocated under pressure to
save struct task space and avoid two separate hierarchy walks on the hot
path.  The exact overage is calculated on return to user space, anyway.

Take the new high limit into account when determining if swap is "full". 
Borrowing the explanation from Johannes:

  The idea behind "swap full" is that as long as the workload has plenty
  of swap space available and it's not changing its memory contents, it
  makes sense to generously hold on to copies of data in the swap device,
  even after the swapin.  A later reclaim cycle can drop the page without
  any IO.  Trading disk space for IO.

  But the only two ways to reclaim a swap slot is when they're faulted
  in and the references go away, or by scanning the virtual address space
  like swapoff does - which is very expensive (one could argue it's too
  expensive even for swapoff, it's often more practical to just reboot).

  So at some point in the fill level, we have to start freeing up swap
  slots on fault/swapin.  Otherwise we could eventually run out of swap
  slots while they're filled with copies of data that is also in RAM.

  We don't want to OOM a workload because its available swap space is
  filled with redundant cache.

Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/cgroup-v2.rst |   20 +++++
 include/linux/memcontrol.h              |    1 
 mm/memcontrol.c                         |   88 ++++++++++++++++++++--
 3 files changed, 102 insertions(+), 7 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst~mm-automatically-penalize-tasks-with-high-swap-use
+++ a/Documentation/admin-guide/cgroup-v2.rst
@@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back.
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
+  memory.swap.high
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Swap usage throttle limit.  If a cgroup's swap usage exceeds
+	this limit, all its further allocations will be throttled to
+	allow userspace to implement custom out-of-memory procedures.
+
+	This limit marks a point of no return for the cgroup. It is NOT
+	designed to manage the amount of swapping a workload does
+	during regular operation. Compare to memory.swap.max, which
+	prohibits swapping past a set amount, but lets the cgroup
+	continue unimpeded as long as other memory can be reclaimed.
+
+	Healthy workloads are not expected to reach this limit.
+
   memory.swap.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
@@ -1387,6 +1403,10 @@ PAGE_SIZE multiple when read back.
 	otherwise, a value change in this file generates a file
 	modified event.
 
+	  high
+		The number of times the cgroup's swap usage was over
+		the high threshold.
+
 	  max
 		The number of times the cgroup's swap usage was about
 		to go over the max boundary and swap allocation
--- a/include/linux/memcontrol.h~mm-automatically-penalize-tasks-with-high-swap-use
+++ a/include/linux/memcontrol.h
@@ -45,6 +45,7 @@ enum memcg_memory_event {
 	MEMCG_MAX,
 	MEMCG_OOM,
 	MEMCG_OOM_KILL,
+	MEMCG_SWAP_HIGH,
 	MEMCG_SWAP_MAX,
 	MEMCG_SWAP_FAIL,
 	MEMCG_NR_MEMORY_EVENTS,
--- a/mm/memcontrol.c~mm-automatically-penalize-tasks-with-high-swap-use
+++ a/mm/memcontrol.c
@@ -2354,6 +2354,22 @@ static u64 mem_find_max_overage(struct m
 	return max_overage;
 }
 
+static u64 swap_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
+
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->swap),
+					    READ_ONCE(memcg->swap.high));
+		if (overage)
+			memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
+		max_overage = max(overage, max_overage);
+	} while ((memcg = parent_mem_cgroup(memcg)) &&
+		 !mem_cgroup_is_root(memcg));
+
+	return max_overage;
+}
+
 /*
  * Get the number of jiffies that we should penalise a mischievous cgroup which
  * is exceeding its memory.high by checking both it and its ancestors.
@@ -2415,6 +2431,9 @@ void mem_cgroup_handle_over_high(void)
 	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
 					       mem_find_max_overage(memcg));
 
+	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+						swap_find_max_overage(memcg));
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -2605,13 +2624,32 @@ done_restock:
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) >
-		    READ_ONCE(memcg->memory.high)) {
-			/* Don't bother a random interrupted task */
-			if (in_interrupt()) {
+		bool mem_high, swap_high;
+
+		mem_high = page_counter_read(&memcg->memory) >
+			READ_ONCE(memcg->memory.high);
+		swap_high = page_counter_read(&memcg->swap) >
+			READ_ONCE(memcg->swap.high);
+
+		/* Don't bother a random interrupted task */
+		if (in_interrupt()) {
+			if (mem_high) {
 				schedule_work(&memcg->high_work);
 				break;
 			}
+			continue;
+		}
+
+		if (mem_high || swap_high) {
+			/*
+			 * The allocating tasks in this cgroup will need to do
+			 * reclaim or be throttled to prevent further growth
+			 * of the memory or swap footprints.
+			 *
+			 * Target some best-effort fairness between the tasks,
+			 * and distribute reclaim work and delay penalties
+			 * based on how much each task is actually allocating.
+			 */
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
 			break;
@@ -5078,6 +5116,7 @@ mem_cgroup_css_alloc(struct cgroup_subsy
 
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -5231,6 +5270,7 @@ static void mem_cgroup_css_reset(struct
 	page_counter_set_low(&memcg->memory, 0);
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	memcg_wb_domain_size_changed(memcg);
 }
 
@@ -7144,10 +7184,13 @@ bool mem_cgroup_swap_full(struct page *p
 	if (!memcg)
 		return false;
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
-		if (page_counter_read(&memcg->swap) * 2 >=
-		    READ_ONCE(memcg->swap.max))
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		unsigned long usage = page_counter_read(&memcg->swap);
+
+		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
+		    usage * 2 >= READ_ONCE(memcg->swap.max))
 			return true;
+	}
 
 	return false;
 }
@@ -7177,6 +7220,29 @@ static u64 swap_current_read(struct cgro
 	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
 }
 
+static int swap_high_show(struct seq_file *m, void *v)
+{
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->swap.high));
+}
+
+static ssize_t swap_high_write(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long high;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &high);
+	if (err)
+		return err;
+
+	page_counter_set_high(&memcg->swap, high);
+
+	return nbytes;
+}
+
 static int swap_max_show(struct seq_file *m, void *v)
 {
 	return seq_puts_memcg_tunable(m,
@@ -7204,6 +7270,8 @@ static int swap_events_show(struct seq_f
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
+	seq_printf(m, "high %lu\n",
+		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
 	seq_printf(m, "max %lu\n",
 		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
 	seq_printf(m, "fail %lu\n",
@@ -7219,6 +7287,12 @@ static struct cftype swap_files[] = {
 		.read_u64 = swap_current_read,
 	},
 	{
+		.name = "swap.high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_high_show,
+		.write = swap_high_write,
+	},
+	{
 		.name = "swap.max",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = swap_max_show,
_

Patches currently in -mm which might be from kuba@kernel.org are

mm-prepare-for-swap-over-high-accounting-and-penalty-calculation.patch
mm-move-penalty-delay-clamping-out-of-calculate_high_delay.patch
mm-move-cgroup-high-memory-limit-setting-into-struct-page_counter.patch
mm-automatically-penalize-tasks-with-high-swap-use.patch

  parent reply	other threads:[~2020-05-27 21:33 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-23  5:22 incoming Andrew Morton
2020-05-23  5:22 ` [patch 01/11] device-dax: don't leak kernel memory to user space after unloading kmem Andrew Morton
2020-05-23  5:22 ` [patch 02/11] x86: bitops: fix build regression Andrew Morton
2020-05-23  5:22 ` [patch 03/11] rapidio: fix an error in get_user_pages_fast() error handling Andrew Morton
2020-05-23  5:22 ` [patch 04/11] selftests/vm/.gitignore: add mremap_dontunmap Andrew Morton
2020-05-23  5:22 ` [patch 05/11] selftests/vm/write_to_hugetlbfs.c: fix unused variable warning Andrew Morton
2020-05-23  5:22 ` [patch 06/11] kasan: disable branch tracing for core runtime Andrew Morton
2020-05-23  5:23 ` [patch 07/11] sh: include linux/time_types.h for sockios Andrew Morton
2020-05-23  5:23 ` [patch 08/11] MAINTAINERS: update email address for Naoya Horiguchi Andrew Morton
2020-05-23  5:23 ` [patch 09/11] sparc32: use PUD rather than PGD to get PMD in srmmu_nocache_init() Andrew Morton
2020-05-23 19:01   ` Mike Rapoport
2020-05-23 19:10     ` Linus Torvalds
2020-05-23 19:10       ` Linus Torvalds
2020-05-23 19:57       ` Mike Rapoport
2020-05-23  5:23 ` [patch 10/11] z3fold: fix use-after-free when freeing handles Andrew Morton
2020-05-23  5:23 ` [patch 11/11] MAINTAINERS: add files related to kdump Andrew Morton
2020-05-25  0:06 ` + bitops-simplify-get_count_order_long.patch added to -mm tree Andrew Morton
2020-05-25  0:08 ` + mmthp-stop-leaking-unreleased-file-pages.patch " Andrew Morton
2020-05-25  0:45 ` + mmap-locking-api-convert-mmap_sem-comments-fix-fix-fix.patch " Andrew Morton
2020-05-25  0:49 ` + mm-remove-vm_bug_onpageslab-from-page_mapcount.patch " Andrew Morton
2020-05-25  0:49   ` Andrew Morton
2020-05-25  0:57 ` + swap-reduce-lock-contention-on-swap-cache-from-swap-slots-allocation-v3.patch " Andrew Morton
2020-05-25  5:10 ` mmotm 2020-05-24-22-09 uploaded Andrew Morton
2020-05-25  5:10 ` Andrew Morton
2020-05-25 20:01 ` + khugepaged-allow-to-collapse-a-page-shared-across-fork-fix-fix.patch added to -mm tree Andrew Morton
2020-05-25 20:19 ` + x86-mm-simplify-init_trampoline-and-surrounding-logic-fix.patch " Andrew Morton
2020-05-25 20:41 ` + lib-make-a-test-module-with-set-clear-bit.patch " Andrew Morton
2020-05-25 20:57 ` + mm-gupc-convert-to-use-get_user_pagepages_fast_only.patch " Andrew Morton
2020-05-25 21:11 ` + mm-remove-vm_bug_onpageslab-from-page_mapcount-fix.patch " Andrew Morton
2020-05-25 21:55 ` + mm_typesh-change-set_page_private-to-inline-function.patch " Andrew Morton
2020-05-25 23:57 ` mmotm 2020-05-25-16-56 uploaded Andrew Morton
2020-05-26  3:49   ` mmotm 2020-05-25-16-56 uploaded (drm/nouveau) Randy Dunlap
2020-05-26  3:49     ` Randy Dunlap
2020-05-26  4:23     ` Dave Airlie
2020-05-26  4:23       ` Dave Airlie
2020-05-26  4:23       ` Dave Airlie
2020-05-26  4:31       ` Randy Dunlap
2020-05-26  4:31         ` Randy Dunlap
2020-05-26  6:56   ` mmotm 2020-05-25-16-56 uploaded (mtd/nand/raw/arasan-nand-controller) Randy Dunlap
2020-05-26 19:37     ` Miquel Raynal
2020-05-26 21:18 ` [failures] proc-meminfo-avoid-open-coded-reading-of-vm_committed_as.patch removed from -mm tree Andrew Morton
2020-05-26 21:18 ` [failures] mm-utilc-make-vm_memory_committed-more-accurate.patch " Andrew Morton
2020-05-26 21:18 ` [failures] mm-adjust-vm_committed_as_batch-according-to-vm-overcommit-policy.patch " Andrew Morton
2020-05-27 19:12 ` + mm-swap-fix-vmstats-for-huge-pages.patch added to " Andrew Morton
2020-05-27 19:12 ` + mm-swap-memcg-fix-memcg-stats-for-huge-pages.patch " Andrew Morton
2020-05-27 21:33 ` + mm-prepare-for-swap-over-high-accounting-and-penalty-calculation.patch " Andrew Morton
2020-05-27 21:33 ` + mm-move-penalty-delay-clamping-out-of-calculate_high_delay.patch " Andrew Morton
2020-05-27 21:33 ` + mm-move-cgroup-high-memory-limit-setting-into-struct-page_counter.patch " Andrew Morton
2020-05-27 21:33 ` Andrew Morton [this message]
2020-05-27 21:36 ` + mm-gup-update-pin_user_pagesrst-for-case-3-mmu-notifiers.patch " Andrew Morton
2020-05-27 21:48 ` + padata-remove-exit-routine.patch " Andrew Morton
2020-05-27 21:48 ` + padata-initialize-earlier.patch " Andrew Morton
2020-05-27 21:48 ` + padata-allocate-work-structures-for-parallel-jobs-from-a-pool.patch " Andrew Morton
2020-05-27 21:49 ` + padata-add-basic-support-for-multithreaded-jobs.patch " Andrew Morton
2020-05-27 21:49 ` + mm-dont-track-number-of-pages-during-deferred-initialization.patch " Andrew Morton
2020-05-27 21:49 ` + mm-parallelize-deferred_init_memmap.patch " Andrew Morton
2020-05-27 21:49 ` + mm-make-deferred-inits-max-threads-arch-specific.patch " Andrew Morton
2020-05-27 21:49 ` + padata-document-multithreaded-jobs.patch " Andrew Morton
2020-05-27 21:55 ` + cpumask-guard-cpumask_of_node-macro-argument.patch " Andrew Morton
2020-05-27 22:15 ` + sparc32-register-memory-occupied-by-kernel-as-memblockmemory.patch " Andrew Morton
2020-05-27 22:32 ` + x86-mm-ptdump-calculate-effective-permissions-correctly-fix.patch " Andrew Morton
2020-05-27 22:55 ` + ocfs2-clear-links-count-in-ocfs2_mknod-if-an-error-occurs.patch " Andrew Morton
2020-05-27 22:55 ` + ocfs2-fix-ocfs2-corrupt-when-iputting-an-inode.patch " Andrew Morton
2020-05-27 23:17 ` + mm-gup-introduce-pin_user_pages_locked.patch " Andrew Morton
2020-05-27 23:17 ` + mm-gup-frame_vector-convert-get_user_pages-pin_user_pages.patch " Andrew Morton
2020-05-27 23:52 ` + mm-memory_failure-only-send-bus_mceerr_ao-to-early-kill-process.patch " Andrew Morton
2020-05-28  0:10 ` + relay-handle-alloc_percpu-returning-null-in-relay_open.patch " Andrew Morton
2020-05-28  0:16 ` + xtensa-implement-flush_icache_user_range-fix.patch " Andrew Morton
2020-05-28  0:36 ` + maccess-unexport-probe_kernel_write-and-probe_user_write.patch " Andrew Morton
2020-05-28  0:36 ` + maccess-remove-various-unused-weak-aliases.patch " Andrew Morton
2020-05-28  0:36 ` + maccess-remove-duplicate-kerneldoc-comments.patch " Andrew Morton
2020-05-28  0:36 ` + maccess-clarify-kerneldoc-comments.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-update-the-top-of-file-comment.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-rename-strncpy_from_unsafe_user-to-strncpy_from_user_nofault.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-rename-strncpy_from_unsafe_strict-to-strncpy_from_kernel_nofault.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-rename-strnlen_unsafe_user-to-strnlen_user_nofault.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-remove-probe_read_common-and-probe_write_common.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-unify-the-probe-kernel-arch-hooks.patch " Andrew Morton
2020-05-28  0:37 ` + bpf-factor-out-a-bpf_trace_copy_string-helper.patch " Andrew Morton
2020-05-28  0:37 ` + bpf-handle-the-compat-string-in-bpf_trace_copy_string-better.patch " Andrew Morton
2020-05-28  0:37 ` + bpf-rework-the-compat-kernel-probe-handling.patch " Andrew Morton
2020-05-28  0:37 ` + tracing-kprobes-handle-mixed-kernel-userspace-probes-better.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-remove-strncpy_from_unsafe.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-always-use-strict-semantics-for-probe_kernel_read.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-move-user-access-routines-together.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-allow-architectures-to-provide-kernel-probing-directly.patch " Andrew Morton
2020-05-28  0:37 ` + x86-use-non-set_fs-based-maccess-routines.patch " Andrew Morton
2020-05-28  0:37 ` + maccess-return-erange-when-copy_from_kernel_nofault_allowed-fails.patch " Andrew Morton
2020-05-28  0:45 ` + x86-use-non-set_fs-based-maccess-routines-checkpatch-fixes.patch " Andrew Morton
2020-05-28  0:55 ` + maccess-unify-the-probe-kernel-arch-hooks-fix.patch " Andrew Morton
2020-05-28  1:02 ` + maccess-always-use-strict-semantics-for-probe_kernel_read-fix.patch " Andrew Morton
2020-05-28  2:04 ` + bpf-bpf_seq_printf-handle-potentially-unsafe-format-string-better.patch " Andrew Morton
2020-05-28  3:09 ` [to-be-updated] mm-memory_failure-only-send-bus_mceerr_ao-to-early-kill-process.patch removed from " Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200527213310.adtflYbJn%akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=chris@chrisdown.name \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.