[RFC PATCH 0/8] memcg: Enable fine-grained per process memory control

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-17 14:08 Waiman Long
  2020-08-17 14:08 ` [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action Waiman Long
                   ` (9 more replies)
  0 siblings, 10 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

Memory controller can be used to control and limit the amount of
physical memory used by a task. When a limit is set in "memory.high" in
a v2 non-root memory cgroup, the memory controller will try to reclaim
memory if the limit has been exceeded. Normally, that will be enough
to keep the physical memory consumption of tasks in the memory cgroup
to be around or below the "memory.high" limit.

Sometimes, memory reclaim may not be able to recover memory in a rate
that can catch up to the physical memory allocation rate. In this case,
the physical memory consumption will keep on increasing.  When it reaches
"memory.max" for memory cgroup v2 or when the system is running out of
free memory, the OOM killer will be invoked to kill some tasks to free
up additional memory. However, one has little control of which tasks
are going to be killed by an OOM killer. Killing tasks that hold some
important resources without freeing them first can create other system
problems down the road.

Users who do not want the OOM killer to be invoked to kill random
tasks in an out-of-memory situation can use the memory control
facility provided by this new patchset via prctl(2) to better manage
the mitigation action that needs to be performed to various tasks when
the specified memory limit is exceeded with memory cgroup v2 being used.

The currently supported mitigation actions include the followings:

 1) Return ENOMEM for some syscalls that allocate or handle memory
 2) Slow down the process for memory reclaim to catch up
 3) Send a specific signal to the task
 4) Kill the task

The users that want better memory control for their applicatons can
either modify their applications to call the prctl(2) syscall directly
with the new memory control command code or write the desired action to
the newly provided memctl procfs files of their applications provided
that those applications run in a non-root v2 memory cgroup.

Waiman Long (8):
  memcg: Enable fine-grained control of over memory.high action
  memcg, mm: Return ENOMEM or delay if memcg_over_limit
  memcg: Allow the use of task RSS memory as over-high action trigger
  fs/proc: Support a new procfs memctl file
  memcg: Allow direct per-task memory limit checking
  memcg: Introduce additional memory control slowdown if needed
  memcg: Enable logging of memory control mitigation action
  memcg: Add over-high action prctl() documentation

 Documentation/userspace-api/index.rst      |   1 +
 Documentation/userspace-api/memcontrol.rst | 174 ++++++++++++++++
 fs/proc/base.c                             | 109 ++++++++++
 include/linux/memcontrol.h                 |   4 +
 include/linux/sched.h                      |  24 +++
 include/uapi/linux/prctl.h                 |  48 +++++
 kernel/fork.c                              |   1 +
 kernel/sys.c                               |  16 ++
 mm/memcontrol.c                            | 227 +++++++++++++++++++++
 mm/mlock.c                                 |   6 +
 mm/mmap.c                                  |  12 ++
 mm/mprotect.c                              |   3 +
 12 files changed, 625 insertions(+)
 create mode 100644 Documentation/userspace-api/memcontrol.rst

-- 
2.18.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
@ 2020-08-17 14:08 ` Waiman Long
  2020-08-17 14:30   ` Chris Down
  2020-08-17 16:44     ` Shakeel Butt
  2020-08-17 14:08 ` [RFC PATCH 2/8] memcg, mm: Return ENOMEM or delay if memcg_over_limit Waiman Long
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

Memory controller can be used to control and limit the amount of
physical memory used by a task. When a limit is set in "memory.high"
in a non-root memory cgroup, the memory controller will try to reclaim
memory if the limit has been exceeded. Normally, that will be enough
to keep the physical memory consumption of tasks in the memory cgroup
to be around or below the "memory.high" limit.

Sometimes, memory reclaim may not be able to recover memory in a rate
that can catch up to the physical memory allocation rate especially
when rotating disks are used for swapping or writing dirty pages. In
this case, the physical memory consumption will keep on increasing.
When it reaches "memory.max" or the system is really running out of
memory, the OOM killer will be invoked to kill some tasks to free up
additional memory. However, one has little control of which tasks are
going to be killed by an OOM killer.

Users who do not want the OOM killer to be invoked to kill random tasks
in an out-of-memory situation will require a better way to manage memory
and deal with applications that are out of control in term of physical
memory consumption rate.

A new set of prctl(2) commands are added to provide a facility to
allow users to manage the physical memory consumption of each of their
applications and control the mitigation actions that should be taken
when those applications consume more physical memory than what they
are supposed to use.

The new prctl(2) commands are PR_SET_MEMCONTROL and PR_GET_MEMCONTROL
to set the memory control parameters and retrieve those parameters
respectively.  The four possible mitigation actions for a task that
exceeds their designated memory limit are:

 1) Return ENOMEM for some syscalls that allocate or handle memory
 2) Slow down the process for memory reclaim to catch up
 3) Send a specific signal to the task
 4) Kill the task

The parameters that can be specified in the new PR_SET_MEMCONTROL
commands are:

 arg2 - the mitigation action (bits 0-7), signal number (bits 8-15)
	and flags (bits 16-31).
 arg3 - the additional memory limit (in bytes) that will be added to
	memory.high as the real limit that will trigger the mitigation
	action.

The PR_MEMFLAG_SIGCONT flag is used to specify continuous signal delivery
instead of a one-shot signal.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h |   4 ++
 include/linux/sched.h      |   7 +++
 include/uapi/linux/prctl.h |  37 ++++++++++++
 kernel/sys.c               |  16 ++++++
 mm/memcontrol.c            | 114 +++++++++++++++++++++++++++++++++++++
 5 files changed, 178 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d0b036123c6a..40e6ceb8209b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -445,6 +445,10 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
+long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item);
+long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
+			      unsigned long limit);
+
 static struct mem_cgroup_per_node *
 mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93ecd930efd3..c79d606d27ab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1265,6 +1265,13 @@ struct task_struct {
 	/* Number of pages to reclaim on returning to userland: */
 	unsigned int			memcg_nr_pages_over_high;
 
+	/* Memory over-high action, flags, signal and limit */
+	unsigned char			memcg_over_high_action;
+	unsigned char			memcg_over_high_signal;
+	unsigned short			memcg_over_high_flags;
+	unsigned int			memcg_over_high_climit;
+	unsigned int			memcg_over_limit;
+
 	/* Used by memcontrol for targeted memcg charge: */
 	struct mem_cgroup		*active_memcg;
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..87970ae7b32c 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,41 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Per task fine-grained memory cgroup control */
+#define PR_GET_MEMCONTROL		59
+#define PR_SET_MEMCONTROL		60
+
+/*
+ * PR_SET_MEMCONTROL:
+ * 2 parameters are passed:
+ *  - Action word
+ *  - Memory cgroup additional memory limit
+ *
+ * The action word consists of 3 bit fields:
+ *  - Bits  0-7 : over-memory-limit action code
+ *  - Bits  8-15: signal number
+ *  - Bits 16-32: action flags
+ */
+
+/* Control values for PR_SET_MEMCONTROL over limit action */
+# define PR_MEMACT_NONE			0
+# define PR_MEMACT_ENOMEM		1	/* Deny memory request */
+# define PR_MEMACT_SLOWDOWN		2	/* Slow down the process */
+# define PR_MEMACT_SIGNAL		3	/* Send signal */
+# define PR_MEMACT_KILL			4	/* Kill the process */
+# define PR_MEMACT_MAX			PR_MEMACT_KILL
+
+/* Flags for PR_SET_MEMCONTROL */
+# define PR_MEMFLAG_SIGCONT		(1UL << 0) /* Continuous signal delivery */
+# define PR_MEMFLAG_MASK		PR_MEMFLAG_SIGCONT
+
+/* Action word masks */
+# define PR_MEMACT_MASK			0xff
+# define PR_MEMACT_SIG_SHIFT		8
+# define PR_MEMACT_FLG_SHIFT		16
+
+/* Return specified value for PR_GET_MEMCONTROL */
+# define PR_MEMGET_ACTION		0
+# define PR_MEMGET_CLIMIT		1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index ca11af9d815d..644b86235d7f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -64,6 +64,10 @@
 
 #include <linux/nospec.h>
 
+#ifdef CONFIG_MEMCG
+#include <linux/memcontrol.h>
+#endif
+
 #include <linux/kmsg_dump.h>
 /* Move somewhere else to avoid recompiling? */
 #include <generated/utsrelease.h>
@@ -2530,6 +2534,18 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+#ifdef CONFIG_MEMCG
+	case PR_GET_MEMCONTROL:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = mem_cgroup_over_high_get(me, arg2);
+		break;
+	case PR_SET_MEMCONTROL:
+		if (arg4 || arg5)
+			return -EINVAL;
+		error = mem_cgroup_over_high_set(me, arg2, arg3);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b807952b4d43..1106dac024ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -59,6 +59,7 @@
 #include <linux/tracehook.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/prctl.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -2628,6 +2629,71 @@ void mem_cgroup_handle_over_high(void)
 	css_put(&memcg->css);
 }
 
+/*
+ * Task specific action when over the high limit.
+ * Return true if an action has been taken or further check is not needed,
+ * false otherwise.
+ */
+static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action)
+{
+	unsigned long mem;
+	bool ret = false;
+	struct mm_struct *mm = get_task_mm(current);
+	u8  signal = READ_ONCE(current->memcg_over_high_signal);
+	u16 flags  = READ_ONCE(current->memcg_over_high_flags);
+	u32 limit  = READ_ONCE(current->memcg_over_high_climit);
+
+	if (!mm)
+		return true;	/* No more check is needed */
+
+	current->memcg_over_limit = false;
+	if ((action == PR_MEMACT_SIGNAL) && !signal)
+		goto out;
+
+	mem = page_counter_read(&memcg->memory);
+	if (mem <= memcg->memory.high + limit)
+		goto out;
+
+	ret = true;
+	switch (action) {
+	case PR_MEMACT_ENOMEM:
+		WRITE_ONCE(current->memcg_over_limit, true);
+		break;
+	case PR_MEMACT_SLOWDOWN:
+		/* Slow down by yielding the cpu */
+		set_tsk_need_resched(current);
+		set_preempt_need_resched();
+		break;
+	case PR_MEMACT_KILL:
+		signal = SIGKILL;
+		fallthrough;
+	case PR_MEMACT_SIGNAL:
+		force_sig(signal);
+
+		/* Deliver signal only once if !PR_MEMFLAG_SIGCONT */
+		if (!(flags & PR_MEMFLAG_SIGCONT))
+			WRITE_ONCE(current->memcg_over_high_signal, 0);
+		break;
+	}
+
+out:
+	mmput(mm);
+	return ret;
+}
+
+/*
+ * Return true if an action has been taken or further check is not needed,
+ * false otherwise.
+ */
+static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg)
+{
+	u8 action = READ_ONCE(current->memcg_over_high_action);
+
+	if (!action)
+		return true;	/* No more check is needed */
+	return __mem_cgroup_over_high_action(memcg, action);
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
@@ -2639,6 +2705,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
+	bool taken = false;
 	unsigned long pflags;
 
 	if (mem_cgroup_is_root(memcg))
@@ -2797,6 +2864,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		swap_high = page_counter_read(&memcg->swap) >
 			READ_ONCE(memcg->swap.high);
 
+		if (mem_high && !taken)
+			taken = mem_cgroup_over_high_action(memcg);
+
 		/* Don't bother a random interrupted task */
 		if (in_interrupt()) {
 			if (mem_high) {
@@ -6959,6 +7029,50 @@ void mem_cgroup_sk_free(struct sock *sk)
 		css_put(&sk->sk_memcg->css);
 }
 
+/*
+ * Get and set cgroup memory-over-high attributes.
+ */
+long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item)
+{
+	switch (item) {
+	case PR_MEMGET_ACTION:
+		return task->memcg_over_high_action |
+		      (task->memcg_over_high_signal << PR_MEMACT_SIG_SHIFT) |
+		      (task->memcg_over_high_flags  << PR_MEMACT_FLG_SHIFT);
+
+	case PR_MEMGET_CLIMIT:
+		return (long)task->memcg_over_high_climit * PAGE_SIZE;
+	}
+	return -EINVAL;
+}
+
+long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
+			      unsigned long limit)
+{
+	unsigned char  cmd   = action & PR_MEMACT_MASK;
+	unsigned char  sig   = (action >> PR_MEMACT_SIG_SHIFT) & PR_MEMACT_MASK;
+	unsigned short flags = action >> PR_MEMACT_FLG_SHIFT;
+
+	if ((cmd > PR_MEMACT_MAX) || (flags & ~PR_MEMFLAG_MASK) ||
+	    (sig >= _NSIG))
+		return -EINVAL;
+
+	WRITE_ONCE(task->memcg_over_high_action, cmd);
+	WRITE_ONCE(task->memcg_over_high_signal, sig);
+	WRITE_ONCE(task->memcg_over_high_flags, flags);
+
+	if (cmd == PR_MEMACT_NONE) {
+		WRITE_ONCE(task->memcg_over_high_climit, 0);
+	} else {
+		/*
+		 * Convert limits to # of pages
+		 */
+		limit = DIV_ROUND_UP(limit, PAGE_SIZE);
+		WRITE_ONCE(task->memcg_over_high_climit, limit);
+	}
+	return 0;
+}
+
 /**
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 2/8] memcg, mm: Return ENOMEM or delay if memcg_over_limit
  2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
  2020-08-17 14:08 ` [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action Waiman Long
@ 2020-08-17 14:08 ` Waiman Long
  2020-08-17 14:08   ` Waiman Long
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

The brk(), mmap(), mlock(), mlockall() and mprotect() syscalls are
modified to check the memcg_over_limit flag and return ENOMEM when it
is set and memory control action is PR_MEMACT_ENOMEM.

In case the action is PR_MEMACT_SLOWDOWN, an artificial delay of 20ms
will be added to slow down the memory allocation syscalls.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched.h | 16 ++++++++++++++++
 kernel/fork.c         |  1 +
 mm/memcontrol.c       | 25 +++++++++++++++++++++++--
 mm/mlock.c            |  6 ++++++
 mm/mmap.c             | 12 ++++++++++++
 mm/mprotect.c         |  3 +++
 6 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c79d606d27ab..9ec1bd072334 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1477,6 +1477,22 @@ static inline char task_state_to_char(struct task_struct *tsk)
 	return task_index_to_char(task_state_index(tsk));
 }
 
+#ifdef CONFIG_MEMCG
+extern bool mem_cgroup_check_over_limit(void);
+
+static inline bool mem_over_memcg_limit(void)
+{
+	if (READ_ONCE(current->memcg_over_limit))
+		return mem_cgroup_check_over_limit();
+	return false;
+}
+#else
+static inline bool mem_over_memcg_limit(void)
+{
+	return false;
+}
+#endif
+
 /**
  * is_global_init - check if a task structure is init. Since init
  * is free to have sub-threads we need to check tgid.
diff --git a/kernel/fork.c b/kernel/fork.c
index 4d32190861bd..61f9a9e5f857 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -940,6 +940,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
+	tsk->memcg_over_limit = false;
 #endif
 	return tsk;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1106dac024ac..5cad7bb26d13 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2646,7 +2646,9 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action)
 	if (!mm)
 		return true;	/* No more check is needed */
 
-	current->memcg_over_limit = false;
+	if (READ_ONCE(current->memcg_over_limit))
+		WRITE_ONCE(current->memcg_over_limit, false);
+
 	if ((action == PR_MEMACT_SIGNAL) && !signal)
 		goto out;
 
@@ -2660,7 +2662,11 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action)
 		WRITE_ONCE(current->memcg_over_limit, true);
 		break;
 	case PR_MEMACT_SLOWDOWN:
-		/* Slow down by yielding the cpu */
+		/*
+		 * Slow down by yielding the cpu & adding delay to
+		 * memory allocation syscalls.
+		 */
+		WRITE_ONCE(current->memcg_over_limit, true);
 		set_tsk_need_resched(current);
 		set_preempt_need_resched();
 		break;
@@ -2694,6 +2700,21 @@ static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg)
 	return __mem_cgroup_over_high_action(memcg, action);
 }
 
+/*
+ * Called from memory allocation syscalls.
+ * Return true if ENOMEM should be returned, false otherwise.
+ */
+bool mem_cgroup_check_over_limit(void)
+{
+	u8 action = READ_ONCE(current->memcg_over_high_action);
+
+	if (action == PR_MEMACT_ENOMEM)
+		return true;
+	if (action == PR_MEMACT_SLOWDOWN)
+		msleep(20);	/* Artificial delay of 20ms */
+	return false;
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..130d4b3fa0f5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -678,6 +678,9 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
 	if (!can_do_mlock())
 		return -EPERM;
 
+	if (mem_over_memcg_limit())
+		return -ENOMEM;
+
 	len = PAGE_ALIGN(len + (offset_in_page(start)));
 	start &= PAGE_MASK;
 
@@ -807,6 +810,9 @@ SYSCALL_DEFINE1(mlockall, int, flags)
 	if (!can_do_mlock())
 		return -EPERM;
 
+	if (mem_over_memcg_limit())
+		return -ENOMEM;
+
 	lock_limit = rlimit(RLIMIT_MEMLOCK);
 	lock_limit >>= PAGE_SHIFT;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 40248d84ad5f..873ccf2560a6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -198,6 +198,10 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	bool downgraded = false;
 	LIST_HEAD(uf);
 
+	/* Too much memory used? */
+	if (mem_over_memcg_limit())
+		return -ENOMEM;
+
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
@@ -1407,6 +1411,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (mm->map_count > sysctl_max_map_count)
 		return -ENOMEM;
 
+	/* Too much memory used? */
+	if (mem_over_memcg_limit())
+		return -ENOMEM;
+
 	/* Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
 	 */
@@ -1557,6 +1565,10 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 	struct file *file = NULL;
 	unsigned long retval;
 
+	/* Too much memory used? */
+	if (mem_over_memcg_limit())
+		return -ENOMEM;
+
 	if (!(flags & MAP_ANONYMOUS)) {
 		audit_mmap_fd(fd, flags);
 		file = fget(fd);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..b2c0f50bb0a0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -519,6 +519,9 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
 				(prot & PROT_READ);
 
+	if (mem_over_memcg_limit())
+		return -ENOMEM;
+
 	start = untagged_addr(start);
 
 	prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 3/8] memcg: Allow the use of task RSS memory as over-high action trigger
@ 2020-08-17 14:08   ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

The total memory consumption of a task as tracked by memory cgroup
includes different types of memory like page caches, anonymous memory,
share memory and kernel memory.

In a memory cgroup with a multiple running tasks, using total memory
consumption of all the tasks within the cgroup as action trigger may
not be fair to tasks that don't contribute to excessive memory usage.

Page cache memory can typically be shared between multiple tasks. It
is also not easy to pin kernel memory usage to a specific task. That
leaves a task's anonymous (RSS) memory usage as best proxy for a task's
contribution to total memory consumption within the memory cgroup.

So a new set of PR_MEMFLAG_RSS_* flags are added to enable the checking
of a task's real RSS memory footprint as a trigger to over-high action
provided that the total memory consumption of the cgroup has exceeded
memory.high + the additional memcg memory limit.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h |  2 +-
 include/linux/sched.h      |  3 ++-
 include/uapi/linux/prctl.h | 14 +++++++++++---
 kernel/sys.c               |  4 ++--
 mm/memcontrol.c            | 32 ++++++++++++++++++++++++++++++--
 5 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 40e6ceb8209b..562958cf79d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -447,7 +447,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
 long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item);
 long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
-			      unsigned long limit);
+			      unsigned long limit, unsigned long limit2);
 
 static struct mem_cgroup_per_node *
 mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9ec1bd072334..a1e9ac8b9b16 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1265,11 +1265,12 @@ struct task_struct {
 	/* Number of pages to reclaim on returning to userland: */
 	unsigned int			memcg_nr_pages_over_high;
 
-	/* Memory over-high action, flags, signal and limit */
+	/* Memory over-high action, flags, signal and limits */
 	unsigned char			memcg_over_high_action;
 	unsigned char			memcg_over_high_signal;
 	unsigned short			memcg_over_high_flags;
 	unsigned int			memcg_over_high_climit;
+	unsigned int			memcg_over_high_plimit;
 	unsigned int			memcg_over_limit;
 
 	/* Used by memcontrol for targeted memcg charge: */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 87970ae7b32c..ef8d84c94b4a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -244,9 +244,10 @@ struct prctl_mm_map {
 
 /*
  * PR_SET_MEMCONTROL:
- * 2 parameters are passed:
+ * 3 parameters are passed:
  *  - Action word
  *  - Memory cgroup additional memory limit
+ *  - Flag specific memory limit
  *
  * The action word consists of 3 bit fields:
  *  - Bits  0-7 : over-memory-limit action code
@@ -263,8 +264,14 @@ struct prctl_mm_map {
 # define PR_MEMACT_MAX			PR_MEMACT_KILL
 
 /* Flags for PR_SET_MEMCONTROL */
-# define PR_MEMFLAG_SIGCONT		(1UL << 0) /* Continuous signal delivery */
-# define PR_MEMFLAG_MASK		PR_MEMFLAG_SIGCONT
+# define PR_MEMFLAG_SIGCONT		(1UL <<  0) /* Continuous signal delivery */
+# define PR_MEMFLAG_RSS_ANON		(1UL <<  8) /* Check anonymous pages */
+# define PR_MEMFLAG_RSS_FILE		(1UL <<  9) /* Check file pages */
+# define PR_MEMFLAG_RSS_SHMEM		(1UL << 10) /* Check shmem pages */
+# define PR_MEMFLAG_RSS			(PR_MEMFLAG_RSS_ANON |\
+					 PR_MEMFLAG_RSS_FILE |\
+					 PR_MEMFLAG_RSS_SHMEM)
+# define PR_MEMFLAG_MASK		(PR_MEMFLAG_SIGCONT | PR_MEMFLAG_RSS)
 
 /* Action word masks */
 # define PR_MEMACT_MASK			0xff
@@ -274,5 +281,6 @@ struct prctl_mm_map {
 /* Return specified value for PR_GET_MEMCONTROL */
 # define PR_MEMGET_ACTION		0
 # define PR_MEMGET_CLIMIT		1
+# define PR_MEMGET_PLIMIT		2
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 644b86235d7f..272f82227c2d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2541,9 +2541,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = mem_cgroup_over_high_get(me, arg2);
 		break;
 	case PR_SET_MEMCONTROL:
-		if (arg4 || arg5)
+		if (arg5)
 			return -EINVAL;
-		error = mem_cgroup_over_high_set(me, arg2, arg3);
+		error = mem_cgroup_over_high_set(me, arg2, arg3, arg4);
 		break;
 #endif
 	default:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5cad7bb26d13..aa76bae7f408 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2629,6 +2629,12 @@ void mem_cgroup_handle_over_high(void)
 	css_put(&memcg->css);
 }
 
+static inline unsigned long
+get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit)
+{
+	return (flags & rss_bit) ? get_mm_counter(mm, mm_bit) : 0;
+}
+
 /*
  * Task specific action when over the high limit.
  * Return true if an action has been taken or further check is not needed,
@@ -2656,6 +2662,22 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action)
 	if (mem <= memcg->memory.high + limit)
 		goto out;
 
+	/*
+	 * Check RSS memory if any of the PR_MEMFLAG_RSS flags is set.
+	 */
+	if (flags & PR_MEMFLAG_RSS) {
+		mem = get_rss_counter(mm, MM_ANONPAGES, flags,
+				      PR_MEMFLAG_RSS_ANON) +
+		      get_rss_counter(mm, MM_FILEPAGES, flags,
+				      PR_MEMFLAG_RSS_FILE) +
+		      get_rss_counter(mm, MM_SHMEMPAGES, flags,
+				      PR_MEMFLAG_RSS_SHMEM);
+
+		limit = READ_ONCE(current->memcg_over_high_plimit);
+		if (mem <= limit)
+			goto out;
+	}
+
 	ret = true;
 	switch (action) {
 	case PR_MEMACT_ENOMEM:
@@ -7063,12 +7085,15 @@ long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item)
 
 	case PR_MEMGET_CLIMIT:
 		return (long)task->memcg_over_high_climit * PAGE_SIZE;
+
+	case PR_MEMGET_PLIMIT:
+		return (long)task->memcg_over_high_plimit * PAGE_SIZE;
 	}
 	return -EINVAL;
 }
 
 long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
-			      unsigned long limit)
+			      unsigned long limit, unsigned long limit2)
 {
 	unsigned char  cmd   = action & PR_MEMACT_MASK;
 	unsigned char  sig   = (action >> PR_MEMACT_SIG_SHIFT) & PR_MEMACT_MASK;
@@ -7084,12 +7109,15 @@ long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
 
 	if (cmd == PR_MEMACT_NONE) {
 		WRITE_ONCE(task->memcg_over_high_climit, 0);
+		WRITE_ONCE(task->memcg_over_high_plimit, 0);
 	} else {
 		/*
 		 * Convert limits to # of pages
 		 */
-		limit = DIV_ROUND_UP(limit, PAGE_SIZE);
+		limit  = DIV_ROUND_UP(limit, PAGE_SIZE);
+		limit2 = DIV_ROUND_UP(limit2, PAGE_SIZE);
 		WRITE_ONCE(task->memcg_over_high_climit, limit);
+		WRITE_ONCE(task->memcg_over_high_plimit, limit2);
 	}
 	return 0;
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 3/8] memcg: Allow the use of task RSS memory as over-high action trigger
@ 2020-08-17 14:08   ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Waiman Long

The total memory consumption of a task as tracked by memory cgroup
includes different types of memory like page caches, anonymous memory,
share memory and kernel memory.

In a memory cgroup with a multiple running tasks, using total memory
consumption of all the tasks within the cgroup as action trigger may
not be fair to tasks that don't contribute to excessive memory usage.

Page cache memory can typically be shared between multiple tasks. It
is also not easy to pin kernel memory usage to a specific task. That
leaves a task's anonymous (RSS) memory usage as best proxy for a task's
contribution to total memory consumption within the memory cgroup.

So a new set of PR_MEMFLAG_RSS_* flags are added to enable the checking
of a task's real RSS memory footprint as a trigger to over-high action
provided that the total memory consumption of the cgroup has exceeded
memory.high + the additional memcg memory limit.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/memcontrol.h |  2 +-
 include/linux/sched.h      |  3 ++-
 include/uapi/linux/prctl.h | 14 +++++++++++---
 kernel/sys.c               |  4 ++--
 mm/memcontrol.c            | 32 ++++++++++++++++++++++++++++++--
 5 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 40e6ceb8209b..562958cf79d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -447,7 +447,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
 long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item);
 long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
-			      unsigned long limit);
+			      unsigned long limit, unsigned long limit2);
 
 static struct mem_cgroup_per_node *
 mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9ec1bd072334..a1e9ac8b9b16 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1265,11 +1265,12 @@ struct task_struct {
 	/* Number of pages to reclaim on returning to userland: */
 	unsigned int			memcg_nr_pages_over_high;
 
-	/* Memory over-high action, flags, signal and limit */
+	/* Memory over-high action, flags, signal and limits */
 	unsigned char			memcg_over_high_action;
 	unsigned char			memcg_over_high_signal;
 	unsigned short			memcg_over_high_flags;
 	unsigned int			memcg_over_high_climit;
+	unsigned int			memcg_over_high_plimit;
 	unsigned int			memcg_over_limit;
 
 	/* Used by memcontrol for targeted memcg charge: */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 87970ae7b32c..ef8d84c94b4a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -244,9 +244,10 @@ struct prctl_mm_map {
 
 /*
  * PR_SET_MEMCONTROL:
- * 2 parameters are passed:
+ * 3 parameters are passed:
  *  - Action word
  *  - Memory cgroup additional memory limit
+ *  - Flag specific memory limit
  *
  * The action word consists of 3 bit fields:
  *  - Bits  0-7 : over-memory-limit action code
@@ -263,8 +264,14 @@ struct prctl_mm_map {
 # define PR_MEMACT_MAX			PR_MEMACT_KILL
 
 /* Flags for PR_SET_MEMCONTROL */
-# define PR_MEMFLAG_SIGCONT		(1UL << 0) /* Continuous signal delivery */
-# define PR_MEMFLAG_MASK		PR_MEMFLAG_SIGCONT
+# define PR_MEMFLAG_SIGCONT		(1UL <<  0) /* Continuous signal delivery */
+# define PR_MEMFLAG_RSS_ANON		(1UL <<  8) /* Check anonymous pages */
+# define PR_MEMFLAG_RSS_FILE		(1UL <<  9) /* Check file pages */
+# define PR_MEMFLAG_RSS_SHMEM		(1UL << 10) /* Check shmem pages */
+# define PR_MEMFLAG_RSS			(PR_MEMFLAG_RSS_ANON |\
+					 PR_MEMFLAG_RSS_FILE |\
+					 PR_MEMFLAG_RSS_SHMEM)
+# define PR_MEMFLAG_MASK		(PR_MEMFLAG_SIGCONT | PR_MEMFLAG_RSS)
 
 /* Action word masks */
 # define PR_MEMACT_MASK			0xff
@@ -274,5 +281,6 @@ struct prctl_mm_map {
 /* Return specified value for PR_GET_MEMCONTROL */
 # define PR_MEMGET_ACTION		0
 # define PR_MEMGET_CLIMIT		1
+# define PR_MEMGET_PLIMIT		2
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 644b86235d7f..272f82227c2d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2541,9 +2541,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = mem_cgroup_over_high_get(me, arg2);
 		break;
 	case PR_SET_MEMCONTROL:
-		if (arg4 || arg5)
+		if (arg5)
 			return -EINVAL;
-		error = mem_cgroup_over_high_set(me, arg2, arg3);
+		error = mem_cgroup_over_high_set(me, arg2, arg3, arg4);
 		break;
 #endif
 	default:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5cad7bb26d13..aa76bae7f408 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2629,6 +2629,12 @@ void mem_cgroup_handle_over_high(void)
 	css_put(&memcg->css);
 }
 
+static inline unsigned long
+get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit)
+{
+	return (flags & rss_bit) ? get_mm_counter(mm, mm_bit) : 0;
+}
+
 /*
  * Task specific action when over the high limit.
  * Return true if an action has been taken or further check is not needed,
@@ -2656,6 +2662,22 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action)
 	if (mem <= memcg->memory.high + limit)
 		goto out;
 
+	/*
+	 * Check RSS memory if any of the PR_MEMFLAG_RSS flags is set.
+	 */
+	if (flags & PR_MEMFLAG_RSS) {
+		mem = get_rss_counter(mm, MM_ANONPAGES, flags,
+				      PR_MEMFLAG_RSS_ANON) +
+		      get_rss_counter(mm, MM_FILEPAGES, flags,
+				      PR_MEMFLAG_RSS_FILE) +
+		      get_rss_counter(mm, MM_SHMEMPAGES, flags,
+				      PR_MEMFLAG_RSS_SHMEM);
+
+		limit = READ_ONCE(current->memcg_over_high_plimit);
+		if (mem <= limit)
+			goto out;
+	}
+
 	ret = true;
 	switch (action) {
 	case PR_MEMACT_ENOMEM:
@@ -7063,12 +7085,15 @@ long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item)
 
 	case PR_MEMGET_CLIMIT:
 		return (long)task->memcg_over_high_climit * PAGE_SIZE;
+
+	case PR_MEMGET_PLIMIT:
+		return (long)task->memcg_over_high_plimit * PAGE_SIZE;
 	}
 	return -EINVAL;
 }
 
 long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
-			      unsigned long limit)
+			      unsigned long limit, unsigned long limit2)
 {
 	unsigned char  cmd   = action & PR_MEMACT_MASK;
 	unsigned char  sig   = (action >> PR_MEMACT_SIG_SHIFT) & PR_MEMACT_MASK;
@@ -7084,12 +7109,15 @@ long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
 
 	if (cmd == PR_MEMACT_NONE) {
 		WRITE_ONCE(task->memcg_over_high_climit, 0);
+		WRITE_ONCE(task->memcg_over_high_plimit, 0);
 	} else {
 		/*
 		 * Convert limits to # of pages
 		 */
-		limit = DIV_ROUND_UP(limit, PAGE_SIZE);
+		limit  = DIV_ROUND_UP(limit, PAGE_SIZE);
+		limit2 = DIV_ROUND_UP(limit2, PAGE_SIZE);
 		WRITE_ONCE(task->memcg_over_high_climit, limit);
+		WRITE_ONCE(task->memcg_over_high_plimit, limit2);
 	}
 	return 0;
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file
  2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
                   ` (2 preceding siblings ...)
  2020-08-17 14:08   ` Waiman Long
@ 2020-08-17 14:08 ` Waiman Long
  2020-08-17 20:10   ` kernel test robot
  2020-08-17 20:10   ` [RFC PATCH] fs/proc: proc_memctl_operations can be static kernel test robot
  2020-08-17 14:08 ` [RFC PATCH 5/8] memcg: Allow direct per-task memory limit checking Waiman Long
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

To allow system administrators to view and modify the over-high action
settings of a running application, a new /proc/<pid>/memctl file is
now added to show the over-high action parameters as well as allowing
their modification.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/proc/base.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..3c9349ad1e37 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -88,6 +88,8 @@
 #include <linux/user_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/prctl.h>
+#include <linux/ctype.h>
 #include <linux/sched/autogroup.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
@@ -3145,6 +3147,107 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_STACKLEAK_METRICS */
 
+#ifdef CONFIG_MEMCG
+/*
+ * Memory cgroup control parameters
+ * <over_high_action> <limit1> <limit2>
+ */
+static ssize_t proc_memctl_read(struct file *file, char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file_inode(file));
+	unsigned long action, limit1, limit2;
+	char buffer[80];
+	ssize_t len;
+
+	if (!task)
+		return -ESRCH;
+
+	action = task->memcg_over_high_action |
+		(task->memcg_over_high_signal << PR_MEMACT_SIG_SHIFT) |
+		(task->memcg_over_high_flags  << PR_MEMACT_FLG_SHIFT);
+	limit1 = (unsigned long)task->memcg_over_high_climit  * PAGE_SIZE;
+	limit2 = (unsigned long)task->memcg_over_high_plimit * PAGE_SIZE;
+
+	put_task_struct(task);
+	len = snprintf(buffer, sizeof(buffer), "%ld %ld %ld\n",
+		       action, limit1, limit2);
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t proc_memctl_write(struct file *file, const char __user *buf,
+				  size_t count, loff_t *offs)
+{
+	struct task_struct *task = get_proc_task(file_inode(file));
+	unsigned long vals[3];
+	char buffer[80];
+	char *ptr, *next;
+	int i, err;
+	unsigned int action, signal, flags;
+
+	if (!task)
+		return -ESRCH;
+	if (count  > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count)) {
+		err = -EFAULT;
+		goto out;
+	}
+	buffer[count] = '\0';
+	next = buffer;
+
+	/*
+	 * Expect to find 3 numbers
+	 */
+	for (i = 0, ptr = buffer; i < 3; i++) {
+		ptr = skip_spaces(next);
+		if (!*ptr) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		/* Skip non-space characters for next */
+		for (next = ptr; *next && !isspace(*next); next++)
+			;
+		if (isspace(*next))
+			*next++ = '\0';
+
+		err = kstrtoul(ptr, 0, &vals[i]);
+		if (err)
+			break;
+	}
+	action = vals[0] & PR_MEMACT_MASK;
+	signal = (vals[0] >> PR_MEMACT_SIG_SHIFT) & PR_MEMACT_MASK;
+	flags  = vals[0] >> PR_MEMACT_FLG_SHIFT;
+
+	/* Round up limits to number of pages */
+	vals[1] = DIV_ROUND_UP(vals[1], PAGE_SIZE);
+	vals[2] = DIV_ROUND_UP(vals[2], PAGE_SIZE);
+
+	/* Check input values */
+	if ((action > PR_MEMACT_MAX) || (signal >= _NSIG) ||
+	    (flags & ~PR_MEMFLAG_MASK)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	WRITE_ONCE(task->memcg_over_high_action, action);
+	WRITE_ONCE(task->memcg_over_high_signal, signal);
+	WRITE_ONCE(task->memcg_over_high_flags,  flags);
+	WRITE_ONCE(task->memcg_over_high_climit, vals[1]);
+	WRITE_ONCE(task->memcg_over_high_plimit, vals[2]);
+out:
+	put_task_struct(task);
+	return err < 0 ? err : count;
+}
+
+const struct file_operations proc_memctl_operations = {
+	.read   = proc_memctl_read,
+	.write  = proc_memctl_write,
+	.llseek	= generic_file_llseek,
+};
+#endif /* CONFIG_MEMCG */
+
 /*
  * Thread groups
  */
@@ -3258,6 +3361,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_MEMCG
+	REG("memctl", 0644, proc_memctl_operations),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3587,6 +3693,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_MEMCG
+	REG("memctl", 0644, proc_memctl_operations),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 5/8] memcg: Allow direct per-task memory limit checking
  2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
                   ` (3 preceding siblings ...)
  2020-08-17 14:08 ` [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file Waiman Long
@ 2020-08-17 14:08 ` Waiman Long
  2020-08-17 14:08 ` [RFC PATCH 6/8] memcg: Introduce additional memory control slowdown if needed Waiman Long
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

Up to now, the PR_SET_MEMCONTROL prctl(2) call enables user-specified
action only if the total memory consumption in the memory cgroup
exceeds memory.high by the additional memory threshold specified.

There are cases where a user may want direct memory consumption control
for certain applications even if the total cgroup memory consumption
has not exceeded the limit yet. One way of doing that is to create one
memory cgroup per application. However, if an application call other
helper applications, these helper applications will fall into the same
cgroup breaking the one application per cgroup rule.

Another alternative is to enable user to enable direct per-task memory
limit checking which is what this patch is about. That is for special
use cases and is not recommended for general use as memory reclaim may
not be triggered even if the per-task memory limit has been exceeded.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/uapi/linux/prctl.h |  4 ++-
 mm/memcontrol.c            | 52 +++++++++++++++++++++++++++-----------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index ef8d84c94b4a..7ba40e10737d 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -265,13 +265,15 @@ struct prctl_mm_map {
 
 /* Flags for PR_SET_MEMCONTROL */
 # define PR_MEMFLAG_SIGCONT		(1UL <<  0) /* Continuous signal delivery */
+# define PR_MEMFLAG_DIRECT		(1UL <<  1) /* Direct memory limit */
 # define PR_MEMFLAG_RSS_ANON		(1UL <<  8) /* Check anonymous pages */
 # define PR_MEMFLAG_RSS_FILE		(1UL <<  9) /* Check file pages */
 # define PR_MEMFLAG_RSS_SHMEM		(1UL << 10) /* Check shmem pages */
 # define PR_MEMFLAG_RSS			(PR_MEMFLAG_RSS_ANON |\
 					 PR_MEMFLAG_RSS_FILE |\
 					 PR_MEMFLAG_RSS_SHMEM)
-# define PR_MEMFLAG_MASK		(PR_MEMFLAG_SIGCONT | PR_MEMFLAG_RSS)
+# define PR_MEMFLAG_MASK		(PR_MEMFLAG_SIGCONT | PR_MEMFLAG_RSS |\
+					 PR_MEMFLAG_DIRECT)
 
 /* Action word masks */
 # define PR_MEMACT_MASK			0xff
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa76bae7f408..6488f8a10d66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2640,27 +2640,27 @@ get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit)
  * Return true if an action has been taken or further check is not needed,
  * false otherwise.
  */
-static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action)
+static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
+					  u16 flags)
 {
-	unsigned long mem;
+	unsigned long mem = 0;
 	bool ret = false;
 	struct mm_struct *mm = get_task_mm(current);
 	u8  signal = READ_ONCE(current->memcg_over_high_signal);
-	u16 flags  = READ_ONCE(current->memcg_over_high_flags);
-	u32 limit  = READ_ONCE(current->memcg_over_high_climit);
+	u32 limit;
 
 	if (!mm)
 		return true;	/* No more check is needed */
 
-	if (READ_ONCE(current->memcg_over_limit))
-		WRITE_ONCE(current->memcg_over_limit, false);
-
 	if ((action == PR_MEMACT_SIGNAL) && !signal)
 		goto out;
 
-	mem = page_counter_read(&memcg->memory);
-	if (mem <= memcg->memory.high + limit)
-		goto out;
+	if (memcg) {
+		mem = page_counter_read(&memcg->memory);
+		limit = READ_ONCE(current->memcg_over_high_climit);
+		if (mem <= memcg->memory.high + limit)
+			goto out;
+	}
 
 	/*
 	 * Check RSS memory if any of the PR_MEMFLAG_RSS flags is set.
@@ -2706,20 +2706,34 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action)
 
 out:
 	mmput(mm);
-	return ret;
+	/*
+	 * We only need to do direct per-task memory limit checking once.
+	 */
+	return memcg ? ret : true;
 }
 
 /*
  * Return true if an action has been taken or further check is not needed,
  * false otherwise.
  */
-static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg,
+					       bool mem_high)
 {
 	u8 action = READ_ONCE(current->memcg_over_high_action);
+	u16 flags = READ_ONCE(current->memcg_over_high_flags);
 
 	if (!action)
 		return true;	/* No more check is needed */
-	return __mem_cgroup_over_high_action(memcg, action);
+
+	if (READ_ONCE(current->memcg_over_limit))
+		WRITE_ONCE(current->memcg_over_limit, false);
+
+	if (flags & PR_MEMFLAG_DIRECT)
+		memcg = NULL;	/* Direct per-task memory limit checking */
+	else if (!mem_high)
+		return false;
+
+	return __mem_cgroup_over_high_action(memcg, action, flags);
 }
 
 /*
@@ -2907,8 +2921,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		swap_high = page_counter_read(&memcg->swap) >
 			READ_ONCE(memcg->swap.high);
 
-		if (mem_high && !taken)
-			taken = mem_cgroup_over_high_action(memcg);
+		if (!taken)
+			taken = mem_cgroup_over_high_action(memcg, mem_high);
 
 		/* Don't bother a random interrupted task */
 		if (in_interrupt()) {
@@ -7103,6 +7117,14 @@ long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action,
 	    (sig >= _NSIG))
 		return -EINVAL;
 
+	/*
+	 * PR_MEMFLAG_DIRECT can only be set if any of the PR_MEMFLAG_RSS flag
+	 * is set and limit2 is non-zero.
+	 */
+	if ((flags & PR_MEMFLAG_DIRECT) &&
+	    (!(flags & PR_MEMFLAG_RSS) || !limit2))
+		return -EINVAL;
+
 	WRITE_ONCE(task->memcg_over_high_action, cmd);
 	WRITE_ONCE(task->memcg_over_high_signal, sig);
 	WRITE_ONCE(task->memcg_over_high_flags, flags);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 6/8] memcg: Introduce additional memory control slowdown if needed
  2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
                   ` (4 preceding siblings ...)
  2020-08-17 14:08 ` [RFC PATCH 5/8] memcg: Allow direct per-task memory limit checking Waiman Long
@ 2020-08-17 14:08 ` Waiman Long
  2020-08-17 14:08   ` Waiman Long
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

For fast cpus on slow disks, yielding the cpus repeatedly with
PR_MEMACT_SLOWDOWN may not be able to slow down memory allocation enough
for memory reclaim to catch up. In case a large memory block is mmap'ed
and the pages are faulted in one-by-one, the syscall delays won't be
activated during this process.

To be safe, an additional variable delay of 20-5000 us will be added
to __mem_cgroup_over_high_action() if the excess memory used is more
than 1/256 of the memory limit.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6488f8a10d66..bddf3e659469 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2643,11 +2643,10 @@ get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit)
 static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 					  u16 flags)
 {
-	unsigned long mem = 0;
+	unsigned long mem = 0, limit = 0, excess = 0;
 	bool ret = false;
 	struct mm_struct *mm = get_task_mm(current);
 	u8  signal = READ_ONCE(current->memcg_over_high_signal);
-	u32 limit;
 
 	if (!mm)
 		return true;	/* No more check is needed */
@@ -2657,9 +2656,10 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 
 	if (memcg) {
 		mem = page_counter_read(&memcg->memory);
-		limit = READ_ONCE(current->memcg_over_high_climit);
-		if (mem <= memcg->memory.high + limit)
+		limit = READ_ONCE(current->memcg_over_high_climit) + memcg->memory.high;
+		if (mem <= limit)
 			goto out;
+		excess = mem - limit;
 	}
 
 	/*
@@ -2676,6 +2676,7 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 		limit = READ_ONCE(current->memcg_over_high_plimit);
 		if (mem <= limit)
 			goto out;
+		excess = mem - limit;
 	}
 
 	ret = true;
@@ -2685,10 +2686,19 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 		break;
 	case PR_MEMACT_SLOWDOWN:
 		/*
-		 * Slow down by yielding the cpu & adding delay to
-		 * memory allocation syscalls.
+		 * Slow down by yielding the cpu & adding delay to memory
+		 * allocation syscalls.
+		 *
+		 * An additional 20-5000 us of delay is added in case the
+		 * excess memory is more than 1/256 of the limit.
 		 */
 		WRITE_ONCE(current->memcg_over_limit, true);
+		limit >>= 8;
+		if (limit && (excess > limit)) {
+			int delay = min(5000UL, excess/limit * 20UL);
+
+			udelay(delay);
+		}
 		set_tsk_need_resched(current);
 		set_preempt_need_resched();
 		break;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 7/8] memcg: Enable logging of memory control mitigation action
@ 2020-08-17 14:08   ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

Some of the migitation actions of PR_SET_MEMCONTROL give no visible
signal that some actions are being done inside the kernel. To make it
more visble, a new PR_MEMFLAG_LOG flag is added to enable the logging
of the migitation action done in the kernel ring buffer.

The logging is done once when the mitigation action starts through the
setting of an internal PR_MEMFLAG_LOGGED flag. This flag will be cleared
when it is detected that the memory limit no longer exceeds memory.high.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/uapi/linux/prctl.h |  1 +
 mm/memcontrol.c            | 34 +++++++++++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7ba40e10737d..faa7a51fc52a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -266,6 +266,7 @@ struct prctl_mm_map {
 /* Flags for PR_SET_MEMCONTROL */
 # define PR_MEMFLAG_SIGCONT		(1UL <<  0) /* Continuous signal delivery */
 # define PR_MEMFLAG_DIRECT		(1UL <<  1) /* Direct memory limit */
+# define PR_MEMFLAG_LOG			(1UL <<  2) /* Log action done */
 # define PR_MEMFLAG_RSS_ANON		(1UL <<  8) /* Check anonymous pages */
 # define PR_MEMFLAG_RSS_FILE		(1UL <<  9) /* Check file pages */
 # define PR_MEMFLAG_RSS_SHMEM		(1UL << 10) /* Check shmem pages */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bddf3e659469..5bda2dd755fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2640,6 +2640,7 @@ get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit)
  * Return true if an action has been taken or further check is not needed,
  * false otherwise.
  */
+#define PR_MEMFLAG_LOGGED	(1UL << 7)	/* A log message printed */
 static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 					  u16 flags)
 {
@@ -2714,6 +2715,32 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 		break;
 	}
 
+	if ((flags & (PR_MEMFLAG_LOG|PR_MEMFLAG_LOGGED)) == PR_MEMFLAG_LOG) {
+		char name[80];
+		static const char * const acts[] = {
+			[PR_MEMACT_ENOMEM]   = "Action: return ENOMEM on some syscalls",
+			[PR_MEMACT_SLOWDOWN] = "Action: slow down process",
+			[PR_MEMACT_SIGNAL]   = "Action: send signal",
+			[PR_MEMACT_KILL]     = "Action: kill the process",
+		};
+
+		name[0] = '\0';
+		if (memcg)
+			cgroup_name(memcg->css.cgroup, name, sizeof(name));
+		else
+			strcpy(name, "N/A");
+
+		/*
+		 * Use printk_deferred() to minimize delay in the memory
+		 * allocation path.
+		 */
+		printk_deferred(KERN_INFO
+			"Cgroup: %s, Comm: %s, Pid: %d, Mem: %ld pages, %s\n",
+			name, current->comm, current->pid, mem, acts[action]);
+		WRITE_ONCE(current->memcg_over_high_flags,
+			   flags | PR_MEMFLAG_LOGGED);
+	}
+
 out:
 	mmput(mm);
 	/*
@@ -2740,8 +2767,13 @@ static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg,
 
 	if (flags & PR_MEMFLAG_DIRECT)
 		memcg = NULL;	/* Direct per-task memory limit checking */
-	else if (!mem_high)
+	else if (!mem_high) {
+		/* Clear the PR_MEMFLAG_LOGGED flag, if set */
+		if (flags & PR_MEMFLAG_LOGGED)
+			WRITE_ONCE(current->memcg_over_high_flags,
+				   flags & ~PR_MEMFLAG_LOGGED);
 		return false;
+	}
 
 	return __mem_cgroup_over_high_action(memcg, action, flags);
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 7/8] memcg: Enable logging of memory control mitigation action
@ 2020-08-17 14:08   ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Waiman Long

Some of the migitation actions of PR_SET_MEMCONTROL give no visible
signal that some actions are being done inside the kernel. To make it
more visble, a new PR_MEMFLAG_LOG flag is added to enable the logging
of the migitation action done in the kernel ring buffer.

The logging is done once when the mitigation action starts through the
setting of an internal PR_MEMFLAG_LOGGED flag. This flag will be cleared
when it is detected that the memory limit no longer exceeds memory.high.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/prctl.h |  1 +
 mm/memcontrol.c            | 34 +++++++++++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7ba40e10737d..faa7a51fc52a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -266,6 +266,7 @@ struct prctl_mm_map {
 /* Flags for PR_SET_MEMCONTROL */
 # define PR_MEMFLAG_SIGCONT		(1UL <<  0) /* Continuous signal delivery */
 # define PR_MEMFLAG_DIRECT		(1UL <<  1) /* Direct memory limit */
+# define PR_MEMFLAG_LOG			(1UL <<  2) /* Log action done */
 # define PR_MEMFLAG_RSS_ANON		(1UL <<  8) /* Check anonymous pages */
 # define PR_MEMFLAG_RSS_FILE		(1UL <<  9) /* Check file pages */
 # define PR_MEMFLAG_RSS_SHMEM		(1UL << 10) /* Check shmem pages */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bddf3e659469..5bda2dd755fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2640,6 +2640,7 @@ get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit)
  * Return true if an action has been taken or further check is not needed,
  * false otherwise.
  */
+#define PR_MEMFLAG_LOGGED	(1UL << 7)	/* A log message printed */
 static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 					  u16 flags)
 {
@@ -2714,6 +2715,32 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action,
 		break;
 	}
 
+	if ((flags & (PR_MEMFLAG_LOG|PR_MEMFLAG_LOGGED)) == PR_MEMFLAG_LOG) {
+		char name[80];
+		static const char * const acts[] = {
+			[PR_MEMACT_ENOMEM]   = "Action: return ENOMEM on some syscalls",
+			[PR_MEMACT_SLOWDOWN] = "Action: slow down process",
+			[PR_MEMACT_SIGNAL]   = "Action: send signal",
+			[PR_MEMACT_KILL]     = "Action: kill the process",
+		};
+
+		name[0] = '\0';
+		if (memcg)
+			cgroup_name(memcg->css.cgroup, name, sizeof(name));
+		else
+			strcpy(name, "N/A");
+
+		/*
+		 * Use printk_deferred() to minimize delay in the memory
+		 * allocation path.
+		 */
+		printk_deferred(KERN_INFO
+			"Cgroup: %s, Comm: %s, Pid: %d, Mem: %ld pages, %s\n",
+			name, current->comm, current->pid, mem, acts[action]);
+		WRITE_ONCE(current->memcg_over_high_flags,
+			   flags | PR_MEMFLAG_LOGGED);
+	}
+
 out:
 	mmput(mm);
 	/*
@@ -2740,8 +2767,13 @@ static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg,
 
 	if (flags & PR_MEMFLAG_DIRECT)
 		memcg = NULL;	/* Direct per-task memory limit checking */
-	else if (!mem_high)
+	else if (!mem_high) {
+		/* Clear the PR_MEMFLAG_LOGGED flag, if set */
+		if (flags & PR_MEMFLAG_LOGGED)
+			WRITE_ONCE(current->memcg_over_high_flags,
+				   flags & ~PR_MEMFLAG_LOGGED);
 		return false;
+	}
 
 	return __mem_cgroup_over_high_action(memcg, action, flags);
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 8/8] memcg: Add over-high action prctl() documentation
  2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
                   ` (6 preceding siblings ...)
  2020-08-17 14:08   ` Waiman Long
@ 2020-08-17 14:08 ` Waiman Long
  2020-08-17 15:26 ` [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Michal Hocko
  2020-08-18  9:14   ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
  9 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 14:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: linux-kernel, linux-doc, linux-fsdevel, cgroups, linux-mm, Waiman Long

A new memcontrol.rst documentation file is added to document the new
prctl(2) interface for setting the over-high mitigation action parameters
and retrieving them.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/userspace-api/index.rst      |   1 +
 Documentation/userspace-api/memcontrol.rst | 174 +++++++++++++++++++++
 2 files changed, 175 insertions(+)
 create mode 100644 Documentation/userspace-api/memcontrol.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 69fc5167e648..1c0fc7a7f4ec 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -23,6 +23,7 @@ place where this information is gathered.
    accelerators/ocxl
    ioctl/index
    media/index
+   memcontrol
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/memcontrol.rst b/Documentation/userspace-api/memcontrol.rst
new file mode 100644
index 000000000000..0cfcc72ad5f0
--- /dev/null
+++ b/Documentation/userspace-api/memcontrol.rst
@@ -0,0 +1,174 @@
+==============
+Memory Control
+==============
+
+Memory controller can be used to control and limit the amount of
+physical memory used by a task. When a limit is set in "memory.high" in
+a v2 non-root memory cgroup, the memory controller will try to reclaim
+memory if the limit has been exceeded. Normally, that will be enough
+to keep the physical memory consumption of tasks in the memory cgroup
+to be around or below the "memory.high" limit.
+
+Sometimes, memory reclaim may not be able to recover memory in a rate
+that can catch up to the physical memory allocation rate. In this case,
+the physical memory consumption will keep on increasing.  For memory
+cgroup v2, when it is reaching "memory.max" or the system is running
+out of free memory, the OOM killer will be invoked to kill some tasks
+to free up additional memory. However, one has little control of which
+tasks are going to be killed by an OOM killer. Killing tasks that hold
+some important resources without freeing them first can create other
+system problems.
+
+Users who do not want the OOM killer to be invoked to kill random
+tasks in an out-of-memory situation can use the memory control facility
+provided by :manpage:`prctl(2)` to better manage the mitigation action
+that needs to be performed to an individual task when the specified
+memory limit is exceeded with memory cgroup v2 being used.
+
+The task to be controlled must be running in a non-root memory cgroup
+as no limit will be imposed on tasks running in the root memory cgroup.
+
+There are two prctl commands related to this:
+
+ * PR_SET_MEMCONTROL
+
+ * PR_GET_MEMCONTROL
+
+
+PR_SET_MEMCONTROL
+-----------------
+
+PR_SET_MEMCTROL controls what action should be taken when the memory
+limit is exceeded.
+
+The arg2 of :manpage:`prctl(2)` sets the desired mitigation action. The
+action code consists of three different parts:
+
+ * Bits 0-7: action command
+
+ * Bits 8-15: signal number
+
+ * Bits 16-31: flags
+
+The currently supported action commands are:
+
+====== ================== ================================================
+Value  Define             Description
+====== ================== ================================================
+0      PR_MEMACT_NONE     Use the default memory cgroup behavior
+1      PR_MEMACT_ENOMEM   Return ENOMEM for selected syscalls that try to
+                          allocate more memory when the preset memory limit
+                          is exceeded
+2      PR_MEMACT_SLOWDOWN Slow down the process for memory reclaim to
+                          catch up when memory limit is exceeded
+3      PR_MEMACT_SIGNAL   Send a signal to the task that has exceeded
+                          preset memory limit
+4      PR_MEMACT_KILL     Kill the task that has exceeded preset memory
+                          limit
+====== ================== ================================================
+
+The currently supports flags are:
+
+====== ==================== ================================================
+Value  Define               Description
+====== ==================== ================================================
+0x01   PR_MEMFLAG_SIGCONT   Send a signal on every allocation request instead
+                            of a one-shot signal
+0x02   PR_MEMFLAG_DIRECT    Check per-task memory limit irrespective of cgroup
+                            setting
+0x04   PR_MEMFLAG_LOG       Log any actions taken to the kernel ring buffer
+0x10   PR_MEMFLAG_RSS_ANON  Check process anonymous memory
+0x20   PR_MEMFLAG_RSS_FILE  Check process page caches
+0x40   PR_MEMFLAG_RSS_SHMEM Check process shared memory
+0x70   PR_MEMFLAG_RSS       Equivalent to (PR_MEMFLAG_RSS_ANON |
+                            PR_MEMFLAG_RSS_FILE | PR_MEMFLAG_RSS_SHMEM)
+====== ==================== ================================================
+
+If the action command is PR_MEMACT_SIGNAL, bits 16-23 of the action
+code contains the signal number to be used when the memory limit is
+exceeded. By default, the signal number is reset after delivery so
+that the signal will be delivered only once. Another PR_SET_MEMCONTROL
+command will have to be issued to set the signal again.  If the user
+want a non-fatal signal to be delivered every time when the memory
+limit is breached without doing another PR_SET_MEMCONTROL call, the
+PR_MEMFLAG_SIGCONT flag can be set.
+
+The arg3 of :manpage:`prctl(2)` sets the additional memory cgroup
+limit that will be added to the value specified in the "memory.high"
+control file to get the real limit over which action specified in the
+action command will be triggered. This is to make sure that mitigation
+action will only be taken when the kernel memory reclaim facility fails
+to limit the growth of physical memory usage.
+
+If any of the PR_MEMFLAG_RSS* flag is specified, arg4 contains the
+per-process memory limit that will be used to compare against the sum
+of the specified RSS memory consumption of the process to determine
+if action will be taken provided that overall memory consumption has
+exceeded the "memory.high" + arg3 limit when the PR_MEMFLAG_DIRECT flag
+isn't set.
+
+If the PR_MEMFLAG_DIRECT flag is set, however, the cgroup memory limit
+is ignored and a memory-over-limit check will be performed on each
+memory allocation request, if applicable. This is reserved for special
+use case and is not recommended for general use.
+
+
+PR_GET_MEMCONTROL
+-----------------
+
+PR_GET_MEMCONTROL returns the parameters set by a previous
+PR_SET_MEMCONTROL command.
+
+The arg2 of :manpage:`prctl(2)` sets type of parameter that is to be
+returned. The possible values are:
+
+====== =================== ================================================
+Value  Define              Description
+====== =================== ================================================
+0      PR_MEMGET_ACTION    Return the action code - command, flags & signal
+1      PR_MEMGET_CLIMIT    Return the additional cgroup memory limit (in bytes)
+2      PR_MEMGET_PLIMIT    Return the process memory limit for PR_MEMFLAG_RSS*
+====== =================== ================================================
+
+
+/proc/<pid>/memctl
+------------------
+
+PR_GET_MEMCONTROL only returns memory control setting about the
+task itself. To find those information about other tasks, the
+/proc/<pid>/memctl file can be read. This file reports three integer
+parameters:
+
+ * action code
+
+ * cgroup additional memory limit
+
+ * process memory limit for PR_MEMFLAG_RSS* flags
+
+These are the same values that will be returned if the task is
+calling :manpage:`prctl(2)` with PR_GET_MEMCONTROL command and the
+PR_MEMGET_ACTION, PR_MEMGET_CLIMIT and PR_MEMGET_PLIMIT arguments
+respectively.
+
+Privileged users can also write to the memctl file directly to modify
+those parameters for a given task.
+
+This procfs file is present for each of the running threads of a process.
+So multiple writes to each of them are needed to update the parameters
+for all the threads within a running process.
+
+Affected Syscalls
+-----------------
+
+The following system calls have additional check for the over-high
+memory usage flag that is set by the above memory control facility.
+
+ * :manpage:`brk(2)`
+
+ * :manpage:`mlock(2)`
+
+ * :manpage:`mlock2(2)`
+
+ * :manpage:`mlockall(2)`
+
+ * :manpage:`mmap(2)`
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 14:08 ` [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action Waiman Long
@ 2020-08-17 14:30   ` Chris Down
  2020-08-17 15:38     ` Waiman Long
  2020-08-17 16:44     ` Shakeel Butt
  1 sibling, 1 reply; 58+ messages in thread
From: Chris Down @ 2020-08-17 14:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

Astractly, I think this really overcomplicates the API a lot. If these are 
truly generally useful (and I think that remains to be demonstrated), they 
should be additions to the existing API, rather than a sidestep with prctl.

I also worry about some other more concrete things:

1. Doesn't this allow unprivileged applications to potentially bypass 
    memory.high constraints set by a system administrator?
2. What's the purpose of PR_MEMACT_KILL, compared to memory.max?
3. Why add this entirely separate signal delivery path when we already have 
    eventfd/poll/inotify support, which makes a lot more sense for modern 
    applications?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
                   ` (7 preceding siblings ...)
  2020-08-17 14:08 ` [RFC PATCH 8/8] memcg: Add over-high action prctl() documentation Waiman Long
@ 2020-08-17 15:26 ` Michal Hocko
  2020-08-17 15:55     ` Waiman Long
  2020-08-18  9:14   ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
  9 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2020-08-17 15:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

On Mon 17-08-20 10:08:23, Waiman Long wrote:
> Memory controller can be used to control and limit the amount of
> physical memory used by a task. When a limit is set in "memory.high" in
> a v2 non-root memory cgroup, the memory controller will try to reclaim
> memory if the limit has been exceeded. Normally, that will be enough
> to keep the physical memory consumption of tasks in the memory cgroup
> to be around or below the "memory.high" limit.
> 
> Sometimes, memory reclaim may not be able to recover memory in a rate
> that can catch up to the physical memory allocation rate. In this case,
> the physical memory consumption will keep on increasing.  When it reaches
> "memory.max" for memory cgroup v2 or when the system is running out of
> free memory, the OOM killer will be invoked to kill some tasks to free
> up additional memory. However, one has little control of which tasks
> are going to be killed by an OOM killer. Killing tasks that hold some
> important resources without freeing them first can create other system
> problems down the road.
> 
> Users who do not want the OOM killer to be invoked to kill random
> tasks in an out-of-memory situation can use the memory control
> facility provided by this new patchset via prctl(2) to better manage
> the mitigation action that needs to be performed to various tasks when
> the specified memory limit is exceeded with memory cgroup v2 being used.
> 
> The currently supported mitigation actions include the followings:
> 
>  1) Return ENOMEM for some syscalls that allocate or handle memory
>  2) Slow down the process for memory reclaim to catch up
>  3) Send a specific signal to the task
>  4) Kill the task
> 
> The users that want better memory control for their applicatons can
> either modify their applications to call the prctl(2) syscall directly
> with the new memory control command code or write the desired action to
> the newly provided memctl procfs files of their applications provided
> that those applications run in a non-root v2 memory cgroup.

prctl is fundamentally about per-process control while cgroup (not only
memcg) is about group of processes interface. How do those two interact
together? In other words what is the semantic when different processes
have a different views on the same underlying memcg event?

Also the above description doesn't really describe any usecase which
struggles with the existing interface. We already do allow slow down and
along with PSI also provide user space control over close to OOM
situation.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 14:30   ` Chris Down
@ 2020-08-17 15:38     ` Waiman Long
  2020-08-17 16:11       ` Chris Down
  0 siblings, 1 reply; 58+ messages in thread
From: Waiman Long @ 2020-08-17 15:38 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

On 8/17/20 10:30 AM, Chris Down wrote:
> Astractly, I think this really overcomplicates the API a lot. If these 
> are truly generally useful (and I think that remains to be 
> demonstrated), they should be additions to the existing API, rather 
> than a sidestep with prctl.
This patchset is derived from customer requests. With existing API, I 
suppose you mean the memory cgroup API. Right? The reason to use prctl() 
is that there are users out there who want some kind of per-process 
control instead of for a whole group of processes unless the users try 
to create one cgroup per process which is not very efficient.
>
> I also worry about some other more concrete things:
>
> 1. Doesn't this allow unprivileged applications to potentially bypass 
>    memory.high constraints set by a system administrator?
The memory.high constraint is for triggering memory reclaim. The new 
mitigation actions introduced by this patchset will only be applied if 
memory reclaim alone fails to limit the physical memory consumption. The 
current memory cgroup memory reclaim code will not be affected by this 
patchset.
> 2. What's the purpose of PR_MEMACT_KILL, compared to memory.max?
A user can use this to specify which processes are less important and 
can be sacrificed first instead of the other more important ones in case 
they are really in a OOM situation. IOW, users can specify the order 
where OOM kills can happen.
> 3. Why add this entirely separate signal delivery path when we already 
> have eventfd/poll/inotify support, which makes a lot more sense for 
> modern    applications?

Good question, I will look further into this to see if it can be 
applicable in this case.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-17 15:55     ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 15:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

On 8/17/20 11:26 AM, Michal Hocko wrote:
> On Mon 17-08-20 10:08:23, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.  When it reaches
>> "memory.max" for memory cgroup v2 or when the system is running out of
>> free memory, the OOM killer will be invoked to kill some tasks to free
>> up additional memory. However, one has little control of which tasks
>> are going to be killed by an OOM killer. Killing tasks that hold some
>> important resources without freeing them first can create other system
>> problems down the road.
>>
>> Users who do not want the OOM killer to be invoked to kill random
>> tasks in an out-of-memory situation can use the memory control
>> facility provided by this new patchset via prctl(2) to better manage
>> the mitigation action that needs to be performed to various tasks when
>> the specified memory limit is exceeded with memory cgroup v2 being used.
>>
>> The currently supported mitigation actions include the followings:
>>
>>   1) Return ENOMEM for some syscalls that allocate or handle memory
>>   2) Slow down the process for memory reclaim to catch up
>>   3) Send a specific signal to the task
>>   4) Kill the task
>>
>> The users that want better memory control for their applicatons can
>> either modify their applications to call the prctl(2) syscall directly
>> with the new memory control command code or write the desired action to
>> the newly provided memctl procfs files of their applications provided
>> that those applications run in a non-root v2 memory cgroup.
> prctl is fundamentally about per-process control while cgroup (not only
> memcg) is about group of processes interface. How do those two interact
> together? In other words what is the semantic when different processes
> have a different views on the same underlying memcg event?
As said in a previous mail, this patchset is derived from a customer 
request and per-process control is exactly what the customer wants. That 
is why prctl() is used. This patchset is intended to supplement the 
existing memory cgroup features. Processes in a memory cgroup that don't 
use this new API will behave exactly like before. Only processes that 
opt to use this new API will have additional mitigation actions applied 
on them in case the additional limits are reached.
>
> Also the above description doesn't really describe any usecase which
> struggles with the existing interface. We already do allow slow down and
> along with PSI also provide user space control over close to OOM
> situation.
>
The customer that request it was using Solaris. Solaris does allow 
per-process memory control and they have tools that rely on this 
capability. This patchset will help them migrate off Solaris easier. I 
will look closer into how PSI can help here.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-17 15:55     ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-17 15:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 8/17/20 11:26 AM, Michal Hocko wrote:
> On Mon 17-08-20 10:08:23, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.  When it reaches
>> "memory.max" for memory cgroup v2 or when the system is running out of
>> free memory, the OOM killer will be invoked to kill some tasks to free
>> up additional memory. However, one has little control of which tasks
>> are going to be killed by an OOM killer. Killing tasks that hold some
>> important resources without freeing them first can create other system
>> problems down the road.
>>
>> Users who do not want the OOM killer to be invoked to kill random
>> tasks in an out-of-memory situation can use the memory control
>> facility provided by this new patchset via prctl(2) to better manage
>> the mitigation action that needs to be performed to various tasks when
>> the specified memory limit is exceeded with memory cgroup v2 being used.
>>
>> The currently supported mitigation actions include the followings:
>>
>>   1) Return ENOMEM for some syscalls that allocate or handle memory
>>   2) Slow down the process for memory reclaim to catch up
>>   3) Send a specific signal to the task
>>   4) Kill the task
>>
>> The users that want better memory control for their applicatons can
>> either modify their applications to call the prctl(2) syscall directly
>> with the new memory control command code or write the desired action to
>> the newly provided memctl procfs files of their applications provided
>> that those applications run in a non-root v2 memory cgroup.
> prctl is fundamentally about per-process control while cgroup (not only
> memcg) is about group of processes interface. How do those two interact
> together? In other words what is the semantic when different processes
> have a different views on the same underlying memcg event?
As said in a previous mail, this patchset is derived from a customer 
request and per-process control is exactly what the customer wants. That 
is why prctl() is used. This patchset is intended to supplement the 
existing memory cgroup features. Processes in a memory cgroup that don't 
use this new API will behave exactly like before. Only processes that 
opt to use this new API will have additional mitigation actions applied 
on them in case the additional limits are reached.
>
> Also the above description doesn't really describe any usecase which
> struggles with the existing interface. We already do allow slow down and
> along with PSI also provide user space control over close to OOM
> situation.
>
The customer that request it was using Solaris. Solaris does allow 
per-process memory control and they have tools that rely on this 
capability. This patchset will help them migrate off Solaris easier. I 
will look closer into how PSI can help here.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 15:38     ` Waiman Long
@ 2020-08-17 16:11       ` Chris Down
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Down @ 2020-08-17 16:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

Waiman Long writes:
>On 8/17/20 10:30 AM, Chris Down wrote:
>>Astractly, I think this really overcomplicates the API a lot. If 
>>these are truly generally useful (and I think that remains to be 
>>demonstrated), they should be additions to the existing API, rather 
>>than a sidestep with prctl.
>This patchset is derived from customer requests. With existing API, I 
>suppose you mean the memory cgroup API. Right? The reason to use 
>prctl() is that there are users out there who want some kind of 
>per-process control instead of for a whole group of processes unless 
>the users try to create one cgroup per process which is not very 
>efficient.

If using one cgroup per process is inefficient, then that's what needs to be 
fixed. Making the API extremely complex to reason about for every user isn't a 
good compromise when we're talking about an already niche use case.

>>I also worry about some other more concrete things:
>>
>>1. Doesn't this allow unprivileged applications to potentially 
>>bypass    memory.high constraints set by a system administrator?
>The memory.high constraint is for triggering memory reclaim. The new 
>mitigation actions introduced by this patchset will only be applied if 
>memory reclaim alone fails to limit the physical memory consumption. 
>The current memory cgroup memory reclaim code will not be affected by 
>this patchset.

memory.high isn't only for triggering memory reclaim, it's also about active 
throttling when the application fails to come under. Fundamentally it's 
supposed to indicate the point at which we expect the application to either 
cooperate or get forcibly descheduled -- take a look at where we call 
schedule_timeout_killable.

I really struggle to think about how all of those things should interact in 
this patchset.

>>2. What's the purpose of PR_MEMACT_KILL, compared to memory.max?
>A user can use this to specify which processes are less important and 
>can be sacrificed first instead of the other more important ones in 
>case they are really in a OOM situation. IOW, users can specify the 
>order where OOM kills can happen.

You can already do that with something like oomd, which has way more 
flexibility than this. Why codify this in the kernel instead of in a userspace 
agent?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 14:08 ` [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action Waiman Long
@ 2020-08-17 16:44     ` Shakeel Butt
  2020-08-17 16:44     ` Shakeel Butt
  1 sibling, 0 replies; 58+ messages in thread
From: Shakeel Butt @ 2020-08-17 16:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, LKML, linux-doc, linux-fsdevel,
	Cgroups, Linux MM

On Mon, Aug 17, 2020 at 7:11 AM Waiman Long <longman@redhat.com> wrote:
>
> Memory controller can be used to control and limit the amount of
> physical memory used by a task. When a limit is set in "memory.high"
> in a non-root memory cgroup, the memory controller will try to reclaim
> memory if the limit has been exceeded. Normally, that will be enough
> to keep the physical memory consumption of tasks in the memory cgroup
> to be around or below the "memory.high" limit.
>
> Sometimes, memory reclaim may not be able to recover memory in a rate
> that can catch up to the physical memory allocation rate especially
> when rotating disks are used for swapping or writing dirty pages. In
> this case, the physical memory consumption will keep on increasing.

Isn't this the real underlying issue? Why not make the guarantees of
memory.high more strict instead of adding more interfaces and
complexity?

By the way, have you observed this issue on real workloads or some
test cases? It would be good to get a repro with simple test cases.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
@ 2020-08-17 16:44     ` Shakeel Butt
  0 siblings, 0 replies; 58+ messages in thread
From: Shakeel Butt @ 2020-08-17 16:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, LKML, linux-doc, linux-fsdevel,
	Cgroups, Linux MM

On Mon, Aug 17, 2020 at 7:11 AM Waiman Long <longman@redhat.com> wrote:
>
> Memory controller can be used to control and limit the amount of
> physical memory used by a task. When a limit is set in "memory.high"
> in a non-root memory cgroup, the memory controller will try to reclaim
> memory if the limit has been exceeded. Normally, that will be enough
> to keep the physical memory consumption of tasks in the memory cgroup
> to be around or below the "memory.high" limit.
>
> Sometimes, memory reclaim may not be able to recover memory in a rate
> that can catch up to the physical memory allocation rate especially
> when rotating disks are used for swapping or writing dirty pages. In
> this case, the physical memory consumption will keep on increasing.

Isn't this the real underlying issue? Why not make the guarantees of
memory.high more strict instead of adding more interfaces and
complexity?

By the way, have you observed this issue on real workloads or some
test cases? It would be good to get a repro with simple test cases.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 16:44     ` Shakeel Butt
  (?)
@ 2020-08-17 16:56     ` Chris Down
  2020-08-18 19:12       ` Waiman Long
  -1 siblings, 1 reply; 58+ messages in thread
From: Chris Down @ 2020-08-17 16:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, LKML, linux-doc,
	linux-fsdevel, Cgroups, Linux MM

Shakeel Butt writes:
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate especially
>> when rotating disks are used for swapping or writing dirty pages. In
>> this case, the physical memory consumption will keep on increasing.
>
>Isn't this the real underlying issue? Why not make the guarantees of
>memory.high more strict instead of adding more interfaces and
>complexity?

Oh, thanks Shakeel for bringing this up. I missed this in the original 
changelog and I'm surprised that it's mentioned, since we do have protections 
against that.

Waiman, we already added artificial throttling if memory reclaim is not 
sufficiently achieved in 0e4b01df8659 ("mm, memcg: throttle allocators when 
failing reclaim over memory.high"), which has been present since v5.4. This 
should significantly inhibit physical memory consumption from increasing. What 
problems are you having with that? :-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-17 15:55     ` Waiman Long
  (?)
@ 2020-08-17 19:26     ` Michal Hocko
  2020-08-18 19:20         ` Waiman Long
  -1 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2020-08-17 19:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

On Mon 17-08-20 11:55:37, Waiman Long wrote:
> On 8/17/20 11:26 AM, Michal Hocko wrote:
> > On Mon 17-08-20 10:08:23, Waiman Long wrote:
> > > Memory controller can be used to control and limit the amount of
> > > physical memory used by a task. When a limit is set in "memory.high" in
> > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > memory if the limit has been exceeded. Normally, that will be enough
> > > to keep the physical memory consumption of tasks in the memory cgroup
> > > to be around or below the "memory.high" limit.
> > > 
> > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > that can catch up to the physical memory allocation rate. In this case,
> > > the physical memory consumption will keep on increasing.  When it reaches
> > > "memory.max" for memory cgroup v2 or when the system is running out of
> > > free memory, the OOM killer will be invoked to kill some tasks to free
> > > up additional memory. However, one has little control of which tasks
> > > are going to be killed by an OOM killer. Killing tasks that hold some
> > > important resources without freeing them first can create other system
> > > problems down the road.
> > > 
> > > Users who do not want the OOM killer to be invoked to kill random
> > > tasks in an out-of-memory situation can use the memory control
> > > facility provided by this new patchset via prctl(2) to better manage
> > > the mitigation action that needs to be performed to various tasks when
> > > the specified memory limit is exceeded with memory cgroup v2 being used.
> > > 
> > > The currently supported mitigation actions include the followings:
> > > 
> > >   1) Return ENOMEM for some syscalls that allocate or handle memory
> > >   2) Slow down the process for memory reclaim to catch up
> > >   3) Send a specific signal to the task
> > >   4) Kill the task
> > > 
> > > The users that want better memory control for their applicatons can
> > > either modify their applications to call the prctl(2) syscall directly
> > > with the new memory control command code or write the desired action to
> > > the newly provided memctl procfs files of their applications provided
> > > that those applications run in a non-root v2 memory cgroup.
> > prctl is fundamentally about per-process control while cgroup (not only
> > memcg) is about group of processes interface. How do those two interact
> > together? In other words what is the semantic when different processes
> > have a different views on the same underlying memcg event?
> As said in a previous mail, this patchset is derived from a customer request
> and per-process control is exactly what the customer wants. That is why
> prctl() is used. This patchset is intended to supplement the existing memory
> cgroup features. Processes in a memory cgroup that don't use this new API
> will behave exactly like before. Only processes that opt to use this new API
> will have additional mitigation actions applied on them in case the
> additional limits are reached.

Please keep in mind that you are proposing a new user API that we will
have to maintain for ever. That requires that the interface is
consistent and well defined. As I've said the fundamental problem with
this interface is that you are trying to hammer a process centric
interface into a framework that is fundamentally process group oriented.
Maybe there is a sensible way to do that without all sorts of weird
corner cases but I haven't seen any of that explained here.

Really just try to describe a semantic when two different tasks in the
same memcg have a different opinion on the same event. One wants ENOMEM
and other a specific signal to be delivered. Right now the behavior will
be timing specific because who hits the oom path is non-deterministic
from the userspace POV. Let's say that you can somehow handle that, now
how are you going implement ENOMEM for any context other than current
task? I am pretty sure the more specific questions you will have the
more this will get awkward.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file
  2020-08-17 14:08 ` [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file Waiman Long
@ 2020-08-17 20:10   ` kernel test robot
  2020-08-17 20:10   ` [RFC PATCH] fs/proc: proc_memctl_operations can be static kernel test robot
  1 sibling, 0 replies; 58+ messages in thread
From: kernel test robot @ 2020-08-17 20:10 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2231 bytes --]

Hi Waiman,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on hnaz-linux-mm/master]
[also build test WARNING on linus/master v5.9-rc1 next-20200817]
[cannot apply to tip/sched/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/memcg-Enable-fine-grained-per-process-memory-control/20200817-221201
base:   https://github.com/hnaz/linux-mm master
config: x86_64-randconfig-s022-20200817 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-180-g49f7e13a-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   fs/proc/base.c:2231:25: sparse: sparse: cast to restricted fmode_t
   fs/proc/base.c:2288:42: sparse: sparse: cast from restricted fmode_t
   fs/proc/base.c:2385:48: sparse: sparse: cast from restricted fmode_t
>> fs/proc/base.c:3244:30: sparse: sparse: symbol 'proc_memctl_operations' was not declared. Should it be static?
   fs/proc/base.c: note: in included file (through include/linux/rcuwait.h, include/linux/percpu-rwsem.h, include/linux/fs.h, ...):
   include/linux/sched/signal.h:695:37: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct spinlock [usertype] *lock @@     got struct spinlock [noderef] __rcu * @@
   include/linux/sched/signal.h:695:37: sparse:     expected struct spinlock [usertype] *lock
   include/linux/sched/signal.h:695:37: sparse:     got struct spinlock [noderef] __rcu *
   fs/proc/base.c:1104:36: sparse: sparse: context imbalance in '__set_oom_adj' - unexpected unlock

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 41262 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH] fs/proc: proc_memctl_operations can be static
  2020-08-17 14:08 ` [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file Waiman Long
  2020-08-17 20:10   ` kernel test robot
@ 2020-08-17 20:10   ` kernel test robot
  1 sibling, 0 replies; 58+ messages in thread
From: kernel test robot @ 2020-08-17 20:10 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]


Signed-off-by: kernel test robot <lkp@intel.com>
---
 base.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 28a1afeb67a9c..463a92cf8a95d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3241,7 +3241,7 @@ static ssize_t proc_memctl_write(struct file *file, const char __user *buf,
 	return err < 0 ? err : count;
 }
 
-const struct file_operations proc_memctl_operations = {
+static const struct file_operations proc_memctl_operations = {
 	.read   = proc_memctl_read,
 	.write  = proc_memctl_write,
 	.llseek	= generic_file_llseek,

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:14   ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
  0 siblings, 0 replies; 58+ messages in thread
From: peterz @ 2020-08-18  9:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> Memory controller can be used to control and limit the amount of
> physical memory used by a task. When a limit is set in "memory.high" in
> a v2 non-root memory cgroup, the memory controller will try to reclaim
> memory if the limit has been exceeded. Normally, that will be enough
> to keep the physical memory consumption of tasks in the memory cgroup
> to be around or below the "memory.high" limit.
> 
> Sometimes, memory reclaim may not be able to recover memory in a rate
> that can catch up to the physical memory allocation rate. In this case,
> the physical memory consumption will keep on increasing. 

Then slow down the allocator? That's what we do for dirty pages too, we
slow down the dirtier when we run against the limits.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:14   ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
  0 siblings, 0 replies; 58+ messages in thread
From: peterz-wEGCiKHe2LqWVfeAwA7xHQ @ 2020-08-18  9:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> Memory controller can be used to control and limit the amount of
> physical memory used by a task. When a limit is set in "memory.high" in
> a v2 non-root memory cgroup, the memory controller will try to reclaim
> memory if the limit has been exceeded. Normally, that will be enough
> to keep the physical memory consumption of tasks in the memory cgroup
> to be around or below the "memory.high" limit.
> 
> Sometimes, memory reclaim may not be able to recover memory in a rate
> that can catch up to the physical memory allocation rate. In this case,
> the physical memory consumption will keep on increasing. 

Then slow down the allocator? That's what we do for dirty pages too, we
slow down the dirtier when we run against the limits.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:26     ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2020-08-18  9:26 UTC (permalink / raw)
  To: peterz
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > Memory controller can be used to control and limit the amount of
> > physical memory used by a task. When a limit is set in "memory.high" in
> > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > memory if the limit has been exceeded. Normally, that will be enough
> > to keep the physical memory consumption of tasks in the memory cgroup
> > to be around or below the "memory.high" limit.
> > 
> > Sometimes, memory reclaim may not be able to recover memory in a rate
> > that can catch up to the physical memory allocation rate. In this case,
> > the physical memory consumption will keep on increasing. 
> 
> Then slow down the allocator? That's what we do for dirty pages too, we
> slow down the dirtier when we run against the limits.

This is what we actually do. Have a look at mem_cgroup_handle_over_high.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:26     ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2020-08-18  9:26 UTC (permalink / raw)
  To: peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > Memory controller can be used to control and limit the amount of
> > physical memory used by a task. When a limit is set in "memory.high" in
> > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > memory if the limit has been exceeded. Normally, that will be enough
> > to keep the physical memory consumption of tasks in the memory cgroup
> > to be around or below the "memory.high" limit.
> > 
> > Sometimes, memory reclaim may not be able to recover memory in a rate
> > that can catch up to the physical memory allocation rate. In this case,
> > the physical memory consumption will keep on increasing. 
> 
> Then slow down the allocator? That's what we do for dirty pages too, we
> slow down the dirtier when we run against the limits.

This is what we actually do. Have a look at mem_cgroup_handle_over_high.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:27     ` Chris Down
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Down @ 2020-08-18  9:27 UTC (permalink / raw)
  To: peterz
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

peterz@infradead.org writes:
>On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.
>
>Then slow down the allocator? That's what we do for dirty pages too, we
>slow down the dirtier when we run against the limits.

We already do that since v5.4. I'm wondering whether Waiman's customer is just 
running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle 
allocators when failing reclaim over memory.high") backported.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:27     ` Chris Down
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Down @ 2020-08-18  9:27 UTC (permalink / raw)
  To: peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org writes:
>On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.
>
>Then slow down the allocator? That's what we do for dirty pages too, we
>slow down the dirtier when we run against the limits.

We already do that since v5.4. I'm wondering whether Waiman's customer is just 
running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle 
allocators when failing reclaim over memory.high") backported.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:59       ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
  0 siblings, 0 replies; 58+ messages in thread
From: peterz @ 2020-08-18  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue, Aug 18, 2020 at 11:26:17AM +0200, Michal Hocko wrote:
> On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > Memory controller can be used to control and limit the amount of
> > > physical memory used by a task. When a limit is set in "memory.high" in
> > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > memory if the limit has been exceeded. Normally, that will be enough
> > > to keep the physical memory consumption of tasks in the memory cgroup
> > > to be around or below the "memory.high" limit.
> > > 
> > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > that can catch up to the physical memory allocation rate. In this case,
> > > the physical memory consumption will keep on increasing. 
> > 
> > Then slow down the allocator? That's what we do for dirty pages too, we
> > slow down the dirtier when we run against the limits.
> 
> This is what we actually do. Have a look at mem_cgroup_handle_over_high.

But then how can it run-away like Waiman suggested?

/me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.

That's a fail... :-(

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18  9:59       ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
  0 siblings, 0 replies; 58+ messages in thread
From: peterz-wEGCiKHe2LqWVfeAwA7xHQ @ 2020-08-18  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Aug 18, 2020 at 11:26:17AM +0200, Michal Hocko wrote:
> On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > Memory controller can be used to control and limit the amount of
> > > physical memory used by a task. When a limit is set in "memory.high" in
> > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > memory if the limit has been exceeded. Normally, that will be enough
> > > to keep the physical memory consumption of tasks in the memory cgroup
> > > to be around or below the "memory.high" limit.
> > > 
> > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > that can catch up to the physical memory allocation rate. In this case,
> > > the physical memory consumption will keep on increasing. 
> > 
> > Then slow down the allocator? That's what we do for dirty pages too, we
> > slow down the dirtier when we run against the limits.
> 
> This is what we actually do. Have a look at mem_cgroup_handle_over_high.

But then how can it run-away like Waiman suggested?

/me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.

That's a fail... :-(

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18  9:27     ` Chris Down
  (?)
@ 2020-08-18 10:04     ` peterz
  2020-08-18 12:55         ` Matthew Wilcox
  -1 siblings, 1 reply; 58+ messages in thread
From: peterz @ 2020-08-18 10:04 UTC (permalink / raw)
  To: Chris Down
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote:
> peterz@infradead.org writes:
> > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > Memory controller can be used to control and limit the amount of
> > > physical memory used by a task. When a limit is set in "memory.high" in
> > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > memory if the limit has been exceeded. Normally, that will be enough
> > > to keep the physical memory consumption of tasks in the memory cgroup
> > > to be around or below the "memory.high" limit.
> > > 
> > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > that can catch up to the physical memory allocation rate. In this case,
> > > the physical memory consumption will keep on increasing.
> > 
> > Then slow down the allocator? That's what we do for dirty pages too, we
> > slow down the dirtier when we run against the limits.
> 
> We already do that since v5.4. I'm wondering whether Waiman's customer is
> just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle
> allocators when failing reclaim over memory.high") backported.

That commit is fundamentally broken, it doesn't guarantee anything.

Please go read how the dirty throttling works (unless people wrecked
that since..).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18  9:59       ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
  (?)
@ 2020-08-18 10:05       ` Michal Hocko
  2020-08-18 10:18         ` peterz
  -1 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2020-08-18 10:05 UTC (permalink / raw)
  To: peterz
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue 18-08-20 11:59:10, Peter Zijlstra wrote:
> On Tue, Aug 18, 2020 at 11:26:17AM +0200, Michal Hocko wrote:
> > On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > > Memory controller can be used to control and limit the amount of
> > > > physical memory used by a task. When a limit is set in "memory.high" in
> > > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > > memory if the limit has been exceeded. Normally, that will be enough
> > > > to keep the physical memory consumption of tasks in the memory cgroup
> > > > to be around or below the "memory.high" limit.
> > > > 
> > > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > > that can catch up to the physical memory allocation rate. In this case,
> > > > the physical memory consumption will keep on increasing. 
> > > 
> > > Then slow down the allocator? That's what we do for dirty pages too, we
> > > slow down the dirtier when we run against the limits.
> > 
> > This is what we actually do. Have a look at mem_cgroup_handle_over_high.
> 
> But then how can it run-away like Waiman suggested?

As Chris mentioned in other reply. This functionality is quite new.
 
> /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.

We can certainly tune a different backoff delays but I suspect this is
not the problem here.
 
> That's a fail... :-(

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 10:17         ` Chris Down
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Down @ 2020-08-18 10:17 UTC (permalink / raw)
  To: peterz
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

peterz@infradead.org writes:
>But then how can it run-away like Waiman suggested?

Probably because he's not running with that commit at all. We and others use 
this to prevent runaway allocation on a huge range of production and desktop 
use cases and it works just fine.

>/me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
>
>That's a fail... :-(

I'd ask that you understand a bit more about the tradeoffs and intentions of 
the patch before rushing in to declare its failure, considering it works just 
fine :-)

Clamping the maximal time allows the application to take some action to 
remediate the situation, while still being slowed down significantly. 2 seconds 
per allocation batch is still absolutely plenty for any use case I've come 
across. If you have evidence it isn't, then present that instead of vague 
notions of "wrongness".

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 10:17         ` Chris Down
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Down @ 2020-08-18 10:17 UTC (permalink / raw)
  To: peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org writes:
>But then how can it run-away like Waiman suggested?

Probably because he's not running with that commit at all. We and others use 
this to prevent runaway allocation on a huge range of production and desktop 
use cases and it works just fine.

>/me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
>
>That's a fail... :-(

I'd ask that you understand a bit more about the tradeoffs and intentions of 
the patch before rushing in to declare its failure, considering it works just 
fine :-)

Clamping the maximal time allows the application to take some action to 
remediate the situation, while still being slowed down significantly. 2 seconds 
per allocation batch is still absolutely plenty for any use case I've come 
across. If you have evidence it isn't, then present that instead of vague 
notions of "wrongness".

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18 10:05       ` Michal Hocko
@ 2020-08-18 10:18         ` peterz
  2020-08-18 10:30           ` Michal Hocko
  2020-08-18 13:49             ` Johannes Weiner
  0 siblings, 2 replies; 58+ messages in thread
From: peterz @ 2020-08-18 10:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue, Aug 18, 2020 at 12:05:16PM +0200, Michal Hocko wrote:
> > But then how can it run-away like Waiman suggested?
> 
> As Chris mentioned in other reply. This functionality is quite new.
>  
> > /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
> 
> We can certainly tune a different backoff delays but I suspect this is
> not the problem here.

Tuning? That thing needs throwing out, it's fundamentally buggered. Why
didn't anybody look at how the I/O drtying thing works first?

What you need is a feeback loop against the rate of freeing pages, and
when you near the saturation point, the allocation rate should exactly
match the freeing rate.

But this thing has nothing what so ever like that.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18 10:17         ` Chris Down
  (?)
@ 2020-08-18 10:26         ` peterz
  2020-08-18 10:35           ` Chris Down
  -1 siblings, 1 reply; 58+ messages in thread
From: peterz @ 2020-08-18 10:26 UTC (permalink / raw)
  To: Chris Down
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

On Tue, Aug 18, 2020 at 11:17:56AM +0100, Chris Down wrote:

> I'd ask that you understand a bit more about the tradeoffs and intentions of
> the patch before rushing in to declare its failure, considering it works
> just fine :-)
> 
> Clamping the maximal time allows the application to take some action to
> remediate the situation, while still being slowed down significantly. 2
> seconds per allocation batch is still absolutely plenty for any use case
> I've come across. If you have evidence it isn't, then present that instead
> of vague notions of "wrongness".

There is no feedback from the freeing rate, therefore it cannot be
correct in maintaining a maximum amount of pages.

0.5 pages / sec is still non-zero, and if the free rate is 0, you'll
crawl across whatever limit was set without any bounds. This is math
101.

It's true that I haven't been paying attention to mm in a while, but I
was one of the original authors of the I/O dirty balancing, I do think I
understand how these things work.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18 10:18         ` peterz
@ 2020-08-18 10:30           ` Michal Hocko
  2020-08-18 10:36             ` peterz
  2020-08-18 13:49             ` Johannes Weiner
  1 sibling, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2020-08-18 10:30 UTC (permalink / raw)
  To: peterz
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue 18-08-20 12:18:44, Peter Zijlstra wrote:
> On Tue, Aug 18, 2020 at 12:05:16PM +0200, Michal Hocko wrote:
> > > But then how can it run-away like Waiman suggested?
> > 
> > As Chris mentioned in other reply. This functionality is quite new.
> >  
> > > /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
> > 
> > We can certainly tune a different backoff delays but I suspect this is
> > not the problem here.
> 
> Tuning? That thing needs throwing out, it's fundamentally buggered. Why
> didn't anybody look at how the I/O drtying thing works first?
> 
> What you need is a feeback loop against the rate of freeing pages, and
> when you near the saturation point, the allocation rate should exactly
> match the freeing rate.
> 
> But this thing has nothing what so ever like that.

Existing usecases seem to be doing fine with the existing
implementation. If we find out that this is insufficient then we can
work on that but I believe this is tangent to this email thread. There
are no indications that the current implementation doesn't throttle
enough. The proposal also aims at much richer interface to define the
oom behavior.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18 10:26         ` peterz
@ 2020-08-18 10:35           ` Chris Down
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Down @ 2020-08-18 10:35 UTC (permalink / raw)
  To: peterz
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

peterz@infradead.org writes:
>On Tue, Aug 18, 2020 at 11:17:56AM +0100, Chris Down wrote:
>
>> I'd ask that you understand a bit more about the tradeoffs and intentions of
>> the patch before rushing in to declare its failure, considering it works
>> just fine :-)
>>
>> Clamping the maximal time allows the application to take some action to
>> remediate the situation, while still being slowed down significantly. 2
>> seconds per allocation batch is still absolutely plenty for any use case
>> I've come across. If you have evidence it isn't, then present that instead
>> of vague notions of "wrongness".
>
>There is no feedback from the freeing rate, therefore it cannot be
>correct in maintaining a maximum amount of pages.

memory.high is not about maintaining a maximum amount of pages. It's strictly 
best-effort, and the ramifications of a breach are typically fundamentally 
different than for dirty throttling.

>0.5 pages / sec is still non-zero, and if the free rate is 0, you'll
>crawl across whatever limit was set without any bounds. This is math
>101.
>
>It's true that I haven't been paying attention to mm in a while, but I
>was one of the original authors of the I/O dirty balancing, I do think I
>understand how these things work.

You're suggesting we replace a well understood, easy to reason about model with 
something non-trivially more complex, all on the back of you suggesting that 
the current approach is "wrong" without any evidence or quantification.

Peter, we're not going to throw out perfectly function memcg code simply 
because of your say so, especially when you've not asked for information or 
context about the tradeoffs involved, or presented any evidence that something 
perverse is actually happening.

Prescribing a specific solution modelled on some other code path here without 
producing evidence or measurements specific to the nuances of this particular 
endpoint is not a recipe for success.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18 10:30           ` Michal Hocko
@ 2020-08-18 10:36             ` peterz
  0 siblings, 0 replies; 58+ messages in thread
From: peterz @ 2020-08-18 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue, Aug 18, 2020 at 12:30:59PM +0200, Michal Hocko wrote:
> The proposal also aims at much richer interface to define the
> oom behavior.

Oh yeah, I'm not defending any of that prctl() nonsense.

Just saying that from a math / control theory point of view, the current
thing is a abhorrent failure.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 12:55         ` Matthew Wilcox
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Wilcox @ 2020-08-18 12:55 UTC (permalink / raw)
  To: peterz
  Cc: Chris Down, Waiman Long, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan,
	Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	linux-doc, linux-fsdevel, cgroups, linux-mm

On Tue, Aug 18, 2020 at 12:04:44PM +0200, peterz@infradead.org wrote:
> On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote:
> > peterz@infradead.org writes:
> > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > > Memory controller can be used to control and limit the amount of
> > > > physical memory used by a task. When a limit is set in "memory.high" in
> > > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > > memory if the limit has been exceeded. Normally, that will be enough
> > > > to keep the physical memory consumption of tasks in the memory cgroup
> > > > to be around or below the "memory.high" limit.
> > > > 
> > > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > > that can catch up to the physical memory allocation rate. In this case,
> > > > the physical memory consumption will keep on increasing.
> > > 
> > > Then slow down the allocator? That's what we do for dirty pages too, we
> > > slow down the dirtier when we run against the limits.
> > 
> > We already do that since v5.4. I'm wondering whether Waiman's customer is
> > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle
> > allocators when failing reclaim over memory.high") backported.
> 
> That commit is fundamentally broken, it doesn't guarantee anything.
> 
> Please go read how the dirty throttling works (unless people wrecked
> that since..).

Of course they did.

https://lore.kernel.org/linux-mm/ce7975cd-6353-3f29-b52c-7a81b1d07caa@kernel.dk/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 12:55         ` Matthew Wilcox
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Wilcox @ 2020-08-18 12:55 UTC (permalink / raw)
  To: peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: Chris Down, Waiman Long, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan,
	Ingo Molnar, Juri Lelli, Vincent Guittot,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Aug 18, 2020 at 12:04:44PM +0200, peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote:
> > peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org writes:
> > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > > Memory controller can be used to control and limit the amount of
> > > > physical memory used by a task. When a limit is set in "memory.high" in
> > > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > > memory if the limit has been exceeded. Normally, that will be enough
> > > > to keep the physical memory consumption of tasks in the memory cgroup
> > > > to be around or below the "memory.high" limit.
> > > > 
> > > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > > that can catch up to the physical memory allocation rate. In this case,
> > > > the physical memory consumption will keep on increasing.
> > > 
> > > Then slow down the allocator? That's what we do for dirty pages too, we
> > > slow down the dirtier when we run against the limits.
> > 
> > We already do that since v5.4. I'm wondering whether Waiman's customer is
> > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle
> > allocators when failing reclaim over memory.high") backported.
> 
> That commit is fundamentally broken, it doesn't guarantee anything.
> 
> Please go read how the dirty throttling works (unless people wrecked
> that since..).

Of course they did.

https://lore.kernel.org/linux-mm/ce7975cd-6353-3f29-b52c-7a81b1d07caa-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 13:49             ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2020-08-18 13:49 UTC (permalink / raw)
  To: peterz
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote:
> What you need is a feeback loop against the rate of freeing pages, and
> when you near the saturation point, the allocation rate should exactly
> match the freeing rate.

IO throttling solves a slightly different problem.

IO occurs in parallel to the workload's execution stream, and you're
trying to take the workload from dirtying at CPU speed to rate match
to the independent IO stream.

With memory allocations, though, freeing happens from inside the
execution stream of the workload. If you throttle allocations, you're
most likely throttling the freeing rate as well. And you'll slow down
reclaim scanning by the same amount as the page references, so it's
not making reclaim more successful either. The alloc/use/free
(im)balance is an inherent property of the workload, regardless of the
speed you're executing it at.

So the goal here is different. We're not trying to pace the workload
into some form of sustainability. Rather, it's for OOM handling. When
we detect the workload's alloc/use/free pattern is unsustainable given
available memory, we slow it down just enough to allow userspace to
implement OOM policy and job priorities (on containerized hosts these
tend to be too complex to express in the kernel's oom scoring system).

The exponential curve makes it look like we're trying to do some type
of feedback system, but it's really only to let minor infractions pass
and throttle unsustainable expansion ruthlessly. Drop-behind reclaim
can be a bit bumpy because we batch on the allocation side as well as
on the reclaim side, hence the fuzz factor there.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 13:49             ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2020-08-18 13:49 UTC (permalink / raw)
  To: peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> What you need is a feeback loop against the rate of freeing pages, and
> when you near the saturation point, the allocation rate should exactly
> match the freeing rate.

IO throttling solves a slightly different problem.

IO occurs in parallel to the workload's execution stream, and you're
trying to take the workload from dirtying at CPU speed to rate match
to the independent IO stream.

With memory allocations, though, freeing happens from inside the
execution stream of the workload. If you throttle allocations, you're
most likely throttling the freeing rate as well. And you'll slow down
reclaim scanning by the same amount as the page references, so it's
not making reclaim more successful either. The alloc/use/free
(im)balance is an inherent property of the workload, regardless of the
speed you're executing it at.

So the goal here is different. We're not trying to pace the workload
into some form of sustainability. Rather, it's for OOM handling. When
we detect the workload's alloc/use/free pattern is unsustainable given
available memory, we slow it down just enough to allow userspace to
implement OOM policy and job priorities (on containerized hosts these
tend to be too complex to express in the kernel's oom scoring system).

The exponential curve makes it look like we're trying to do some type
of feedback system, but it's really only to let minor infractions pass
and throttle unsustainable expansion ruthlessly. Drop-behind reclaim
can be a bit bumpy because we batch on the allocation side as well as
on the reclaim side, hence the fuzz factor there.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 16:56     ` Chris Down
@ 2020-08-18 19:12       ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-18 19:12 UTC (permalink / raw)
  To: Chris Down, Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, LKML, linux-doc, linux-fsdevel,
	Cgroups, Linux MM

On 8/17/20 12:56 PM, Chris Down wrote:
> Shakeel Butt writes:
>>> Sometimes, memory reclaim may not be able to recover memory in a rate
>>> that can catch up to the physical memory allocation rate especially
>>> when rotating disks are used for swapping or writing dirty pages. In
>>> this case, the physical memory consumption will keep on increasing.
>>
>> Isn't this the real underlying issue? Why not make the guarantees of
>> memory.high more strict instead of adding more interfaces and
>> complexity?
>
> Oh, thanks Shakeel for bringing this up. I missed this in the original 
> changelog and I'm surprised that it's mentioned, since we do have 
> protections against that.
>
> Waiman, we already added artificial throttling if memory reclaim is 
> not sufficiently achieved in 0e4b01df8659 ("mm, memcg: throttle 
> allocators when failing reclaim over memory.high"), which has been 
> present since v5.4. This should significantly inhibit physical memory 
> consumption from increasing. What problems are you having with that? :-)
>
Oh, I think I overlooked your patch. You are right. There are already 
throttling in place. So I need to re-examine my patch to see if it is 
still necessary or reduce the scope of the patch.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action
  2020-08-17 16:44     ` Shakeel Butt
  (?)
  (?)
@ 2020-08-18 19:14     ` Waiman Long
  -1 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-18 19:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, LKML, linux-doc, linux-fsdevel,
	Cgroups, Linux MM

On 8/17/20 12:44 PM, Shakeel Butt wrote:
> On Mon, Aug 17, 2020 at 7:11 AM Waiman Long <longman@redhat.com> wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high"
>> in a non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate especially
>> when rotating disks are used for swapping or writing dirty pages. In
>> this case, the physical memory consumption will keep on increasing.
> Isn't this the real underlying issue? Why not make the guarantees of
> memory.high more strict instead of adding more interfaces and
> complexity?
>
> By the way, have you observed this issue on real workloads or some
> test cases? It would be good to get a repro with simple test cases.
>
As said before, this is from a customer request. I will need to 
re-examine the existing features to see if they can satisfy the customer 
need.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 19:20         ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-18 19:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

On 8/17/20 3:26 PM, Michal Hocko wrote:
> On Mon 17-08-20 11:55:37, Waiman Long wrote:
>> On 8/17/20 11:26 AM, Michal Hocko wrote:
>>> On Mon 17-08-20 10:08:23, Waiman Long wrote:
>>>> Memory controller can be used to control and limit the amount of
>>>> physical memory used by a task. When a limit is set in "memory.high" in
>>>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>>>> memory if the limit has been exceeded. Normally, that will be enough
>>>> to keep the physical memory consumption of tasks in the memory cgroup
>>>> to be around or below the "memory.high" limit.
>>>>
>>>> Sometimes, memory reclaim may not be able to recover memory in a rate
>>>> that can catch up to the physical memory allocation rate. In this case,
>>>> the physical memory consumption will keep on increasing.  When it reaches
>>>> "memory.max" for memory cgroup v2 or when the system is running out of
>>>> free memory, the OOM killer will be invoked to kill some tasks to free
>>>> up additional memory. However, one has little control of which tasks
>>>> are going to be killed by an OOM killer. Killing tasks that hold some
>>>> important resources without freeing them first can create other system
>>>> problems down the road.
>>>>
>>>> Users who do not want the OOM killer to be invoked to kill random
>>>> tasks in an out-of-memory situation can use the memory control
>>>> facility provided by this new patchset via prctl(2) to better manage
>>>> the mitigation action that needs to be performed to various tasks when
>>>> the specified memory limit is exceeded with memory cgroup v2 being used.
>>>>
>>>> The currently supported mitigation actions include the followings:
>>>>
>>>>    1) Return ENOMEM for some syscalls that allocate or handle memory
>>>>    2) Slow down the process for memory reclaim to catch up
>>>>    3) Send a specific signal to the task
>>>>    4) Kill the task
>>>>
>>>> The users that want better memory control for their applicatons can
>>>> either modify their applications to call the prctl(2) syscall directly
>>>> with the new memory control command code or write the desired action to
>>>> the newly provided memctl procfs files of their applications provided
>>>> that those applications run in a non-root v2 memory cgroup.
>>> prctl is fundamentally about per-process control while cgroup (not only
>>> memcg) is about group of processes interface. How do those two interact
>>> together? In other words what is the semantic when different processes
>>> have a different views on the same underlying memcg event?
>> As said in a previous mail, this patchset is derived from a customer request
>> and per-process control is exactly what the customer wants. That is why
>> prctl() is used. This patchset is intended to supplement the existing memory
>> cgroup features. Processes in a memory cgroup that don't use this new API
>> will behave exactly like before. Only processes that opt to use this new API
>> will have additional mitigation actions applied on them in case the
>> additional limits are reached.
> Please keep in mind that you are proposing a new user API that we will
> have to maintain for ever. That requires that the interface is
> consistent and well defined. As I've said the fundamental problem with
> this interface is that you are trying to hammer a process centric
> interface into a framework that is fundamentally process group oriented.
> Maybe there is a sensible way to do that without all sorts of weird
> corner cases but I haven't seen any of that explained here.
>
> Really just try to describe a semantic when two different tasks in the
> same memcg have a different opinion on the same event. One wants ENOMEM
> and other a specific signal to be delivered. Right now the behavior will
> be timing specific because who hits the oom path is non-deterministic
> from the userspace POV. Let's say that you can somehow handle that, now
> how are you going implement ENOMEM for any context other than current
> task? I am pretty sure the more specific questions you will have the
> more this will get awkward.

The basic idea of triggering a user-specified memory-over-high 
mitigation is when the actual memory usage exceed a threshold which is 
supposed to be between "high" and "max". The additional limit that is 
passed in is for setting this additional threshold. We want to avoid OOM 
at all cost.

The ENOMEM error may not be suitable for all applications as some of 
them may not be able to handle ENOMEM gracefully. That is for 
applications that are designed to handle that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-18 19:20         ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-18 19:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 8/17/20 3:26 PM, Michal Hocko wrote:
> On Mon 17-08-20 11:55:37, Waiman Long wrote:
>> On 8/17/20 11:26 AM, Michal Hocko wrote:
>>> On Mon 17-08-20 10:08:23, Waiman Long wrote:
>>>> Memory controller can be used to control and limit the amount of
>>>> physical memory used by a task. When a limit is set in "memory.high" in
>>>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>>>> memory if the limit has been exceeded. Normally, that will be enough
>>>> to keep the physical memory consumption of tasks in the memory cgroup
>>>> to be around or below the "memory.high" limit.
>>>>
>>>> Sometimes, memory reclaim may not be able to recover memory in a rate
>>>> that can catch up to the physical memory allocation rate. In this case,
>>>> the physical memory consumption will keep on increasing.  When it reaches
>>>> "memory.max" for memory cgroup v2 or when the system is running out of
>>>> free memory, the OOM killer will be invoked to kill some tasks to free
>>>> up additional memory. However, one has little control of which tasks
>>>> are going to be killed by an OOM killer. Killing tasks that hold some
>>>> important resources without freeing them first can create other system
>>>> problems down the road.
>>>>
>>>> Users who do not want the OOM killer to be invoked to kill random
>>>> tasks in an out-of-memory situation can use the memory control
>>>> facility provided by this new patchset via prctl(2) to better manage
>>>> the mitigation action that needs to be performed to various tasks when
>>>> the specified memory limit is exceeded with memory cgroup v2 being used.
>>>>
>>>> The currently supported mitigation actions include the followings:
>>>>
>>>>    1) Return ENOMEM for some syscalls that allocate or handle memory
>>>>    2) Slow down the process for memory reclaim to catch up
>>>>    3) Send a specific signal to the task
>>>>    4) Kill the task
>>>>
>>>> The users that want better memory control for their applicatons can
>>>> either modify their applications to call the prctl(2) syscall directly
>>>> with the new memory control command code or write the desired action to
>>>> the newly provided memctl procfs files of their applications provided
>>>> that those applications run in a non-root v2 memory cgroup.
>>> prctl is fundamentally about per-process control while cgroup (not only
>>> memcg) is about group of processes interface. How do those two interact
>>> together? In other words what is the semantic when different processes
>>> have a different views on the same underlying memcg event?
>> As said in a previous mail, this patchset is derived from a customer request
>> and per-process control is exactly what the customer wants. That is why
>> prctl() is used. This patchset is intended to supplement the existing memory
>> cgroup features. Processes in a memory cgroup that don't use this new API
>> will behave exactly like before. Only processes that opt to use this new API
>> will have additional mitigation actions applied on them in case the
>> additional limits are reached.
> Please keep in mind that you are proposing a new user API that we will
> have to maintain for ever. That requires that the interface is
> consistent and well defined. As I've said the fundamental problem with
> this interface is that you are trying to hammer a process centric
> interface into a framework that is fundamentally process group oriented.
> Maybe there is a sensible way to do that without all sorts of weird
> corner cases but I haven't seen any of that explained here.
>
> Really just try to describe a semantic when two different tasks in the
> same memcg have a different opinion on the same event. One wants ENOMEM
> and other a specific signal to be delivered. Right now the behavior will
> be timing specific because who hits the oom path is non-deterministic
> from the userspace POV. Let's say that you can somehow handle that, now
> how are you going implement ENOMEM for any context other than current
> task? I am pretty sure the more specific questions you will have the
> more this will get awkward.

The basic idea of triggering a user-specified memory-over-high 
mitigation is when the actual memory usage exceed a threshold which is 
supposed to be between "high" and "max". The additional limit that is 
passed in is for setting this additional threshold. We want to avoid OOM 
at all cost.

The ENOMEM error may not be suitable for all applications as some of 
them may not be able to handle ENOMEM gracefully. That is for 
applications that are designed to handle that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18  9:14   ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
                     ` (2 preceding siblings ...)
  (?)
@ 2020-08-18 19:27   ` Waiman Long
  -1 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-18 19:27 UTC (permalink / raw)
  To: peterz
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On 8/18/20 5:14 AM, peterz@infradead.org wrote:
> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.
> Then slow down the allocator? That's what we do for dirty pages too, we
> slow down the dirtier when we run against the limits.
>
I missed that there are already allocator throttling done in upstream 
code. So I will need to reexamine if this patch is necessary or not.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18  9:27     ` Chris Down
  (?)
  (?)
@ 2020-08-18 19:30     ` Waiman Long
  -1 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-18 19:30 UTC (permalink / raw)
  To: Chris Down, peterz
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On 8/18/20 5:27 AM, Chris Down wrote:
> peterz@infradead.org writes:
>> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
>>> Memory controller can be used to control and limit the amount of
>>> physical memory used by a task. When a limit is set in "memory.high" in
>>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>>> memory if the limit has been exceeded. Normally, that will be enough
>>> to keep the physical memory consumption of tasks in the memory cgroup
>>> to be around or below the "memory.high" limit.
>>>
>>> Sometimes, memory reclaim may not be able to recover memory in a rate
>>> that can catch up to the physical memory allocation rate. In this case,
>>> the physical memory consumption will keep on increasing.
>>
>> Then slow down the allocator? That's what we do for dirty pages too, we
>> slow down the dirtier when we run against the limits.
>
> We already do that since v5.4. I'm wondering whether Waiman's customer 
> is just running with a too-old kernel without 0e4b01df865 ("mm, memcg: 
> throttle allocators when failing reclaim over memory.high") backported.
>
The fact is that we don't have that in RHEL8 yet and cgroup v2 is still 
not the default at the moment.

I am planning to backport the throttling patches to RHEL and hopefully 
can switch to use cgroup v2 soon.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-18 12:55         ` Matthew Wilcox
  (?)
@ 2020-08-20  6:11         ` Dave Chinner
  -1 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2020-08-20  6:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: peterz, Chris Down, Waiman Long, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan,
	Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	linux-doc, linux-fsdevel, cgroups, linux-mm

On Tue, Aug 18, 2020 at 01:55:59PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 18, 2020 at 12:04:44PM +0200, peterz@infradead.org wrote:
> > On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote:
> > > peterz@infradead.org writes:
> > > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > > > Memory controller can be used to control and limit the amount of
> > > > > physical memory used by a task. When a limit is set in "memory.high" in
> > > > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > > > memory if the limit has been exceeded. Normally, that will be enough
> > > > > to keep the physical memory consumption of tasks in the memory cgroup
> > > > > to be around or below the "memory.high" limit.
> > > > > 
> > > > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > > > that can catch up to the physical memory allocation rate. In this case,
> > > > > the physical memory consumption will keep on increasing.
> > > > 
> > > > Then slow down the allocator? That's what we do for dirty pages too, we
> > > > slow down the dirtier when we run against the limits.
> > > 
> > > We already do that since v5.4. I'm wondering whether Waiman's customer is
> > > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle
> > > allocators when failing reclaim over memory.high") backported.
> > 
> > That commit is fundamentally broken, it doesn't guarantee anything.
> > 
> > Please go read how the dirty throttling works (unless people wrecked
> > that since..).
> 
> Of course they did.
> 
> https://lore.kernel.org/linux-mm/ce7975cd-6353-3f29-b52c-7a81b1d07caa@kernel.dk/

Different thing. That's memory reclaim throttling, not dirty page
throttling.  balance_dirty_pages() still works just fine as it does
not look at device congestion. page cleaning rate is accounted in
test_clear_page_writeback(), page dirtying rate is accounted
directly in balance_dirty_pages(). That feedback loop has not been
broken...

And I compeltely agree with Peter here - the control theory we
applied to the dirty throttling problem is still 100% valid and so
the algorithm still just works all these years later. I've only been
saying that allocation should use the same feedback model for
reclaim throttling since ~2011...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-21 19:37               ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2020-08-21 19:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote:
> On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote:
> > What you need is a feeback loop against the rate of freeing pages, and
> > when you near the saturation point, the allocation rate should exactly
> > match the freeing rate.
> 
> IO throttling solves a slightly different problem.
> 
> IO occurs in parallel to the workload's execution stream, and you're
> trying to take the workload from dirtying at CPU speed to rate match
> to the independent IO stream.
> 
> With memory allocations, though, freeing happens from inside the
> execution stream of the workload. If you throttle allocations, you're

For a single task, but even then you're making the argument that we need
to allocate memory to free memory, and we all know where that gets us.

But we're actually talking about a cgroup here, which is a collection of
tasks all doing things in parallel.

> most likely throttling the freeing rate as well. And you'll slow down
> reclaim scanning by the same amount as the page references, so it's
> not making reclaim more successful either. The alloc/use/free
> (im)balance is an inherent property of the workload, regardless of the
> speed you're executing it at.

Arguably seeing the rate drop to near 0 is a very good point to consider
running cgroup-OOM.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-21 19:37               ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2020-08-21 19:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote:
> On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> > What you need is a feeback loop against the rate of freeing pages, and
> > when you near the saturation point, the allocation rate should exactly
> > match the freeing rate.
> 
> IO throttling solves a slightly different problem.
> 
> IO occurs in parallel to the workload's execution stream, and you're
> trying to take the workload from dirtying at CPU speed to rate match
> to the independent IO stream.
> 
> With memory allocations, though, freeing happens from inside the
> execution stream of the workload. If you throttle allocations, you're

For a single task, but even then you're making the argument that we need
to allocate memory to free memory, and we all know where that gets us.

But we're actually talking about a cgroup here, which is a collection of
tasks all doing things in parallel.

> most likely throttling the freeing rate as well. And you'll slow down
> reclaim scanning by the same amount as the page references, so it's
> not making reclaim more successful either. The alloc/use/free
> (im)balance is an inherent property of the workload, regardless of the
> speed you're executing it at.

Arguably seeing the rate drop to near 0 is a very good point to consider
running cgroup-OOM.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-23  2:49           ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-23  2:49 UTC (permalink / raw)
  To: Chris Down, peterz
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On 8/18/20 6:17 AM, Chris Down wrote:
> peterz@infradead.org writes:
>> But then how can it run-away like Waiman suggested?
>
> Probably because he's not running with that commit at all. We and 
> others use this to prevent runaway allocation on a huge range of 
> production and desktop use cases and it works just fine.
>
>> /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
>>
>> That's a fail... :-(
>
> I'd ask that you understand a bit more about the tradeoffs and 
> intentions of the patch before rushing in to declare its failure, 
> considering it works just fine :-)
>
> Clamping the maximal time allows the application to take some action 
> to remediate the situation, while still being slowed down 
> significantly. 2 seconds per allocation batch is still absolutely 
> plenty for any use case I've come across. If you have evidence it 
> isn't, then present that instead of vague notions of "wrongness".
>
Sorry for the late reply.

I ran some test on the latest kernel and and it seems to work as 
expected. I was running the test on an older kernel that doesn't have 
this patch and I was not aware of it before hand.

Sorry for the confusion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
@ 2020-08-23  2:49           ` Waiman Long
  0 siblings, 0 replies; 58+ messages in thread
From: Waiman Long @ 2020-08-23  2:49 UTC (permalink / raw)
  To: Chris Down, peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 8/18/20 6:17 AM, Chris Down wrote:
> peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org writes:
>> But then how can it run-away like Waiman suggested?
>
> Probably because he's not running with that commit at all. We and 
> others use this to prevent runaway allocation on a huge range of 
> production and desktop use cases and it works just fine.
>
>> /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
>>
>> That's a fail... :-(
>
> I'd ask that you understand a bit more about the tradeoffs and 
> intentions of the patch before rushing in to declare its failure, 
> considering it works just fine :-)
>
> Clamping the maximal time allows the application to take some action 
> to remediate the situation, while still being slowed down 
> significantly. 2 seconds per allocation batch is still absolutely 
> plenty for any use case I've come across. If you have evidence it 
> isn't, then present that instead of vague notions of "wrongness".
>
Sorry for the late reply.

I ran some test on the latest kernel and and it seems to work as 
expected. I was running the test on an older kernel that doesn't have 
this patch and I was not aware of it before hand.

Sorry for the confusion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-21 19:37               ` Peter Zijlstra
  (?)
@ 2020-08-24 16:58               ` Johannes Weiner
  2020-09-07 11:47                 ` Chris Down
  2020-09-09 11:53                 ` Michal Hocko
  -1 siblings, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2020-08-24 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Waiman Long, Andrew Morton, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote:
> > On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote:
> > > What you need is a feeback loop against the rate of freeing pages, and
> > > when you near the saturation point, the allocation rate should exactly
> > > match the freeing rate.
> > 
> > IO throttling solves a slightly different problem.
> > 
> > IO occurs in parallel to the workload's execution stream, and you're
> > trying to take the workload from dirtying at CPU speed to rate match
> > to the independent IO stream.
> > 
> > With memory allocations, though, freeing happens from inside the
> > execution stream of the workload. If you throttle allocations, you're
> 
> For a single task, but even then you're making the argument that we need
> to allocate memory to free memory, and we all know where that gets us.
>
> But we're actually talking about a cgroup here, which is a collection of
> tasks all doing things in parallel.

Right, but sharing a memory cgroup means sharing an LRU list, and that
transfers memory pressure and allocation burden between otherwise
independent tasks - if nothing else through cache misses on the
executables and libraries. I doubt that one task can go through
several comprehensive reclaim cycles on a shared LRU without
completely annihilating the latency or throughput targets of everybody
else in the group in most real world applications.

> > most likely throttling the freeing rate as well. And you'll slow down
> > reclaim scanning by the same amount as the page references, so it's
> > not making reclaim more successful either. The alloc/use/free
> > (im)balance is an inherent property of the workload, regardless of the
> > speed you're executing it at.
> 
> Arguably seeing the rate drop to near 0 is a very good point to consider
> running cgroup-OOM.

Agreed. In the past, that's actually what we did: In cgroup1, you
could disable the kernel OOM killer, and when reclaim failed at the
limit, the allocating task would be put on a waitqueue until woken up
by a freeing event. Conceptually this is clean & straight-forward.

However,

1. Putting allocation contexts with unknown locks to indefinite sleep
   caused deadlocks, for obvious reasons. Userspace OOM killing tends
   to take a lot of task-specific locks when scanning through /proc
   files for kill candidates, and can easily get stuck.

   Using bounded over indefinite waits is simply acknowledging that
   the deadlock potential when connecting arbitrary task stacks in the
   system through free->alloc ordering is equally difficult to plan
   out as alloc->free ordering.

   The non-cgroup OOM killer actually has the same deadlock potential,
   where the allocating/killing task can hold resources that the OOM
   victim requires to exit. The OOM reaper hides it, the static
   emergency reserves hide it - but to truly solve this problem, you
   would have to have full knowledge of memory & lock ordering
   dependencies of those tasks. And then can still end up with
   scenarios where the only answer is panic().

2. I don't recall ever seeing situations in cgroup1 where the precise
   matching of allocation rate to freeing rate has allowed cgroups to
   run sustainably after reclaim has failed. The practical benefit of
   a complicated feedback loop over something crude & robust once
   we're in an OOM situation is not apparent to me.

   [ That's different from the IO-throttling *while still doing
     reclaim* that Dave brought up. *That* justifies the same effort
     we put into dirty throttling. I'm only talking about the
     situation where reclaim has already failed and we need to
     facilitate userspace OOM handling. ]

So that was the motivation for the bounded sleeps. They do not
guarantee containment, but they provide a reasonable amount of time
for the userspace OOM handler to intervene, without deadlocking.

That all being said, the semantics of the new 'high' limit in cgroup2
have allowed us to move reclaim/limit enforcement out of the
allocation context and into the userspace return path.

See the call to mem_cgroup_handle_over_high() from
tracehook_notify_resume(), and the comments in try_charge() around
set_notify_resume().

This already solves the free->alloc ordering problem by allowing the
allocation to exceed the limit temporarily until at least all locks
are dropped, we know we can sleep etc., before performing enforcement.

That means we may not need the timed sleeps anymore for that purpose,
and could bring back directed waits for freeing-events again.

What do you think? Any hazards around indefinite sleeps in that resume
path? It's called before __rseq_handle_notify_resume and the
arch-specific resume callback (which appears to be a no-op currently).

Chris, Michal, what are your thoughts? It would certainly be simpler
conceptually on the memcg side.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-24 16:58               ` Johannes Weiner
@ 2020-09-07 11:47                 ` Chris Down
  2020-09-09 11:53                 ` Michal Hocko
  1 sibling, 0 replies; 58+ messages in thread
From: Chris Down @ 2020-09-07 11:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Michal Hocko, Waiman Long, Andrew Morton,
	Vladimir Davydov, Jonathan Corbet, Alexey Dobriyan, Ingo Molnar,
	Juri Lelli, Vincent Guittot, linux-kernel, linux-doc,
	linux-fsdevel, cgroups, linux-mm

Johannes Weiner writes:
>That all being said, the semantics of the new 'high' limit in cgroup2
>have allowed us to move reclaim/limit enforcement out of the
>allocation context and into the userspace return path.
>
>See the call to mem_cgroup_handle_over_high() from
>tracehook_notify_resume(), and the comments in try_charge() around
>set_notify_resume().
>
>This already solves the free->alloc ordering problem by allowing the
>allocation to exceed the limit temporarily until at least all locks
>are dropped, we know we can sleep etc., before performing enforcement.
>
>That means we may not need the timed sleeps anymore for that purpose,
>and could bring back directed waits for freeing-events again.
>
>What do you think? Any hazards around indefinite sleeps in that resume
>path? It's called before __rseq_handle_notify_resume and the
>arch-specific resume callback (which appears to be a no-op currently).
>
>Chris, Michal, what are your thoughts? It would certainly be simpler
>conceptually on the memcg side.

I'm not against that, although I personally don't feel very strongly about it 
either way, since the current behaviour clearly works in practice.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
  2020-08-24 16:58               ` Johannes Weiner
  2020-09-07 11:47                 ` Chris Down
@ 2020-09-09 11:53                 ` Michal Hocko
  1 sibling, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2020-09-09 11:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Waiman Long, Andrew Morton, Vladimir Davydov,
	Jonathan Corbet, Alexey Dobriyan, Ingo Molnar, Juri Lelli,
	Vincent Guittot, linux-kernel, linux-doc, linux-fsdevel, cgroups,
	linux-mm

[Sorry, this slipped through cracks]

On Mon 24-08-20 12:58:50, Johannes Weiner wrote:
> On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
[...]
> > Arguably seeing the rate drop to near 0 is a very good point to consider
> > running cgroup-OOM.
> 
> Agreed. In the past, that's actually what we did: In cgroup1, you
> could disable the kernel OOM killer, and when reclaim failed at the
> limit, the allocating task would be put on a waitqueue until woken up
> by a freeing event. Conceptually this is clean & straight-forward.
> 
> However,
> 
> 1. Putting allocation contexts with unknown locks to indefinite sleep
>    caused deadlocks, for obvious reasons. Userspace OOM killing tends
>    to take a lot of task-specific locks when scanning through /proc
>    files for kill candidates, and can easily get stuck.
> 
>    Using bounded over indefinite waits is simply acknowledging that
>    the deadlock potential when connecting arbitrary task stacks in the
>    system through free->alloc ordering is equally difficult to plan
>    out as alloc->free ordering.
> 
>    The non-cgroup OOM killer actually has the same deadlock potential,
>    where the allocating/killing task can hold resources that the OOM
>    victim requires to exit. The OOM reaper hides it, the static
>    emergency reserves hide it - but to truly solve this problem, you
>    would have to have full knowledge of memory & lock ordering
>    dependencies of those tasks. And then can still end up with
>    scenarios where the only answer is panic().

Yes. Even killing all eligible tasks is not guaranteed to help the
situation because a) resources might be not bound to a process life time
(e.g. tmpfs) or ineligible task might be holding resources that block
others to do the proper cleanup. OOM reaper is here to make sure we
reclaim some of the address space of the victim and we go over all
eligible tasks rather than getting stuck at the first victim for ever.
 
> 2. I don't recall ever seeing situations in cgroup1 where the precise
>    matching of allocation rate to freeing rate has allowed cgroups to
>    run sustainably after reclaim has failed. The practical benefit of
>    a complicated feedback loop over something crude & robust once
>    we're in an OOM situation is not apparent to me.

Yes, this is usually go OOM and kill something. Running on a very edge
of the (memcg) oom doesn't tend to be sustainable and I am not sure it
makes sense to optimize for.

>    [ That's different from the IO-throttling *while still doing
>      reclaim* that Dave brought up. *That* justifies the same effort
>      we put into dirty throttling. I'm only talking about the
>      situation where reclaim has already failed and we need to
>      facilitate userspace OOM handling. ]
> 
> So that was the motivation for the bounded sleeps. They do not
> guarantee containment, but they provide a reasonable amount of time
> for the userspace OOM handler to intervene, without deadlocking.

Yes, memory.high is mostly a best effort containment. We do have the
hard limit to put a stop on runaways or if you are watching for PSI then
the high limit throttling would give you enough idea to take an action
from the userspace.

> That all being said, the semantics of the new 'high' limit in cgroup2
> have allowed us to move reclaim/limit enforcement out of the
> allocation context and into the userspace return path.
> 
> See the call to mem_cgroup_handle_over_high() from
> tracehook_notify_resume(), and the comments in try_charge() around
> set_notify_resume().
> 
> This already solves the free->alloc ordering problem by allowing the
> allocation to exceed the limit temporarily until at least all locks
> are dropped, we know we can sleep etc., before performing enforcement.
> 
> That means we may not need the timed sleeps anymore for that purpose,
> and could bring back directed waits for freeing-events again.
> 
> What do you think? Any hazards around indefinite sleeps in that resume
> path? It's called before __rseq_handle_notify_resume and the
> arch-specific resume callback (which appears to be a no-op currently).
> 
> Chris, Michal, what are your thoughts? It would certainly be simpler
> conceptually on the memcg side.

I would need a more specific description. But as I've already said. It
doesn't seem that we are in a need to fix any practical problem here.
High limit implementation has changed quite a lot recently. I would
rather see it settled for a while and see how it behaves in wider
variety of workloads before changing the implementation again.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2020-09-09 15:14 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 14:08 [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Waiman Long
2020-08-17 14:08 ` [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action Waiman Long
2020-08-17 14:30   ` Chris Down
2020-08-17 15:38     ` Waiman Long
2020-08-17 16:11       ` Chris Down
2020-08-17 16:44   ` Shakeel Butt
2020-08-17 16:44     ` Shakeel Butt
2020-08-17 16:56     ` Chris Down
2020-08-18 19:12       ` Waiman Long
2020-08-18 19:14     ` Waiman Long
2020-08-17 14:08 ` [RFC PATCH 2/8] memcg, mm: Return ENOMEM or delay if memcg_over_limit Waiman Long
2020-08-17 14:08 ` [RFC PATCH 3/8] memcg: Allow the use of task RSS memory as over-high action trigger Waiman Long
2020-08-17 14:08   ` Waiman Long
2020-08-17 14:08 ` [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file Waiman Long
2020-08-17 20:10   ` kernel test robot
2020-08-17 20:10   ` [RFC PATCH] fs/proc: proc_memctl_operations can be static kernel test robot
2020-08-17 14:08 ` [RFC PATCH 5/8] memcg: Allow direct per-task memory limit checking Waiman Long
2020-08-17 14:08 ` [RFC PATCH 6/8] memcg: Introduce additional memory control slowdown if needed Waiman Long
2020-08-17 14:08 ` [RFC PATCH 7/8] memcg: Enable logging of memory control mitigation action Waiman Long
2020-08-17 14:08   ` Waiman Long
2020-08-17 14:08 ` [RFC PATCH 8/8] memcg: Add over-high action prctl() documentation Waiman Long
2020-08-17 15:26 ` [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Michal Hocko
2020-08-17 15:55   ` Waiman Long
2020-08-17 15:55     ` Waiman Long
2020-08-17 19:26     ` Michal Hocko
2020-08-18 19:20       ` Waiman Long
2020-08-18 19:20         ` Waiman Long
2020-08-18  9:14 ` peterz
2020-08-18  9:14   ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
2020-08-18  9:26   ` Michal Hocko
2020-08-18  9:26     ` Michal Hocko
2020-08-18  9:59     ` peterz
2020-08-18  9:59       ` peterz-wEGCiKHe2LqWVfeAwA7xHQ
2020-08-18 10:05       ` Michal Hocko
2020-08-18 10:18         ` peterz
2020-08-18 10:30           ` Michal Hocko
2020-08-18 10:36             ` peterz
2020-08-18 13:49           ` Johannes Weiner
2020-08-18 13:49             ` Johannes Weiner
2020-08-21 19:37             ` Peter Zijlstra
2020-08-21 19:37               ` Peter Zijlstra
2020-08-24 16:58               ` Johannes Weiner
2020-09-07 11:47                 ` Chris Down
2020-09-09 11:53                 ` Michal Hocko
2020-08-18 10:17       ` Chris Down
2020-08-18 10:17         ` Chris Down
2020-08-18 10:26         ` peterz
2020-08-18 10:35           ` Chris Down
2020-08-23  2:49         ` Waiman Long
2020-08-23  2:49           ` Waiman Long
2020-08-18  9:27   ` Chris Down
2020-08-18  9:27     ` Chris Down
2020-08-18 10:04     ` peterz
2020-08-18 12:55       ` Matthew Wilcox
2020-08-18 12:55         ` Matthew Wilcox
2020-08-20  6:11         ` Dave Chinner
2020-08-18 19:30     ` Waiman Long
2020-08-18 19:27   ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.