linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch -mm] mm, oom: add global access to memory reserves on livelock
@ 2015-08-20 21:00 David Rientjes
  2015-08-20 23:10 ` Andrew Morton
  2015-08-21  8:17 ` Michal Hocko
  0 siblings, 2 replies; 13+ messages in thread
From: David Rientjes @ 2015-08-20 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Oleg Nesterov, Michal Hocko,
	Vlastimil Babka, linux-kernel, linux-mm

On system oom, a process may fail to exit if its thread depends on a lock
held by another allocating process.

In this case, we can detect an oom kill livelock that requires memory
allocation to be successful to resolve.

This patch introduces an oom expiration, set to 5s, that defines how long
a thread has to exit after being oom killed.

When this period elapses, it is assumed that the thread cannot make
forward progress without help.  The only help the VM may provide is to
allow pending allocations to succeed, so it grants all allocators access
to memory reserves after reclaim and compaction have failed.

This patch does not allow global access to memory reserves on memcg oom
kill, but the functionality is there if extended.

An example livelock is possible with a kernel module (requires
EXPORT_SYMBOL(min_free_kbytes)):

#include <linux/delay.h>
#include <linux/kernel.h>
#include <linux/kthread.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/oom.h>

/*
 * The contended mutex that the allocating kthread is holding and that the oom
 * killed process is trying to acquire.
 */
static DEFINE_MUTEX(shared_mutex);

static int alloc_thread(void *param)
{
	struct task_struct *locking_thread = param;
	struct page *page, *next;
	unsigned int allocs_left;
	bool oom_killed = false;
	LIST_HEAD(pagelist);

	/* Allocate half of memory reserves after oom */
	allocs_left = (min_free_kbytes >> (PAGE_SHIFT - 10)) / 2;

	mutex_lock(&shared_mutex);

	/* Give locking_thread a chance to wakeup */
	msleep(1000);

	while (!oom_killed || allocs_left) {
		page = alloc_pages(GFP_KERNEL, 0);
		if (likely(page))
			list_add(&page->lru, &pagelist);

		cond_resched();

		if (unlikely(kthread_should_stop()))
			break;
		if (oom_killed) {
			allocs_left--;
			continue;
		}
		if (unlikely(fatal_signal_pending(locking_thread))) {
			/*
			 * The process trying to acquire shared_mutex has been
			 * killed.  Continue to allocate some memory to use
			 * reserves, and then drop the mutex.
			 */
			oom_killed = true;
		}
	}
	mutex_unlock(&shared_mutex);

	/* Release memory back to the system */
	list_for_each_entry_safe(page, next, &pagelist, lru) {
		list_del(&page->lru);
		__free_page(page);
		cond_resched();
	}
	return 0;
}

static int __init oom_livelock_init(void)
{
	struct task_struct *allocating_thread;

	allocating_thread = kthread_run(alloc_thread, current,
					"oom_test_alloc");
	if (unlikely(IS_ERR(allocating_thread))) {
		pr_err("oom_test_alloc: could not create oom_test_alloc\n");
		return PTR_ERR(allocating_thread);
	}

	/* Prefer to be the first process killed on system oom */
	set_current_oom_origin();

	/* Wait until the kthread has acquired the mutex */
	while (!mutex_is_locked(&shared_mutex))
		schedule();

	/* This should livelock without VM intervention */
	mutex_lock(&shared_mutex);
	mutex_unlock(&shared_mutex);

	kthread_stop(allocating_thread);
	return 0;
}

module_init(oom_livelock_init);

This will livelock a system running with this patch.  With an oom
expiration, allocating_thread eventually gets access to allocate memory
so that it can drop shared_mutex and the oom victim, insmod, exits.

Format of the kernel log is of the oom killed victim:

	oom_test_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0, oom_score_badness=5000 (enabled), memcg_scoring=disabled
	oom_test_alloc cpuset=/ mems_allowed=0-1
	...
	Out of Memory: Kill process 13692 (insmod) score 1142282919 or sacrifice child
	Out of Memory: Killed process 13692 (insmod) total-vm:4356kB, anon-rss:68kB, file-rss:284kB
	GOOGLE: oom-kill constraint=CONSTRAINT_NONE origin_memcg= kill_memcg= task=insmod pid=13692 uid=0

and then access to memory reserves being granted:

	WARNING: CPU: 17 PID: 13046 at mm/oom_kill.c:314 oom_scan_process_thread+0x16b/0x1f0()
	insmod (13692) has failed to exit -- global access to memory reserves started
	...
	Call Trace:
	 [<ffffffffbb5f0e67>] dump_stack+0x46/0x58
	 [<ffffffffbb069394>] warn_slowpath_common+0x94/0xc0
	 [<ffffffffbb069476>] warn_slowpath_fmt+0x46/0x50
	 [<ffffffffbb14ad1b>] oom_scan_process_thread+0x16b/0x1f0
	 [<ffffffffbb14b81b>] out_of_memory+0x39b/0x6f0
	 ...
	---[ end trace a26a290e84699a90 ]---
	Call Trace of insmod/13692:
	 [<ffffffffbb0d620c>] load_module+0x1cdc/0x23c0
	 [<ffffffffbb0d6996>] SyS_init_module+0xa6/0xd0
	 [<ffffffffbb5fb322>] system_call_fastpath+0x16/0x1b
	 [<ffffffffffffffff>] 0xffffffffffffffff

load_module+0x1cdc is a shared mutex in the unit test that insmod is
trying to acquire and that oom_test_alloc is holding while allocating.

And then the eventual exit of insmod:

	insmod (13692) exited -- global access to memory reserves ended

Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/tty/sysrq.c   |  3 +-
 include/linux/oom.h   |  5 +--
 include/linux/sched.h |  1 +
 kernel/exit.c         |  4 +++
 mm/memcontrol.c       |  4 ++-
 mm/oom_kill.c         | 92 +++++++++++++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c       | 18 ++++++----
 7 files changed, 110 insertions(+), 17 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -360,9 +360,10 @@ static void moom_callback(struct work_struct *ignored)
 		.gfp_mask = gfp_mask,
 		.order = -1,
 	};
+	bool unused;
 
 	mutex_lock(&oom_lock);
-	if (!out_of_memory(&oc))
+	if (!out_of_memory(&oc, &unused))
 		pr_info("OOM request ignored because killer is disabled\n");
 	mutex_unlock(&oom_lock);
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -87,9 +87,10 @@ extern void check_panic_on_oom(struct oom_control *oc,
 			       struct mem_cgroup *memcg);
 
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
-		struct task_struct *task, unsigned long totalpages);
+		struct task_struct *task, unsigned long totalpages,
+		bool *expire);
 
-extern bool out_of_memory(struct oom_control *oc);
+extern bool out_of_memory(struct oom_control *oc, bool *expired_oom_kill);
 
 extern void exit_oom_victim(void);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -767,6 +767,7 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	unsigned long oom_kill_expire;	/* expiration time */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -427,6 +427,10 @@ static void exit_mm(struct task_struct *tsk)
 	}
 	atomic_inc(&mm->mm_count);
 	BUG_ON(mm != tsk->active_mm);
+	if (tsk->signal->oom_kill_expire &&
+	    time_after_eq(jiffies, tsk->signal->oom_kill_expire))
+		pr_info("%s (%d) exited -- global access to memory reserves ended\n",
+			tsk->comm, tsk->pid);
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1330,10 +1330,12 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
+		bool unused;
 
 		css_task_iter_start(&iter->css, &it);
 		while ((task = css_task_iter_next(&it))) {
-			switch (oom_scan_process_thread(&oc, task, totalpages)) {
+			switch (oom_scan_process_thread(&oc, task, totalpages,
+							&unused)) {
 			case OOM_SCAN_SELECT:
 				if (chosen)
 					put_task_struct(chosen);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,7 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/stacktrace.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -45,6 +46,14 @@ int sysctl_oom_dump_tasks = 1;
 
 DEFINE_MUTEX(oom_lock);
 
+/*
+ * If an oom killed thread cannot exit because another thread is holding a lock
+ * that is requires, then the oom killer cannot ensure forward progress.  When
+ * OOM_EXPIRE_MSECS lapses, provide all threads access to memory reserves so the
+ * thread holding the lock may drop it and the oom victim may exit.
+ */
+#define OOM_EXPIRE_MSECS	(5000)
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -254,8 +263,57 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+#ifdef CONFIG_STACKTRACE
+#define MAX_STACK_TRACE_ENTRIES	(64)
+static unsigned long stack_trace_entries[MAX_STACK_TRACE_ENTRIES *
+					 sizeof(unsigned long)];
+static DEFINE_MUTEX(stack_trace_mutex);
+
+static void print_stacks_expired(struct task_struct *task)
+{
+	/* One set of stack traces every OOM_EXPIRE_MS */
+	static DEFINE_RATELIMIT_STATE(expire_rs, OOM_EXPIRE_MSECS / 1000 * HZ,
+				      1);
+	struct stack_trace trace = {
+		.nr_entries = 0,
+		.max_entries = ARRAY_SIZE(stack_trace_entries),
+		.entries = stack_trace_entries,
+		.skip = 2,
+	};
+
+	if (!__ratelimit(&expire_rs))
+		return;
+
+	WARN(true,
+	     "%s (%d) has failed to exit -- global access to memory reserves started\n",
+	     task->comm, task->pid);
+
+	/*
+	 * If cred_guard_mutex can't be acquired, this may be a mutex that is
+	 * being held causing the livelock.  Return without printing the stack.
+	 */
+	if (!mutex_trylock(&task->signal->cred_guard_mutex))
+		return;
+
+	mutex_lock(&stack_trace_mutex);
+	save_stack_trace_tsk(task, &trace);
+
+	pr_info("Call Trace of %s/%d:\n", task->comm, task->pid);
+	print_stack_trace(&trace, 0);
+
+	mutex_unlock(&stack_trace_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
+}
+#else
+static inline void print_stacks_expired(struct task_struct *task)
+{
+}
+#endif /* CONFIG_STACKTRACE */
+
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
-			struct task_struct *task, unsigned long totalpages)
+			struct task_struct *task, unsigned long totalpages,
+			bool *expired_oom_kill)
 {
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
@@ -265,8 +323,14 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (oc->order != -1)
+		if (oc->order != -1) {
+			if (task->mm && time_after_eq(jiffies,
+					task->signal->oom_kill_expire)) {
+				get_task_struct(task);
+				*expired_oom_kill = true;
+			}
 			return OOM_SCAN_ABORT;
+		}
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
@@ -289,7 +353,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
  * number of 'points'.  Returns -1 on scan abort.
  */
 static struct task_struct *select_bad_process(struct oom_control *oc,
-		unsigned int *ppoints, unsigned long totalpages)
+		unsigned int *ppoints, unsigned long totalpages,
+		bool *expired_oom_kill)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -299,7 +364,8 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 	for_each_process_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(oc, p, totalpages)) {
+		switch (oom_scan_process_thread(oc, p, totalpages,
+						expired_oom_kill)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
 			chosen_points = ULONG_MAX;
@@ -308,6 +374,10 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 			continue;
 		case OOM_SCAN_ABORT:
 			rcu_read_unlock();
+			if (*expired_oom_kill) {
+				print_stacks_expired(p);
+				put_task_struct(p);
+			}
 			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
 			break;
@@ -414,6 +484,10 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
+	tsk->signal->oom_kill_expire = jiffies +
+				       msecs_to_jiffies(OOM_EXPIRE_MSECS);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -637,8 +711,11 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
  * killing a random task (bad), letting the system crash (worse)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
+ *
+ * expired_oom_kill is true if waiting on a process that has exceeded its oom
+ * expiration to exit.
  */
-bool out_of_memory(struct oom_control *oc)
+bool out_of_memory(struct oom_control *oc, bool *expired_oom_kill)
 {
 	struct task_struct *p;
 	unsigned long totalpages;
@@ -686,7 +763,7 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	p = select_bad_process(oc, &points, totalpages);
+	p = select_bad_process(oc, &points, totalpages, expired_oom_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && oc->order != -1) {
 		dump_header(oc, NULL, NULL);
@@ -717,6 +794,7 @@ void pagefault_out_of_memory(void)
 		.gfp_mask = 0,
 		.order = 0,
 	};
+	bool unused;
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
@@ -724,7 +802,7 @@ void pagefault_out_of_memory(void)
 	if (!mutex_trylock(&oom_lock))
 		return;
 
-	if (!out_of_memory(&oc)) {
+	if (!out_of_memory(&oc, &unused)) {
 		/*
 		 * There shouldn't be any user tasks runnable while the
 		 * OOM killer is disabled, so the current task has to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2717,7 +2717,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+	const struct alloc_context *ac, unsigned long *did_some_progress,
+	bool *expired_oom_kill)
 {
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
@@ -2776,7 +2777,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+	if (out_of_memory(&oc, expired_oom_kill) ||
+	    WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
 		*did_some_progress = 1;
 out:
 	mutex_unlock(&oom_lock);
@@ -2947,7 +2949,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 }
 
 static inline int
-gfp_to_alloc_flags(gfp_t gfp_mask)
+gfp_to_alloc_flags(gfp_t gfp_mask, const bool expired_oom_kill)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
@@ -2987,6 +2989,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 				((current->flags & PF_MEMALLOC) ||
 				 unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (expired_oom_kill)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2997,7 +3001,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
+	return !!(gfp_to_alloc_flags(gfp_mask, false) & ALLOC_NO_WATERMARKS);
 }
 
 static inline struct page *
@@ -3012,6 +3016,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	bool expired_oom_kill = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3041,7 +3046,7 @@ retry:
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
 	 */
-	alloc_flags = gfp_to_alloc_flags(gfp_mask);
+	alloc_flags = gfp_to_alloc_flags(gfp_mask, expired_oom_kill);
 
 	/*
 	 * Find the true preferred zone if the allocation is unconstrained by
@@ -3166,7 +3171,8 @@ retry:
 	}
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress,
+				     &expired_oom_kill);
 	if (page)
 		goto got_pg;
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-20 21:00 [patch -mm] mm, oom: add global access to memory reserves on livelock David Rientjes
@ 2015-08-20 23:10 ` Andrew Morton
  2015-08-21  8:17 ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2015-08-20 23:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Johannes Weiner, Oleg Nesterov, Michal Hocko,
	Vlastimil Babka, linux-kernel, linux-mm

On Thu, 20 Aug 2015 14:00:36 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> On system oom, a process may fail to exit if its thread depends on a lock
> held by another allocating process.
> 
> In this case, we can detect an oom kill livelock that requires memory
> allocation to be successful to resolve.
> 
> This patch introduces an oom expiration, set to 5s, that defines how long
> a thread has to exit after being oom killed.
> 
> When this period elapses, it is assumed that the thread cannot make
> forward progress without help.  The only help the VM may provide is to
> allow pending allocations to succeed, so it grants all allocators access
> to memory reserves after reclaim and compaction have failed.
> 
> This patch does not allow global access to memory reserves on memcg oom
> kill, but the functionality is there if extended.

I'm struggling a bit to understand how this works.  afaict what happens
is that if some other (non-oom-killed) thread is spinning in the page
allocator and then __alloc_pages_may_oom() decides to oom-kill this
not-yet-oom-killed thread, out_of_memory() will then tell this process
"you can access page reserves", rather than oom-killing it.

I think.

If so, the "provide all threads" comment over the OOM_EXPIRE_MSECS
definition is a bit incomplete.

Also, there are a whole bunch of reasons why a caller to
out_of_memory() won't call into select_bad_process(), where all the
magic happens.  Such as task_will_free_mem(), !oom_unkillable_task(),
etc.  Can't those thing prevent those threads from getting permission
to use page reserves?

I suspect I'm just not understanding the implementation here.  A fuller
explanation (preferably in the .c files!) would help.


Also...  the hard-wired 5 second delay is of course problematic.  What
goes wrong if this is reduced to zero?  ie, let non-oom-killed threads
access page reserves immediately?

>
> ...
>
> @@ -254,8 +263,57 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
>  }
>  #endif
>  
> +#ifdef CONFIG_STACKTRACE
> +#define MAX_STACK_TRACE_ENTRIES	(64)
> +static unsigned long stack_trace_entries[MAX_STACK_TRACE_ENTRIES *
> +					 sizeof(unsigned long)];
> +static DEFINE_MUTEX(stack_trace_mutex);
> +
> +static void print_stacks_expired(struct task_struct *task)
> +{
> +	/* One set of stack traces every OOM_EXPIRE_MS */
> +	static DEFINE_RATELIMIT_STATE(expire_rs, OOM_EXPIRE_MSECS / 1000 * HZ,
> +				      1);
> +	struct stack_trace trace = {
> +		.nr_entries = 0,
> +		.max_entries = ARRAY_SIZE(stack_trace_entries),
> +		.entries = stack_trace_entries,
> +		.skip = 2,
> +	};
> +
> +	if (!__ratelimit(&expire_rs))
> +		return;
> +
> +	WARN(true,
> +	     "%s (%d) has failed to exit -- global access to memory reserves started\n",
> +	     task->comm, task->pid);
> +
> +	/*
> +	 * If cred_guard_mutex can't be acquired, this may be a mutex that is
> +	 * being held causing the livelock.  Return without printing the stack.
> +	 */
> +	if (!mutex_trylock(&task->signal->cred_guard_mutex))
> +		return;
> +
> +	mutex_lock(&stack_trace_mutex);
> +	save_stack_trace_tsk(task, &trace);
> +
> +	pr_info("Call Trace of %s/%d:\n", task->comm, task->pid);
> +	print_stack_trace(&trace, 0);
> +
> +	mutex_unlock(&stack_trace_mutex);
> +	mutex_unlock(&task->signal->cred_guard_mutex);
> +}
> +#else
> +static inline void print_stacks_expired(struct task_struct *task)
> +{
> +}
> +#endif /* CONFIG_STACKTRACE */

That ""%s (%d) has failed to exit" warning is still useful if
CONFIG_STACKTRACE=n and I suggest it be moved into the caller.

Alternatively, make that message in exit_mm() dependent on
CONFIG_STACKTRACE as well - it's a bit odd to print the "ended" message
without having printed the "started" message.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-20 21:00 [patch -mm] mm, oom: add global access to memory reserves on livelock David Rientjes
  2015-08-20 23:10 ` Andrew Morton
@ 2015-08-21  8:17 ` Michal Hocko
  2015-08-21 13:29   ` Tetsuo Handa
  2015-08-24 21:04   ` David Rientjes
  1 sibling, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2015-08-21  8:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

[CCing Tetsuo - he was really concerned about the oom deadlocks and he
 was proposing a timeout based solution as well]

On Thu 20-08-15 14:00:36, David Rientjes wrote:
> On system oom, a process may fail to exit if its thread depends on a lock
> held by another allocating process.
> 
> In this case, we can detect an oom kill livelock that requires memory
> allocation to be successful to resolve.
> 
> This patch introduces an oom expiration, set to 5s, that defines how long
> a thread has to exit after being oom killed.
> 
> When this period elapses, it is assumed that the thread cannot make
> forward progress without help.  The only help the VM may provide is to
> allow pending allocations to succeed, so it grants all allocators access
> to memory reserves after reclaim and compaction have failed.

There might be many threads waiting for the allocation and this can lead
to quick oom reserves depletion without releasing resources which are
holding back the oom victim. As Tetsuo has shown, such a load can be
generated from the userspace without root privileges so it is much
easier to make the system _completely_ unusable with this patch. Not that
having an OOM deadlock would be great but you still have emergency tools
like sysrq triggered OOM killer to attempt to sort the situation out.
Once your are out of reserves nothing will help you, though. So I think it
is a bad idea to give access to reserves without any throttling.

Johannes' idea to give a partial access to memory reserves to the task
which has invoked the OOM killer was much better IMO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-21  8:17 ` Michal Hocko
@ 2015-08-21 13:29   ` Tetsuo Handa
  2015-08-24 21:10     ` David Rientjes
  2015-08-24 21:04   ` David Rientjes
  1 sibling, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2015-08-21 13:29 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, mgorman, hannes, oleg, vbabka, linux-kernel, linux-mm

Michal Hocko wrote:
> [CCing Tetsuo - he was really concerned about the oom deadlocks and he
>  was proposing a timeout based solution as well]

Thank you for CCing me.
My proposal is http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp .

> 
> On Thu 20-08-15 14:00:36, David Rientjes wrote:
> > On system oom, a process may fail to exit if its thread depends on a lock
> > held by another allocating process.
> > 
> > In this case, we can detect an oom kill livelock that requires memory
> > allocation to be successful to resolve.
> > 
> > This patch introduces an oom expiration, set to 5s, that defines how long
> > a thread has to exit after being oom killed.
> > 
> > When this period elapses, it is assumed that the thread cannot make
> > forward progress without help.  The only help the VM may provide is to
> > allow pending allocations to succeed, so it grants all allocators access
> > to memory reserves after reclaim and compaction have failed.

Why can't we think about choosing more OOM victims instead of granting access
to memory reserves?

> 
> There might be many threads waiting for the allocation and this can lead
> to quick oom reserves depletion without releasing resources which are
> holding back the oom victim. As Tetsuo has shown, such a load can be
> generated from the userspace without root privileges so it is much
> easier to make the system _completely_ unusable with this patch. Not that
> having an OOM deadlock would be great but you still have emergency tools
> like sysrq triggered OOM killer to attempt to sort the situation out.

Like I described in my proposal, the administrator might not be ready to use
SysRq. Automatic recovery based on timeout is useful for such cases than
manual emergency tools.

Also, SysRq might not be usable under OOM because workqueues can get stuck.
The panic_on_oom_timeout was first proposed using a workqueue but was
updated to use a timer because there is no guarantee that workqueues work
as expected under OOM.

> Once your are out of reserves nothing will help you, though. So I think it
> is a bad idea to give access to reserves without any throttling.

I agree.
But I also think that it is a bad idea to cling to only memory reserves.

> 
> Johannes' idea to give a partial access to memory reserves to the task
> which has invoked the OOM killer was much better IMO.

In a different thread, we are planning to change !__GFP_FS allocations.
Some of !__GFP_FS allocations are about to acquire __GFP_NOFAIL which could
in turn suffer from OOM deadlock because the possibility of hitting
"__GFP_NOFAIL allocations not only start calling out_of_memory() but also
start triggering OOM deadlock when out_of_memory() cannot choose next OOM
victim" is increased.

The panic_on_oom_timeout will be overkilling when choosing next OOM victim
can make forward progress. If a local unprivileged user discovers a method
for keeping the OOM state for panic_on_oom_timeout seconds, we will allow
that user to kill the system _completely_.

I think that "out_of_memory() cannot choose next OOM victim" problem
should be addressed before addressing "how to manage memory reserves"
problem.

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-21  8:17 ` Michal Hocko
  2015-08-21 13:29   ` Tetsuo Handa
@ 2015-08-24 21:04   ` David Rientjes
  2015-08-25 14:25     ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: David Rientjes @ 2015-08-24 21:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

On Fri, 21 Aug 2015, Michal Hocko wrote:

> There might be many threads waiting for the allocation and this can lead
> to quick oom reserves depletion without releasing resources which are
> holding back the oom victim. As Tetsuo has shown, such a load can be
> generated from the userspace without root privileges so it is much
> easier to make the system _completely_ unusable with this patch. Not that
> having an OOM deadlock would be great but you still have emergency tools
> like sysrq triggered OOM killer to attempt to sort the situation out.
> Once your are out of reserves nothing will help you, though. So I think it
> is a bad idea to give access to reserves without any throttling.
> 

I don't believe a solution that requires admin intervention is 
maintainable.  It would be better to reboot when memory reserves are fully 
depleted.

> Johannes' idea to give a partial access to memory reserves to the task
> which has invoked the OOM killer was much better IMO.

That's what this patch does, just without the "partial."  Processes are 
required to reclaim and then invoke the oom killler every time an 
allocation is made using memory reserves with this approach after the 
expiration has lapsed.

We can discuss only allowing partial access to memory reserves equal to 
ALLOC_HARD | ALLOC_HARDER, or defining a new watermark, but I'm concerned 
about what happens when that threshold is reached and the oom killer is 
still livelocked.  It would seem better to attempt recovery at whatever 
cost and then panic if fully depleted.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-21 13:29   ` Tetsuo Handa
@ 2015-08-24 21:10     ` David Rientjes
  2015-08-25 15:26       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2015-08-24 21:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, hannes, oleg, vbabka, linux-kernel, linux-mm

On Fri, 21 Aug 2015, Tetsuo Handa wrote:

> Why can't we think about choosing more OOM victims instead of granting access
> to memory reserves?
> 

We have no indication of which thread is holding a mutex that would need 
to be killed, so we'd be randomly killing processes waiting for forward 
progress.  A worst-case scenario would be the thread is OOM_DISABLE and we 
kill every process on the system needlessly.  This problem obviously 
occurs often enough that killing all userspace isnt going to be a viable 
solution.

> Also, SysRq might not be usable under OOM because workqueues can get stuck.
> The panic_on_oom_timeout was first proposed using a workqueue but was
> updated to use a timer because there is no guarantee that workqueues work
> as expected under OOM.
> 

I don't know anything about a panic_on_oom_timeout, but panicking would 
only be a reasonable action if memory reserves were fully depleted.  That 
could easily be dealt with in the page allocator so there's no timeout 
involved.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-24 21:04   ` David Rientjes
@ 2015-08-25 14:25     ` Michal Hocko
  2015-08-25 23:41       ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2015-08-25 14:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

On Mon 24-08-15 14:04:28, David Rientjes wrote:
> On Fri, 21 Aug 2015, Michal Hocko wrote:
> 
> > There might be many threads waiting for the allocation and this can lead
> > to quick oom reserves depletion without releasing resources which are
> > holding back the oom victim. As Tetsuo has shown, such a load can be
> > generated from the userspace without root privileges so it is much
> > easier to make the system _completely_ unusable with this patch. Not that
> > having an OOM deadlock would be great but you still have emergency tools
> > like sysrq triggered OOM killer to attempt to sort the situation out.
> > Once your are out of reserves nothing will help you, though. So I think it
> > is a bad idea to give access to reserves without any throttling.
> > 
> 
> I don't believe a solution that requires admin intervention is 
> maintainable.

Why?

> It would be better to reboot when memory reserves are fully depleted.

The question is when are the reserves depleted without any way to
replenish them. While playing with GFP_NOFS patch set which gives
__GFP_NOFAIL allocations access to memory reserves
(http://marc.info/?l=linux-mm&m=143876830916540&w=2) I could see the
warning hit while the system still resurrected from the memory pressure.

> > Johannes' idea to give a partial access to memory reserves to the task
> > which has invoked the OOM killer was much better IMO.
> 
> That's what this patch does, just without the "partial."  Processes are 
> required to reclaim and then invoke the oom killler every time an 
> allocation is made using memory reserves with this approach after the 
> expiration has lapsed.
> 
> We can discuss only allowing partial access to memory reserves equal to 
> ALLOC_HARD | ALLOC_HARDER, or defining a new watermark, but I'm concerned 
> about what happens when that threshold is reached and the oom killer is 
> still livelocked.  It would seem better to attempt recovery at whatever 
> cost and then panic if fully depleted.

I think an OOM reserve/watermark makes more sense. It will not solve the
livelock but neithere granting the full access to reserves will. But the
partial access has a potential to leave some others means to intervene.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-24 21:10     ` David Rientjes
@ 2015-08-25 15:26       ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2015-08-25 15:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tetsuo Handa, akpm, mgorman, hannes, oleg, vbabka, linux-kernel,
	linux-mm

On Mon 24-08-15 14:10:10, David Rientjes wrote:
> On Fri, 21 Aug 2015, Tetsuo Handa wrote:
> 
> > Why can't we think about choosing more OOM victims instead of granting access
> > to memory reserves?
> > 
> 
> We have no indication of which thread is holding a mutex that would need 
> to be killed, so we'd be randomly killing processes waiting for forward 
> progress.  A worst-case scenario would be the thread is OOM_DISABLE and we 
> kill every process on the system needlessly.  This problem obviously 
> occurs often enough that killing all userspace isnt going to be a viable 
> solution.
> 
> > Also, SysRq might not be usable under OOM because workqueues can get stuck.
> > The panic_on_oom_timeout was first proposed using a workqueue but was
> > updated to use a timer because there is no guarantee that workqueues work
> > as expected under OOM.
> > 
> 
> I don't know anything about a panic_on_oom_timeout,

You were CCed on the discussion
http://lkml.kernel.org/r/20150609170310.GA8990%40dhcp22.suse.cz

> but panicking would 
> only be a reasonable action if memory reserves were fully depleted.  That 
> could easily be dealt with in the page allocator so there's no timeout 
> involved.

As noted in other email. Just depletion is not a good indicator. The
system can still make a forward progress even when reserves are
depleted.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-25 14:25     ` Michal Hocko
@ 2015-08-25 23:41       ` David Rientjes
  2015-08-26  7:01         ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2015-08-25 23:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

On Tue, 25 Aug 2015, Michal Hocko wrote:

> > I don't believe a solution that requires admin intervention is 
> > maintainable.
> 
> Why?
> 

Because the company I work for has far too many machines for that to be 
possible.

> > It would be better to reboot when memory reserves are fully depleted.
> 
> The question is when are the reserves depleted without any way to
> replenish them. While playing with GFP_NOFS patch set which gives
> __GFP_NOFAIL allocations access to memory reserves
> (http://marc.info/?l=linux-mm&m=143876830916540&w=2) I could see the
> warning hit while the system still resurrected from the memory pressure.
> 

If there is a holder of a mutex that then allocates gigabytes of memory, 
no amount of memory reserves is going to assist in resolving an oom killer 
livelock, whether that's partial access to memory reserves or full access 
to memory reserves.

You're referring to two different conditions:

 (1) oom livelock as a result of an oom kill victim waiting on a lock that
     is held by an allocator, and

 (2) depletion of memory reserves, which can also happen today without 
     this patchset and we have fixed in the past.

This patch addresses (1) by giving it a higher probability, absent the 
ability to determine which thread is holding the lock that the victim 
depends on, to make forward progress.  It would be fine to do (2) as a 
separate patch, since it is a separate problem, that I agree has a higher 
likelihood of happening now to panic when memory reserves have been 
depleted.

> I think an OOM reserve/watermark makes more sense. It will not solve the
> livelock but neithere granting the full access to reserves will. But the
> partial access has a potential to leave some others means to intervene.
> 

Unless the oom watermark was higher than the lowest access to memory 
reserves other than ALLOC_NO_WATERMARKS, then no forward progress would be 
made in this scenario.  I think it would be better to give access to that 
crucial last page that may solve the livelock to make forward progress, or 
panic as a result of complete depletion of memory reserves.  That panic() 
is a very trivial patch that can be checked in the allocator slowpath and 
addresses a problem that already exists today.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-25 23:41       ` David Rientjes
@ 2015-08-26  7:01         ` Michal Hocko
  2015-08-26 22:23           ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2015-08-26  7:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

On Tue 25-08-15 16:41:29, David Rientjes wrote:
> On Tue, 25 Aug 2015, Michal Hocko wrote:
> 
> > > I don't believe a solution that requires admin intervention is 
> > > maintainable.
> > 
> > Why?
> > 
> 
> Because the company I work for has far too many machines for that to be 
> possible.

OK I can see that manual intervention for hundreds of machines is not
practical. But not everybody is that large and there are users who might
want to be be able to recover.
 
> > > It would be better to reboot when memory reserves are fully depleted.
> > 
> > The question is when are the reserves depleted without any way to
> > replenish them. While playing with GFP_NOFS patch set which gives
> > __GFP_NOFAIL allocations access to memory reserves
> > (http://marc.info/?l=linux-mm&m=143876830916540&w=2) I could see the
> > warning hit while the system still resurrected from the memory pressure.
> > 
> 
> If there is a holder of a mutex that then allocates gigabytes of memory, 
> no amount of memory reserves is going to assist in resolving an oom killer 
> livelock, whether that's partial access to memory reserves or full access 
> to memory reserves.

Sure, but do we have something like that in the kernel? I would argue it
would be terribly broken and a clear bug which should be fixed.

> You're referring to two different conditions:
> 
>  (1) oom livelock as a result of an oom kill victim waiting on a lock that
>      is held by an allocator, and
> 
>  (2) depletion of memory reserves, which can also happen today without 
>      this patchset and we have fixed in the past.
> 
> This patch addresses (1) by giving it a higher probability, absent the 
> ability to determine which thread is holding the lock that the victim 
> depends on, to make forward progress.  It would be fine to do (2) as a 
> separate patch, since it is a separate problem, that I agree has a higher 
> likelihood of happening now to panic when memory reserves have been 
> depleted.
> 
> > I think an OOM reserve/watermark makes more sense. It will not solve the
> > livelock but neithere granting the full access to reserves will. But the
> > partial access has a potential to leave some others means to intervene.
> > 
> 
> Unless the oom watermark was higher than the lowest access to memory 
> reserves other than ALLOC_NO_WATERMARKS, then no forward progress would be 
> made in this scenario.  I think it would be better to give access to that 
> crucial last page that may solve the livelock to make forward progress, or 
> panic as a result of complete depletion of memory reserves.  That panic() 
> is a very trivial patch that can be checked in the allocator slowpath and 
> addresses a problem that already exists today.

The panicing the system is really trivial, no question about that. The
question is whether that panic would be premature. And that is what
I've tried to tell you. The patch I am referring to above gives the
__GFP_NOFAIL request the full access to memory reserves (throttled by
oom_lock) but it still failed temporarily. What is more important,
though, this state wasn't permanent and the system recovered after short
time so panicing at the time would be premature.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-26  7:01         ` Michal Hocko
@ 2015-08-26 22:23           ` David Rientjes
  2015-08-27 12:41             ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2015-08-26 22:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

On Wed, 26 Aug 2015, Michal Hocko wrote:

> > Because the company I work for has far too many machines for that to be 
> > possible.
> 
> OK I can see that manual intervention for hundreds of machines is not
> practical. But not everybody is that large and there are users who might
> want to be be able to recover.
>  

If Andrew would prefer moving in a direction where all Linux users are 
required to have their admin use sysrq+f to manually trigger an oom kill, 
which may or may not resolve the livelock since there's no way to 
determine which process is holding the common mutex (or even which 
processes are currently allocating), in such situations, then we can carry 
this patch internally.  I disagree with that solution for upstream Linux.

> > If there is a holder of a mutex that then allocates gigabytes of memory, 
> > no amount of memory reserves is going to assist in resolving an oom killer 
> > livelock, whether that's partial access to memory reserves or full access 
> > to memory reserves.
> 
> Sure, but do we have something like that in the kernel? I would argue it
> would be terribly broken and a clear bug which should be fixed.
> 

This is also why my patch dumps the stack trace of both threads: so we can 
evaluate the memory allocation of threads holding shared mutexes.  If it 
is excessive, we can report that and show that it is a common offender and 
see if we can mitigate that.

The scenario described, the full or partial depletion of memory reserves, 
does not need to be induced by a single user.  We don't control the order 
in which the mutex is grabbed so it's multipled by the number of threads 
that grab it, allocate memory, and drop it before the victim has a chance 
to grab it.  In the past, the oom killer would also increase the 
scheduling priority of a victim so it tried to resolve issues like this 
faster.

> > Unless the oom watermark was higher than the lowest access to memory 
> > reserves other than ALLOC_NO_WATERMARKS, then no forward progress would be 
> > made in this scenario.  I think it would be better to give access to that 
> > crucial last page that may solve the livelock to make forward progress, or 
> > panic as a result of complete depletion of memory reserves.  That panic() 
> > is a very trivial patch that can be checked in the allocator slowpath and 
> > addresses a problem that already exists today.
> 
> The panicing the system is really trivial, no question about that. The
> question is whether that panic would be premature. And that is what
> I've tried to tell you.

My patch has defined that by OOM_EXPIRE_MSECS.  The premise is that an oom 
victim with full access to memory reserves should never take more than 5s 
to exit, which I consider a very long time.  If it's increased, we see 
userspace responsiveness issues with our processes that monitor system 
health which timeout.

Sure, it's always possible that a process that requires no mutexes that 
are held by allocators exits and frees a lot of memory 10s later, 5m 
later, etc, and the system can recover.  We have no guarntee of that 
happening, so the panic point needs to be defined where the VM gives up 
and it's unlikely anything can make forward progress.  My suggestion would 
be when memory reserves are fully depleted, but again, that is something 
that can be shown to happen today and is a separate issue.

> The patch I am referring to above gives the
> __GFP_NOFAIL request the full access to memory reserves (throttled by
> oom_lock) but it still failed temporarily. What is more important,
> though, this state wasn't permanent and the system recovered after short
> time so panicing at the time would be premature.
> 

I'm not addressing __GFP_NOFAIL in this patch, and __GFP_NOFAIL is not the 
pain point that this patch is addressing, so I consider it a different 
issue.  We don't allow atomic __GFP_NOFAIL allocations, so the context is 
really saying "I need this memory, and I'm willing to wait until it's 
available."  I only consider that to be significant to oom killer livelock 
if it is holding a mutex that the victim depends on to handle its SIGKILL.  
With my patch, that thread would also get access to memory reserves to 
make forward progress after the expiration has lapsed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-26 22:23           ` David Rientjes
@ 2015-08-27 12:41             ` Michal Hocko
  2015-08-27 20:52               ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2015-08-27 12:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

On Wed 26-08-15 15:23:07, David Rientjes wrote:
> On Wed, 26 Aug 2015, Michal Hocko wrote:
> 
> > > Because the company I work for has far too many machines for that to be 
> > > possible.
> > 
> > OK I can see that manual intervention for hundreds of machines is not
> > practical. But not everybody is that large and there are users who might
> > want to be be able to recover.
> >  
> 
> If Andrew would prefer moving in a direction where all Linux users are 
> required to have their admin use sysrq+f to manually trigger an oom kill, 
> which may or may not resolve the livelock since there's no way to 
> determine which process is holding the common mutex (or even which 
> processes are currently allocating), in such situations, then we can carry 
> this patch internally.  I disagree with that solution for upstream Linux.

There are other possibilities than the manual sysrq intervention. E.g.
the already mentioned oom_{panic,reboot}_timeout which has a little
advantage that it allows admin to opt in into the policy rather than
having it hard coded into the kernel.
 
> > > If there is a holder of a mutex that then allocates gigabytes of memory, 
> > > no amount of memory reserves is going to assist in resolving an oom killer 
> > > livelock, whether that's partial access to memory reserves or full access 
> > > to memory reserves.
> > 
> > Sure, but do we have something like that in the kernel? I would argue it
> > would be terribly broken and a clear bug which should be fixed.
> > 
> 
> This is also why my patch dumps the stack trace of both threads: so we can 
> evaluate the memory allocation of threads holding shared mutexes.  If it 
> is excessive, we can report that and show that it is a common offender and 
> see if we can mitigate that.
> 
> The scenario described, the full or partial depletion of memory reserves, 
> does not need to be induced by a single user.  We don't control the order 
> in which the mutex is grabbed so it's multipled by the number of threads 
> that grab it, allocate memory, and drop it before the victim has a chance 
> to grab it.  In the past, the oom killer would also increase the 
> scheduling priority of a victim so it tried to resolve issues like this 
> faster.

> > > Unless the oom watermark was higher than the lowest access to memory 
> > > reserves other than ALLOC_NO_WATERMARKS, then no forward progress would be 
> > > made in this scenario.  I think it would be better to give access to that 
> > > crucial last page that may solve the livelock to make forward progress, or 
> > > panic as a result of complete depletion of memory reserves.  That panic() 
> > > is a very trivial patch that can be checked in the allocator slowpath and 
> > > addresses a problem that already exists today.
> > 
> > The panicing the system is really trivial, no question about that. The
> > question is whether that panic would be premature. And that is what
> > I've tried to tell you.
> 
> My patch has defined that by OOM_EXPIRE_MSECS.  The premise is that an oom 
> victim with full access to memory reserves should never take more than 5s 
> to exit, which I consider a very long time.  If it's increased, we see 
> userspace responsiveness issues with our processes that monitor system 
> health which timeout.

Yes but it sounds very much like a policy which should better be defined
from the userspace because different users might have different
preferences.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch -mm] mm, oom: add global access to memory reserves on livelock
  2015-08-27 12:41             ` Michal Hocko
@ 2015-08-27 20:52               ` David Rientjes
  0 siblings, 0 replies; 13+ messages in thread
From: David Rientjes @ 2015-08-27 20:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Oleg Nesterov,
	Vlastimil Babka, linux-kernel, linux-mm, Tetsuo Handa

On Thu, 27 Aug 2015, Michal Hocko wrote:

> > If Andrew would prefer moving in a direction where all Linux users are 
> > required to have their admin use sysrq+f to manually trigger an oom kill, 
> > which may or may not resolve the livelock since there's no way to 
> > determine which process is holding the common mutex (or even which 
> > processes are currently allocating), in such situations, then we can carry 
> > this patch internally.  I disagree with that solution for upstream Linux.
> 
> There are other possibilities than the manual sysrq intervention. E.g.
> the already mentioned oom_{panic,reboot}_timeout which has a little
> advantage that it allows admin to opt in into the policy rather than
> having it hard coded into the kernel.
>  

This falls under my scenario (2) from Tuesday's message:

 (2) depletion of memory reserves, which can also happen today without 
     this patchset and we have fixed in the past.

You can deplete memory reserves today without access to global reserves on 
oom livelock.  I'm indifferent to whether the machine panics as soon as 
memory reserves are fully depleted, independent of oom livelock and this 
patch to address it, or whether there is a configurable timeout.  It's an 
independent issue, though, since the oom killer is not the only way for 
this to happen and it seems there will be additional possibilities in the 
future (the __GFP_NOFAIL case you bring up).

> > My patch has defined that by OOM_EXPIRE_MSECS.  The premise is that an oom 
> > victim with full access to memory reserves should never take more than 5s 
> > to exit, which I consider a very long time.  If it's increased, we see 
> > userspace responsiveness issues with our processes that monitor system 
> > health which timeout.
> 
> Yes but it sounds very much like a policy which should better be defined
> from the userspace because different users might have different
> preferences.
> 

My patch internally actually does make this configurable through yet 
another VM sysctl and it defaults to what OOM_EXPIRE_MSECS does in my 
patch.  We would probably never increase it, but may decrease it from the 
default of 5000.  I was concerned about adding another sysctl that doesn't 
have a clear user.  If you feel that OOM_EXPIRE_MSECS is too small and 
believe there would be a user who desires their system to be livelocked 
for 10s, 5m, 1h, etc, then I can add the sysctl upstream as well even it's 
unjustified as far as I'm concerned.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-08-27 20:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-20 21:00 [patch -mm] mm, oom: add global access to memory reserves on livelock David Rientjes
2015-08-20 23:10 ` Andrew Morton
2015-08-21  8:17 ` Michal Hocko
2015-08-21 13:29   ` Tetsuo Handa
2015-08-24 21:10     ` David Rientjes
2015-08-25 15:26       ` Michal Hocko
2015-08-24 21:04   ` David Rientjes
2015-08-25 14:25     ` Michal Hocko
2015-08-25 23:41       ` David Rientjes
2015-08-26  7:01         ` Michal Hocko
2015-08-26 22:23           ` David Rientjes
2015-08-27 12:41             ` Michal Hocko
2015-08-27 20:52               ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).