linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] oom: sysrq+f fixes + cleanups
@ 2015-07-08 13:04 Michal Hocko
  2015-07-08 13:04 ` [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Michal Hocko @ 2015-07-08 13:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Jakob Unterwurzacher, linux-mm, LKML

Hi,
some of these patches have been posted already: http://marc.info/?l=linux-mm&m=143462145830969&w=2
This series contains an additional fix for another sysrq+f issue
reported off mailing list (patch #2).

First two patches are clear fixes. The third patch is from David with
my minor changes. The last patch is a cleanup but I have put it after
others because it has seen some opposition in the past.

Shortlog says:
David Rientjes (1):
      mm, oom: organize oom context into struct

Michal Hocko (3):
      oom: Do not panic when OOM killer is sysrq triggered
      oom: Do not invoke oom notifiers on sysrq+f
      oom: split out forced OOM killer

Thanks!

And diffstat:
 Documentation/sysrq.txt |   5 +-
 drivers/tty/sysrq.c     |   3 +-
 include/linux/oom.h     |  26 +++++----
 mm/memcontrol.c         |  13 +++--
 mm/oom_kill.c           | 141 +++++++++++++++++++++++++++---------------------
 mm/page_alloc.c         |   9 +++-
 6 files changed, 116 insertions(+), 81 deletions(-)



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered
  2015-07-08 13:04 [PATCH 0/4] oom: sysrq+f fixes + cleanups Michal Hocko
@ 2015-07-08 13:04 ` Michal Hocko
  2015-07-08 23:36   ` David Rientjes
  2015-07-08 13:04 ` [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-08 13:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.cz>

OOM killer might be triggered explicitly via sysrq+f. This is supposed
to kill a task no matter what e.g. a task is selected even though there
is an OOM victim on the way to exit. This is a big hammer for an admin
to help to resolve a memory short condition when the system is not able
to cope with it on its own in a reasonable time frame (e.g. when the
system is trashing or the OOM killer cannot make sufficient progress)

E.g. it doesn't make any sense to obey panic_on_oom setting because
a) administrator could have used other sysrqs to achieve the
panic/reboot and b) the policy would break an existing usecase to
kill a memory hog which would be recoverable unlike the panic which
might be configured for the real OOM condition.

It also doesn't make much sense to panic the system when there is no
OOM killable task because administrator might choose to do additional
steps before rebooting/panicking the system.

While we are there also add a comment explaining why
sysctl_oom_kill_allocating_task doesn't apply to sysrq triggered OOM
killer even though there is no explicit check and we subtly rely
on current->mm being NULL for the context from which it is triggered.

Also be more explicit about sysrq+f behavior in the documentation.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/sysrq.txt |  5 ++++-
 mm/oom_kill.c           | 15 ++++++++++++---
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt
index 0e307c94809a..7664e93411d2 100644
--- a/Documentation/sysrq.txt
+++ b/Documentation/sysrq.txt
@@ -75,7 +75,10 @@ On other - If you know of the key combos for other architectures, please
 
 'e'     - Send a SIGTERM to all processes, except for init.
 
-'f'	- Will call oom_kill to kill a memory hog process.
+'f'	- Will call oom_kill to kill a memory hog process. Please note that
+	  parallel OOM killer is ignored and a task is killed even though
+	  there was an oom victim selected already. panic_on_oom is ignored
+	  and the system doesn't panic if there are no oom killable tasks.
 
 'g'	- Used by kgdb (kernel debugger)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e0681e..f2737d66f66a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -687,8 +687,13 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
 						&totalpages);
 	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
-	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
+	if (!force_kill)
+		check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
 
+	/*
+	 * not affecting force_kill because sysrq triggered OOM killer runs from
+	 * the workqueue context so current->mm will be NULL
+	 */
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
@@ -702,8 +707,12 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
-		panic("Out of memory and no killable processes...\n");
+		if (!force_kill) {
+			dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
+			panic("Out of memory and no killable processes...\n");
+		} else {
+			pr_info("Sysrq triggered out of memory. No killable task found...\n");
+		}
 	}
 	if (p != (void *)-1UL) {
 		oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-08 13:04 [PATCH 0/4] oom: sysrq+f fixes + cleanups Michal Hocko
  2015-07-08 13:04 ` [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered Michal Hocko
@ 2015-07-08 13:04 ` Michal Hocko
  2015-07-08 23:37   ` David Rientjes
  2015-07-08 13:04 ` [PATCH 3/4] mm, oom: organize oom context into struct Michal Hocko
  2015-07-08 13:04 ` [PATCH 4/4] oom: split out forced OOM killer Michal Hocko
  3 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-08 13:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.cz>

A github user rfjakob has reported the following issue via IRC.
<rfjakob> Manually triggering the OOM killer does not work anymore in 4.0.5
<rfjakob> This is what it looks like: https://gist.github.com/rfjakob/346b7dc611fc3cdf4011
<rfjakob> Basically, what happens is that the GPU driver frees some memory, that satisfies the OOM killer
<rfjakob> But the memory is allocated immediately again, and in the, no processes are killed no matter how often you trigger the oom killer
<rfjakob> "in the end"

Quoting from the github:
"
[19291.202062] sysrq: SysRq : Manual OOM execution
[19291.208335] Purging GPU memory, 74399744 bytes freed, 8728576 bytes still pinned.
[19291.390767] sysrq: SysRq : Manual OOM execution
[19291.396792] Purging GPU memory, 74452992 bytes freed, 8728576 bytes still pinned.
[19291.560349] sysrq: SysRq : Manual OOM execution
[19291.566018] Purging GPU memory, 75489280 bytes freed, 8728576 bytes still pinned.
[19291.729944] sysrq: SysRq : Manual OOM execution
[19291.735686] Purging GPU memory, 74399744 bytes freed, 8728576 bytes still pinned.
[19291.918637] sysrq: SysRq : Manual OOM execution
[19291.924299] Purging GPU memory, 74403840 bytes freed, 8728576 bytes still pinned.
"

The issue is that sysrq+f (force_kill) gets confused by the regular OOM
heuristic which tries to prevent from OOM killer if some of the oom
notifier can relase a memory. The heuristic doesn't make much sense for
the sysrq+f path because this one is used by the administrator to kill
a memory hog.

Reported-by: Jakob Unterwurzacher <jakobunt@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f2737d66f66a..0b1b0b25f928 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -661,10 +661,12 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (oom_killer_disabled)
 		return false;
 
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		goto out;
+	if (!force_kill) {
+		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
+		if (freed > 0)
+			/* Got some memory back in the last second. */
+			goto out;
+	}
 
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] mm, oom: organize oom context into struct
  2015-07-08 13:04 [PATCH 0/4] oom: sysrq+f fixes + cleanups Michal Hocko
  2015-07-08 13:04 ` [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered Michal Hocko
  2015-07-08 13:04 ` [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f Michal Hocko
@ 2015-07-08 13:04 ` Michal Hocko
  2015-07-08 23:38   ` David Rientjes
  2015-07-08 13:04 ` [PATCH 4/4] oom: split out forced OOM killer Michal Hocko
  3 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-08 13:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

From: David Rientjes <rientjes@google.com>

There are essential elements to an oom context that are passed around to
multiple functions.

Organize these elements into a new struct, struct oom_context, that
specifies the context for an oom condition.

This patch introduces no functional change.

[mhocko@suse.cz: s@oom_control@oom_context@]
[mhocko@suse.cz: do not initialize on stack oom_context with NULL or 0]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c |  10 ++++-
 include/linux/oom.h |  25 +++++++-----
 mm/memcontrol.c     |  13 +++---
 mm/oom_kill.c       | 115 +++++++++++++++++++++++-----------------------------
 mm/page_alloc.c     |   9 +++-
 5 files changed, 89 insertions(+), 83 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index b20d2c0ec451..865b837a9aee 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -356,9 +356,15 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
+	const gfp_t gfp_mask = GFP_KERNEL;
+	struct oom_context oc = {
+		.zonelist = node_zonelist(first_memory_node, gfp_mask),
+		.gfp_mask = gfp_mask,
+		.force_kill = true,
+	};
+
 	mutex_lock(&oom_lock);
-	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
-			   GFP_KERNEL, 0, NULL, true))
+	if (!out_of_memory(&oc))
 		pr_info("OOM request ignored because killer is disabled\n");
 	mutex_unlock(&oom_lock);
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7deecb7bca5e..094407cb2d2e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -12,6 +12,14 @@ struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
 
+struct oom_context {
+	struct zonelist *zonelist;
+	nodemask_t	*nodemask;
+	gfp_t		gfp_mask;
+	int		order;
+	bool		force_kill;
+};
+
 /*
  * Types of limitations to the nodes from which allocations may occur
  */
@@ -57,21 +65,18 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern int oom_kills_count(void);
 extern void note_oom_kill(void);
-extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+extern void oom_kill_process(struct oom_context *oc, struct task_struct *p,
 			     unsigned int points, unsigned long totalpages,
-			     struct mem_cgroup *memcg, nodemask_t *nodemask,
-			     const char *message);
+			     struct mem_cgroup *memcg, const char *message);
 
-extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
-			       int order, const nodemask_t *nodemask,
+extern void check_panic_on_oom(struct oom_context *oc,
+			       enum oom_constraint constraint,
 			       struct mem_cgroup *memcg);
 
-extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-		unsigned long totalpages, const nodemask_t *nodemask,
-		bool force_kill);
+extern enum oom_scan_t oom_scan_process_thread(struct oom_context *oc,
+		struct task_struct *task, unsigned long totalpages);
 
-extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
-		int order, nodemask_t *mask, bool force_kill);
+extern bool out_of_memory(struct oom_context *oc);
 
 extern void exit_oom_victim(void);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..7ad5352bd3f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1545,6 +1545,10 @@ static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
 static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				     int order)
 {
+	struct oom_context oc = {
+		.gfp_mask = gfp_mask,
+		.order = order,
+	};
 	struct mem_cgroup *iter;
 	unsigned long chosen_points = 0;
 	unsigned long totalpages;
@@ -1563,7 +1567,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		goto unlock;
 	}
 
-	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
+	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
@@ -1571,8 +1575,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 		css_task_iter_start(&iter->css, &it);
 		while ((task = css_task_iter_next(&it))) {
-			switch (oom_scan_process_thread(task, totalpages, NULL,
-							false)) {
+			switch (oom_scan_process_thread(&oc, task, totalpages)) {
 			case OOM_SCAN_SELECT:
 				if (chosen)
 					put_task_struct(chosen);
@@ -1610,8 +1613,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	if (chosen) {
 		points = chosen_points * 1000 / totalpages;
-		oom_kill_process(chosen, gfp_mask, order, points, totalpages,
-				 memcg, NULL, "Memory cgroup out of memory");
+		oom_kill_process(&oc, chosen, points, totalpages, memcg,
+				 "Memory cgroup out of memory");
 	}
 unlock:
 	mutex_unlock(&oom_lock);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0b1b0b25f928..01aa4cb86857 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -196,27 +196,26 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
  * Determine the type of allocation constraint.
  */
 #ifdef CONFIG_NUMA
-static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+static enum oom_constraint constrained_alloc(struct oom_context *oc,
+					     unsigned long *totalpages)
 {
 	struct zone *zone;
 	struct zoneref *z;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	enum zone_type high_zoneidx = gfp_zone(oc->gfp_mask);
 	bool cpuset_limited = false;
 	int nid;
 
 	/* Default to all available memory */
 	*totalpages = totalram_pages + total_swap_pages;
 
-	if (!zonelist)
+	if (!oc->zonelist)
 		return CONSTRAINT_NONE;
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
 	 */
-	if (gfp_mask & __GFP_THISNODE)
+	if (oc->gfp_mask & __GFP_THISNODE)
 		return CONSTRAINT_NONE;
 
 	/*
@@ -224,17 +223,18 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 	 * the page allocator means a mempolicy is in effect.  Cpuset policy
 	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) {
+	if (oc->nodemask &&
+	    !nodes_subset(node_states[N_MEMORY], *oc->nodemask)) {
 		*totalpages = total_swap_pages;
-		for_each_node_mask(nid, *nodemask)
+		for_each_node_mask(nid, *oc->nodemask)
 			*totalpages += node_spanned_pages(nid);
 		return CONSTRAINT_MEMORY_POLICY;
 	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-			high_zoneidx, nodemask)
-		if (!cpuset_zone_allowed(zone, gfp_mask))
+	for_each_zone_zonelist_nodemask(zone, z, oc->zonelist,
+			high_zoneidx, oc->nodemask)
+		if (!cpuset_zone_allowed(zone, oc->gfp_mask))
 			cpuset_limited = true;
 
 	if (cpuset_limited) {
@@ -246,20 +246,18 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 	return CONSTRAINT_NONE;
 }
 #else
-static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask,
-				unsigned long *totalpages)
+static enum oom_constraint constrained_alloc(struct oom_context *oc,
+					     unsigned long *totalpages)
 {
 	*totalpages = totalram_pages + total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
 
-enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-		unsigned long totalpages, const nodemask_t *nodemask,
-		bool force_kill)
+enum oom_scan_t oom_scan_process_thread(struct oom_context *oc,
+			struct task_struct *task, unsigned long totalpages)
 {
-	if (oom_unkillable_task(task, NULL, nodemask))
+	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
 
 	/*
@@ -267,7 +265,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!force_kill)
+		if (!oc->force_kill)
 			return OOM_SCAN_ABORT;
 	}
 	if (!task->mm)
@@ -280,7 +278,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task_will_free_mem(task) && !force_kill)
+	if (task_will_free_mem(task) && !oc->force_kill)
 		return OOM_SCAN_ABORT;
 
 	return OOM_SCAN_OK;
@@ -289,12 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'.  Returns -1 on scan abort.
- *
- * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, const nodemask_t *nodemask,
-		bool force_kill)
+static struct task_struct *select_bad_process(struct oom_context *oc,
+		unsigned int *ppoints, unsigned long totalpages)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -304,8 +299,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	for_each_process_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(p, totalpages, nodemask,
-						force_kill)) {
+		switch (oom_scan_process_thread(oc, p, totalpages)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
 			chosen_points = ULONG_MAX;
@@ -318,7 +312,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		case OOM_SCAN_OK:
 			break;
 		};
-		points = oom_badness(p, NULL, nodemask, totalpages);
+		points = oom_badness(p, NULL, oc->nodemask, totalpages);
 		if (!points || points < chosen_points)
 			continue;
 		/* Prefer thread group leaders for display purposes */
@@ -380,13 +374,13 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 	rcu_read_unlock();
 }
 
-static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
-			struct mem_cgroup *memcg, const nodemask_t *nodemask)
+static void dump_header(struct oom_context *oc, struct task_struct *p,
+			struct mem_cgroup *memcg)
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
 		"oom_score_adj=%hd\n",
-		current->comm, gfp_mask, order,
+		current->comm, oc->gfp_mask, oc->order,
 		current->signal->oom_score_adj);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
@@ -396,7 +390,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 	else
 		show_mem(SHOW_MEM_FILTER_NODES);
 	if (sysctl_oom_dump_tasks)
-		dump_tasks(memcg, nodemask);
+		dump_tasks(memcg, oc->nodemask);
 }
 
 /*
@@ -487,10 +481,9 @@ void oom_killer_enable(void)
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
-void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+void oom_kill_process(struct oom_context *oc, struct task_struct *p,
 		      unsigned int points, unsigned long totalpages,
-		      struct mem_cgroup *memcg, nodemask_t *nodemask,
-		      const char *message)
+		      struct mem_cgroup *memcg, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -514,7 +507,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
-		dump_header(p, gfp_mask, order, memcg, nodemask);
+		dump_header(oc, p, memcg);
 
 	task_lock(p);
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
@@ -537,7 +530,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */
-			child_points = oom_badness(child, memcg, nodemask,
+			child_points = oom_badness(child, memcg, oc->nodemask,
 								totalpages);
 			if (child_points > victim_points) {
 				put_task_struct(victim);
@@ -600,8 +593,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
-			int order, const nodemask_t *nodemask,
+void check_panic_on_oom(struct oom_context *oc, enum oom_constraint constraint,
 			struct mem_cgroup *memcg)
 {
 	if (likely(!sysctl_panic_on_oom))
@@ -615,7 +607,7 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		if (constraint != CONSTRAINT_NONE)
 			return;
 	}
-	dump_header(NULL, gfp_mask, order, memcg, nodemask);
+	dump_header(oc, NULL, memcg);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
@@ -635,22 +627,16 @@ int unregister_oom_notifier(struct notifier_block *nb)
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
 /**
- * __out_of_memory - kill the "best" process when we run out of memory
- * @zonelist: zonelist pointer
- * @gfp_mask: memory allocation flags
- * @order: amount of memory being requested as a power of 2
- * @nodemask: nodemask passed to page allocator
- * @force_kill: true if a task must be killed, even if others are exiting
+ * out_of_memory - kill the "best" process when we run out of memory
+ * @oc: pointer to struct oom_context
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
-		   int order, nodemask_t *nodemask, bool force_kill)
+bool out_of_memory(struct oom_context *oc)
 {
-	const nodemask_t *mpol_mask;
 	struct task_struct *p;
 	unsigned long totalpages;
 	unsigned long freed = 0;
@@ -661,7 +647,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (oom_killer_disabled)
 		return false;
 
-	if (!force_kill) {
+	if (!oc->force_kill) {
 		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 		if (freed > 0)
 			/* Got some memory back in the last second. */
@@ -686,39 +672,38 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
-						&totalpages);
-	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
-	if (!force_kill)
-		check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
+	constraint = constrained_alloc(oc, &totalpages);
+	if (constraint != CONSTRAINT_MEMORY_POLICY)
+		oc->nodemask = NULL;
+	if (!oc->force_kill)
+		check_panic_on_oom(oc, constraint, NULL);
 
 	/*
 	 * not affecting force_kill because sysrq triggered OOM killer runs from
 	 * the workqueue context so current->mm will be NULL
 	 */
 	if (sysctl_oom_kill_allocating_task && current->mm &&
-	    !oom_unkillable_task(current, NULL, nodemask) &&
+	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,
-				 nodemask,
+		oom_kill_process(oc, current, 0, totalpages, NULL,
 				 "Out of memory (oom_kill_allocating_task)");
 		goto out;
 	}
 
-	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
+	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		if (!force_kill) {
-			dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
+		if (!oc->force_kill) {
+			dump_header(oc, NULL, NULL);
 			panic("Out of memory and no killable processes...\n");
 		} else {
 			pr_info("Sysrq triggered out of memory. No killable task found...\n");
 		}
 	}
 	if (p != (void *)-1UL) {
-		oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
-				 nodemask, "Out of memory");
+		oom_kill_process(oc, p, points, totalpages, NULL,
+				 "Out of memory");
 		killed = 1;
 	}
 out:
@@ -739,13 +724,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
  */
 void pagefault_out_of_memory(void)
 {
+	struct oom_context oc = { 0 };
+
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
 	if (!mutex_trylock(&oom_lock))
 		return;
 
-	if (!out_of_memory(NULL, 0, 0, NULL, false)) {
+	if (!out_of_memory(&oc)) {
 		/*
 		 * There shouldn't be any user tasks runnable while the
 		 * OOM killer is disabled, so the current task has to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f9ffbb087cb..4b172b4213b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2680,6 +2680,12 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
+	struct oom_context oc = {
+		.zonelist = ac->zonelist,
+		.nodemask = ac->nodemask,
+		.gfp_mask = gfp_mask,
+		.order = order,
+	};
 	struct page *page;
 
 	*did_some_progress = 0;
@@ -2731,8 +2737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
-			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
 		*did_some_progress = 1;
 out:
 	mutex_unlock(&oom_lock);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] oom: split out forced OOM killer
  2015-07-08 13:04 [PATCH 0/4] oom: sysrq+f fixes + cleanups Michal Hocko
                   ` (2 preceding siblings ...)
  2015-07-08 13:04 ` [PATCH 3/4] mm, oom: organize oom context into struct Michal Hocko
@ 2015-07-08 13:04 ` Michal Hocko
  2015-07-08 23:41   ` David Rientjes
  3 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-08 13:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.cz>

The forced OOM killing is currently wired into out_of_memory() call
even though their objective is different which makes the code ugly
and harder to follow. Generic out_of_memory path has to deal with
configuration settings and heuristics which are completely irrelevant
to the forced OOM killer (e.g. sysctl_oom_kill_allocating_task or
OOM killer prevention for already dying tasks). All of them are
either relying on explicit force_kill check or indirectly by checking
current->mm which is always NULL for sysrq+f. This is not nice, hard
to follow and error prone.

Let's pull forced OOM killer code out into a separate function
(force_out_of_memory) which is really trivial now.
As a bonus we can clearly state that this is a forced OOM killer
in the OOM message which is helpful to distinguish it from the
regular OOM killer.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c |  9 +--------
 include/linux/oom.h |  1 +
 mm/oom_kill.c       | 57 ++++++++++++++++++++++++++++++++++++-----------------
 3 files changed, 41 insertions(+), 26 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 865b837a9aee..6a3def693ded 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -356,15 +356,8 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	const gfp_t gfp_mask = GFP_KERNEL;
-	struct oom_context oc = {
-		.zonelist = node_zonelist(first_memory_node, gfp_mask),
-		.gfp_mask = gfp_mask,
-		.force_kill = true,
-	};
-
 	mutex_lock(&oom_lock);
-	if (!out_of_memory(&oc))
+	if (!force_out_of_memory())
 		pr_info("OOM request ignored because killer is disabled\n");
 	mutex_unlock(&oom_lock);
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 094407cb2d2e..6af2d12d6134 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -77,6 +77,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_context *oc,
 		struct task_struct *task, unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_context *oc);
+extern bool force_out_of_memory(void);
 
 extern void exit_oom_victim(void);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 01aa4cb86857..6a0b09296236 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -627,6 +627,38 @@ int unregister_oom_notifier(struct notifier_block *nb)
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
 /**
+ * force_out_of_memory - forces OOM killer to kill a process
+ *
+ * Explicitly trigger the OOM killer. The system doesn't have to be under
+ * OOM condition (e.g. sysrq+f).
+ */
+bool force_out_of_memory(void)
+{
+	struct task_struct *p;
+	unsigned long totalpages;
+	unsigned int points;
+	const gfp_t gfp_mask = GFP_KERNEL;
+	struct oom_context oc = {
+		.zonelist = node_zonelist(first_memory_node, gfp_mask),
+		.gfp_mask = gfp_mask,
+		.force_kill = true,
+	};
+
+	if (oom_killer_disabled)
+		return false;
+
+	constrained_alloc(&oc, &totalpages);
+	p = select_bad_process(&oc, &points, totalpages);
+	if (p != (void *)-1UL)
+		oom_kill_process(&oc, p, points, totalpages, NULL,
+				 "Forced out of memory killer");
+	else
+		pr_warn("Sysrq triggered out of memory. No killable task found...\n");
+
+	return true;
+}
+
+/**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_context
  *
@@ -647,12 +679,10 @@ bool out_of_memory(struct oom_context *oc)
 	if (oom_killer_disabled)
 		return false;
 
-	if (!oc->force_kill) {
-		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-		if (freed > 0)
-			/* Got some memory back in the last second. */
-			goto out;
-	}
+	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
+	if (freed > 0)
+		/* Got some memory back in the last second. */
+		goto out;
 
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
@@ -675,13 +705,8 @@ bool out_of_memory(struct oom_context *oc)
 	constraint = constrained_alloc(oc, &totalpages);
 	if (constraint != CONSTRAINT_MEMORY_POLICY)
 		oc->nodemask = NULL;
-	if (!oc->force_kill)
-		check_panic_on_oom(oc, constraint, NULL);
+	check_panic_on_oom(oc, constraint, NULL);
 
-	/*
-	 * not affecting force_kill because sysrq triggered OOM killer runs from
-	 * the workqueue context so current->mm will be NULL
-	 */
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
@@ -694,12 +719,8 @@ bool out_of_memory(struct oom_context *oc)
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		if (!oc->force_kill) {
-			dump_header(oc, NULL, NULL);
-			panic("Out of memory and no killable processes...\n");
-		} else {
-			pr_info("Sysrq triggered out of memory. No killable task found...\n");
-		}
+		dump_header(oc, NULL, NULL);
+		panic("Out of memory and no killable processes...\n");
 	}
 	if (p != (void *)-1UL) {
 		oom_kill_process(oc, p, points, totalpages, NULL,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered
  2015-07-08 13:04 ` [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered Michal Hocko
@ 2015-07-08 23:36   ` David Rientjes
  2015-07-09  8:23     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-08 23:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

On Wed, 8 Jul 2015, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.cz>
> 
> OOM killer might be triggered explicitly via sysrq+f. This is supposed
> to kill a task no matter what e.g. a task is selected even though there
> is an OOM victim on the way to exit. This is a big hammer for an admin
> to help to resolve a memory short condition when the system is not able
> to cope with it on its own in a reasonable time frame (e.g. when the
> system is trashing or the OOM killer cannot make sufficient progress)
> 
> E.g. it doesn't make any sense to obey panic_on_oom setting because
> a) administrator could have used other sysrqs to achieve the
> panic/reboot and b) the policy would break an existing usecase to
> kill a memory hog which would be recoverable unlike the panic which
> might be configured for the real OOM condition.
> 
> It also doesn't make much sense to panic the system when there is no
> OOM killable task because administrator might choose to do additional
> steps before rebooting/panicking the system.
> 
> While we are there also add a comment explaining why
> sysctl_oom_kill_allocating_task doesn't apply to sysrq triggered OOM
> killer even though there is no explicit check and we subtly rely
> on current->mm being NULL for the context from which it is triggered.
> 
> Also be more explicit about sysrq+f behavior in the documentation.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Nack, this is already handled by patch 2 in my series.  I understand that 
the titles were wrong for patches 2 and 3, but it doesn't mean we need to 
add hacks around the code before organizing this into struct oom_control 
or completely pointless comments and printks that will fill the kernel 
log.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-08 13:04 ` [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f Michal Hocko
@ 2015-07-08 23:37   ` David Rientjes
  2015-07-09  8:55     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-08 23:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

On Wed, 8 Jul 2015, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.cz>
> 
> A github user rfjakob has reported the following issue via IRC.
> <rfjakob> Manually triggering the OOM killer does not work anymore in 4.0.5
> <rfjakob> This is what it looks like: https://gist.github.com/rfjakob/346b7dc611fc3cdf4011
> <rfjakob> Basically, what happens is that the GPU driver frees some memory, that satisfies the OOM killer
> <rfjakob> But the memory is allocated immediately again, and in the, no processes are killed no matter how often you trigger the oom killer
> <rfjakob> "in the end"
> 
> Quoting from the github:
> "
> [19291.202062] sysrq: SysRq : Manual OOM execution
> [19291.208335] Purging GPU memory, 74399744 bytes freed, 8728576 bytes still pinned.
> [19291.390767] sysrq: SysRq : Manual OOM execution
> [19291.396792] Purging GPU memory, 74452992 bytes freed, 8728576 bytes still pinned.
> [19291.560349] sysrq: SysRq : Manual OOM execution
> [19291.566018] Purging GPU memory, 75489280 bytes freed, 8728576 bytes still pinned.
> [19291.729944] sysrq: SysRq : Manual OOM execution
> [19291.735686] Purging GPU memory, 74399744 bytes freed, 8728576 bytes still pinned.
> [19291.918637] sysrq: SysRq : Manual OOM execution
> [19291.924299] Purging GPU memory, 74403840 bytes freed, 8728576 bytes still pinned.
> "
> 
> The issue is that sysrq+f (force_kill) gets confused by the regular OOM
> heuristic which tries to prevent from OOM killer if some of the oom
> notifier can relase a memory. The heuristic doesn't make much sense for
> the sysrq+f path because this one is used by the administrator to kill
> a memory hog.
> 
> Reported-by: Jakob Unterwurzacher <jakobunt@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Nack, the oom notify list has no place in the oom killer, it should be 
called in the page allocator before calling out_of_memory().  
out_of_memory() should serve a single, well defined purpose: kill a 
process.  If this were done, you wouldn't need random hacks like this in 
place.  This also shouldn't be included in a patchset that redefines the 
semantics of a forced oom kill, which is quite separate.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm, oom: organize oom context into struct
  2015-07-08 13:04 ` [PATCH 3/4] mm, oom: organize oom context into struct Michal Hocko
@ 2015-07-08 23:38   ` David Rientjes
  2015-07-09  8:56     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-08 23:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

On Wed, 8 Jul 2015, Michal Hocko wrote:

> From: David Rientjes <rientjes@google.com>
> 
> There are essential elements to an oom context that are passed around to
> multiple functions.
> 
> Organize these elements into a new struct, struct oom_context, that
> specifies the context for an oom condition.
> 
> This patch introduces no functional change.
> 
> [mhocko@suse.cz: s@oom_control@oom_context@]
> [mhocko@suse.cz: do not initialize on stack oom_context with NULL or 0]
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

I hate to appear to be nacking my own patch, but it was correct when 
included in my series and should be merged as part of the larger cleanup.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] oom: split out forced OOM killer
  2015-07-08 13:04 ` [PATCH 4/4] oom: split out forced OOM killer Michal Hocko
@ 2015-07-08 23:41   ` David Rientjes
  2015-07-09 10:05     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-08 23:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML, Michal Hocko

On Wed, 8 Jul 2015, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.cz>
> 
> The forced OOM killing is currently wired into out_of_memory() call
> even though their objective is different which makes the code ugly
> and harder to follow. Generic out_of_memory path has to deal with
> configuration settings and heuristics which are completely irrelevant
> to the forced OOM killer (e.g. sysctl_oom_kill_allocating_task or
> OOM killer prevention for already dying tasks). All of them are
> either relying on explicit force_kill check or indirectly by checking
> current->mm which is always NULL for sysrq+f. This is not nice, hard
> to follow and error prone.
> 
> Let's pull forced OOM killer code out into a separate function
> (force_out_of_memory) which is really trivial now.
> As a bonus we can clearly state that this is a forced OOM killer
> in the OOM message which is helpful to distinguish it from the
> regular OOM killer.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

It's really absurd that we have to go through this over and over and that 
your patches are actually being merged into -mm just because you don't get 
the point.

We have no need for a force_out_of_memory() function.  None whatsoever.  
Keeping oc->force_kill around is just more pointless space on a very deep 
stack and I'm tired of fixing stack overflows.  I'm certainly not going to 
introduce others because you think it looks cleaner in the code when 
memory compaction does the exact same thing by using cc->order == -1 to 
mean explicit compaction.

This is turning into a complete waste of time.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered
  2015-07-08 23:36   ` David Rientjes
@ 2015-07-09  8:23     ` Michal Hocko
  2015-07-09 21:03       ` David Rientjes
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-09  8:23 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Wed 08-07-15 16:36:14, David Rientjes wrote:
> On Wed, 8 Jul 2015, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.cz>
> > 
> > OOM killer might be triggered explicitly via sysrq+f. This is supposed
> > to kill a task no matter what e.g. a task is selected even though there
> > is an OOM victim on the way to exit. This is a big hammer for an admin
> > to help to resolve a memory short condition when the system is not able
> > to cope with it on its own in a reasonable time frame (e.g. when the
> > system is trashing or the OOM killer cannot make sufficient progress)
> > 
> > E.g. it doesn't make any sense to obey panic_on_oom setting because
> > a) administrator could have used other sysrqs to achieve the
> > panic/reboot and b) the policy would break an existing usecase to
> > kill a memory hog which would be recoverable unlike the panic which
> > might be configured for the real OOM condition.
> > 
> > It also doesn't make much sense to panic the system when there is no
> > OOM killable task because administrator might choose to do additional
> > steps before rebooting/panicking the system.
> > 
> > While we are there also add a comment explaining why
> > sysctl_oom_kill_allocating_task doesn't apply to sysrq triggered OOM
> > killer even though there is no explicit check and we subtly rely
> > on current->mm being NULL for the context from which it is triggered.
> > 
> > Also be more explicit about sysrq+f behavior in the documentation.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Nack, this is already handled by patch 2 in my series.  I understand that 

I guess you mean patch#3

> the titles were wrong for patches 2 and 3, but it doesn't mean we need to 
> add hacks around the code before organizing this into struct oom_control 

It is much easier to backport _fixes_ into older kernels (and yes I do
care about that) if they do not depend on other cleanups. So I do not
understand your point here. Besides that the cleanup really didn't make
much change to the actuall fix because one way or another you still have
to add a simple condition to rule out a heuristic/configuration which
doesn't apply to sysrq+f path.

So I am really lost in your argumentation here.

> or completely pointless comments and printks that will fill the kernel 
> log.

Could you explain what is so pointless about a comment which clarifies
the fact which is not obviously visible from the current function?

Also could you explain why the admin shouldn't get an information if
sysrq+f didn't kill anything because no eligible task has been found?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-08 23:37   ` David Rientjes
@ 2015-07-09  8:55     ` Michal Hocko
  2015-07-09 21:07       ` David Rientjes
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-09  8:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Wed 08-07-15 16:37:49, David Rientjes wrote:
> On Wed, 8 Jul 2015, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.cz>
> > 
> > A github user rfjakob has reported the following issue via IRC.
> > <rfjakob> Manually triggering the OOM killer does not work anymore in 4.0.5
> > <rfjakob> This is what it looks like: https://gist.github.com/rfjakob/346b7dc611fc3cdf4011
> > <rfjakob> Basically, what happens is that the GPU driver frees some memory, that satisfies the OOM killer
> > <rfjakob> But the memory is allocated immediately again, and in the, no processes are killed no matter how often you trigger the oom killer
> > <rfjakob> "in the end"
> > 
> > Quoting from the github:
> > "
> > [19291.202062] sysrq: SysRq : Manual OOM execution
> > [19291.208335] Purging GPU memory, 74399744 bytes freed, 8728576 bytes still pinned.
> > [19291.390767] sysrq: SysRq : Manual OOM execution
> > [19291.396792] Purging GPU memory, 74452992 bytes freed, 8728576 bytes still pinned.
> > [19291.560349] sysrq: SysRq : Manual OOM execution
> > [19291.566018] Purging GPU memory, 75489280 bytes freed, 8728576 bytes still pinned.
> > [19291.729944] sysrq: SysRq : Manual OOM execution
> > [19291.735686] Purging GPU memory, 74399744 bytes freed, 8728576 bytes still pinned.
> > [19291.918637] sysrq: SysRq : Manual OOM execution
> > [19291.924299] Purging GPU memory, 74403840 bytes freed, 8728576 bytes still pinned.
> > "
> > 
> > The issue is that sysrq+f (force_kill) gets confused by the regular OOM
> > heuristic which tries to prevent from OOM killer if some of the oom
> > notifier can relase a memory. The heuristic doesn't make much sense for
> > the sysrq+f path because this one is used by the administrator to kill
> > a memory hog.
> > 
> > Reported-by: Jakob Unterwurzacher <jakobunt@gmail.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Nack, the oom notify list has no place in the oom killer, it should be 
> called in the page allocator before calling out_of_memory().  

I cannot say I would like oom notifiers interface. Quite contrary, it is
just a crude hack. It is living outside of the shrinker interface which is
what the reclaim is using and it acts like the last attempt before OOM
(e.g. i915_gem_shrinker_init registers both "shrinkers"). So I am not
sure it belongs outside of the oom killer proper.

Besides that out_of_memory already contains shortcuts to prevent killing
a task. Why is this any different? I mean why shouldn't callers of
out_of_memory check whether the task is killed or existing before
calling out_of_memory?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm, oom: organize oom context into struct
  2015-07-08 23:38   ` David Rientjes
@ 2015-07-09  8:56     ` Michal Hocko
  2015-07-09 21:09       ` David Rientjes
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-09  8:56 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Wed 08-07-15 16:38:25, David Rientjes wrote:
> On Wed, 8 Jul 2015, Michal Hocko wrote:
> 
> > From: David Rientjes <rientjes@google.com>
> > 
> > There are essential elements to an oom context that are passed around to
> > multiple functions.
> > 
> > Organize these elements into a new struct, struct oom_context, that
> > specifies the context for an oom condition.
> > 
> > This patch introduces no functional change.
> > 
> > [mhocko@suse.cz: s@oom_control@oom_context@]
> > [mhocko@suse.cz: do not initialize on stack oom_context with NULL or 0]
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> I hate to appear to be nacking my own patch, but it was correct when 
> included in my series

What is incorrect in this patch?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] oom: split out forced OOM killer
  2015-07-08 23:41   ` David Rientjes
@ 2015-07-09 10:05     ` Michal Hocko
  2015-07-09 21:27       ` David Rientjes
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-09 10:05 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Wed 08-07-15 16:41:23, David Rientjes wrote:
> On Wed, 8 Jul 2015, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.cz>
> > 
> > The forced OOM killing is currently wired into out_of_memory() call
> > even though their objective is different which makes the code ugly
> > and harder to follow. Generic out_of_memory path has to deal with
> > configuration settings and heuristics which are completely irrelevant
> > to the forced OOM killer (e.g. sysctl_oom_kill_allocating_task or
> > OOM killer prevention for already dying tasks). All of them are
> > either relying on explicit force_kill check or indirectly by checking
> > current->mm which is always NULL for sysrq+f. This is not nice, hard
> > to follow and error prone.
> > 
> > Let's pull forced OOM killer code out into a separate function
> > (force_out_of_memory) which is really trivial now.
> > As a bonus we can clearly state that this is a forced OOM killer
> > in the OOM message which is helpful to distinguish it from the
> > regular OOM killer.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> It's really absurd that we have to go through this over and over and that 
> your patches are actually being merged into -mm just because you don't get 
> the point.
> 
> We have no need for a force_out_of_memory() function.  None whatsoever.  

The reasons are explained in the changelog and I do not see a single
argument against any of them.

> Keeping oc->force_kill around is just more pointless space on a very deep 
> stack and I'm tired of fixing stack overflows.

This just doesn't make any sense. oc->force_kill vs oc->order =
-1 replacement is completely independent on this patch and can be
implemented on top of it if you really insist.

> I'm certainly not going to 
> introduce others because you think it looks cleaner in the code when 
> memory compaction does the exact same thing by using cc->order == -1 to 
> mean explicit compaction.
> 
> This is turning into a complete waste of time.

You know what? I am tired of your complete immunity to any arguments and
the way how you are pushing more hacks into an already cluttered code.

out_of_memory is a giant mess wrt. to force killing and you can see
at least two different bugs being there just because of the code
obfuscation. If this is the state that you want to keep, I do not
care. I wanted to fix real issues and do a clean up on top. You seem to
do anything to block that. I just give up.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered
  2015-07-09  8:23     ` Michal Hocko
@ 2015-07-09 21:03       ` David Rientjes
  2015-07-10  7:41         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-09 21:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Thu, 9 Jul 2015, Michal Hocko wrote:

> > the titles were wrong for patches 2 and 3, but it doesn't mean we need to 
> > add hacks around the code before organizing this into struct oom_control 
> 
> It is much easier to backport _fixes_ into older kernels (and yes I do
> care about that) if they do not depend on other cleanups. So I do not
> understand your point here. Besides that the cleanup really didn't make
> much change to the actuall fix because one way or another you still have
> to add a simple condition to rule out a heuristic/configuration which
> doesn't apply to sysrq+f path.
> 
> So I am really lost in your argumentation here.
> 

This isn't a bugfix: sysrq+f has, at least for eight years, been able to 
panic the kernel.  We're not fixing a bug, we're changing behavior.  It's 
quite appropriate to reorganize code before a behavior change to make it 
cleaner.

> > or completely pointless comments and printks that will fill the kernel 
> > log.
> 
> Could you explain what is so pointless about a comment which clarifies
> the fact which is not obviously visible from the current function?
> 

It states the obvious, a kthread is not going to be oom killed for 
oom_kill_allocating_task: it's not only current->mm, but also 
oom_unkillable_task(), which quite explicitly checks for PF_KTHREAD.  I 
don't think any reader of this code will assume a kthread is going to be 
oom killed.

> Also could you explain why the admin shouldn't get an information if
> sysrq+f didn't kill anything because no eligible task has been found?

The kernel log is the only notification mechanism that we have of the 
kernel killing a process, we want to avoid spamming it unnecessarily.  The 
kernel log is not the appropriate place for your debugging information 
that would only specify that yes, out_of_memory() was called, but there 
was nothing actionable, especially when that trigger can be constantly 
invoked by userspace once panicking is no longer possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-09  8:55     ` Michal Hocko
@ 2015-07-09 21:07       ` David Rientjes
  2015-07-10  7:40         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-09 21:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Thu, 9 Jul 2015, Michal Hocko wrote:

> > Nack, the oom notify list has no place in the oom killer, it should be 
> > called in the page allocator before calling out_of_memory().  
> 
> I cannot say I would like oom notifiers interface. Quite contrary, it is
> just a crude hack. It is living outside of the shrinker interface which is
> what the reclaim is using and it acts like the last attempt before OOM
> (e.g. i915_gem_shrinker_init registers both "shrinkers").

I agree.

> So I am not
> sure it belongs outside of the oom killer proper.
> 

Umm it has nothing to do with oom killing, it quite obviously doesn't 
belong in the oom killer.  It belongs prior to invoking the oom killer if 
memory could be freed.

> Besides that out_of_memory already contains shortcuts to prevent killing
> a task. Why is this any different? I mean why shouldn't callers of
> out_of_memory check whether the task is killed or existing before
> calling out_of_memory?
> 

Because the oom killer is for oom killing and the most vital part of oom 
killing is the granting of memory reserves, otherwise no forward progress 
can be made.

The line between "out of memory" and "not out of memory" is quite clear 
and logic that handles "out of memory" situations belongs in the oom 
killer and logic that handles "not out of memory" situations belongs in 
the page allocator.  This shouldn't be surprising whatsoever, but if you 
insist me moving the code to where it belongs, I will.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm, oom: organize oom context into struct
  2015-07-09  8:56     ` Michal Hocko
@ 2015-07-09 21:09       ` David Rientjes
  0 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2015-07-09 21:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Thu, 9 Jul 2015, Michal Hocko wrote:

> > > From: David Rientjes <rientjes@google.com>
> > > 
> > > There are essential elements to an oom context that are passed around to
> > > multiple functions.
> > > 
> > > Organize these elements into a new struct, struct oom_context, that
> > > specifies the context for an oom condition.
> > > 
> > > This patch introduces no functional change.
> > > 
> > > [mhocko@suse.cz: s@oom_control@oom_context@]
> > > [mhocko@suse.cz: do not initialize on stack oom_context with NULL or 0]
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > 
> > I hate to appear to be nacking my own patch, but it was correct when 
> > included in my series
> 
> What is incorrect in this patch?

Nothing, the statement is quite clear that it is also correct in my series 
which is much cleaner, reduces stack usage, removes more lines of code 
than yours, and organizes the oom killer code to avoid passing so many 
formals which should have been done long ago.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] oom: split out forced OOM killer
  2015-07-09 10:05     ` Michal Hocko
@ 2015-07-09 21:27       ` David Rientjes
  0 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2015-07-09 21:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Thu, 9 Jul 2015, Michal Hocko wrote:

> > > The forced OOM killing is currently wired into out_of_memory() call
> > > even though their objective is different which makes the code ugly
> > > and harder to follow. Generic out_of_memory path has to deal with
> > > configuration settings and heuristics which are completely irrelevant
> > > to the forced OOM killer (e.g. sysctl_oom_kill_allocating_task or
> > > OOM killer prevention for already dying tasks). All of them are
> > > either relying on explicit force_kill check or indirectly by checking
> > > current->mm which is always NULL for sysrq+f. This is not nice, hard
> > > to follow and error prone.
> > > 
> > > Let's pull forced OOM killer code out into a separate function
> > > (force_out_of_memory) which is really trivial now.
> > > As a bonus we can clearly state that this is a forced OOM killer
> > > in the OOM message which is helpful to distinguish it from the
> > > regular OOM killer.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > 
> > It's really absurd that we have to go through this over and over and that 
> > your patches are actually being merged into -mm just because you don't get 
> > the point.
> > 
> > We have no need for a force_out_of_memory() function.  None whatsoever.  
> 
> The reasons are explained in the changelog and I do not see a single
> argument against any of them.
> 

We have a large number of checks in the oom killer to handle various 
circumstances.  Those include different sysctl behaviors, different oom 
contexts (system, mempolicy, cpuset, memcg) to handle, behavior on 
concurrent exiting or killed processes, different calling context, etc.  
That doesn't mean they are deserving of individual functions that 
duplicate logic that add more and more lines of code.

> > Keeping oc->force_kill around is just more pointless space on a very deep 
> > stack and I'm tired of fixing stack overflows.
> 
> This just doesn't make any sense. oc->force_kill vs oc->order =
> -1 replacement is completely independent on this patch and can be
> implemented on top of it if you really insist.
> 

You are introducing a separate function that duplicates logic to avoid 
adding two checks to existing conditionals.  That's what I disagree with.

> > I'm certainly not going to 
> > introduce others because you think it looks cleaner in the code when 
> > memory compaction does the exact same thing by using cc->order == -1 to 
> > mean explicit compaction.
> > 
> > This is turning into a complete waste of time.
> 
> You know what? I am tired of your complete immunity to any arguments and
> the way how you are pushing more hacks into an already cluttered code.
> 

I'm going to make one final comment on these constant reiterations of the 
same patchset and then move on.  I simply don't have the time to continue 
to discuss stylistic differences: in this case, I disagree with you 
introducing a new function that duplicates logic elsewhere to avoid adding 
two checks in existing conditions.

If we look at memory compaction, I see cc->order == -1 checks in four 
places.  cc->order == -1 means compaction was triggered explicitly from 
the command line, just as oc->order == -1 in my patchset means the oom 
killer was triggered explicitly from sysrq.

__compact_finished():
	/*
	 * order == -1 is expected when compacting via
	 * /proc/sys/vm/compact_memory
	 */
	if (cc->order == -1)
		return COMPACT_CONTINUE;

__compaction_suitable():
	/*
	 * order == -1 is expected when compacting via
	 * /proc/sys/vm/compact_memory
	 */
	if (order == -1)
		return COMPACT_CONTINUE;

__compact_pgdat():
		/*
		 * When called via /proc/sys/vm/compact_memory
		 * this makes sure we compact the whole zone regardless of
		 * cached scanner positions.
		 */
		if (cc->order == -1)
			__reset_isolation_suitable(zone);

		if (cc->order == -1 || !compaction_deferred(zone, cc->order))
			compact_zone(zone, cc);

We don't implement separate memory compaction scanners when triggered by 
the command line.  We simply check, where appropriate, if this is a full 
compaction scan or not.  In that case, cc->order doesn't matter since we 
aren't trying to allocate a page; this is the exact same as in my patchset 
since oc->order doesn't matter since we aren't concerned with the order of 
the failed page allocation.

I have never had trouble following Mel's code when it comes to the Linux 
VM.

I recognized this as an opportunity to remove data on the stack, which is 
always important for the page allocator and oom killer because it can be 
very deep, by doing the exact same thing.

check_panic_on_oom():
	/* Do not panic for oom kills triggered by sysrq */
	if (oc->order == -1)
		return;

and two changes to existing conditions to determine if we should panic or 
if we have an eligible victim.

What we don't do is what your patch does:

bool force_out_of_memory(void)
{
	struct task_struct *p;
	unsigned long totalpages;
	unsigned int points;
	const gfp_t gfp_mask = GFP_KERNEL;
	struct oom_context oc = {
		.zonelist = node_zonelist(first_memory_node, gfp_mask),
		.gfp_mask = gfp_mask,
		.force_kill = true,
	};

	if (oom_killer_disabled)
		return false;

	constrained_alloc(&oc, &totalpages);
	p = select_bad_process(&oc, &points, totalpages);
	if (p != (void *)-1UL)
		oom_kill_process(&oc, p, points, totalpages, NULL,
				"Forced out of memory killer");
	else
		pr_warn("Sysrq triggered out of memory. No killable task found...\n");

	return true;
}

which duplicates _all_ that logic that appears elsewhere.

I can also see that it changes the oom_kill_process() message which would 
break anybody who is parsing the kernel log, which is the only 
notification mechanism we have that the kernel killed a process, for what 
has always been printed on oom kill.

I think I've reviewed this same patch three times and I'm not going to 
respond further in regards to it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-09 21:07       ` David Rientjes
@ 2015-07-10  7:40         ` Michal Hocko
  2015-07-14 21:58           ` David Rientjes
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-07-10  7:40 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Thu 09-07-15 14:07:37, David Rientjes wrote:
> On Thu, 9 Jul 2015, Michal Hocko wrote:
[...]
> > So I am not
> > sure it belongs outside of the oom killer proper.
> > 
> 
> Umm it has nothing to do with oom killing, it quite obviously doesn't 
> belong in the oom killer.

The naming of the API would disagree. To me register_oom_notifier sounds
like a mechanism to be notified when we are oom.

> It belongs prior to invoking the oom killer if memory could be freed.

Shrinkers are there to reclaim and prevent from OOM. This API is a gray
zone. It looks generic method for the notification yet it allows to
prevent from oom killer. I can imagine somebody might abuse this
interface to implement OOM killer policies.

Anyway, I think it would be preferable to kill it altogether rather than
play with its placing. It will always be a questionable API.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered
  2015-07-09 21:03       ` David Rientjes
@ 2015-07-10  7:41         ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2015-07-10  7:41 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Thu 09-07-15 14:03:53, David Rientjes wrote:
> On Thu, 9 Jul 2015, Michal Hocko wrote:
> 
> > > the titles were wrong for patches 2 and 3, but it doesn't mean we need to 
> > > add hacks around the code before organizing this into struct oom_control 
> > 
> > It is much easier to backport _fixes_ into older kernels (and yes I do
> > care about that) if they do not depend on other cleanups. So I do not
> > understand your point here. Besides that the cleanup really didn't make
> > much change to the actuall fix because one way or another you still have
> > to add a simple condition to rule out a heuristic/configuration which
> > doesn't apply to sysrq+f path.
> > 
> > So I am really lost in your argumentation here.
> > 
> 
> This isn't a bugfix: sysrq+f has, at least for eight years, been able to 
> panic the kernel.

This is an unwanted behavior and that is why I call it a bug. The mere
fact that nobody has noticed because panic_on_oom is not used widely and
even less with sysrq+f has nothing to do with it.

> We're not fixing a bug, we're changing behavior.  It's 
> quite appropriate to reorganize code before a behavior change to make it 
> cleaner.
> 
> > > or completely pointless comments and printks that will fill the kernel 
> > > log.
> > 
> > Could you explain what is so pointless about a comment which clarifies
> > the fact which is not obviously visible from the current function?
> > 
> 
> It states the obvious, a kthread is not going to be oom killed for 
> oom_kill_allocating_task:

Sigh. The comment says that the force_kill path _runs_ from the kthread
context which is far from obvious in out_of_memory.

[...]

> > Also could you explain why the admin shouldn't get an information if
> > sysrq+f didn't kill anything because no eligible task has been found?
> 
> The kernel log is the only notification mechanism that we have of the 
> kernel killing a process, we want to avoid spamming it unnecessarily.  The 
> kernel log is not the appropriate place for your debugging information 
> that would only specify that yes, out_of_memory() was called, but there 
> was nothing actionable, especially when that trigger can be constantly 
> invoked by userspace once panicking is no longer possible.

So how would you find out that there is no oom killable task?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-10  7:40         ` Michal Hocko
@ 2015-07-14 21:58           ` David Rientjes
  2015-07-15  9:42             ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-14 21:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Fri, 10 Jul 2015, Michal Hocko wrote:

> Shrinkers are there to reclaim and prevent from OOM. This API is a gray
> zone. It looks generic method for the notification yet it allows to
> prevent from oom killer. I can imagine somebody might abuse this
> interface to implement OOM killer policies.
> 
> Anyway, I think it would be preferable to kill it altogether rather than
> play with its placing. It will always be a questionable API.
> 

Agreed.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-14 21:58           ` David Rientjes
@ 2015-07-15  9:42             ` Michal Hocko
  2015-07-15 22:21               ` David Rientjes
  2015-07-15 22:44               ` [patch -mm] mm, oom: move oom notifiers to page allocator David Rientjes
  0 siblings, 2 replies; 24+ messages in thread
From: Michal Hocko @ 2015-07-15  9:42 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Tue 14-07-15 14:58:55, David Rientjes wrote:
> On Fri, 10 Jul 2015, Michal Hocko wrote:
> 
> > Shrinkers are there to reclaim and prevent from OOM. This API is a gray
> > zone. It looks generic method for the notification yet it allows to
> > prevent from oom killer. I can imagine somebody might abuse this
> > interface to implement OOM killer policies.
> > 
> > Anyway, I think it would be preferable to kill it altogether rather than
> > play with its placing. It will always be a questionable API.
> > 
> 
> Agreed.

In such a case it would be still good to fix the bug fixed by this
patch.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f
  2015-07-15  9:42             ` Michal Hocko
@ 2015-07-15 22:21               ` David Rientjes
  2015-07-15 22:44               ` [patch -mm] mm, oom: move oom notifiers to page allocator David Rientjes
  1 sibling, 0 replies; 24+ messages in thread
From: David Rientjes @ 2015-07-15 22:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Jakob Unterwurzacher, linux-mm, LKML

On Wed, 15 Jul 2015, Michal Hocko wrote:

> > > Shrinkers are there to reclaim and prevent from OOM. This API is a gray
> > > zone. It looks generic method for the notification yet it allows to
> > > prevent from oom killer. I can imagine somebody might abuse this
> > > interface to implement OOM killer policies.
> > > 
> > > Anyway, I think it would be preferable to kill it altogether rather than
> > > play with its placing. It will always be a questionable API.
> > > 
> > 
> > Agreed.
> 
> In such a case it would be still good to fix the bug fixed by this
> patch.
> 

It's fixed if you follow the suggestion of moving the oom notification out 
of the oom killer where it doesn't belong.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch -mm] mm, oom: move oom notifiers to page allocator
  2015-07-15  9:42             ` Michal Hocko
  2015-07-15 22:21               ` David Rientjes
@ 2015-07-15 22:44               ` David Rientjes
  2015-07-16  7:12                 ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: David Rientjes @ 2015-07-15 22:44 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: Mel Gorman, Vlastimil Babka, Jakob Unterwurzacher, linux-mm,
	linux-kernel

OOM notifiers exist to give one last chance at reclaiming memory before
the oom killer does its work.

Thus, they don't actually belong in the oom killer proper, but rather in
the page allocator where reclaim is invoked.

Move the oom notifiers to their proper place: before out_of_memory(),
which now deals solely with providing access to memory reserves and
ensuring a process is exiting to free its memory.

This also fixes an issue that invoked the oom notifiers and aborted oom
kill when triggered manually with sysrq+f.  Sysrq+f now properly triggers
an oom kill in all instances.

Such callbacks should use register_shrinker() so they are a part of
reclaim, and there should be no need for oom notifiers at all.  Thus,
add a new comment directed to reclaim rather than continuing to use this
interface.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c   | 20 --------------------
 mm/page_alloc.c | 22 ++++++++++++++++++++++
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -615,20 +615,6 @@ void check_panic_on_oom(struct oom_control *oc, enum oom_constraint constraint,
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
 
-static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
-
-int register_oom_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_register(&oom_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(register_oom_notifier);
-
-int unregister_oom_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(unregister_oom_notifier);
-
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -642,18 +628,12 @@ bool out_of_memory(struct oom_control *oc)
 {
 	struct task_struct *p;
 	unsigned long totalpages;
-	unsigned long freed = 0;
 	unsigned int uninitialized_var(points);
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	if (oom_killer_disabled)
 		return false;
 
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return true;
-
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2676,6 +2676,23 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		show_mem(filter);
 }
 
+static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
+/*
+ * Deprecated -- no new callers of this interface should be added.  Instead,
+ * use reclaim shrinkers: see register_shrinker().
+ */
+int register_oom_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&oom_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_oom_notifier);
+
+int unregister_oom_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_oom_notifier);
+
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
@@ -2736,6 +2753,11 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+
+	blocking_notifier_call_chain(&oom_notify_list, 0, did_some_progress);
+	if (*did_some_progress > 0)
+		goto out;
+
 	/* Exhausted what can be done so it's blamo time */
 	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
 		*did_some_progress = 1;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch -mm] mm, oom: move oom notifiers to page allocator
  2015-07-15 22:44               ` [patch -mm] mm, oom: move oom notifiers to page allocator David Rientjes
@ 2015-07-16  7:12                 ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2015-07-16  7:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Jakob Unterwurzacher,
	linux-mm, linux-kernel

On Wed 15-07-15 15:44:36, David Rientjes wrote:
> OOM notifiers exist to give one last chance at reclaiming memory before
> the oom killer does its work.
> 
> Thus, they don't actually belong in the oom killer proper, but rather in
> the page allocator where reclaim is invoked.

Why this is not needed in the page fault oom path anymore?

> Move the oom notifiers to their proper place: before out_of_memory(),
> which now deals solely with providing access to memory reserves and
> ensuring a process is exiting to free its memory.
> 
> This also fixes an issue that invoked the oom notifiers and aborted oom
> kill when triggered manually with sysrq+f.  Sysrq+f now properly triggers
> an oom kill in all instances.
> 
> Such callbacks should use register_shrinker() so they are a part of
> reclaim, and there should be no need for oom notifiers at all.  Thus,
> add a new comment directed to reclaim rather than continuing to use this
> interface.

I do not see anybody from the subsystems using oom notifiers CCed here
for their opinion.

I guess Reported-by could have been preserved

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c   | 20 --------------------
>  mm/page_alloc.c | 22 ++++++++++++++++++++++
>  2 files changed, 22 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -615,20 +615,6 @@ void check_panic_on_oom(struct oom_control *oc, enum oom_constraint constraint,
>  		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
>  }
>  
> -static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
> -
> -int register_oom_notifier(struct notifier_block *nb)
> -{
> -	return blocking_notifier_chain_register(&oom_notify_list, nb);
> -}
> -EXPORT_SYMBOL_GPL(register_oom_notifier);
> -
> -int unregister_oom_notifier(struct notifier_block *nb)
> -{
> -	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
> -}
> -EXPORT_SYMBOL_GPL(unregister_oom_notifier);
> -
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
>   * @oc: pointer to struct oom_control
> @@ -642,18 +628,12 @@ bool out_of_memory(struct oom_control *oc)
>  {
>  	struct task_struct *p;
>  	unsigned long totalpages;
> -	unsigned long freed = 0;
>  	unsigned int uninitialized_var(points);
>  	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
>  	if (oom_killer_disabled)
>  		return false;
>  
> -	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> -	if (freed > 0)
> -		/* Got some memory back in the last second. */
> -		return true;
> -
>  	/*
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2676,6 +2676,23 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  		show_mem(filter);
>  }
>  
> +static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
> +/*
> + * Deprecated -- no new callers of this interface should be added.  Instead,
> + * use reclaim shrinkers: see register_shrinker().
> + */
> +int register_oom_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&oom_notify_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(register_oom_notifier);
> +
> +int unregister_oom_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(unregister_oom_notifier);
> +
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
> @@ -2736,6 +2753,11 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
>  	}
> +
> +	blocking_notifier_call_chain(&oom_notify_list, 0, did_some_progress);
> +	if (*did_some_progress > 0)
> +		goto out;
> +
>  	/* Exhausted what can be done so it's blamo time */
>  	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
>  		*did_some_progress = 1;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-07-16  7:12 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-08 13:04 [PATCH 0/4] oom: sysrq+f fixes + cleanups Michal Hocko
2015-07-08 13:04 ` [PATCH 1/4] oom: Do not panic when OOM killer is sysrq triggered Michal Hocko
2015-07-08 23:36   ` David Rientjes
2015-07-09  8:23     ` Michal Hocko
2015-07-09 21:03       ` David Rientjes
2015-07-10  7:41         ` Michal Hocko
2015-07-08 13:04 ` [PATCH 2/4] oom: Do not invoke oom notifiers on sysrq+f Michal Hocko
2015-07-08 23:37   ` David Rientjes
2015-07-09  8:55     ` Michal Hocko
2015-07-09 21:07       ` David Rientjes
2015-07-10  7:40         ` Michal Hocko
2015-07-14 21:58           ` David Rientjes
2015-07-15  9:42             ` Michal Hocko
2015-07-15 22:21               ` David Rientjes
2015-07-15 22:44               ` [patch -mm] mm, oom: move oom notifiers to page allocator David Rientjes
2015-07-16  7:12                 ` Michal Hocko
2015-07-08 13:04 ` [PATCH 3/4] mm, oom: organize oom context into struct Michal Hocko
2015-07-08 23:38   ` David Rientjes
2015-07-09  8:56     ` Michal Hocko
2015-07-09 21:09       ` David Rientjes
2015-07-08 13:04 ` [PATCH 4/4] oom: split out forced OOM killer Michal Hocko
2015-07-08 23:41   ` David Rientjes
2015-07-09 10:05     ` Michal Hocko
2015-07-09 21:27       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).