[PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-07-27  9:03 ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27  9:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

Hi,
this is a part of a larger series I posted back in Oct last year [1]. I
have dropped patch 3 because it was incorrect and patch 4 is not
applicable without it.

The primary reason to apply patch 1 is to remove a risk of the complete
memory depletion by oom victims. While this is a theoretical risk right
now there is a demand for memcg aware oom killer which might kill all
processes inside a memcg which can be a lot of tasks. That would make
the risk quite real.

This issue is addressed by limiting access to memory reserves. We no
longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
instead. See Patch 1 for more details. Patch 2 is a trivial follow up
cleanup.

I would still like to get rid of TIF_MEMDIE completely but I do not have
time to do it now and it is not a pressing issue.

[1] http://lkml.kernel.org/r/20161004090009.7974-1-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-07-27  9:03 ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27  9:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

Hi,
this is a part of a larger series I posted back in Oct last year [1]. I
have dropped patch 3 because it was incorrect and patch 4 is not
applicable without it.

The primary reason to apply patch 1 is to remove a risk of the complete
memory depletion by oom victims. While this is a theoretical risk right
now there is a demand for memcg aware oom killer which might kill all
processes inside a memcg which can be a lot of tasks. That would make
the risk quite real.

This issue is addressed by limiting access to memory reserves. We no
longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
instead. See Patch 1 for more details. Patch 2 is a trivial follow up
cleanup.

I would still like to get rid of TIF_MEMDIE completely but I do not have
time to do it now and it is not a pressing issue.

[1] http://lkml.kernel.org/r/20161004090009.7974-1-mhocko@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-07-27  9:03 ` Michal Hocko
@ 2017-07-27  9:03   ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27  9:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full
access to memory reserves. There are few shortcomings of this
implementation, though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates
on mm_struct granularity so all processes sharing the given mm are
killed. Giving the full access to all these task_structs could lead to
a quick memory reserves depletion. We have tried to reduce this risk by
giving TIF_MEMDIE only to the main thread and the currently allocating
task but that doesn't really solve this problem while it surely opens up
a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
the allocator without access to memory reserves because a particular
thread was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any
usage of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach but be careful because they still might
deplete all the memory reserves so keep the semantic as close to the
original implementation as possible and give them access to memory
reserves only up to exit_mm (when tsk->mm is cleared) rather than while
tsk_is_oom_victim which is until signal struct is gone.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once. This change will allow such a usecase without
worrying about complete memory reserves depletion.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/internal.h   | 11 +++++++++++
 mm/oom_kill.c   |  9 +++++----
 mm/page_alloc.c | 55 +++++++++++++++++++++++++++++++++++++++++++++----------
 3 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..1ebcb1ed01b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..c9f3569a76c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
@@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	count_memcg_event_mm(mm, OOM_KILL);
 
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
+	 * We should send SIGKILL before granting access to memory reserves
+	 * in order to prevent the OOM victim from depleting the memory
+	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..5e5911f40014 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!alloc_harder)) {
 		free_pages -= z->nr_reserved_highatomic;
-	else
-		min -= min / 4;
+	} else {
+		/*
+		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * users
+		 */
+		if (alloc_flags & ALLOC_OOM)
+			min -= min / 2;
+		else
+			min -= min / 4;
+	}
+
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3184,7 +3193,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (tsk_is_oom_victim(current) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+static bool oom_reserves_allowed(struct task_struct *tsk)
+{
+	if (!tsk_is_oom_victim(tsk))
+		return false;
+
+	/*
+	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
+	 * depletion and shouldn't give access to memory reserves passed the
+	 * exit_mm
+	 */
+	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
+		return false;
+
+	return true;
+}
+
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
 	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
@@ -3614,7 +3639,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		return true;
 	if (!in_interrupt() &&
 			((current->flags & PF_MEMALLOC) ||
-			 unlikely(test_thread_flag(TIF_MEMDIE))))
+			 oom_reserves_allowed(current)))
 		return true;
 
 	return false;
@@ -3770,6 +3795,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	bool reserves;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3875,15 +3901,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
-		alloc_flags = ALLOC_NO_WATERMARKS;
+	/*
+	 * Distinguish requests which really need access to whole memory
+	 * reserves from oom victims which can live with their own reserve
+	 */
+	reserves = gfp_pfmemalloc_allowed(gfp_mask);
+	if (reserves) {
+		if (tsk_is_oom_victim(current))
+			alloc_flags = ALLOC_OOM;
+		else
+			alloc_flags = ALLOC_NO_WATERMARKS;
+	}
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
@@ -3960,7 +3995,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) &&
+	if (tsk_is_oom_victim(current) &&
 	    (alloc_flags == ALLOC_NO_WATERMARKS ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-07-27  9:03   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27  9:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full
access to memory reserves. There are few shortcomings of this
implementation, though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates
on mm_struct granularity so all processes sharing the given mm are
killed. Giving the full access to all these task_structs could lead to
a quick memory reserves depletion. We have tried to reduce this risk by
giving TIF_MEMDIE only to the main thread and the currently allocating
task but that doesn't really solve this problem while it surely opens up
a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
the allocator without access to memory reserves because a particular
thread was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any
usage of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach but be careful because they still might
deplete all the memory reserves so keep the semantic as close to the
original implementation as possible and give them access to memory
reserves only up to exit_mm (when tsk->mm is cleared) rather than while
tsk_is_oom_victim which is until signal struct is gone.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once. This change will allow such a usecase without
worrying about complete memory reserves depletion.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/internal.h   | 11 +++++++++++
 mm/oom_kill.c   |  9 +++++----
 mm/page_alloc.c | 55 +++++++++++++++++++++++++++++++++++++++++++++----------
 3 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..1ebcb1ed01b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..c9f3569a76c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
@@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	count_memcg_event_mm(mm, OOM_KILL);
 
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
+	 * We should send SIGKILL before granting access to memory reserves
+	 * in order to prevent the OOM victim from depleting the memory
+	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..5e5911f40014 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!alloc_harder)) {
 		free_pages -= z->nr_reserved_highatomic;
-	else
-		min -= min / 4;
+	} else {
+		/*
+		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * users
+		 */
+		if (alloc_flags & ALLOC_OOM)
+			min -= min / 2;
+		else
+			min -= min / 4;
+	}
+
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3184,7 +3193,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (tsk_is_oom_victim(current) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+static bool oom_reserves_allowed(struct task_struct *tsk)
+{
+	if (!tsk_is_oom_victim(tsk))
+		return false;
+
+	/*
+	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
+	 * depletion and shouldn't give access to memory reserves passed the
+	 * exit_mm
+	 */
+	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
+		return false;
+
+	return true;
+}
+
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
 	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
@@ -3614,7 +3639,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		return true;
 	if (!in_interrupt() &&
 			((current->flags & PF_MEMALLOC) ||
-			 unlikely(test_thread_flag(TIF_MEMDIE))))
+			 oom_reserves_allowed(current)))
 		return true;
 
 	return false;
@@ -3770,6 +3795,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	bool reserves;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3875,15 +3901,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
-		alloc_flags = ALLOC_NO_WATERMARKS;
+	/*
+	 * Distinguish requests which really need access to whole memory
+	 * reserves from oom victims which can live with their own reserve
+	 */
+	reserves = gfp_pfmemalloc_allowed(gfp_mask);
+	if (reserves) {
+		if (tsk_is_oom_victim(current))
+			alloc_flags = ALLOC_OOM;
+		else
+			alloc_flags = ALLOC_NO_WATERMARKS;
+	}
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
@@ -3960,7 +3995,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) &&
+	if (tsk_is_oom_victim(current) &&
 	    (alloc_flags == ALLOC_NO_WATERMARKS ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-27  9:03 ` Michal Hocko
@ 2017-07-27  9:03   ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27  9:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path. tsk_is_oom_victim is more generic and allows to identify all tasks
(threads) which share the mm with the oom victim.

Please note that the freezer still needs to check TIF_MEMDIE because
we cannot thaw tasks which do not participage in oom_victims counting
otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/cgroup/cpuset.c | 8 ++++----
 mm/memcontrol.c        | 2 +-
 mm/oom_kill.c          | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca8376e5008c..1cc53dff0d94 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2497,12 +2497,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * If we're in interrupt, yes, we can always allocate.  If @node is set in
  * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
- * yes.  If current has access to memory reserves due to TIF_MEMDIE, yes.
+ * yes.  If current has access to memory reserves as an oom victim, yes.
  * Otherwise, no.
  *
  * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
  * and do not allow allocations outside the current tasks cpuset
- * unless the task has been OOM killed as is marked TIF_MEMDIE.
+ * unless the task has been OOM killed.
  * GFP_KERNEL allocations are not so marked, so can escape to the
  * nearest enclosing hardwalled ancestor cpuset.
  *
@@ -2525,7 +2525,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * affect that:
  *	in_interrupt - any node ok (current task context irrelevant)
  *	GFP_ATOMIC   - any node ok
- *	TIF_MEMDIE   - any node ok
+ *	tsk_is_oom_victim   - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
@@ -2543,7 +2543,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(tsk_is_oom_victim(current)))
 		return true;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544d47e5cbbd..86a48affb938 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+	if (unlikely(tsk_is_oom_victim(current) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
 		goto force;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c9f3569a76c7..65cc2f9aaa05 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 *				[...]
 	 *				out_of_memory
 	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
+	 *				    # no TIF_MEMDIE, selects new victim
 	 *  unmap_page_range # frees some memory
 	 */
 	mutex_lock(&oom_lock);
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
@ 2017-07-27  9:03   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27  9:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path. tsk_is_oom_victim is more generic and allows to identify all tasks
(threads) which share the mm with the oom victim.

Please note that the freezer still needs to check TIF_MEMDIE because
we cannot thaw tasks which do not participage in oom_victims counting
otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/cgroup/cpuset.c | 8 ++++----
 mm/memcontrol.c        | 2 +-
 mm/oom_kill.c          | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca8376e5008c..1cc53dff0d94 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2497,12 +2497,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * If we're in interrupt, yes, we can always allocate.  If @node is set in
  * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
- * yes.  If current has access to memory reserves due to TIF_MEMDIE, yes.
+ * yes.  If current has access to memory reserves as an oom victim, yes.
  * Otherwise, no.
  *
  * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
  * and do not allow allocations outside the current tasks cpuset
- * unless the task has been OOM killed as is marked TIF_MEMDIE.
+ * unless the task has been OOM killed.
  * GFP_KERNEL allocations are not so marked, so can escape to the
  * nearest enclosing hardwalled ancestor cpuset.
  *
@@ -2525,7 +2525,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * affect that:
  *	in_interrupt - any node ok (current task context irrelevant)
  *	GFP_ATOMIC   - any node ok
- *	TIF_MEMDIE   - any node ok
+ *	tsk_is_oom_victim   - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
@@ -2543,7 +2543,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(tsk_is_oom_victim(current)))
 		return true;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544d47e5cbbd..86a48affb938 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+	if (unlikely(tsk_is_oom_victim(current) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
 		goto force;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c9f3569a76c7..65cc2f9aaa05 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 *				[...]
 	 *				out_of_memory
 	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
+	 *				    # no TIF_MEMDIE, selects new victim
 	 *  unmap_page_range # frees some memory
 	 */
 	mutex_lock(&oom_lock);
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-27  9:03   ` Michal Hocko
@ 2017-07-27 14:01     ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-07-27 14:01 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: rientjes, hannes, guro, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 544d47e5cbbd..86a48affb938 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * bypass the last charges so that they can exit quickly and
>  	 * free their memory.
>  	 */
> -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> +	if (unlikely(tsk_is_oom_victim(current) ||
>  		     fatal_signal_pending(current) ||
>  		     current->flags & PF_EXITING))
>  		goto force;

Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c9f3569a76c7..65cc2f9aaa05 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 *				[...]
>  	 *				out_of_memory
>  	 *				  select_bad_process
> -	 *				    # no TIF_MEMDIE task selects new victim
> +	 *				    # no TIF_MEMDIE, selects new victim
>  	 *  unmap_page_range # frees some memory
>  	 */
>  	mutex_lock(&oom_lock);

This comment is wrong. No MMF_OOM_SKIP mm selects new victim.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
@ 2017-07-27 14:01     ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-07-27 14:01 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: rientjes, hannes, guro, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 544d47e5cbbd..86a48affb938 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * bypass the last charges so that they can exit quickly and
>  	 * free their memory.
>  	 */
> -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> +	if (unlikely(tsk_is_oom_victim(current) ||
>  		     fatal_signal_pending(current) ||
>  		     current->flags & PF_EXITING))
>  		goto force;

Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c9f3569a76c7..65cc2f9aaa05 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 *				[...]
>  	 *				out_of_memory
>  	 *				  select_bad_process
> -	 *				    # no TIF_MEMDIE task selects new victim
> +	 *				    # no TIF_MEMDIE, selects new victim
>  	 *  unmap_page_range # frees some memory
>  	 */
>  	mutex_lock(&oom_lock);

This comment is wrong. No MMF_OOM_SKIP mm selects new victim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-27 14:01     ` Tetsuo Handa
@ 2017-07-27 14:08       ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-07-27 14:08 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: rientjes, hannes, guro, linux-mm, linux-kernel, mhocko

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 544d47e5cbbd..86a48affb938 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * bypass the last charges so that they can exit quickly and
> >  	 * free their memory.
> >  	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +	if (unlikely(tsk_is_oom_victim(current) ||
> >  		     fatal_signal_pending(current) ||
> >  		     current->flags & PF_EXITING))
> >  		goto force;
> 
> Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index c9f3569a76c7..65cc2f9aaa05 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> >  	 *				[...]
> >  	 *				out_of_memory
> >  	 *				  select_bad_process
> > -	 *				    # no TIF_MEMDIE task selects new victim
> > +	 *				    # no TIF_MEMDIE, selects new victim
> >  	 *  unmap_page_range # frees some memory
> >  	 */
> >  	mutex_lock(&oom_lock);
> 
> This comment is wrong. No MMF_OOM_SKIP mm selects new victim.
> 
Oops. "MMF_OOM_SKIP mm selects new victim." according to
http://lkml.kernel.org/r/201706271952.FEB21375.SFJFHOQLOtVOMF@I-love.SAKURA.ne.jp .

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
@ 2017-07-27 14:08       ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-07-27 14:08 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: rientjes, hannes, guro, linux-mm, linux-kernel, mhocko

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 544d47e5cbbd..86a48affb938 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * bypass the last charges so that they can exit quickly and
> >  	 * free their memory.
> >  	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +	if (unlikely(tsk_is_oom_victim(current) ||
> >  		     fatal_signal_pending(current) ||
> >  		     current->flags & PF_EXITING))
> >  		goto force;
> 
> Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index c9f3569a76c7..65cc2f9aaa05 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> >  	 *				[...]
> >  	 *				out_of_memory
> >  	 *				  select_bad_process
> > -	 *				    # no TIF_MEMDIE task selects new victim
> > +	 *				    # no TIF_MEMDIE, selects new victim
> >  	 *  unmap_page_range # frees some memory
> >  	 */
> >  	mutex_lock(&oom_lock);
> 
> This comment is wrong. No MMF_OOM_SKIP mm selects new victim.
> 
Oops. "MMF_OOM_SKIP mm selects new victim." according to
http://lkml.kernel.org/r/201706271952.FEB21375.SFJFHOQLOtVOMF@I-love.SAKURA.ne.jp .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-27 14:01     ` Tetsuo Handa
@ 2017-07-27 14:18       ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27 14:18 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 27-07-17 23:01:05, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 544d47e5cbbd..86a48affb938 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * bypass the last charges so that they can exit quickly and
> >  	 * free their memory.
> >  	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +	if (unlikely(tsk_is_oom_victim(current) ||
> >  		     fatal_signal_pending(current) ||
> >  		     current->flags & PF_EXITING))
> >  		goto force;
> 
> Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?

I will double check.

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index c9f3569a76c7..65cc2f9aaa05 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> >  	 *				[...]
> >  	 *				out_of_memory
> >  	 *				  select_bad_process
> > -	 *				    # no TIF_MEMDIE task selects new victim
> > +	 *				    # no TIF_MEMDIE, selects new victim
> >  	 *  unmap_page_range # frees some memory
> >  	 */
> >  	mutex_lock(&oom_lock);
> 
> This comment is wrong. No MMF_OOM_SKIP mm selects new victim.

This hunk shouldn't make it to the patch.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
@ 2017-07-27 14:18       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27 14:18 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 27-07-17 23:01:05, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 544d47e5cbbd..86a48affb938 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * bypass the last charges so that they can exit quickly and
> >  	 * free their memory.
> >  	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +	if (unlikely(tsk_is_oom_victim(current) ||
> >  		     fatal_signal_pending(current) ||
> >  		     current->flags & PF_EXITING))
> >  		goto force;
> 
> Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?

I will double check.

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index c9f3569a76c7..65cc2f9aaa05 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -483,7 +483,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> >  	 *				[...]
> >  	 *				out_of_memory
> >  	 *				  select_bad_process
> > -	 *				    # no TIF_MEMDIE task selects new victim
> > +	 *				    # no TIF_MEMDIE, selects new victim
> >  	 *  unmap_page_range # frees some memory
> >  	 */
> >  	mutex_lock(&oom_lock);
> 
> This comment is wrong. No MMF_OOM_SKIP mm selects new victim.

This hunk shouldn't make it to the patch.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-27 14:01     ` Tetsuo Handa
@ 2017-07-27 14:45       ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27 14:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 27-07-17 23:01:05, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 544d47e5cbbd..86a48affb938 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * bypass the last charges so that they can exit quickly and
> >  	 * free their memory.
> >  	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +	if (unlikely(tsk_is_oom_victim(current) ||
> >  		     fatal_signal_pending(current) ||
> >  		     current->flags & PF_EXITING))
> >  		goto force;
> 
> Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?

OK, so your concern was

> Does this test_thread_flag(TIF_MEMDIE) (or tsk_is_oom_victim(current)) make sense?
> 
> If current thread is OOM-killed, SIGKILL must be pending before arriving at
> do_exit() and PF_EXITING must be set after arriving at do_exit().

> But I can't find locations which do memory allocation between clearing
> SIGKILL and setting PF_EXITING.

I can't find them either and maybe there are none. But why do we care
in this particular patch which merely replaces TIF_MEMDIE check by
tsk_is_oom_victim? The code will surely not become less valid. If
you believe this check is redundant then send a patch with the clear
justification. But I would say, at least from the robustness point of
view I would just keep it there. We do not really have any control on
what happens between clearing signals and setting PF_EXITING.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
@ 2017-07-27 14:45       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-27 14:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 27-07-17 23:01:05, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 544d47e5cbbd..86a48affb938 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * bypass the last charges so that they can exit quickly and
> >  	 * free their memory.
> >  	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +	if (unlikely(tsk_is_oom_victim(current) ||
> >  		     fatal_signal_pending(current) ||
> >  		     current->flags & PF_EXITING))
> >  		goto force;
> 
> Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?

OK, so your concern was

> Does this test_thread_flag(TIF_MEMDIE) (or tsk_is_oom_victim(current)) make sense?
> 
> If current thread is OOM-killed, SIGKILL must be pending before arriving at
> do_exit() and PF_EXITING must be set after arriving at do_exit().

> But I can't find locations which do memory allocation between clearing
> SIGKILL and setting PF_EXITING.

I can't find them either and maybe there are none. But why do we care
in this particular patch which merely replaces TIF_MEMDIE check by
tsk_is_oom_victim? The code will surely not become less valid. If
you believe this check is redundant then send a patch with the clear
justification. But I would say, at least from the robustness point of
view I would just keep it there. We do not really have any control on
what happens between clearing signals and setting PF_EXITING.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-27 14:45       ` Michal Hocko
@ 2017-07-27 14:55         ` Roman Gushchin
  -1 siblings, 0 replies; 55+ messages in thread
From: Roman Gushchin @ 2017-07-27 14:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, akpm, rientjes, hannes, linux-mm, linux-kernel

On Thu, Jul 27, 2017 at 04:45:44PM +0200, Michal Hocko wrote:
> On Thu 27-07-17 23:01:05, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 544d47e5cbbd..86a48affb938 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  	 * bypass the last charges so that they can exit quickly and
> > >  	 * free their memory.
> > >  	 */
> > > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > > +	if (unlikely(tsk_is_oom_victim(current) ||
> > >  		     fatal_signal_pending(current) ||
> > >  		     current->flags & PF_EXITING))
> > >  		goto force;
> > 
> > Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?
> 
> OK, so your concern was
> 
> > Does this test_thread_flag(TIF_MEMDIE) (or tsk_is_oom_victim(current)) make sense?
> > 
> > If current thread is OOM-killed, SIGKILL must be pending before arriving at
> > do_exit() and PF_EXITING must be set after arriving at do_exit().
> 
> > But I can't find locations which do memory allocation between clearing
> > SIGKILL and setting PF_EXITING.
> 
> I can't find them either and maybe there are none. But why do we care
> in this particular patch which merely replaces TIF_MEMDIE check by
> tsk_is_oom_victim? The code will surely not become less valid. If
> you believe this check is redundant then send a patch with the clear
> justification. But I would say, at least from the robustness point of
> view I would just keep it there. We do not really have any control on
> what happens between clearing signals and setting PF_EXITING.

I agree, this check is probably redundant, but it really makes no difference,
let's keep it bullet-proof. If we care about performance here, we can rearrange
the checks:
  if (unlikely(fatal_signal_pending(current) ||
  	     current->flags & PF_EXITING) ||
  	     tsk_is_oom_victim(current))
  	goto force;

Roman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
@ 2017-07-27 14:55         ` Roman Gushchin
  0 siblings, 0 replies; 55+ messages in thread
From: Roman Gushchin @ 2017-07-27 14:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, akpm, rientjes, hannes, linux-mm, linux-kernel

On Thu, Jul 27, 2017 at 04:45:44PM +0200, Michal Hocko wrote:
> On Thu 27-07-17 23:01:05, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 544d47e5cbbd..86a48affb938 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  	 * bypass the last charges so that they can exit quickly and
> > >  	 * free their memory.
> > >  	 */
> > > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > > +	if (unlikely(tsk_is_oom_victim(current) ||
> > >  		     fatal_signal_pending(current) ||
> > >  		     current->flags & PF_EXITING))
> > >  		goto force;
> > 
> > Did we check http://lkml.kernel.org/r/20160909140508.GO4844@dhcp22.suse.cz ?
> 
> OK, so your concern was
> 
> > Does this test_thread_flag(TIF_MEMDIE) (or tsk_is_oom_victim(current)) make sense?
> > 
> > If current thread is OOM-killed, SIGKILL must be pending before arriving at
> > do_exit() and PF_EXITING must be set after arriving at do_exit().
> 
> > But I can't find locations which do memory allocation between clearing
> > SIGKILL and setting PF_EXITING.
> 
> I can't find them either and maybe there are none. But why do we care
> in this particular patch which merely replaces TIF_MEMDIE check by
> tsk_is_oom_victim? The code will surely not become less valid. If
> you believe this check is redundant then send a patch with the clear
> justification. But I would say, at least from the robustness point of
> view I would just keep it there. We do not really have any control on
> what happens between clearing signals and setting PF_EXITING.

I agree, this check is probably redundant, but it really makes no difference,
let's keep it bullet-proof. If we care about performance here, we can rearrange
the checks:
  if (unlikely(fatal_signal_pending(current) ||
  	     current->flags & PF_EXITING) ||
  	     tsk_is_oom_victim(current))
  	goto force;

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-27  9:03   ` Michal Hocko
  (?)
  (?)
@ 2017-07-29  8:33   ` kbuild test robot
  2017-07-31  6:46       ` Michal Hocko
  -1 siblings, 1 reply; 55+ messages in thread
From: kbuild test robot @ 2017-07-29  8:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild-all, Andrew Morton, David Rientjes, Johannes Weiner,
	Roman Gushchin, Tetsuo Handa, linux-mm, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 4865 bytes --]

Hi Michal,

[auto build test ERROR on cgroup/for-next]
[also build test ERROR on v4.13-rc2 next-20170728]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-oom-do-not-rely-on-TIF_MEMDIE-for-memory-reserves-access/20170728-101955
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
config: i386-randconfig-c0-07291424 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from include/linux/ioport.h:12:0,
                    from include/linux/device.h:16,
                    from include/linux/node.h:17,
                    from include/linux/cpu.h:16,
                    from kernel/cgroup/cpuset.c:25:
   kernel/cgroup/cpuset.c: In function '__cpuset_node_allowed':
>> include/linux/compiler.h:123:18: error: implicit declaration of function 'tsk_is_oom_victim' [-Werror=implicit-function-declaration]
       static struct ftrace_likely_data  \
                     ^
   include/linux/compiler.h:156:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^
   kernel/cgroup/cpuset.c:2546:2: note: in expansion of macro 'if'
     if (unlikely(tsk_is_oom_victim(current)))
     ^
   include/linux/compiler.h:146:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^
   kernel/cgroup/cpuset.c:2546:6: note: in expansion of macro 'unlikely'
     if (unlikely(tsk_is_oom_victim(current)))
         ^
   cc1: some warnings being treated as errors
--
   In file included from include/linux/ioport.h:12:0,
                    from include/linux/device.h:16,
                    from include/linux/node.h:17,
                    from include/linux/cpu.h:16,
                    from kernel//cgroup/cpuset.c:25:
   kernel//cgroup/cpuset.c: In function '__cpuset_node_allowed':
>> include/linux/compiler.h:123:18: error: implicit declaration of function 'tsk_is_oom_victim' [-Werror=implicit-function-declaration]
       static struct ftrace_likely_data  \
                     ^
   include/linux/compiler.h:156:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^
   kernel//cgroup/cpuset.c:2546:2: note: in expansion of macro 'if'
     if (unlikely(tsk_is_oom_victim(current)))
     ^
   include/linux/compiler.h:146:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^
   kernel//cgroup/cpuset.c:2546:6: note: in expansion of macro 'unlikely'
     if (unlikely(tsk_is_oom_victim(current)))
         ^
   cc1: some warnings being treated as errors

vim +/tsk_is_oom_victim +123 include/linux/compiler.h

1f0d69a9 Steven Rostedt          2008-11-12  120  
d45ae1f7 Steven Rostedt (VMware  2017-01-17  121) #define __branch_check__(x, expect, is_constant) ({			\
1f0d69a9 Steven Rostedt          2008-11-12  122  			int ______r;					\
134e6a03 Steven Rostedt (VMware  2017-01-19 @123) 			static struct ftrace_likely_data		\
1f0d69a9 Steven Rostedt          2008-11-12  124  				__attribute__((__aligned__(4)))		\
45b79749 Steven Rostedt          2008-11-21  125  				__attribute__((section("_ftrace_annotated_branch"))) \
1f0d69a9 Steven Rostedt          2008-11-12  126  				______f = {				\
134e6a03 Steven Rostedt (VMware  2017-01-19  127) 				.data.func = __func__,			\
134e6a03 Steven Rostedt (VMware  2017-01-19  128) 				.data.file = __FILE__,			\
134e6a03 Steven Rostedt (VMware  2017-01-19  129) 				.data.line = __LINE__,			\
1f0d69a9 Steven Rostedt          2008-11-12  130  			};						\
d45ae1f7 Steven Rostedt (VMware  2017-01-17  131) 			______r = __builtin_expect(!!(x), expect);	\
d45ae1f7 Steven Rostedt (VMware  2017-01-17  132) 			ftrace_likely_update(&______f, ______r,		\
d45ae1f7 Steven Rostedt (VMware  2017-01-17  133) 					     expect, is_constant);	\
1f0d69a9 Steven Rostedt          2008-11-12  134  			______r;					\
1f0d69a9 Steven Rostedt          2008-11-12  135  		})
1f0d69a9 Steven Rostedt          2008-11-12  136  

:::::: The code at line 123 was first introduced by commit
:::::: 134e6a034cb004ed5acd3048792de70ced1c6cf5 tracing: Show number of constants profiled in likely profiler

:::::: TO: Steven Rostedt (VMware) <rostedt@goodmis.org>
:::::: CC: Steven Rostedt (VMware) <rostedt@goodmis.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28292 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2017-07-29  8:33   ` kbuild test robot
@ 2017-07-31  6:46       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-31  6:46 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, David Rientjes, Johannes Weiner,
	Roman Gushchin, Tetsuo Handa, linux-mm, LKML

On Sat 29-07-17 16:33:35, kbuild test robot wrote:
> Hi Michal,
> 
> [auto build test ERROR on cgroup/for-next]
> [also build test ERROR on v4.13-rc2 next-20170728]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-oom-do-not-rely-on-TIF_MEMDIE-for-memory-reserves-access/20170728-101955
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
> config: i386-randconfig-c0-07291424 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from include/linux/ioport.h:12:0,
>                     from include/linux/device.h:16,
>                     from include/linux/node.h:17,
>                     from include/linux/cpu.h:16,
>                     from kernel/cgroup/cpuset.c:25:
>    kernel/cgroup/cpuset.c: In function '__cpuset_node_allowed':
> >> include/linux/compiler.h:123:18: error: implicit declaration of function 'tsk_is_oom_victim' [-Werror=implicit-function-declaration]

Thanks for the report. We need this
---
commit 638b5ab1ed275f23b52a71941b66c8966d332cd7
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Jul 31 08:45:53 2017 +0200

    fold me
    
    - fix implicit declaration of function 'tsk_is_oom_victim' reported by
      0day
    
    Signed-off-by: Michal Hocko <mhocko@suse.com>

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1cc53dff0d94..734ae4fa9775 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -56,6 +56,7 @@
 #include <linux/time64.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
+#include <linux/oom.h>
 
 #include <linux/uaccess.h>
 #include <linux/atomic.h>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
@ 2017-07-31  6:46       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-31  6:46 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, David Rientjes, Johannes Weiner,
	Roman Gushchin, Tetsuo Handa, linux-mm, LKML

On Sat 29-07-17 16:33:35, kbuild test robot wrote:
> Hi Michal,
> 
> [auto build test ERROR on cgroup/for-next]
> [also build test ERROR on v4.13-rc2 next-20170728]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-oom-do-not-rely-on-TIF_MEMDIE-for-memory-reserves-access/20170728-101955
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
> config: i386-randconfig-c0-07291424 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from include/linux/ioport.h:12:0,
>                     from include/linux/device.h:16,
>                     from include/linux/node.h:17,
>                     from include/linux/cpu.h:16,
>                     from kernel/cgroup/cpuset.c:25:
>    kernel/cgroup/cpuset.c: In function '__cpuset_node_allowed':
> >> include/linux/compiler.h:123:18: error: implicit declaration of function 'tsk_is_oom_victim' [-Werror=implicit-function-declaration]

Thanks for the report. We need this
---
commit 638b5ab1ed275f23b52a71941b66c8966d332cd7
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Jul 31 08:45:53 2017 +0200

    fold me
    
    - fix implicit declaration of function 'tsk_is_oom_victim' reported by
      0day
    
    Signed-off-by: Michal Hocko <mhocko@suse.com>

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1cc53dff0d94..734ae4fa9775 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -56,6 +56,7 @@
 #include <linux/time64.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
+#include <linux/oom.h>
 
 #include <linux/uaccess.h>
 #include <linux/atomic.h>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
  2017-07-27  9:03 ` Michal Hocko
@ 2017-08-01 12:16   ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 12:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> Hi,
> this is a part of a larger series I posted back in Oct last year [1]. I
> have dropped patch 3 because it was incorrect and patch 4 is not
> applicable without it.
> 
> The primary reason to apply patch 1 is to remove a risk of the complete
> memory depletion by oom victims. While this is a theoretical risk right
> now there is a demand for memcg aware oom killer which might kill all
> processes inside a memcg which can be a lot of tasks. That would make
> the risk quite real.
> 
> This issue is addressed by limiting access to memory reserves. We no
> longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> cleanup.

Any comments, concerns? Can we merge it?
 
> I would still like to get rid of TIF_MEMDIE completely but I do not have
> time to do it now and it is not a pressing issue.
> 
> [1] http://lkml.kernel.org/r/20161004090009.7974-1-mhocko@kernel.org

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-08-01 12:16   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 12:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> Hi,
> this is a part of a larger series I posted back in Oct last year [1]. I
> have dropped patch 3 because it was incorrect and patch 4 is not
> applicable without it.
> 
> The primary reason to apply patch 1 is to remove a risk of the complete
> memory depletion by oom victims. While this is a theoretical risk right
> now there is a demand for memcg aware oom killer which might kill all
> processes inside a memcg which can be a lot of tasks. That would make
> the risk quite real.
> 
> This issue is addressed by limiting access to memory reserves. We no
> longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> cleanup.

Any comments, concerns? Can we merge it?
 
> I would still like to get rid of TIF_MEMDIE completely but I do not have
> time to do it now and it is not a pressing issue.
> 
> [1] http://lkml.kernel.org/r/20161004090009.7974-1-mhocko@kernel.org

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
  2017-08-01 12:16   ` Michal Hocko
@ 2017-08-01 12:23     ` Roman Gushchin
  -1 siblings, 0 replies; 55+ messages in thread
From: Roman Gushchin @ 2017-08-01 12:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > Hi,
> > this is a part of a larger series I posted back in Oct last year [1]. I
> > have dropped patch 3 because it was incorrect and patch 4 is not
> > applicable without it.
> > 
> > The primary reason to apply patch 1 is to remove a risk of the complete
> > memory depletion by oom victims. While this is a theoretical risk right
> > now there is a demand for memcg aware oom killer which might kill all
> > processes inside a memcg which can be a lot of tasks. That would make
> > the risk quite real.
> > 
> > This issue is addressed by limiting access to memory reserves. We no
> > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > cleanup.
> 
> Any comments, concerns? Can we merge it?

I've rebased the cgroup-aware OOM killer and ran some tests.
Everything works well.

Thanks!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-08-01 12:23     ` Roman Gushchin
  0 siblings, 0 replies; 55+ messages in thread
From: Roman Gushchin @ 2017-08-01 12:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > Hi,
> > this is a part of a larger series I posted back in Oct last year [1]. I
> > have dropped patch 3 because it was incorrect and patch 4 is not
> > applicable without it.
> > 
> > The primary reason to apply patch 1 is to remove a risk of the complete
> > memory depletion by oom victims. While this is a theoretical risk right
> > now there is a demand for memcg aware oom killer which might kill all
> > processes inside a memcg which can be a lot of tasks. That would make
> > the risk quite real.
> > 
> > This issue is addressed by limiting access to memory reserves. We no
> > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > cleanup.
> 
> Any comments, concerns? Can we merge it?

I've rebased the cgroup-aware OOM killer and ran some tests.
Everything works well.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
  2017-08-01 12:23     ` Roman Gushchin
@ 2017-08-01 12:29       ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 12:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue 01-08-17 13:23:44, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> > On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > > Hi,
> > > this is a part of a larger series I posted back in Oct last year [1]. I
> > > have dropped patch 3 because it was incorrect and patch 4 is not
> > > applicable without it.
> > > 
> > > The primary reason to apply patch 1 is to remove a risk of the complete
> > > memory depletion by oom victims. While this is a theoretical risk right
> > > now there is a demand for memcg aware oom killer which might kill all
> > > processes inside a memcg which can be a lot of tasks. That would make
> > > the risk quite real.
> > > 
> > > This issue is addressed by limiting access to memory reserves. We no
> > > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > > cleanup.
> > 
> > Any comments, concerns? Can we merge it?
> 
> I've rebased the cgroup-aware OOM killer and ran some tests.
> Everything works well.

Thanks for your testing. Can I assume your Tested-by?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-08-01 12:29       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 12:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue 01-08-17 13:23:44, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> > On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > > Hi,
> > > this is a part of a larger series I posted back in Oct last year [1]. I
> > > have dropped patch 3 because it was incorrect and patch 4 is not
> > > applicable without it.
> > > 
> > > The primary reason to apply patch 1 is to remove a risk of the complete
> > > memory depletion by oom victims. While this is a theoretical risk right
> > > now there is a demand for memcg aware oom killer which might kill all
> > > processes inside a memcg which can be a lot of tasks. That would make
> > > the risk quite real.
> > > 
> > > This issue is addressed by limiting access to memory reserves. We no
> > > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > > cleanup.
> > 
> > Any comments, concerns? Can we merge it?
> 
> I've rebased the cgroup-aware OOM killer and ran some tests.
> Everything works well.

Thanks for your testing. Can I assume your Tested-by?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
  2017-08-01 12:29       ` Michal Hocko
@ 2017-08-01 12:42         ` Roman Gushchin
  -1 siblings, 0 replies; 55+ messages in thread
From: Roman Gushchin @ 2017-08-01 12:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue, Aug 01, 2017 at 02:29:05PM +0200, Michal Hocko wrote:
> On Tue 01-08-17 13:23:44, Roman Gushchin wrote:
> > On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> > > On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > > > Hi,
> > > > this is a part of a larger series I posted back in Oct last year [1]. I
> > > > have dropped patch 3 because it was incorrect and patch 4 is not
> > > > applicable without it.
> > > > 
> > > > The primary reason to apply patch 1 is to remove a risk of the complete
> > > > memory depletion by oom victims. While this is a theoretical risk right
> > > > now there is a demand for memcg aware oom killer which might kill all
> > > > processes inside a memcg which can be a lot of tasks. That would make
> > > > the risk quite real.
> > > > 
> > > > This issue is addressed by limiting access to memory reserves. We no
> > > > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > > > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > > > cleanup.
> > > 
> > > Any comments, concerns? Can we merge it?
> > 
> > I've rebased the cgroup-aware OOM killer and ran some tests.
> > Everything works well.
> 
> Thanks for your testing. Can I assume your Tested-by?

Sure.

I wonder if we can get rid of TIF_MEMDIE completely,
if we will count OOM victims on per-oom-victim-signal-struct rather than
on per-thread basis? Say, assign oom_mm using cmpxchg, and call
exit_oom_victim() from __exit_signal()? __thaw_task() can be called from
mark_oom_victim() unconditionally.

Do you see any problems with this approach?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-08-01 12:42         ` Roman Gushchin
  0 siblings, 0 replies; 55+ messages in thread
From: Roman Gushchin @ 2017-08-01 12:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue, Aug 01, 2017 at 02:29:05PM +0200, Michal Hocko wrote:
> On Tue 01-08-17 13:23:44, Roman Gushchin wrote:
> > On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> > > On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > > > Hi,
> > > > this is a part of a larger series I posted back in Oct last year [1]. I
> > > > have dropped patch 3 because it was incorrect and patch 4 is not
> > > > applicable without it.
> > > > 
> > > > The primary reason to apply patch 1 is to remove a risk of the complete
> > > > memory depletion by oom victims. While this is a theoretical risk right
> > > > now there is a demand for memcg aware oom killer which might kill all
> > > > processes inside a memcg which can be a lot of tasks. That would make
> > > > the risk quite real.
> > > > 
> > > > This issue is addressed by limiting access to memory reserves. We no
> > > > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > > > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > > > cleanup.
> > > 
> > > Any comments, concerns? Can we merge it?
> > 
> > I've rebased the cgroup-aware OOM killer and ran some tests.
> > Everything works well.
> 
> Thanks for your testing. Can I assume your Tested-by?

Sure.

I wonder if we can get rid of TIF_MEMDIE completely,
if we will count OOM victims on per-oom-victim-signal-struct rather than
on per-thread basis? Say, assign oom_mm using cmpxchg, and call
exit_oom_victim() from __exit_signal()? __thaw_task() can be called from
mark_oom_victim() unconditionally.

Do you see any problems with this approach?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
  2017-08-01 12:42         ` Roman Gushchin
@ 2017-08-01 12:54           ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 12:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue 01-08-17 13:42:38, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 02:29:05PM +0200, Michal Hocko wrote:
> > On Tue 01-08-17 13:23:44, Roman Gushchin wrote:
> > > On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> > > > On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > > > > Hi,
> > > > > this is a part of a larger series I posted back in Oct last year [1]. I
> > > > > have dropped patch 3 because it was incorrect and patch 4 is not
> > > > > applicable without it.
> > > > > 
> > > > > The primary reason to apply patch 1 is to remove a risk of the complete
> > > > > memory depletion by oom victims. While this is a theoretical risk right
> > > > > now there is a demand for memcg aware oom killer which might kill all
> > > > > processes inside a memcg which can be a lot of tasks. That would make
> > > > > the risk quite real.
> > > > > 
> > > > > This issue is addressed by limiting access to memory reserves. We no
> > > > > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > > > > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > > > > cleanup.
> > > > 
> > > > Any comments, concerns? Can we merge it?
> > > 
> > > I've rebased the cgroup-aware OOM killer and ran some tests.
> > > Everything works well.
> > 
> > Thanks for your testing. Can I assume your Tested-by?
> 
> Sure.

Thanks!

> I wonder if we can get rid of TIF_MEMDIE completely,
> if we will count OOM victims on per-oom-victim-signal-struct rather than
> on per-thread basis? Say, assign oom_mm using cmpxchg, and call
> exit_oom_victim() from __exit_signal()? __thaw_task() can be called from
> mark_oom_victim() unconditionally.
> 
> Do you see any problems with this approach?

Ohh, I wish we could do that. All my previous attempts failed though. I
have always hit the problem to tell that the last thread of the process
is exiting to know when to call exit_oom_victim and release the oom
disable barrier. Maybe things have changed somehow since I've tried the
last time but this is a tricky code. I will certainly get back to it
some day but not likely anytime soon.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-08-01 12:54           ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 12:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-mm, LKML

On Tue 01-08-17 13:42:38, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 02:29:05PM +0200, Michal Hocko wrote:
> > On Tue 01-08-17 13:23:44, Roman Gushchin wrote:
> > > On Tue, Aug 01, 2017 at 02:16:44PM +0200, Michal Hocko wrote:
> > > > On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> > > > > Hi,
> > > > > this is a part of a larger series I posted back in Oct last year [1]. I
> > > > > have dropped patch 3 because it was incorrect and patch 4 is not
> > > > > applicable without it.
> > > > > 
> > > > > The primary reason to apply patch 1 is to remove a risk of the complete
> > > > > memory depletion by oom victims. While this is a theoretical risk right
> > > > > now there is a demand for memcg aware oom killer which might kill all
> > > > > processes inside a memcg which can be a lot of tasks. That would make
> > > > > the risk quite real.
> > > > > 
> > > > > This issue is addressed by limiting access to memory reserves. We no
> > > > > longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> > > > > instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> > > > > cleanup.
> > > > 
> > > > Any comments, concerns? Can we merge it?
> > > 
> > > I've rebased the cgroup-aware OOM killer and ran some tests.
> > > Everything works well.
> > 
> > Thanks for your testing. Can I assume your Tested-by?
> 
> Sure.

Thanks!

> I wonder if we can get rid of TIF_MEMDIE completely,
> if we will count OOM victims on per-oom-victim-signal-struct rather than
> on per-thread basis? Say, assign oom_mm using cmpxchg, and call
> exit_oom_victim() from __exit_signal()? __thaw_task() can be called from
> mark_oom_victim() unconditionally.
> 
> Do you see any problems with this approach?

Ohh, I wish we could do that. All my previous attempts failed though. I
have always hit the problem to tell that the last thread of the process
is exiting to know when to call exit_oom_victim and release the oom
disable barrier. Maybe things have changed somehow since I've tried the
last time but this is a tricky code. I will certainly get back to it
some day but not likely anytime soon.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-07-27  9:03   ` Michal Hocko
@ 2017-08-01 15:30     ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-01 15:30 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: rientjes, hannes, guro, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> ALLOC_NO_WATERMARKS approach but be careful because they still might
> deplete all the memory reserves so keep the semantic as close to the
> original implementation as possible and give them access to memory
> reserves only up to exit_mm (when tsk->mm is cleared) rather than while
> tsk_is_oom_victim which is until signal struct is gone.

Currently memory allocations from __mmput() can use memory reserves but
this patch changes __mmput() not to use memory reserves. You say "keep
the semantic as close to the original implementation as possible" but
this change is not guaranteed to be safe.

> @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	 * the high-atomic reserves. This will over-estimate the size of the
>  	 * atomic reserve but it avoids a search.
>  	 */
> -	if (likely(!alloc_harder))
> +	if (likely(!alloc_harder)) {
>  		free_pages -= z->nr_reserved_highatomic;
> -	else
> -		min -= min / 4;
> +	} else {
> +		/*
> +		 * OOM victims can try even harder than normal ALLOC_HARDER
> +		 * users
> +		 */
> +		if (alloc_flags & ALLOC_OOM)

ALLOC_OOM is ALLOC_NO_WATERMARKS if CONFIG_MMU=n.
I wonder this test makes sense for ALLOC_NO_WATERMARKS.

> +			min -= min / 2;
> +		else
> +			min -= min / 4;
> +	}
> +
>  
>  #ifdef CONFIG_CMA
>  	/* If allocation can't use CMA areas don't use free CMA pages */
> @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	return alloc_flags;
>  }
>  
> +static bool oom_reserves_allowed(struct task_struct *tsk)
> +{
> +	if (!tsk_is_oom_victim(tsk))
> +		return false;
> +
> +	/*
> +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> +	 * depletion and shouldn't give access to memory reserves passed the
> +	 * exit_mm
> +	 */
> +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> +		return false;

Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
victim selection if CONFIG_MMU=n. Then, we no longer need to worry about
memory reserves depletion and we can treat equally.

> +
> +	return true;
> +}
> +
>  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
>  	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))

> @@ -3770,6 +3795,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned long alloc_start = jiffies;
>  	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
> +	bool reserves;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3875,15 +3901,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
> -	if (gfp_pfmemalloc_allowed(gfp_mask))
> -		alloc_flags = ALLOC_NO_WATERMARKS;
> +	/*
> +	 * Distinguish requests which really need access to whole memory
> +	 * reserves from oom victims which can live with their own reserve
> +	 */
> +	reserves = gfp_pfmemalloc_allowed(gfp_mask);
> +	if (reserves) {
> +		if (tsk_is_oom_victim(current))
> +			alloc_flags = ALLOC_OOM;

If reserves == true due to reasons other than tsk_is_oom_victim(current) == true
(e.g. __GFP_MEMALLOC), why dare to reduce it?

> +		else
> +			alloc_flags = ALLOC_NO_WATERMARKS;
> +	}

If CONFIG_MMU=n, doing this test is silly.

if (tsk_is_oom_victim(current))
	alloc_flags = ALLOC_NO_WATERMARKS;
else
	alloc_flags = ALLOC_NO_WATERMARKS;

>  
>  	/*
>  	 * Reset the zonelist iterators if memory policies can be ignored.
>  	 * These allocations are high priority and system rather than user
>  	 * orientated.
>  	 */
> -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
>  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask);
> @@ -3960,7 +3995,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) &&
> +	if (tsk_is_oom_victim(current) &&
>  	    (alloc_flags == ALLOC_NO_WATERMARKS ||
>  	     (gfp_mask & __GFP_NOMEMALLOC)))
>  		goto nopage;

And you are silently changing to "!costly __GFP_DIRECT_RECLAIM allocations never fail
(even selected for OOM victims)" (i.e. updating the too small to fail memory allocation
rule) by doing alloc_flags == ALLOC_NO_WATERMARKS if CONFIG_MMU=y.

Applying this change might disturb memory allocation behavior. I don't like this patch.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-01 15:30     ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-01 15:30 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: rientjes, hannes, guro, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> ALLOC_NO_WATERMARKS approach but be careful because they still might
> deplete all the memory reserves so keep the semantic as close to the
> original implementation as possible and give them access to memory
> reserves only up to exit_mm (when tsk->mm is cleared) rather than while
> tsk_is_oom_victim which is until signal struct is gone.

Currently memory allocations from __mmput() can use memory reserves but
this patch changes __mmput() not to use memory reserves. You say "keep
the semantic as close to the original implementation as possible" but
this change is not guaranteed to be safe.

> @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	 * the high-atomic reserves. This will over-estimate the size of the
>  	 * atomic reserve but it avoids a search.
>  	 */
> -	if (likely(!alloc_harder))
> +	if (likely(!alloc_harder)) {
>  		free_pages -= z->nr_reserved_highatomic;
> -	else
> -		min -= min / 4;
> +	} else {
> +		/*
> +		 * OOM victims can try even harder than normal ALLOC_HARDER
> +		 * users
> +		 */
> +		if (alloc_flags & ALLOC_OOM)

ALLOC_OOM is ALLOC_NO_WATERMARKS if CONFIG_MMU=n.
I wonder this test makes sense for ALLOC_NO_WATERMARKS.

> +			min -= min / 2;
> +		else
> +			min -= min / 4;
> +	}
> +
>  
>  #ifdef CONFIG_CMA
>  	/* If allocation can't use CMA areas don't use free CMA pages */
> @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	return alloc_flags;
>  }
>  
> +static bool oom_reserves_allowed(struct task_struct *tsk)
> +{
> +	if (!tsk_is_oom_victim(tsk))
> +		return false;
> +
> +	/*
> +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> +	 * depletion and shouldn't give access to memory reserves passed the
> +	 * exit_mm
> +	 */
> +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> +		return false;

Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
victim selection if CONFIG_MMU=n. Then, we no longer need to worry about
memory reserves depletion and we can treat equally.

> +
> +	return true;
> +}
> +
>  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
>  	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))

> @@ -3770,6 +3795,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned long alloc_start = jiffies;
>  	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
> +	bool reserves;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3875,15 +3901,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
> -	if (gfp_pfmemalloc_allowed(gfp_mask))
> -		alloc_flags = ALLOC_NO_WATERMARKS;
> +	/*
> +	 * Distinguish requests which really need access to whole memory
> +	 * reserves from oom victims which can live with their own reserve
> +	 */
> +	reserves = gfp_pfmemalloc_allowed(gfp_mask);
> +	if (reserves) {
> +		if (tsk_is_oom_victim(current))
> +			alloc_flags = ALLOC_OOM;

If reserves == true due to reasons other than tsk_is_oom_victim(current) == true
(e.g. __GFP_MEMALLOC), why dare to reduce it?

> +		else
> +			alloc_flags = ALLOC_NO_WATERMARKS;
> +	}

If CONFIG_MMU=n, doing this test is silly.

if (tsk_is_oom_victim(current))
	alloc_flags = ALLOC_NO_WATERMARKS;
else
	alloc_flags = ALLOC_NO_WATERMARKS;

>  
>  	/*
>  	 * Reset the zonelist iterators if memory policies can be ignored.
>  	 * These allocations are high priority and system rather than user
>  	 * orientated.
>  	 */
> -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
>  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask);
> @@ -3960,7 +3995,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) &&
> +	if (tsk_is_oom_victim(current) &&
>  	    (alloc_flags == ALLOC_NO_WATERMARKS ||
>  	     (gfp_mask & __GFP_NOMEMALLOC)))
>  		goto nopage;

And you are silently changing to "!costly __GFP_DIRECT_RECLAIM allocations never fail
(even selected for OOM victims)" (i.e. updating the too small to fail memory allocation
rule) by doing alloc_flags == ALLOC_NO_WATERMARKS if CONFIG_MMU=y.

Applying this change might disturb memory allocation behavior. I don't like this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-01 15:30     ` Tetsuo Handa
@ 2017-08-01 16:52       ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 16:52 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> > ALLOC_NO_WATERMARKS approach but be careful because they still might
> > deplete all the memory reserves so keep the semantic as close to the
> > original implementation as possible and give them access to memory
> > reserves only up to exit_mm (when tsk->mm is cleared) rather than while
> > tsk_is_oom_victim which is until signal struct is gone.
> 
> Currently memory allocations from __mmput() can use memory reserves but
> this patch changes __mmput() not to use memory reserves. You say "keep
> the semantic as close to the original implementation as possible" but
> this change is not guaranteed to be safe.

Yeah it cannot. That's why I've said as close as possible rather than
equivalent. On the other hand I am wondering whether you have anything
specific in mind or this is just a formalistic nitpicking^Wremark.

> > @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >  	 * the high-atomic reserves. This will over-estimate the size of the
> >  	 * atomic reserve but it avoids a search.
> >  	 */
> > -	if (likely(!alloc_harder))
> > +	if (likely(!alloc_harder)) {
> >  		free_pages -= z->nr_reserved_highatomic;
> > -	else
> > -		min -= min / 4;
> > +	} else {
> > +		/*
> > +		 * OOM victims can try even harder than normal ALLOC_HARDER
> > +		 * users
> > +		 */
> > +		if (alloc_flags & ALLOC_OOM)
> 
> ALLOC_OOM is ALLOC_NO_WATERMARKS if CONFIG_MMU=n.
> I wonder this test makes sense for ALLOC_NO_WATERMARKS.

Yeah, it would be pointless because get_page_from_freelist will then
ignore the result of the watermark check for ALLOC_NO_WATERMARKS. It is
not harmfull though. I didn't find much better way without making the
code harder to read.  Do you have any suggestion?

> > +			min -= min / 2;
> > +		else
> > +			min -= min / 4;
> > +	}
> > +
> >  
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> > @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	return alloc_flags;
> >  }
> >  
> > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > +{
> > +	if (!tsk_is_oom_victim(tsk))
> > +		return false;
> > +
> > +	/*
> > +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > +	 * depletion and shouldn't give access to memory reserves passed the
> > +	 * exit_mm
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > +		return false;
> 
> Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
> victim selection if CONFIG_MMU=n.

I suggest we do not argue about nommu without actually optimizing for or
fixing nommu which we are not here. I am even not sure memory reserves
can ever be depleted for that config.

Anyway I will go with the following instead
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e5911f40014..3510e06b3bf3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3618,11 +3618,10 @@ static bool oom_reserves_allowed(struct task_struct *tsk)
 		return false;
 
 	/*
-	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
-	 * depletion and shouldn't give access to memory reserves passed the
-	 * exit_mm
+	 * !MMU doesn't have oom reaper so give access to memory reserves
+	 * only to the thread with TIF_MEMDIE set
 	 */
-	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
+	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
 		return false;
 
 	return true;

This should preserve the original semantic. Is that acceptable for you?

> > @@ -3875,15 +3901,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> >  		wake_all_kswapds(order, ac);
> >  
> > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > +	/*
> > +	 * Distinguish requests which really need access to whole memory
> > +	 * reserves from oom victims which can live with their own reserve
> > +	 */
> > +	reserves = gfp_pfmemalloc_allowed(gfp_mask);
> > +	if (reserves) {
> > +		if (tsk_is_oom_victim(current))
> > +			alloc_flags = ALLOC_OOM;
> 
> If reserves == true due to reasons other than tsk_is_oom_victim(current) == true
> (e.g. __GFP_MEMALLOC), why dare to reduce it?

Well the comment above tries to explain. I assume that the oom victim is
special here. a) it is on the way to die and b) we know that something
will be freeing memory on the background so I assume this is acceptable.
 
> > +		else
> > +			alloc_flags = ALLOC_NO_WATERMARKS;
> > +	}
> 
> If CONFIG_MMU=n, doing this test is silly.
> 
> if (tsk_is_oom_victim(current))
> 	alloc_flags = ALLOC_NO_WATERMARKS;
> else
> 	alloc_flags = ALLOC_NO_WATERMARKS;

I am pretty sure any compiler can see the outcome is the same so the
check would be dropped in that case. I primarily wanted to prevent from
an additional ifdefery. I am open to suggestions for a better layout
though.

> > @@ -3960,7 +3995,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto got_pg;
> >  
> >  	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE) &&
> > +	if (tsk_is_oom_victim(current) &&
> >  	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> >  	     (gfp_mask & __GFP_NOMEMALLOC)))
> >  		goto nopage;
> 
> And you are silently changing to "!costly __GFP_DIRECT_RECLAIM allocations never fail
> (even selected for OOM victims)" (i.e. updating the too small to fail memory allocation
> rule) by doing alloc_flags == ALLOC_NO_WATERMARKS if CONFIG_MMU=y.

Ups that is an oversight during the rebase.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e5911f40014..6593ff9de1d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3996,7 +3996,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Avoid allocations with no watermarks from looping endlessly */
 	if (tsk_is_oom_victim(current) &&
-	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	    (alloc_flags == ALLOC_OOM ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
Does this look better?
 
> Applying this change might disturb memory allocation behavior. I don't
> like this patch.

Do you see anything appart from nommu that would be an unfixable road
block?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-01 16:52       ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-01 16:52 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> > ALLOC_NO_WATERMARKS approach but be careful because they still might
> > deplete all the memory reserves so keep the semantic as close to the
> > original implementation as possible and give them access to memory
> > reserves only up to exit_mm (when tsk->mm is cleared) rather than while
> > tsk_is_oom_victim which is until signal struct is gone.
> 
> Currently memory allocations from __mmput() can use memory reserves but
> this patch changes __mmput() not to use memory reserves. You say "keep
> the semantic as close to the original implementation as possible" but
> this change is not guaranteed to be safe.

Yeah it cannot. That's why I've said as close as possible rather than
equivalent. On the other hand I am wondering whether you have anything
specific in mind or this is just a formalistic nitpicking^Wremark.

> > @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >  	 * the high-atomic reserves. This will over-estimate the size of the
> >  	 * atomic reserve but it avoids a search.
> >  	 */
> > -	if (likely(!alloc_harder))
> > +	if (likely(!alloc_harder)) {
> >  		free_pages -= z->nr_reserved_highatomic;
> > -	else
> > -		min -= min / 4;
> > +	} else {
> > +		/*
> > +		 * OOM victims can try even harder than normal ALLOC_HARDER
> > +		 * users
> > +		 */
> > +		if (alloc_flags & ALLOC_OOM)
> 
> ALLOC_OOM is ALLOC_NO_WATERMARKS if CONFIG_MMU=n.
> I wonder this test makes sense for ALLOC_NO_WATERMARKS.

Yeah, it would be pointless because get_page_from_freelist will then
ignore the result of the watermark check for ALLOC_NO_WATERMARKS. It is
not harmfull though. I didn't find much better way without making the
code harder to read.  Do you have any suggestion?

> > +			min -= min / 2;
> > +		else
> > +			min -= min / 4;
> > +	}
> > +
> >  
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> > @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	return alloc_flags;
> >  }
> >  
> > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > +{
> > +	if (!tsk_is_oom_victim(tsk))
> > +		return false;
> > +
> > +	/*
> > +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > +	 * depletion and shouldn't give access to memory reserves passed the
> > +	 * exit_mm
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > +		return false;
> 
> Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
> victim selection if CONFIG_MMU=n.

I suggest we do not argue about nommu without actually optimizing for or
fixing nommu which we are not here. I am even not sure memory reserves
can ever be depleted for that config.

Anyway I will go with the following instead
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e5911f40014..3510e06b3bf3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3618,11 +3618,10 @@ static bool oom_reserves_allowed(struct task_struct *tsk)
 		return false;
 
 	/*
-	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
-	 * depletion and shouldn't give access to memory reserves passed the
-	 * exit_mm
+	 * !MMU doesn't have oom reaper so give access to memory reserves
+	 * only to the thread with TIF_MEMDIE set
 	 */
-	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
+	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
 		return false;
 
 	return true;

This should preserve the original semantic. Is that acceptable for you?

> > @@ -3875,15 +3901,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> >  		wake_all_kswapds(order, ac);
> >  
> > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > +	/*
> > +	 * Distinguish requests which really need access to whole memory
> > +	 * reserves from oom victims which can live with their own reserve
> > +	 */
> > +	reserves = gfp_pfmemalloc_allowed(gfp_mask);
> > +	if (reserves) {
> > +		if (tsk_is_oom_victim(current))
> > +			alloc_flags = ALLOC_OOM;
> 
> If reserves == true due to reasons other than tsk_is_oom_victim(current) == true
> (e.g. __GFP_MEMALLOC), why dare to reduce it?

Well the comment above tries to explain. I assume that the oom victim is
special here. a) it is on the way to die and b) we know that something
will be freeing memory on the background so I assume this is acceptable.
 
> > +		else
> > +			alloc_flags = ALLOC_NO_WATERMARKS;
> > +	}
> 
> If CONFIG_MMU=n, doing this test is silly.
> 
> if (tsk_is_oom_victim(current))
> 	alloc_flags = ALLOC_NO_WATERMARKS;
> else
> 	alloc_flags = ALLOC_NO_WATERMARKS;

I am pretty sure any compiler can see the outcome is the same so the
check would be dropped in that case. I primarily wanted to prevent from
an additional ifdefery. I am open to suggestions for a better layout
though.

> > @@ -3960,7 +3995,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto got_pg;
> >  
> >  	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE) &&
> > +	if (tsk_is_oom_victim(current) &&
> >  	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> >  	     (gfp_mask & __GFP_NOMEMALLOC)))
> >  		goto nopage;
> 
> And you are silently changing to "!costly __GFP_DIRECT_RECLAIM allocations never fail
> (even selected for OOM victims)" (i.e. updating the too small to fail memory allocation
> rule) by doing alloc_flags == ALLOC_NO_WATERMARKS if CONFIG_MMU=y.

Ups that is an oversight during the rebase.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e5911f40014..6593ff9de1d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3996,7 +3996,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Avoid allocations with no watermarks from looping endlessly */
 	if (tsk_is_oom_victim(current) &&
-	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	    (alloc_flags == ALLOC_OOM ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
Does this look better?
 
> Applying this change might disturb memory allocation behavior. I don't
> like this patch.

Do you see anything appart from nommu that would be an unfixable road
block?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-01 16:52       ` Michal Hocko
@ 2017-08-02  6:10         ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-02  6:10 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Tue 01-08-17 18:52:42, Michal Hocko wrote:
> On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
[...]
> > > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > > +	/*
> > > +	 * Distinguish requests which really need access to whole memory
> > > +	 * reserves from oom victims which can live with their own reserve
> > > +	 */
> > > +	reserves = gfp_pfmemalloc_allowed(gfp_mask);
> > > +	if (reserves) {
> > > +		if (tsk_is_oom_victim(current))
> > > +			alloc_flags = ALLOC_OOM;
> > 
> > If reserves == true due to reasons other than tsk_is_oom_victim(current) == true
> > (e.g. __GFP_MEMALLOC), why dare to reduce it?
> 
> Well the comment above tries to explain. I assume that the oom victim is
> special here. a) it is on the way to die and b) we know that something
> will be freeing memory on the background so I assume this is acceptable.

I was thinking about this some more. It is not that hard to achive the
original semantic. The code is slightly uglier but acceptable I guess
What do you think about the following?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3510e06b3bf3..7ae0f6d45614 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3627,21 +3627,31 @@ static bool oom_reserves_allowed(struct task_struct *tsk)
 	return true;
 }
 
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+/*
+ * Distinguish requests which really need access to full memory
+ * reserves from oom victims which can live with a portion of it
+ */
+static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 {
 	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
-		return false;
-
+		return 0;
 	if (gfp_mask & __GFP_MEMALLOC)
-		return true;
+		return ALLOC_NO_WATERMARKS;
 	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-		return true;
-	if (!in_interrupt() &&
-			((current->flags & PF_MEMALLOC) ||
-			 oom_reserves_allowed(current)))
-		return true;
+		return ALLOC_NO_WATERMARKS;
+	if (!in_interrupt()) {
+		if (current->flags & PF_MEMALLOC)
+			return ALLOC_NO_WATERMARKS;
+		else if (oom_reserves_allowed(current))
+			return ALLOC_OOM;
+	}
 
-	return false;
+	return 0;
+}
+
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
 }
 
 /*
@@ -3794,7 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
-	bool reserves;
+	int reserves;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3900,17 +3910,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	/*
-	 * Distinguish requests which really need access to whole memory
-	 * reserves from oom victims which can live with their own reserve
-	 */
-	reserves = gfp_pfmemalloc_allowed(gfp_mask);
-	if (reserves) {
-		if (tsk_is_oom_victim(current))
-			alloc_flags = ALLOC_OOM;
-		else
-			alloc_flags = ALLOC_NO_WATERMARKS;
-	}
+	reserves = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserves)
+		alloc_flags = reserves;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-02  6:10         ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-02  6:10 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Tue 01-08-17 18:52:42, Michal Hocko wrote:
> On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
[...]
> > > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > > +	/*
> > > +	 * Distinguish requests which really need access to whole memory
> > > +	 * reserves from oom victims which can live with their own reserve
> > > +	 */
> > > +	reserves = gfp_pfmemalloc_allowed(gfp_mask);
> > > +	if (reserves) {
> > > +		if (tsk_is_oom_victim(current))
> > > +			alloc_flags = ALLOC_OOM;
> > 
> > If reserves == true due to reasons other than tsk_is_oom_victim(current) == true
> > (e.g. __GFP_MEMALLOC), why dare to reduce it?
> 
> Well the comment above tries to explain. I assume that the oom victim is
> special here. a) it is on the way to die and b) we know that something
> will be freeing memory on the background so I assume this is acceptable.

I was thinking about this some more. It is not that hard to achive the
original semantic. The code is slightly uglier but acceptable I guess
What do you think about the following?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3510e06b3bf3..7ae0f6d45614 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3627,21 +3627,31 @@ static bool oom_reserves_allowed(struct task_struct *tsk)
 	return true;
 }
 
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+/*
+ * Distinguish requests which really need access to full memory
+ * reserves from oom victims which can live with a portion of it
+ */
+static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 {
 	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
-		return false;
-
+		return 0;
 	if (gfp_mask & __GFP_MEMALLOC)
-		return true;
+		return ALLOC_NO_WATERMARKS;
 	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-		return true;
-	if (!in_interrupt() &&
-			((current->flags & PF_MEMALLOC) ||
-			 oom_reserves_allowed(current)))
-		return true;
+		return ALLOC_NO_WATERMARKS;
+	if (!in_interrupt()) {
+		if (current->flags & PF_MEMALLOC)
+			return ALLOC_NO_WATERMARKS;
+		else if (oom_reserves_allowed(current))
+			return ALLOC_OOM;
+	}
 
-	return false;
+	return 0;
+}
+
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
 }
 
 /*
@@ -3794,7 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
-	bool reserves;
+	int reserves;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3900,17 +3910,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	/*
-	 * Distinguish requests which really need access to whole memory
-	 * reserves from oom victims which can live with their own reserve
-	 */
-	reserves = gfp_pfmemalloc_allowed(gfp_mask);
-	if (reserves) {
-		if (tsk_is_oom_victim(current))
-			alloc_flags = ALLOC_OOM;
-		else
-			alloc_flags = ALLOC_NO_WATERMARKS;
-	}
+	reserves = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserves)
+		alloc_flags = reserves;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-07-27  9:03   ` Michal Hocko
@ 2017-08-02  8:29     ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-02  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

With the follow up changes based on Tetsuo review. I have dropped
Roman's Tested-by because at least the allocation exit for oom victim
behavior has changed.
---
commit b1cfab26a98d48d69dc106d0e0ab616728b04992
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Jul 26 16:38:40 2017 +0200

    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    
    For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full
    access to memory reserves. There are few shortcomings of this
    implementation, though.
    
    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.
    
    Secondly this flag is per task_struct while the OOM killer operates
    on mm_struct granularity so all processes sharing the given mm are
    killed. Giving the full access to all these task_structs could lead to
    a quick memory reserves depletion. We have tried to reduce this risk by
    giving TIF_MEMDIE only to the main thread and the currently allocating
    task but that doesn't really solve this problem while it surely opens up
    a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
    the allocator without access to memory reserves because a particular
    thread was not the group leader.
    
    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.
    
    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any
    usage of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.
    
    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.
    
    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase without
    worrying about complete memory reserves depletion.
    
    Changes since v1
    - do not play tricks with nommu and grant access to memory reserves as
      long as TIF_MEMDIE is set
    - break out from allocation properly for oom victims as per Tetsuo
    - distinguish oom victims from other consumers of memory reserves in
      __gfp_pfmemalloc_flags - per Tetsuo
    
    Signed-off-by: Michal Hocko <mhocko@suse.com>

diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..1ebcb1ed01b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..c9f3569a76c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
@@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	count_memcg_event_mm(mm, OOM_KILL);
 
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
+	 * We should send SIGKILL before granting access to memory reserves
+	 * in order to prevent the OOM victim from depleting the memory
+	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..7ae0f6d45614 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!alloc_harder)) {
 		free_pages -= z->nr_reserved_highatomic;
-	else
-		min -= min / 4;
+	} else {
+		/*
+		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * users
+		 */
+		if (alloc_flags & ALLOC_OOM)
+			min -= min / 2;
+		else
+			min -= min / 4;
+	}
+
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3184,7 +3193,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (tsk_is_oom_victim(current) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+static bool oom_reserves_allowed(struct task_struct *tsk)
 {
-	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+	if (!tsk_is_oom_victim(tsk))
+		return false;
+
+	/*
+	 * !MMU doesn't have oom reaper so give access to memory reserves
+	 * only to the thread with TIF_MEMDIE set
+	 */
+	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
 		return false;
 
+	return true;
+}
+
+/*
+ * Distinguish requests which really need access to full memory
+ * reserves from oom victims which can live with a portion of it
+ */
+static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
+{
+	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+		return 0;
 	if (gfp_mask & __GFP_MEMALLOC)
-		return true;
+		return ALLOC_NO_WATERMARKS;
 	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-		return true;
-	if (!in_interrupt() &&
-			((current->flags & PF_MEMALLOC) ||
-			 unlikely(test_thread_flag(TIF_MEMDIE))))
-		return true;
+		return ALLOC_NO_WATERMARKS;
+	if (!in_interrupt()) {
+		if (current->flags & PF_MEMALLOC)
+			return ALLOC_NO_WATERMARKS;
+		else if (oom_reserves_allowed(current))
+			return ALLOC_OOM;
+	}
 
-	return false;
+	return 0;
+}
+
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
 }
 
 /*
@@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	int reserves;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
-		alloc_flags = ALLOC_NO_WATERMARKS;
+	reserves = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserves)
+		alloc_flags = reserves;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
@@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) &&
-	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	if (tsk_is_oom_victim(current) &&
+	    (alloc_flags == ALLOC_OOM ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-02  8:29     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-02  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

With the follow up changes based on Tetsuo review. I have dropped
Roman's Tested-by because at least the allocation exit for oom victim
behavior has changed.
---
commit b1cfab26a98d48d69dc106d0e0ab616728b04992
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Jul 26 16:38:40 2017 +0200

    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    
    For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full
    access to memory reserves. There are few shortcomings of this
    implementation, though.
    
    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.
    
    Secondly this flag is per task_struct while the OOM killer operates
    on mm_struct granularity so all processes sharing the given mm are
    killed. Giving the full access to all these task_structs could lead to
    a quick memory reserves depletion. We have tried to reduce this risk by
    giving TIF_MEMDIE only to the main thread and the currently allocating
    task but that doesn't really solve this problem while it surely opens up
    a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
    the allocator without access to memory reserves because a particular
    thread was not the group leader.
    
    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.
    
    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any
    usage of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.
    
    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.
    
    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase without
    worrying about complete memory reserves depletion.
    
    Changes since v1
    - do not play tricks with nommu and grant access to memory reserves as
      long as TIF_MEMDIE is set
    - break out from allocation properly for oom victims as per Tetsuo
    - distinguish oom victims from other consumers of memory reserves in
      __gfp_pfmemalloc_flags - per Tetsuo
    
    Signed-off-by: Michal Hocko <mhocko@suse.com>

diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..1ebcb1ed01b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..c9f3569a76c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
@@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	count_memcg_event_mm(mm, OOM_KILL);
 
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
+	 * We should send SIGKILL before granting access to memory reserves
+	 * in order to prevent the OOM victim from depleting the memory
+	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..7ae0f6d45614 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!alloc_harder)) {
 		free_pages -= z->nr_reserved_highatomic;
-	else
-		min -= min / 4;
+	} else {
+		/*
+		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * users
+		 */
+		if (alloc_flags & ALLOC_OOM)
+			min -= min / 2;
+		else
+			min -= min / 4;
+	}
+
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3184,7 +3193,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (tsk_is_oom_victim(current) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+static bool oom_reserves_allowed(struct task_struct *tsk)
 {
-	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+	if (!tsk_is_oom_victim(tsk))
+		return false;
+
+	/*
+	 * !MMU doesn't have oom reaper so give access to memory reserves
+	 * only to the thread with TIF_MEMDIE set
+	 */
+	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
 		return false;
 
+	return true;
+}
+
+/*
+ * Distinguish requests which really need access to full memory
+ * reserves from oom victims which can live with a portion of it
+ */
+static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
+{
+	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+		return 0;
 	if (gfp_mask & __GFP_MEMALLOC)
-		return true;
+		return ALLOC_NO_WATERMARKS;
 	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-		return true;
-	if (!in_interrupt() &&
-			((current->flags & PF_MEMALLOC) ||
-			 unlikely(test_thread_flag(TIF_MEMDIE))))
-		return true;
+		return ALLOC_NO_WATERMARKS;
+	if (!in_interrupt()) {
+		if (current->flags & PF_MEMALLOC)
+			return ALLOC_NO_WATERMARKS;
+		else if (oom_reserves_allowed(current))
+			return ALLOC_OOM;
+	}
 
-	return false;
+	return 0;
+}
+
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
 }
 
 /*
@@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	int reserves;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
-		alloc_flags = ALLOC_NO_WATERMARKS;
+	reserves = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserves)
+		alloc_flags = reserves;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
@@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) &&
-	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	if (tsk_is_oom_victim(current) &&
+	    (alloc_flags == ALLOC_OOM ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-01 16:52       ` Michal Hocko
@ 2017-08-03  1:39         ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-03  1:39 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
> > > @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	return alloc_flags;
> > >  }
> > >  
> > > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > > +{
> > > +	if (!tsk_is_oom_victim(tsk))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > > +	 * depletion and shouldn't give access to memory reserves passed the
> > > +	 * exit_mm
> > > +	 */
> > > +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > > +		return false;
> > 
> > Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
> > victim selection if CONFIG_MMU=n.
> 
> I suggest we do not argue about nommu without actually optimizing for or
> fixing nommu which we are not here. I am even not sure memory reserves
> can ever be depleted for that config.

I don't think memory reserves can deplete for CONFIG_MMU=n environment.
But the reason the OOM reaper was introduced is not limited to handling
depletion of memory reserves. The OOM reaper was introduced because
OOM victims might get stuck indirectly waiting for other threads doing
memory allocation. You said

  > Yes, exit_aio is the only blocking call I know of currently. But I would
  > like this to be as robust as possible and so I do not want to rely on
  > the current implementation. This can change in future and I can
  > guarantee that nobody will think about the oom path when adding
  > something to the final __mmput path.

at http://lkml.kernel.org/r/20170726054533.GA960@dhcp22.suse.cz , but
how can you guarantee that nobody will think about the oom path
when adding something to the final __mmput() path without thinking
about possibility of getting stuck waiting for memory allocation in
CONFIG_MMU=n environment? As long as possibility of getting stuck remains,
you should not assume that something you don't want will not happen.
It's time to make CONFIG_MMU=n kernels treatable like CONFIG_MMU=y kernels.

If it is technically impossible (or is not worthwhile) to implement
the OOM reaper for CONFIG_MMU=n kernels, I'm fine with timeout based
approach like shown below. Then, we no longer need to use branching
based on CONFIG_MMU.

 include/linux/mm_types.h |  3 +++
 mm/oom_kill.c            | 20 +++++++++++++++-----
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f384bb..374a2ae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -504,6 +504,9 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#ifndef CONFIG_MMU
+	unsigned long oom_victim_wait_timer;
+#endif
 } __randomize_layout;
 
 extern struct mm_struct init_mm;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..dd6239d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,6 +53,17 @@
 
 DEFINE_MUTEX(oom_lock);
 
+static bool should_ignore_this_mm(struct mm_struct *mm)
+{
+#ifndef CONFIG_MMU
+	if (!mm->oom_victim_wait_timer)
+		mm->oom_victim_wait_timer = jiffies;
+	else if (time_after(jiffies, mm->oom_victim_wait_timer + HZ))
+		return true;
+#endif
+	return test_bit(MMF_OOM_SKIP, &mm->flags);
+};
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -188,9 +199,8 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
-			in_vfork(p)) {
+	if (adj == OOM_SCORE_ADJ_MIN || should_ignore_this_mm(p->mm) ||
+	    in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
@@ -303,7 +313,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
+		if (should_ignore_this_mm(task->signal->oom_mm))
 			goto next;
 		goto abort;
 	}
@@ -783,7 +793,7 @@ static bool task_will_free_mem(struct task_struct *task)
 	 * This task has already been drained by the oom reaper so there are
 	 * only small chances it will free some more
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+	if (should_ignore_this_mm(mm))
 		return false;
 
 	if (atomic_read(&mm->mm_users) <= 1)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-03  1:39         ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-03  1:39 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
> > > @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	return alloc_flags;
> > >  }
> > >  
> > > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > > +{
> > > +	if (!tsk_is_oom_victim(tsk))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > > +	 * depletion and shouldn't give access to memory reserves passed the
> > > +	 * exit_mm
> > > +	 */
> > > +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > > +		return false;
> > 
> > Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
> > victim selection if CONFIG_MMU=n.
> 
> I suggest we do not argue about nommu without actually optimizing for or
> fixing nommu which we are not here. I am even not sure memory reserves
> can ever be depleted for that config.

I don't think memory reserves can deplete for CONFIG_MMU=n environment.
But the reason the OOM reaper was introduced is not limited to handling
depletion of memory reserves. The OOM reaper was introduced because
OOM victims might get stuck indirectly waiting for other threads doing
memory allocation. You said

  > Yes, exit_aio is the only blocking call I know of currently. But I would
  > like this to be as robust as possible and so I do not want to rely on
  > the current implementation. This can change in future and I can
  > guarantee that nobody will think about the oom path when adding
  > something to the final __mmput path.

at http://lkml.kernel.org/r/20170726054533.GA960@dhcp22.suse.cz , but
how can you guarantee that nobody will think about the oom path
when adding something to the final __mmput() path without thinking
about possibility of getting stuck waiting for memory allocation in
CONFIG_MMU=n environment? As long as possibility of getting stuck remains,
you should not assume that something you don't want will not happen.
It's time to make CONFIG_MMU=n kernels treatable like CONFIG_MMU=y kernels.

If it is technically impossible (or is not worthwhile) to implement
the OOM reaper for CONFIG_MMU=n kernels, I'm fine with timeout based
approach like shown below. Then, we no longer need to use branching
based on CONFIG_MMU.

 include/linux/mm_types.h |  3 +++
 mm/oom_kill.c            | 20 +++++++++++++++-----
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f384bb..374a2ae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -504,6 +504,9 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#ifndef CONFIG_MMU
+	unsigned long oom_victim_wait_timer;
+#endif
 } __randomize_layout;
 
 extern struct mm_struct init_mm;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..dd6239d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,6 +53,17 @@
 
 DEFINE_MUTEX(oom_lock);
 
+static bool should_ignore_this_mm(struct mm_struct *mm)
+{
+#ifndef CONFIG_MMU
+	if (!mm->oom_victim_wait_timer)
+		mm->oom_victim_wait_timer = jiffies;
+	else if (time_after(jiffies, mm->oom_victim_wait_timer + HZ))
+		return true;
+#endif
+	return test_bit(MMF_OOM_SKIP, &mm->flags);
+};
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -188,9 +199,8 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
-			in_vfork(p)) {
+	if (adj == OOM_SCORE_ADJ_MIN || should_ignore_this_mm(p->mm) ||
+	    in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
@@ -303,7 +313,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
+		if (should_ignore_this_mm(task->signal->oom_mm))
 			goto next;
 		goto abort;
 	}
@@ -783,7 +793,7 @@ static bool task_will_free_mem(struct task_struct *task)
 	 * This task has already been drained by the oom reaper so there are
 	 * only small chances it will free some more
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+	if (should_ignore_this_mm(mm))
 		return false;
 
 	if (atomic_read(&mm->mm_users) <= 1)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-03  1:39         ` Tetsuo Handa
@ 2017-08-03  7:06           ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03  7:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 03-08-17 10:39:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
> > > > @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > >  	return alloc_flags;
> > > >  }
> > > >  
> > > > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > > > +{
> > > > +	if (!tsk_is_oom_victim(tsk))
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > > > +	 * depletion and shouldn't give access to memory reserves passed the
> > > > +	 * exit_mm
> > > > +	 */
> > > > +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > > > +		return false;
> > > 
> > > Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
> > > victim selection if CONFIG_MMU=n.
> > 
> > I suggest we do not argue about nommu without actually optimizing for or
> > fixing nommu which we are not here. I am even not sure memory reserves
> > can ever be depleted for that config.
> 
> I don't think memory reserves can deplete for CONFIG_MMU=n environment.
> But the reason the OOM reaper was introduced is not limited to handling
> depletion of memory reserves. The OOM reaper was introduced because
> OOM victims might get stuck indirectly waiting for other threads doing
> memory allocation. You said
> 
>   > Yes, exit_aio is the only blocking call I know of currently. But I would
>   > like this to be as robust as possible and so I do not want to rely on
>   > the current implementation. This can change in future and I can
>   > guarantee that nobody will think about the oom path when adding
>   > something to the final __mmput path.
> 
> at http://lkml.kernel.org/r/20170726054533.GA960@dhcp22.suse.cz , but
> how can you guarantee that nobody will think about the oom path
> when adding something to the final __mmput() path without thinking
> about possibility of getting stuck waiting for memory allocation in
> CONFIG_MMU=n environment?

Look, I really appreciate your sentiment for for nommu platform but with
an absolute lack of _any_ oom reports on that platform that I am aware
of nor any reports about lockups during oom I am less than thrilled to
add a code to fix a problem which even might not exist. Nommu is usually
very special with a very specific workload running (e.g. no overcommit)
so I strongly suspect that any OOM theories are highly academic.

All I do care about is to not regress nommu as much as possible. So can
we get back to the proposed patch and updates I have done to address
your review feedback please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-03  7:06           ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03  7:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 03-08-17 10:39:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 02-08-17 00:30:33, Tetsuo Handa wrote:
> > > > @@ -3603,6 +3612,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > >  	return alloc_flags;
> > > >  }
> > > >  
> > > > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > > > +{
> > > > +	if (!tsk_is_oom_victim(tsk))
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > > > +	 * depletion and shouldn't give access to memory reserves passed the
> > > > +	 * exit_mm
> > > > +	 */
> > > > +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > > > +		return false;
> > > 
> > > Branching based on CONFIG_MMU is ugly. I suggest timeout based next OOM
> > > victim selection if CONFIG_MMU=n.
> > 
> > I suggest we do not argue about nommu without actually optimizing for or
> > fixing nommu which we are not here. I am even not sure memory reserves
> > can ever be depleted for that config.
> 
> I don't think memory reserves can deplete for CONFIG_MMU=n environment.
> But the reason the OOM reaper was introduced is not limited to handling
> depletion of memory reserves. The OOM reaper was introduced because
> OOM victims might get stuck indirectly waiting for other threads doing
> memory allocation. You said
> 
>   > Yes, exit_aio is the only blocking call I know of currently. But I would
>   > like this to be as robust as possible and so I do not want to rely on
>   > the current implementation. This can change in future and I can
>   > guarantee that nobody will think about the oom path when adding
>   > something to the final __mmput path.
> 
> at http://lkml.kernel.org/r/20170726054533.GA960@dhcp22.suse.cz , but
> how can you guarantee that nobody will think about the oom path
> when adding something to the final __mmput() path without thinking
> about possibility of getting stuck waiting for memory allocation in
> CONFIG_MMU=n environment?

Look, I really appreciate your sentiment for for nommu platform but with
an absolute lack of _any_ oom reports on that platform that I am aware
of nor any reports about lockups during oom I am less than thrilled to
add a code to fix a problem which even might not exist. Nommu is usually
very special with a very specific workload running (e.g. no overcommit)
so I strongly suspect that any OOM theories are highly academic.

All I do care about is to not regress nommu as much as possible. So can
we get back to the proposed patch and updates I have done to address
your review feedback please?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-03  7:06           ` Michal Hocko
@ 2017-08-03  8:03             ` Tetsuo Handa
  -1 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-03  8:03 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

Michal Hocko wrote:
> Look, I really appreciate your sentiment for for nommu platform but with
> an absolute lack of _any_ oom reports on that platform that I am aware
> of nor any reports about lockups during oom I am less than thrilled to
> add a code to fix a problem which even might not exist. Nommu is usually
> very special with a very specific workload running (e.g. no overcommit)
> so I strongly suspect that any OOM theories are highly academic.

If you believe that there is really no oom report, get rid of the OOM
killer completely.

> 
> All I do care about is to not regress nommu as much as possible. So can
> we get back to the proposed patch and updates I have done to address
> your review feedback please?

No unless we get rid of the OOM killer if CONFIG_MMU=n.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 170db4d..e931969 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3312,7 +3312,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if ((IS_ENABLED(CONFIG_MMU) && out_of_memory(&oc)) ||
+		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
 		/*

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-03  8:03             ` Tetsuo Handa
  0 siblings, 0 replies; 55+ messages in thread
From: Tetsuo Handa @ 2017-08-03  8:03 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

Michal Hocko wrote:
> Look, I really appreciate your sentiment for for nommu platform but with
> an absolute lack of _any_ oom reports on that platform that I am aware
> of nor any reports about lockups during oom I am less than thrilled to
> add a code to fix a problem which even might not exist. Nommu is usually
> very special with a very specific workload running (e.g. no overcommit)
> so I strongly suspect that any OOM theories are highly academic.

If you believe that there is really no oom report, get rid of the OOM
killer completely.

> 
> All I do care about is to not regress nommu as much as possible. So can
> we get back to the proposed patch and updates I have done to address
> your review feedback please?

No unless we get rid of the OOM killer if CONFIG_MMU=n.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 170db4d..e931969 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3312,7 +3312,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if ((IS_ENABLED(CONFIG_MMU) && out_of_memory(&oc)) ||
+		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-03  8:03             ` Tetsuo Handa
@ 2017-08-03  8:21               ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03  8:21 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 03-08-17 17:03:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Look, I really appreciate your sentiment for for nommu platform but with
> > an absolute lack of _any_ oom reports on that platform that I am aware
> > of nor any reports about lockups during oom I am less than thrilled to
> > add a code to fix a problem which even might not exist. Nommu is usually
> > very special with a very specific workload running (e.g. no overcommit)
> > so I strongly suspect that any OOM theories are highly academic.
> 
> If you believe that there is really no oom report, get rid of the OOM
> killer completely.

I am not an user or even an owner of such a platform. As I've said all I
care about is to not regress for those guys and I believe that the patch
doesn't change nommu behavior in any risky way. If yes, point them out
and I will try to address them.
 
> > All I do care about is to not regress nommu as much as possible. So can
> > we get back to the proposed patch and updates I have done to address
> > your review feedback please?
> 
> No unless we get rid of the OOM killer if CONFIG_MMU=n.

Are you saying that you are going to nack the patch based on this
reasoning? This is just ridiculous.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-03  8:21               ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03  8:21 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, hannes, guro, linux-mm, linux-kernel

On Thu 03-08-17 17:03:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Look, I really appreciate your sentiment for for nommu platform but with
> > an absolute lack of _any_ oom reports on that platform that I am aware
> > of nor any reports about lockups during oom I am less than thrilled to
> > add a code to fix a problem which even might not exist. Nommu is usually
> > very special with a very specific workload running (e.g. no overcommit)
> > so I strongly suspect that any OOM theories are highly academic.
> 
> If you believe that there is really no oom report, get rid of the OOM
> killer completely.

I am not an user or even an owner of such a platform. As I've said all I
care about is to not regress for those guys and I believe that the patch
doesn't change nommu behavior in any risky way. If yes, point them out
and I will try to address them.
 
> > All I do care about is to not regress nommu as much as possible. So can
> > we get back to the proposed patch and updates I have done to address
> > your review feedback please?
> 
> No unless we get rid of the OOM killer if CONFIG_MMU=n.

Are you saying that you are going to nack the patch based on this
reasoning? This is just ridiculous.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-02  8:29     ` Michal Hocko
@ 2017-08-03  9:37       ` Mel Gorman
  -1 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-08-03  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tetsuo Handa, linux-mm, LKML

On Wed, Aug 02, 2017 at 10:29:14AM +0200, Michal Hocko wrote:
>     For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
>     victims and then, among other things, to give these threads full
>     access to memory reserves. There are few shortcomings of this
>     implementation, though.
>     

I believe the original intent was that allocations from the exit path
would be small and relatively short-lived.

>     First of all and the most serious one is that the full access to memory
>     reserves is quite dangerous because we leave no safety room for the
>     system to operate and potentially do last emergency steps to move on.
>     
>     Secondly this flag is per task_struct while the OOM killer operates
>     on mm_struct granularity so all processes sharing the given mm are
>     killed. Giving the full access to all these task_structs could lead to
>     a quick memory reserves depletion.

This is a valid concern.

>     We have tried to reduce this risk by
>     giving TIF_MEMDIE only to the main thread and the currently allocating
>     task but that doesn't really solve this problem while it surely opens up
>     a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
>     the allocator without access to memory reserves because a particular
>     thread was not the group leader.
>     
>     Now that we have the oom reaper and that all oom victims are reapable
>     after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
>     kthreads") we can be more conservative and grant only partial access to
>     memory reserves because there are reasonable chances of the parallel
>     memory freeing. We still want some access to reserves because we do not
>     want other consumers to eat up the victim's freed memory. oom victims
>     will still contend with __GFP_HIGH users but those shouldn't be so
>     aggressive to starve oom victims completely.
>     
>     Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
>     the half of the reserves. This makes the access to reserves independent
>     on which task has passed through mark_oom_victim. Also drop any
>     usage of TIF_MEMDIE from the page allocator proper and replace it by
>     tsk_is_oom_victim as well which will make page_alloc.c completely
>     TIF_MEMDIE free finally.
>     
>     CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
>     ALLOC_NO_WATERMARKS approach.
>     
>     There is a demand to make the oom killer memcg aware which will imply
>     many tasks killed at once. This change will allow such a usecase without
>     worrying about complete memory reserves depletion.
>     
>     Changes since v1
>     - do not play tricks with nommu and grant access to memory reserves as
>       long as TIF_MEMDIE is set
>     - break out from allocation properly for oom victims as per Tetsuo
>     - distinguish oom victims from other consumers of memory reserves in
>       __gfp_pfmemalloc_flags - per Tetsuo
>     
>     Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 24d88f084705..1ebcb1ed01b5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>  /* Mask to get the watermark bits */
>  #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
>  
> +/*
> + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
> + * cannot assume a reduced access to memory reserves is sufficient for
> + * !MMU
> + */
> +#ifdef CONFIG_MMU
> +#define ALLOC_OOM		0x08
> +#else
> +#define ALLOC_OOM		ALLOC_NO_WATERMARKS
> +#endif
> +
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */

Ok, no collision with the wmark indexes so that should be fine. While I
didn't check, I suspect that !MMU users also have relatively few CPUs to
allow major contention.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..c9f3569a76c7 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> -	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> +	 * its children or threads, just give it access to memory reserves
> +	 * so it can die quickly
>  	 */
>  	task_lock(p);
>  	if (task_will_free_mem(p)) {
> @@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	count_memcg_event_mm(mm, OOM_KILL);
>  
>  	/*
> -	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
> -	 * the OOM victim from depleting the memory reserves from the user
> -	 * space under its control.
> +	 * We should send SIGKILL before granting access to memory reserves
> +	 * in order to prevent the OOM victim from depleting the memory
> +	 * reserves from the user space under its control.
>  	 */
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	mark_oom_victim(victim);

There is the secondary effect that some operations gets aborted entirely
if a SIGKILL is pending but that's neither here nor there.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb4c360..7ae0f6d45614 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  {
>  	long min = mark;
>  	int o;
> -	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
> +	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>  
>  	/* free_pages may go negative - that's OK */
>  	free_pages -= (1 << order) - 1;
> @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	 * the high-atomic reserves. This will over-estimate the size of the
>  	 * atomic reserve but it avoids a search.
>  	 */
> -	if (likely(!alloc_harder))
> +	if (likely(!alloc_harder)) {
>  		free_pages -= z->nr_reserved_highatomic;
> -	else
> -		min -= min / 4;
> +	} else {
> +		/*
> +		 * OOM victims can try even harder than normal ALLOC_HARDER
> +		 * users
> +		 */
> +		if (alloc_flags & ALLOC_OOM)
> +			min -= min / 2;
> +		else
> +			min -= min / 4;
> +	}
> +
>  
>  #ifdef CONFIG_CMA
>  	/* If allocation can't use CMA areas don't use free CMA pages */

I see no problem here although the comment could be slightly better. It
suggests at OOM victims can try harder but not why

/*
 * OOM victims can try even harder than normal ALLOC_HARDER users on the
 * grounds that it's definitely going to be in the exit path shortly and
 * free memory. Any allocation it makes during the free path will be
 * small and short-lived.
 */

> @@ -3184,7 +3193,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
>  	 * of allowed nodes.
>  	 */
>  	if (!(gfp_mask & __GFP_NOMEMALLOC))
> -		if (test_thread_flag(TIF_MEMDIE) ||
> +		if (tsk_is_oom_victim(current) ||
>  		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
>  			filter &= ~SHOW_MEM_FILTER_NODES;
>  	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))

> @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	return alloc_flags;
>  }
>  
> -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> +static bool oom_reserves_allowed(struct task_struct *tsk)
>  {
> -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> +	if (!tsk_is_oom_victim(tsk))
> +		return false;
> +
> +	/*
> +	 * !MMU doesn't have oom reaper so give access to memory reserves
> +	 * only to the thread with TIF_MEMDIE set
> +	 */
> +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
>  		return false;
>  
> +	return true;
> +}
> +

Ok, there is a chance that a task selected as an OOM kill victim may be
in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
path (which we care about for an OOM killed task) and the caller should
always be able to handle a failure.

> +/*
> + * Distinguish requests which really need access to full memory
> + * reserves from oom victims which can live with a portion of it
> + */
> +static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
> +{
> +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> +		return 0;
>  	if (gfp_mask & __GFP_MEMALLOC)
> -		return true;
> +		return ALLOC_NO_WATERMARKS;
>  	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> -		return true;
> -	if (!in_interrupt() &&
> -			((current->flags & PF_MEMALLOC) ||
> -			 unlikely(test_thread_flag(TIF_MEMDIE))))
> -		return true;
> +		return ALLOC_NO_WATERMARKS;
> +	if (!in_interrupt()) {
> +		if (current->flags & PF_MEMALLOC)
> +			return ALLOC_NO_WATERMARKS;
> +		else if (oom_reserves_allowed(current))
> +			return ALLOC_OOM;
> +	}
>  
> -	return false;
> +	return 0;
> +}
> +
> +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> +{
> +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
>  }

Very subtle sign casing error here. If the flags ever use the high bit,
this wraps and fails. It "shouldn't be possible" but you could just remove
the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
return unsigned.

>  
>  /*
> @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned long alloc_start = jiffies;
>  	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
> +	int reserves;
>  

This should be explicitly named to indicate it's about flags and not the
number of reserve pages or something else wacky.

>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
> -	if (gfp_pfmemalloc_allowed(gfp_mask))
> -		alloc_flags = ALLOC_NO_WATERMARKS;
> +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> +	if (reserves)
> +		alloc_flags = reserves;
>  

And if it's reserve_flags you can save a branch with

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
alloc_pags |= reserve_flags;

It won't make much difference considering how branch-intensive the allocator
is anyway.

>  	/*
>  	 * Reset the zonelist iterators if memory policies can be ignored.
>  	 * These allocations are high priority and system rather than user
>  	 * orientated.
>  	 */
> -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
>  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask);
> @@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) &&
> -	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> +	if (tsk_is_oom_victim(current) &&
> +	    (alloc_flags == ALLOC_OOM ||
>  	     (gfp_mask & __GFP_NOMEMALLOC)))
>  		goto nopage;
>  

Mostly I only found nit-picks so whether you address them or not

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-03  9:37       ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-08-03  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tetsuo Handa, linux-mm, LKML

On Wed, Aug 02, 2017 at 10:29:14AM +0200, Michal Hocko wrote:
>     For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
>     victims and then, among other things, to give these threads full
>     access to memory reserves. There are few shortcomings of this
>     implementation, though.
>     

I believe the original intent was that allocations from the exit path
would be small and relatively short-lived.

>     First of all and the most serious one is that the full access to memory
>     reserves is quite dangerous because we leave no safety room for the
>     system to operate and potentially do last emergency steps to move on.
>     
>     Secondly this flag is per task_struct while the OOM killer operates
>     on mm_struct granularity so all processes sharing the given mm are
>     killed. Giving the full access to all these task_structs could lead to
>     a quick memory reserves depletion.

This is a valid concern.

>     We have tried to reduce this risk by
>     giving TIF_MEMDIE only to the main thread and the currently allocating
>     task but that doesn't really solve this problem while it surely opens up
>     a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
>     the allocator without access to memory reserves because a particular
>     thread was not the group leader.
>     
>     Now that we have the oom reaper and that all oom victims are reapable
>     after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
>     kthreads") we can be more conservative and grant only partial access to
>     memory reserves because there are reasonable chances of the parallel
>     memory freeing. We still want some access to reserves because we do not
>     want other consumers to eat up the victim's freed memory. oom victims
>     will still contend with __GFP_HIGH users but those shouldn't be so
>     aggressive to starve oom victims completely.
>     
>     Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
>     the half of the reserves. This makes the access to reserves independent
>     on which task has passed through mark_oom_victim. Also drop any
>     usage of TIF_MEMDIE from the page allocator proper and replace it by
>     tsk_is_oom_victim as well which will make page_alloc.c completely
>     TIF_MEMDIE free finally.
>     
>     CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
>     ALLOC_NO_WATERMARKS approach.
>     
>     There is a demand to make the oom killer memcg aware which will imply
>     many tasks killed at once. This change will allow such a usecase without
>     worrying about complete memory reserves depletion.
>     
>     Changes since v1
>     - do not play tricks with nommu and grant access to memory reserves as
>       long as TIF_MEMDIE is set
>     - break out from allocation properly for oom victims as per Tetsuo
>     - distinguish oom victims from other consumers of memory reserves in
>       __gfp_pfmemalloc_flags - per Tetsuo
>     
>     Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 24d88f084705..1ebcb1ed01b5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>  /* Mask to get the watermark bits */
>  #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
>  
> +/*
> + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
> + * cannot assume a reduced access to memory reserves is sufficient for
> + * !MMU
> + */
> +#ifdef CONFIG_MMU
> +#define ALLOC_OOM		0x08
> +#else
> +#define ALLOC_OOM		ALLOC_NO_WATERMARKS
> +#endif
> +
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */

Ok, no collision with the wmark indexes so that should be fine. While I
didn't check, I suspect that !MMU users also have relatively few CPUs to
allow major contention.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..c9f3569a76c7 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> -	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> +	 * its children or threads, just give it access to memory reserves
> +	 * so it can die quickly
>  	 */
>  	task_lock(p);
>  	if (task_will_free_mem(p)) {
> @@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	count_memcg_event_mm(mm, OOM_KILL);
>  
>  	/*
> -	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
> -	 * the OOM victim from depleting the memory reserves from the user
> -	 * space under its control.
> +	 * We should send SIGKILL before granting access to memory reserves
> +	 * in order to prevent the OOM victim from depleting the memory
> +	 * reserves from the user space under its control.
>  	 */
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	mark_oom_victim(victim);

There is the secondary effect that some operations gets aborted entirely
if a SIGKILL is pending but that's neither here nor there.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb4c360..7ae0f6d45614 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  {
>  	long min = mark;
>  	int o;
> -	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
> +	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>  
>  	/* free_pages may go negative - that's OK */
>  	free_pages -= (1 << order) - 1;
> @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	 * the high-atomic reserves. This will over-estimate the size of the
>  	 * atomic reserve but it avoids a search.
>  	 */
> -	if (likely(!alloc_harder))
> +	if (likely(!alloc_harder)) {
>  		free_pages -= z->nr_reserved_highatomic;
> -	else
> -		min -= min / 4;
> +	} else {
> +		/*
> +		 * OOM victims can try even harder than normal ALLOC_HARDER
> +		 * users
> +		 */
> +		if (alloc_flags & ALLOC_OOM)
> +			min -= min / 2;
> +		else
> +			min -= min / 4;
> +	}
> +
>  
>  #ifdef CONFIG_CMA
>  	/* If allocation can't use CMA areas don't use free CMA pages */

I see no problem here although the comment could be slightly better. It
suggests at OOM victims can try harder but not why

/*
 * OOM victims can try even harder than normal ALLOC_HARDER users on the
 * grounds that it's definitely going to be in the exit path shortly and
 * free memory. Any allocation it makes during the free path will be
 * small and short-lived.
 */

> @@ -3184,7 +3193,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
>  	 * of allowed nodes.
>  	 */
>  	if (!(gfp_mask & __GFP_NOMEMALLOC))
> -		if (test_thread_flag(TIF_MEMDIE) ||
> +		if (tsk_is_oom_victim(current) ||
>  		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
>  			filter &= ~SHOW_MEM_FILTER_NODES;
>  	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))

> @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	return alloc_flags;
>  }
>  
> -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> +static bool oom_reserves_allowed(struct task_struct *tsk)
>  {
> -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> +	if (!tsk_is_oom_victim(tsk))
> +		return false;
> +
> +	/*
> +	 * !MMU doesn't have oom reaper so give access to memory reserves
> +	 * only to the thread with TIF_MEMDIE set
> +	 */
> +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
>  		return false;
>  
> +	return true;
> +}
> +

Ok, there is a chance that a task selected as an OOM kill victim may be
in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
path (which we care about for an OOM killed task) and the caller should
always be able to handle a failure.

> +/*
> + * Distinguish requests which really need access to full memory
> + * reserves from oom victims which can live with a portion of it
> + */
> +static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
> +{
> +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> +		return 0;
>  	if (gfp_mask & __GFP_MEMALLOC)
> -		return true;
> +		return ALLOC_NO_WATERMARKS;
>  	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> -		return true;
> -	if (!in_interrupt() &&
> -			((current->flags & PF_MEMALLOC) ||
> -			 unlikely(test_thread_flag(TIF_MEMDIE))))
> -		return true;
> +		return ALLOC_NO_WATERMARKS;
> +	if (!in_interrupt()) {
> +		if (current->flags & PF_MEMALLOC)
> +			return ALLOC_NO_WATERMARKS;
> +		else if (oom_reserves_allowed(current))
> +			return ALLOC_OOM;
> +	}
>  
> -	return false;
> +	return 0;
> +}
> +
> +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> +{
> +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
>  }

Very subtle sign casing error here. If the flags ever use the high bit,
this wraps and fails. It "shouldn't be possible" but you could just remove
the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
return unsigned.

>  
>  /*
> @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned long alloc_start = jiffies;
>  	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
> +	int reserves;
>  

This should be explicitly named to indicate it's about flags and not the
number of reserve pages or something else wacky.

>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
> -	if (gfp_pfmemalloc_allowed(gfp_mask))
> -		alloc_flags = ALLOC_NO_WATERMARKS;
> +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> +	if (reserves)
> +		alloc_flags = reserves;
>  

And if it's reserve_flags you can save a branch with

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
alloc_pags |= reserve_flags;

It won't make much difference considering how branch-intensive the allocator
is anyway.

>  	/*
>  	 * Reset the zonelist iterators if memory policies can be ignored.
>  	 * These allocations are high priority and system rather than user
>  	 * orientated.
>  	 */
> -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
>  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask);
> @@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) &&
> -	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> +	if (tsk_is_oom_victim(current) &&
> +	    (alloc_flags == ALLOC_OOM ||
>  	     (gfp_mask & __GFP_NOMEMALLOC)))
>  		goto nopage;
>  

Mostly I only found nit-picks so whether you address them or not

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-03  9:37       ` Mel Gorman
@ 2017-08-03 11:00         ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03 11:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tetsuo Handa, linux-mm, LKML

On Thu 03-08-17 10:37:01, Mel Gorman wrote:
> On Wed, Aug 02, 2017 at 10:29:14AM +0200, Michal Hocko wrote:
> >     For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
> >     victims and then, among other things, to give these threads full
> >     access to memory reserves. There are few shortcomings of this
> >     implementation, though.
> >     
> 
> I believe the original intent was that allocations from the exit path
> would be small and relatively short-lived.

Yes.

> 
> >     First of all and the most serious one is that the full access to memory
> >     reserves is quite dangerous because we leave no safety room for the
> >     system to operate and potentially do last emergency steps to move on.
> >     
> >     Secondly this flag is per task_struct while the OOM killer operates
> >     on mm_struct granularity so all processes sharing the given mm are
> >     killed. Giving the full access to all these task_structs could lead to
> >     a quick memory reserves depletion.
> 
> This is a valid concern.
> 
> >     We have tried to reduce this risk by
> >     giving TIF_MEMDIE only to the main thread and the currently allocating
> >     task but that doesn't really solve this problem while it surely opens up
> >     a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
> >     the allocator without access to memory reserves because a particular
> >     thread was not the group leader.
> >     
> >     Now that we have the oom reaper and that all oom victims are reapable
> >     after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
> >     kthreads") we can be more conservative and grant only partial access to
> >     memory reserves because there are reasonable chances of the parallel
> >     memory freeing. We still want some access to reserves because we do not
> >     want other consumers to eat up the victim's freed memory. oom victims
> >     will still contend with __GFP_HIGH users but those shouldn't be so
> >     aggressive to starve oom victims completely.
> >     
> >     Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
> >     the half of the reserves. This makes the access to reserves independent
> >     on which task has passed through mark_oom_victim. Also drop any
> >     usage of TIF_MEMDIE from the page allocator proper and replace it by
> >     tsk_is_oom_victim as well which will make page_alloc.c completely
> >     TIF_MEMDIE free finally.
> >     
> >     CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> >     ALLOC_NO_WATERMARKS approach.
> >     
> >     There is a demand to make the oom killer memcg aware which will imply
> >     many tasks killed at once. This change will allow such a usecase without
> >     worrying about complete memory reserves depletion.
> >     
> >     Changes since v1
> >     - do not play tricks with nommu and grant access to memory reserves as
> >       long as TIF_MEMDIE is set
> >     - break out from allocation properly for oom victims as per Tetsuo
> >     - distinguish oom victims from other consumers of memory reserves in
> >       __gfp_pfmemalloc_flags - per Tetsuo
> >     
> >     Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 24d88f084705..1ebcb1ed01b5 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >  /* Mask to get the watermark bits */
> >  #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> >  
> > +/*
> > + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
> > + * cannot assume a reduced access to memory reserves is sufficient for
> > + * !MMU
> > + */
> > +#ifdef CONFIG_MMU
> > +#define ALLOC_OOM		0x08
> > +#else
> > +#define ALLOC_OOM		ALLOC_NO_WATERMARKS
> > +#endif
> > +
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> 
> Ok, no collision with the wmark indexes so that should be fine. While I
> didn't check, I suspect that !MMU users also have relatively few CPUs to
> allow major contention.

Well, I didn't try to improve the !MMU case because a) I do not know
whether there is a real problem with oom depletion there and b) I have
no way to test this. So I only focused on keeping the status quo for
nommu.

[...]

> > @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >  	 * the high-atomic reserves. This will over-estimate the size of the
> >  	 * atomic reserve but it avoids a search.
> >  	 */
> > -	if (likely(!alloc_harder))
> > +	if (likely(!alloc_harder)) {
> >  		free_pages -= z->nr_reserved_highatomic;
> > -	else
> > -		min -= min / 4;
> > +	} else {
> > +		/*
> > +		 * OOM victims can try even harder than normal ALLOC_HARDER
> > +		 * users
> > +		 */
> > +		if (alloc_flags & ALLOC_OOM)
> > +			min -= min / 2;
> > +		else
> > +			min -= min / 4;
> > +	}
> > +
> >  
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> 
> I see no problem here although the comment could be slightly better. It
> suggests at OOM victims can try harder but not why
> 
> /*
>  * OOM victims can try even harder than normal ALLOC_HARDER users on the
>  * grounds that it's definitely going to be in the exit path shortly and
>  * free memory. Any allocation it makes during the free path will be
>  * small and short-lived.
>  */

OK, I have replaced the coment.
 
> > @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	return alloc_flags;
> >  }
> >  
> > -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +static bool oom_reserves_allowed(struct task_struct *tsk)
> >  {
> > -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +	if (!tsk_is_oom_victim(tsk))
> > +		return false;
> > +
> > +	/*
> > +	 * !MMU doesn't have oom reaper so give access to memory reserves
> > +	 * only to the thread with TIF_MEMDIE set
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> >  		return false;
> >  
> > +	return true;
> > +}
> > +
> 
> Ok, there is a chance that a task selected as an OOM kill victim may be
> in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
> problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
> path (which we care about for an OOM killed task) and the caller should
> always be able to handle a failure.

Not sure I understand. If the oom victim is doing __GFP_NOMEMALLOC then
we haven't been doing ALLOC_NO_WATERMARKS even before. So I preserve the
behavior here. Even though I am not sure this is a deliberate behavior
or something more of result of an evolution of the code.

> > +/*
> > + * Distinguish requests which really need access to full memory
> > + * reserves from oom victims which can live with a portion of it
> > + */
> > +static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
> > +{
> > +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +		return 0;
> >  	if (gfp_mask & __GFP_MEMALLOC)
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> >  	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > -		return true;
> > -	if (!in_interrupt() &&
> > -			((current->flags & PF_MEMALLOC) ||
> > -			 unlikely(test_thread_flag(TIF_MEMDIE))))
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> > +	if (!in_interrupt()) {
> > +		if (current->flags & PF_MEMALLOC)
> > +			return ALLOC_NO_WATERMARKS;
> > +		else if (oom_reserves_allowed(current))
> > +			return ALLOC_OOM;
> > +	}
> >  
> > -	return false;
> > +	return 0;
> > +}
> > +
> > +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +{
> > +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
> >  }
> 
> Very subtle sign casing error here. If the flags ever use the high bit,
> this wraps and fails. It "shouldn't be possible" but you could just remove
> the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
> return unsigned.

what about
	return !!__gfp_pfmemalloc_flags(gfp_mask);
 
> >  /*
> > @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	unsigned long alloc_start = jiffies;
> >  	unsigned int stall_timeout = 10 * HZ;
> >  	unsigned int cpuset_mems_cookie;
> > +	int reserves;
> >  
> 
> This should be explicitly named to indicate it's about flags and not the
> number of reserve pages or something else wacky.

s@reserves@reserve_flags@?

> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> >  		wake_all_kswapds(order, ac);
> >  
> > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> > +	if (reserves)
> > +		alloc_flags = reserves;
> >  
> 
> And if it's reserve_flags you can save a branch with
> 
> reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> alloc_pags |= reserve_flags;
> 
> It won't make much difference considering how branch-intensive the allocator
> is anyway.

I was actually considering that but rather didn't want to do it because
I wanted to reset alloc_flags rather than create strange ALLOC_$FOO
combinations which would be harder to evaluate.
 
> >  	/*
> >  	 * Reset the zonelist iterators if memory policies can be ignored.
> >  	 * These allocations are high priority and system rather than user
> >  	 * orientated.
> >  	 */
> > -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> > +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
> >  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> >  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> >  					ac->high_zoneidx, ac->nodemask);
> > @@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto got_pg;
> >  
> >  	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE) &&
> > -	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> > +	if (tsk_is_oom_victim(current) &&
> > +	    (alloc_flags == ALLOC_OOM ||
> >  	     (gfp_mask & __GFP_NOMEMALLOC)))
> >  		goto nopage;
> >  
> 
> Mostly I only found nit-picks so whether you address them or not
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks a lot for your review. Here is an incremental diff on top
---
commit 8483306e108a25441d796a8da3df5b85d633d859
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Aug 3 12:58:49 2017 +0200

    fold me "mm, oom: do not rely on TIF_MEMDIE for memory reserves access"
    
    - clarify access to memory reserves in __zone_watermark_ok - per Mel
    - make int->bool conversion in gfp_pfmemalloc_allowed more robust - per
      Mel
    - s@reserves@reserve_flags@ in __alloc_pages_slowpath - per Mel

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ae0f6d45614..90e331e4c077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2948,7 +2948,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	} else {
 		/*
 		 * OOM victims can try even harder than normal ALLOC_HARDER
-		 * users
+		 * users on the grounds that it's definitely going to be in
+		 * the exit path shortly and free memory. Any allocation it
+		 * makes during the free path will be small and short-lived.
 		 */
 		if (alloc_flags & ALLOC_OOM)
 			min -= min / 2;
@@ -3651,7 +3653,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
+	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
 /*
@@ -3804,7 +3806,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
-	int reserves;
+	int reserve_flags;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3910,16 +3912,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	reserves = __gfp_pfmemalloc_flags(gfp_mask);
-	if (reserves)
-		alloc_flags = reserves;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-03 11:00         ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-03 11:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tetsuo Handa, linux-mm, LKML

On Thu 03-08-17 10:37:01, Mel Gorman wrote:
> On Wed, Aug 02, 2017 at 10:29:14AM +0200, Michal Hocko wrote:
> >     For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
> >     victims and then, among other things, to give these threads full
> >     access to memory reserves. There are few shortcomings of this
> >     implementation, though.
> >     
> 
> I believe the original intent was that allocations from the exit path
> would be small and relatively short-lived.

Yes.

> 
> >     First of all and the most serious one is that the full access to memory
> >     reserves is quite dangerous because we leave no safety room for the
> >     system to operate and potentially do last emergency steps to move on.
> >     
> >     Secondly this flag is per task_struct while the OOM killer operates
> >     on mm_struct granularity so all processes sharing the given mm are
> >     killed. Giving the full access to all these task_structs could lead to
> >     a quick memory reserves depletion.
> 
> This is a valid concern.
> 
> >     We have tried to reduce this risk by
> >     giving TIF_MEMDIE only to the main thread and the currently allocating
> >     task but that doesn't really solve this problem while it surely opens up
> >     a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
> >     the allocator without access to memory reserves because a particular
> >     thread was not the group leader.
> >     
> >     Now that we have the oom reaper and that all oom victims are reapable
> >     after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
> >     kthreads") we can be more conservative and grant only partial access to
> >     memory reserves because there are reasonable chances of the parallel
> >     memory freeing. We still want some access to reserves because we do not
> >     want other consumers to eat up the victim's freed memory. oom victims
> >     will still contend with __GFP_HIGH users but those shouldn't be so
> >     aggressive to starve oom victims completely.
> >     
> >     Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
> >     the half of the reserves. This makes the access to reserves independent
> >     on which task has passed through mark_oom_victim. Also drop any
> >     usage of TIF_MEMDIE from the page allocator proper and replace it by
> >     tsk_is_oom_victim as well which will make page_alloc.c completely
> >     TIF_MEMDIE free finally.
> >     
> >     CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> >     ALLOC_NO_WATERMARKS approach.
> >     
> >     There is a demand to make the oom killer memcg aware which will imply
> >     many tasks killed at once. This change will allow such a usecase without
> >     worrying about complete memory reserves depletion.
> >     
> >     Changes since v1
> >     - do not play tricks with nommu and grant access to memory reserves as
> >       long as TIF_MEMDIE is set
> >     - break out from allocation properly for oom victims as per Tetsuo
> >     - distinguish oom victims from other consumers of memory reserves in
> >       __gfp_pfmemalloc_flags - per Tetsuo
> >     
> >     Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 24d88f084705..1ebcb1ed01b5 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >  /* Mask to get the watermark bits */
> >  #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> >  
> > +/*
> > + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
> > + * cannot assume a reduced access to memory reserves is sufficient for
> > + * !MMU
> > + */
> > +#ifdef CONFIG_MMU
> > +#define ALLOC_OOM		0x08
> > +#else
> > +#define ALLOC_OOM		ALLOC_NO_WATERMARKS
> > +#endif
> > +
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> 
> Ok, no collision with the wmark indexes so that should be fine. While I
> didn't check, I suspect that !MMU users also have relatively few CPUs to
> allow major contention.

Well, I didn't try to improve the !MMU case because a) I do not know
whether there is a real problem with oom depletion there and b) I have
no way to test this. So I only focused on keeping the status quo for
nommu.

[...]

> > @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >  	 * the high-atomic reserves. This will over-estimate the size of the
> >  	 * atomic reserve but it avoids a search.
> >  	 */
> > -	if (likely(!alloc_harder))
> > +	if (likely(!alloc_harder)) {
> >  		free_pages -= z->nr_reserved_highatomic;
> > -	else
> > -		min -= min / 4;
> > +	} else {
> > +		/*
> > +		 * OOM victims can try even harder than normal ALLOC_HARDER
> > +		 * users
> > +		 */
> > +		if (alloc_flags & ALLOC_OOM)
> > +			min -= min / 2;
> > +		else
> > +			min -= min / 4;
> > +	}
> > +
> >  
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> 
> I see no problem here although the comment could be slightly better. It
> suggests at OOM victims can try harder but not why
> 
> /*
>  * OOM victims can try even harder than normal ALLOC_HARDER users on the
>  * grounds that it's definitely going to be in the exit path shortly and
>  * free memory. Any allocation it makes during the free path will be
>  * small and short-lived.
>  */

OK, I have replaced the coment.
 
> > @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	return alloc_flags;
> >  }
> >  
> > -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +static bool oom_reserves_allowed(struct task_struct *tsk)
> >  {
> > -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +	if (!tsk_is_oom_victim(tsk))
> > +		return false;
> > +
> > +	/*
> > +	 * !MMU doesn't have oom reaper so give access to memory reserves
> > +	 * only to the thread with TIF_MEMDIE set
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> >  		return false;
> >  
> > +	return true;
> > +}
> > +
> 
> Ok, there is a chance that a task selected as an OOM kill victim may be
> in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
> problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
> path (which we care about for an OOM killed task) and the caller should
> always be able to handle a failure.

Not sure I understand. If the oom victim is doing __GFP_NOMEMALLOC then
we haven't been doing ALLOC_NO_WATERMARKS even before. So I preserve the
behavior here. Even though I am not sure this is a deliberate behavior
or something more of result of an evolution of the code.

> > +/*
> > + * Distinguish requests which really need access to full memory
> > + * reserves from oom victims which can live with a portion of it
> > + */
> > +static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
> > +{
> > +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +		return 0;
> >  	if (gfp_mask & __GFP_MEMALLOC)
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> >  	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > -		return true;
> > -	if (!in_interrupt() &&
> > -			((current->flags & PF_MEMALLOC) ||
> > -			 unlikely(test_thread_flag(TIF_MEMDIE))))
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> > +	if (!in_interrupt()) {
> > +		if (current->flags & PF_MEMALLOC)
> > +			return ALLOC_NO_WATERMARKS;
> > +		else if (oom_reserves_allowed(current))
> > +			return ALLOC_OOM;
> > +	}
> >  
> > -	return false;
> > +	return 0;
> > +}
> > +
> > +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +{
> > +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
> >  }
> 
> Very subtle sign casing error here. If the flags ever use the high bit,
> this wraps and fails. It "shouldn't be possible" but you could just remove
> the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
> return unsigned.

what about
	return !!__gfp_pfmemalloc_flags(gfp_mask);
 
> >  /*
> > @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	unsigned long alloc_start = jiffies;
> >  	unsigned int stall_timeout = 10 * HZ;
> >  	unsigned int cpuset_mems_cookie;
> > +	int reserves;
> >  
> 
> This should be explicitly named to indicate it's about flags and not the
> number of reserve pages or something else wacky.

s@reserves@reserve_flags@?

> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> >  		wake_all_kswapds(order, ac);
> >  
> > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> > +	if (reserves)
> > +		alloc_flags = reserves;
> >  
> 
> And if it's reserve_flags you can save a branch with
> 
> reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> alloc_pags |= reserve_flags;
> 
> It won't make much difference considering how branch-intensive the allocator
> is anyway.

I was actually considering that but rather didn't want to do it because
I wanted to reset alloc_flags rather than create strange ALLOC_$FOO
combinations which would be harder to evaluate.
 
> >  	/*
> >  	 * Reset the zonelist iterators if memory policies can be ignored.
> >  	 * These allocations are high priority and system rather than user
> >  	 * orientated.
> >  	 */
> > -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> > +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
> >  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> >  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> >  					ac->high_zoneidx, ac->nodemask);
> > @@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto got_pg;
> >  
> >  	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE) &&
> > -	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> > +	if (tsk_is_oom_victim(current) &&
> > +	    (alloc_flags == ALLOC_OOM ||
> >  	     (gfp_mask & __GFP_NOMEMALLOC)))
> >  		goto nopage;
> >  
> 
> Mostly I only found nit-picks so whether you address them or not
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks a lot for your review. Here is an incremental diff on top
---
commit 8483306e108a25441d796a8da3df5b85d633d859
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Aug 3 12:58:49 2017 +0200

    fold me "mm, oom: do not rely on TIF_MEMDIE for memory reserves access"
    
    - clarify access to memory reserves in __zone_watermark_ok - per Mel
    - make int->bool conversion in gfp_pfmemalloc_allowed more robust - per
      Mel
    - s@reserves@reserve_flags@ in __alloc_pages_slowpath - per Mel

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ae0f6d45614..90e331e4c077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2948,7 +2948,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	} else {
 		/*
 		 * OOM victims can try even harder than normal ALLOC_HARDER
-		 * users
+		 * users on the grounds that it's definitely going to be in
+		 * the exit path shortly and free memory. Any allocation it
+		 * makes during the free path will be small and short-lived.
 		 */
 		if (alloc_flags & ALLOC_OOM)
 			min -= min / 2;
@@ -3651,7 +3653,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
+	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
 /*
@@ -3804,7 +3806,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
-	int reserves;
+	int reserve_flags;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3910,16 +3912,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	reserves = __gfp_pfmemalloc_flags(gfp_mask);
-	if (reserves)
-		alloc_flags = reserves;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-03 11:00         ` Michal Hocko
@ 2017-08-03 12:22           ` Mel Gorman
  -1 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-08-03 12:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tetsuo Handa, linux-mm, LKML

On Thu, Aug 03, 2017 at 01:00:30PM +0200, Michal Hocko wrote:
> > Ok, no collision with the wmark indexes so that should be fine. While I
> > didn't check, I suspect that !MMU users also have relatively few CPUs to
> > allow major contention.
> 
> Well, I didn't try to improve the !MMU case because a) I do not know
> whether there is a real problem with oom depletion there and b) I have
> no way to test this. So I only focused on keeping the status quo for
> nommu.
> 

I've no problem with that.

> > > @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	return alloc_flags;
> > >  }
> > >  
> > > -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > >  {
> > > -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > > +	if (!tsk_is_oom_victim(tsk))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * !MMU doesn't have oom reaper so give access to memory reserves
> > > +	 * only to the thread with TIF_MEMDIE set
> > > +	 */
> > > +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> > >  		return false;
> > >  
> > > +	return true;
> > > +}
> > > +
> > 
> > Ok, there is a chance that a task selected as an OOM kill victim may be
> > in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
> > problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
> > path (which we care about for an OOM killed task) and the caller should
> > always be able to handle a failure.
> 
> Not sure I understand. If the oom victim is doing __GFP_NOMEMALLOC then
> we haven't been doing ALLOC_NO_WATERMARKS even before. So I preserve the
> behavior here. Even though I am not sure this is a deliberate behavior
> or something more of result of an evolution of the code.
> 

The behaviour is fine as far as I can tell.

> > > +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > > +{
> > > +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
> > >  }
> > 
> > Very subtle sign casing error here. If the flags ever use the high bit,
> > this wraps and fails. It "shouldn't be possible" but you could just remove
> > the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
> > return unsigned.
> 
> what about
> 	return !!__gfp_pfmemalloc_flags(gfp_mask);
>  

You could but it's overkill. Any value cast to bool should be safe as it's
meant to be immune from truncation concerns.

> > >  /*
> > > @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  	unsigned long alloc_start = jiffies;
> > >  	unsigned int stall_timeout = 10 * HZ;
> > >  	unsigned int cpuset_mems_cookie;
> > > +	int reserves;
> > >  
> > 
> > This should be explicitly named to indicate it's about flags and not the
> > number of reserve pages or something else wacky.
> 
> s@reserves@reserve_flags@?
> 

That's do.

> > >  	/*
> > >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > > @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > >  		wake_all_kswapds(order, ac);
> > >  
> > > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > > +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> > > +	if (reserves)
> > > +		alloc_flags = reserves;
> > >  
> > 
> > And if it's reserve_flags you can save a branch with
> > 
> > reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> > alloc_pags |= reserve_flags;
> > 
> > It won't make much difference considering how branch-intensive the allocator
> > is anyway.
> 
> I was actually considering that but rather didn't want to do it because
> I wanted to reset alloc_flags rather than create strange ALLOC_$FOO
> combinations which would be harder to evaluate.
>  

Ok, it does implicitely clear flags like ALLOC_CPUSET which is fine in
this context but it must be remembered in the future if an alloc flag is
ever introduced that has meaning even for oom kill.

> > Mostly I only found nit-picks so whether you address them or not
> > 
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Thanks a lot for your review. Here is an incremental diff on top

Looks fine. I am not a major fan of the !! because I think it's
unnecessary but it's not worth making a big deal out of. It's a
well-recognised idiom.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-03 12:22           ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-08-03 12:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, Roman Gushchin,
	Tetsuo Handa, linux-mm, LKML

On Thu, Aug 03, 2017 at 01:00:30PM +0200, Michal Hocko wrote:
> > Ok, no collision with the wmark indexes so that should be fine. While I
> > didn't check, I suspect that !MMU users also have relatively few CPUs to
> > allow major contention.
> 
> Well, I didn't try to improve the !MMU case because a) I do not know
> whether there is a real problem with oom depletion there and b) I have
> no way to test this. So I only focused on keeping the status quo for
> nommu.
> 

I've no problem with that.

> > > @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	return alloc_flags;
> > >  }
> > >  
> > > -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > >  {
> > > -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > > +	if (!tsk_is_oom_victim(tsk))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * !MMU doesn't have oom reaper so give access to memory reserves
> > > +	 * only to the thread with TIF_MEMDIE set
> > > +	 */
> > > +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> > >  		return false;
> > >  
> > > +	return true;
> > > +}
> > > +
> > 
> > Ok, there is a chance that a task selected as an OOM kill victim may be
> > in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
> > problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
> > path (which we care about for an OOM killed task) and the caller should
> > always be able to handle a failure.
> 
> Not sure I understand. If the oom victim is doing __GFP_NOMEMALLOC then
> we haven't been doing ALLOC_NO_WATERMARKS even before. So I preserve the
> behavior here. Even though I am not sure this is a deliberate behavior
> or something more of result of an evolution of the code.
> 

The behaviour is fine as far as I can tell.

> > > +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > > +{
> > > +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
> > >  }
> > 
> > Very subtle sign casing error here. If the flags ever use the high bit,
> > this wraps and fails. It "shouldn't be possible" but you could just remove
> > the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
> > return unsigned.
> 
> what about
> 	return !!__gfp_pfmemalloc_flags(gfp_mask);
>  

You could but it's overkill. Any value cast to bool should be safe as it's
meant to be immune from truncation concerns.

> > >  /*
> > > @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  	unsigned long alloc_start = jiffies;
> > >  	unsigned int stall_timeout = 10 * HZ;
> > >  	unsigned int cpuset_mems_cookie;
> > > +	int reserves;
> > >  
> > 
> > This should be explicitly named to indicate it's about flags and not the
> > number of reserve pages or something else wacky.
> 
> s@reserves@reserve_flags@?
> 

That's do.

> > >  	/*
> > >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > > @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> > >  		wake_all_kswapds(order, ac);
> > >  
> > > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > > +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> > > +	if (reserves)
> > > +		alloc_flags = reserves;
> > >  
> > 
> > And if it's reserve_flags you can save a branch with
> > 
> > reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> > alloc_pags |= reserve_flags;
> > 
> > It won't make much difference considering how branch-intensive the allocator
> > is anyway.
> 
> I was actually considering that but rather didn't want to do it because
> I wanted to reset alloc_flags rather than create strange ALLOC_$FOO
> combinations which would be harder to evaluate.
>  

Ok, it does implicitely clear flags like ALLOC_CPUSET which is fine in
this context but it must be remembered in the future if an alloc flag is
ever introduced that has meaning even for oom kill.

> > Mostly I only found nit-picks so whether you address them or not
> > 
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Thanks a lot for your review. Here is an incremental diff on top

Looks fine. I am not a major fan of the !! because I think it's
unnecessary but it's not worth making a big deal out of. It's a
well-recognised idiom.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
  2017-07-27  9:03 ` Michal Hocko
@ 2017-08-07 14:21   ` Michal Hocko
  -1 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-07 14:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> Hi,
> this is a part of a larger series I posted back in Oct last year [1]. I
> have dropped patch 3 because it was incorrect and patch 4 is not
> applicable without it.
> 
> The primary reason to apply patch 1 is to remove a risk of the complete
> memory depletion by oom victims. While this is a theoretical risk right
> now there is a demand for memcg aware oom killer which might kill all
> processes inside a memcg which can be a lot of tasks. That would make
> the risk quite real.
> 
> This issue is addressed by limiting access to memory reserves. We no
> longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> cleanup.
> 
> I would still like to get rid of TIF_MEMDIE completely but I do not have
> time to do it now and it is not a pressing issue.
> 
> [1] http://lkml.kernel.org/r/20161004090009.7974-1-mhocko@kernel.org

Are there any more comments/questions? I will resubmit if not.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access
@ 2017-08-07 14:21   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-07 14:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, Roman Gushchin, Tetsuo Handa,
	linux-mm, LKML

On Thu 27-07-17 11:03:55, Michal Hocko wrote:
> Hi,
> this is a part of a larger series I posted back in Oct last year [1]. I
> have dropped patch 3 because it was incorrect and patch 4 is not
> applicable without it.
> 
> The primary reason to apply patch 1 is to remove a risk of the complete
> memory depletion by oom victims. While this is a theoretical risk right
> now there is a demand for memcg aware oom killer which might kill all
> processes inside a memcg which can be a lot of tasks. That would make
> the risk quite real.
> 
> This issue is addressed by limiting access to memory reserves. We no
> longer use TIF_MEMDIE to grant the access and use tsk_is_oom_victim
> instead. See Patch 1 for more details. Patch 2 is a trivial follow up
> cleanup.
> 
> I would still like to get rid of TIF_MEMDIE completely but I do not have
> time to do it now and it is not a pressing issue.
> 
> [1] http://lkml.kernel.org/r/20161004090009.7974-1-mhocko@kernel.org

Are there any more comments/questions? I will resubmit if not.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2017-08-10  7:50 [PATCH v2 " Michal Hocko
@ 2017-08-10  7:50   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-10  7:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Roman Gushchin, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full
access to memory reserves. There are few shortcomings of this
implementation, though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates
on mm_struct granularity so all processes sharing the given mm are
killed. Giving the full access to all these task_structs could lead to
a quick memory reserves depletion. We have tried to reduce this risk by
giving TIF_MEMDIE only to the main thread and the currently allocating
task but that doesn't really solve this problem while it surely opens up
a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
the allocator without access to memory reserves because a particular
thread was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any
usage of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once. This change will allow such a usecase without
worrying about complete memory reserves depletion.

Changes since v1
- do not play tricks with nommu and grant access to memory reserves as
  long as TIF_MEMDIE is set
- break out from allocation properly for oom victims as per Tetsuo
- distinguish oom victims from other consumers of memory reserves in
  __gfp_pfmemalloc_flags - per Tetsuo
- clarify access to memory reserves in __zone_watermark_ok - per Mel
- make int->bool conversion in gfp_pfmemalloc_allowed more robust - per
  Mel
- s@reserves@reserve_flags@ in __alloc_pages_slowpath - per Mel

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   | 11 +++++++++
 mm/oom_kill.c   |  9 ++++---
 mm/page_alloc.c | 76 ++++++++++++++++++++++++++++++++++++++++++---------------
 3 files changed, 73 insertions(+), 23 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..1ebcb1ed01b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..c9f3569a76c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
@@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	count_memcg_event_mm(mm, OOM_KILL);
 
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
+	 * We should send SIGKILL before granting access to memory reserves
+	 * in order to prevent the OOM victim from depleting the memory
+	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..90e331e4c077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2943,10 +2943,21 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!alloc_harder)) {
 		free_pages -= z->nr_reserved_highatomic;
-	else
-		min -= min / 4;
+	} else {
+		/*
+		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * users on the grounds that it's definitely going to be in
+		 * the exit path shortly and free memory. Any allocation it
+		 * makes during the free path will be small and short-lived.
+		 */
+		if (alloc_flags & ALLOC_OOM)
+			min -= min / 2;
+		else
+			min -= min / 4;
+	}
+
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3184,7 +3195,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (tsk_is_oom_victim(current) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3603,21 +3614,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+static bool oom_reserves_allowed(struct task_struct *tsk)
 {
-	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+	if (!tsk_is_oom_victim(tsk))
+		return false;
+
+	/*
+	 * !MMU doesn't have oom reaper so give access to memory reserves
+	 * only to the thread with TIF_MEMDIE set
+	 */
+	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
 		return false;
 
+	return true;
+}
+
+/*
+ * Distinguish requests which really need access to full memory
+ * reserves from oom victims which can live with a portion of it
+ */
+static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
+{
+	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+		return 0;
 	if (gfp_mask & __GFP_MEMALLOC)
-		return true;
+		return ALLOC_NO_WATERMARKS;
 	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-		return true;
-	if (!in_interrupt() &&
-			((current->flags & PF_MEMALLOC) ||
-			 unlikely(test_thread_flag(TIF_MEMDIE))))
-		return true;
+		return ALLOC_NO_WATERMARKS;
+	if (!in_interrupt()) {
+		if (current->flags & PF_MEMALLOC)
+			return ALLOC_NO_WATERMARKS;
+		else if (oom_reserves_allowed(current))
+			return ALLOC_OOM;
+	}
 
-	return false;
+	return 0;
+}
+
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
 /*
@@ -3770,6 +3806,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	int reserve_flags;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3875,15 +3912,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
-		alloc_flags = ALLOC_NO_WATERMARKS;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
@@ -3960,8 +3998,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) &&
-	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	if (tsk_is_oom_victim(current) &&
+	    (alloc_flags == ALLOC_OOM ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
@ 2017-08-10  7:50   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-08-10  7:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Roman Gushchin, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full
access to memory reserves. There are few shortcomings of this
implementation, though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates
on mm_struct granularity so all processes sharing the given mm are
killed. Giving the full access to all these task_structs could lead to
a quick memory reserves depletion. We have tried to reduce this risk by
giving TIF_MEMDIE only to the main thread and the currently allocating
task but that doesn't really solve this problem while it surely opens up
a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
the allocator without access to memory reserves because a particular
thread was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any
usage of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once. This change will allow such a usecase without
worrying about complete memory reserves depletion.

Changes since v1
- do not play tricks with nommu and grant access to memory reserves as
  long as TIF_MEMDIE is set
- break out from allocation properly for oom victims as per Tetsuo
- distinguish oom victims from other consumers of memory reserves in
  __gfp_pfmemalloc_flags - per Tetsuo
- clarify access to memory reserves in __zone_watermark_ok - per Mel
- make int->bool conversion in gfp_pfmemalloc_allowed more robust - per
  Mel
- s@reserves@reserve_flags@ in __alloc_pages_slowpath - per Mel

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   | 11 +++++++++
 mm/oom_kill.c   |  9 ++++---
 mm/page_alloc.c | 76 ++++++++++++++++++++++++++++++++++++++++++---------------
 3 files changed, 73 insertions(+), 23 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..1ebcb1ed01b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..c9f3569a76c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
@@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	count_memcg_event_mm(mm, OOM_KILL);
 
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
+	 * We should send SIGKILL before granting access to memory reserves
+	 * in order to prevent the OOM victim from depleting the memory
+	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..90e331e4c077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2930,7 +2930,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2943,10 +2943,21 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!alloc_harder)) {
 		free_pages -= z->nr_reserved_highatomic;
-	else
-		min -= min / 4;
+	} else {
+		/*
+		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * users on the grounds that it's definitely going to be in
+		 * the exit path shortly and free memory. Any allocation it
+		 * makes during the free path will be small and short-lived.
+		 */
+		if (alloc_flags & ALLOC_OOM)
+			min -= min / 2;
+		else
+			min -= min / 4;
+	}
+
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3184,7 +3195,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (tsk_is_oom_victim(current) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3603,21 +3614,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+static bool oom_reserves_allowed(struct task_struct *tsk)
 {
-	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+	if (!tsk_is_oom_victim(tsk))
+		return false;
+
+	/*
+	 * !MMU doesn't have oom reaper so give access to memory reserves
+	 * only to the thread with TIF_MEMDIE set
+	 */
+	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
 		return false;
 
+	return true;
+}
+
+/*
+ * Distinguish requests which really need access to full memory
+ * reserves from oom victims which can live with a portion of it
+ */
+static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
+{
+	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+		return 0;
 	if (gfp_mask & __GFP_MEMALLOC)
-		return true;
+		return ALLOC_NO_WATERMARKS;
 	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-		return true;
-	if (!in_interrupt() &&
-			((current->flags & PF_MEMALLOC) ||
-			 unlikely(test_thread_flag(TIF_MEMDIE))))
-		return true;
+		return ALLOC_NO_WATERMARKS;
+	if (!in_interrupt()) {
+		if (current->flags & PF_MEMALLOC)
+			return ALLOC_NO_WATERMARKS;
+		else if (oom_reserves_allowed(current))
+			return ALLOC_OOM;
+	}
 
-	return false;
+	return 0;
+}
+
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
 /*
@@ -3770,6 +3806,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
+	int reserve_flags;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3875,15 +3912,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
-		alloc_flags = ALLOC_NO_WATERMARKS;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
@@ -3960,8 +3998,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) &&
-	    (alloc_flags == ALLOC_NO_WATERMARKS ||
+	if (tsk_is_oom_victim(current) &&
+	    (alloc_flags == ALLOC_OOM ||
 	     (gfp_mask & __GFP_NOMEMALLOC)))
 		goto nopage;
 
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2017-08-10  7:50 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27  9:03 [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access Michal Hocko
2017-07-27  9:03 ` Michal Hocko
2017-07-27  9:03 ` [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for " Michal Hocko
2017-07-27  9:03   ` Michal Hocko
2017-08-01 15:30   ` Tetsuo Handa
2017-08-01 15:30     ` Tetsuo Handa
2017-08-01 16:52     ` Michal Hocko
2017-08-01 16:52       ` Michal Hocko
2017-08-02  6:10       ` Michal Hocko
2017-08-02  6:10         ` Michal Hocko
2017-08-03  1:39       ` Tetsuo Handa
2017-08-03  1:39         ` Tetsuo Handa
2017-08-03  7:06         ` Michal Hocko
2017-08-03  7:06           ` Michal Hocko
2017-08-03  8:03           ` Tetsuo Handa
2017-08-03  8:03             ` Tetsuo Handa
2017-08-03  8:21             ` Michal Hocko
2017-08-03  8:21               ` Michal Hocko
2017-08-02  8:29   ` [PATCH v2 " Michal Hocko
2017-08-02  8:29     ` Michal Hocko
2017-08-03  9:37     ` Mel Gorman
2017-08-03  9:37       ` Mel Gorman
2017-08-03 11:00       ` Michal Hocko
2017-08-03 11:00         ` Michal Hocko
2017-08-03 12:22         ` Mel Gorman
2017-08-03 12:22           ` Mel Gorman
2017-07-27  9:03 ` [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim Michal Hocko
2017-07-27  9:03   ` Michal Hocko
2017-07-27 14:01   ` Tetsuo Handa
2017-07-27 14:01     ` Tetsuo Handa
2017-07-27 14:08     ` Tetsuo Handa
2017-07-27 14:08       ` Tetsuo Handa
2017-07-27 14:18     ` Michal Hocko
2017-07-27 14:18       ` Michal Hocko
2017-07-27 14:45     ` Michal Hocko
2017-07-27 14:45       ` Michal Hocko
2017-07-27 14:55       ` Roman Gushchin
2017-07-27 14:55         ` Roman Gushchin
2017-07-29  8:33   ` kbuild test robot
2017-07-31  6:46     ` Michal Hocko
2017-07-31  6:46       ` Michal Hocko
2017-08-01 12:16 ` [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access Michal Hocko
2017-08-01 12:16   ` Michal Hocko
2017-08-01 12:23   ` Roman Gushchin
2017-08-01 12:23     ` Roman Gushchin
2017-08-01 12:29     ` Michal Hocko
2017-08-01 12:29       ` Michal Hocko
2017-08-01 12:42       ` Roman Gushchin
2017-08-01 12:42         ` Roman Gushchin
2017-08-01 12:54         ` Michal Hocko
2017-08-01 12:54           ` Michal Hocko
2017-08-07 14:21 ` Michal Hocko
2017-08-07 14:21   ` Michal Hocko
2017-08-10  7:50 [PATCH v2 " Michal Hocko
2017-08-10  7:50 ` [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for " Michal Hocko
2017-08-10  7:50   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.