linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] mm, oom: get rid of TIF_MEMDIE
@ 2016-09-01  9:51 Michal Hocko
  2016-09-01  9:51 ` [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access Michal Hocko
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-01  9:51 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Johannes Weiner, Andrew Morton,
	LKML, Al Viro, Michal Hocko, Oleg Nesterov

Hi,
this is an early RFC to see whether the approach I've taken is acceptable.
The series is on top of the current mmotm tree (2016-08-31-16-06). I didn't
get to test it so it might be completely broken.

The primary point of this series is to get rid of TIF_MEMDIE finally.
Recent changes in the oom proper allows for that finally, I believe. Now
that all the oom victims are reapable we are no longer depending on
ALLOC_NO_WATERMARKS because the memory held by the victim is reclaimed
asynchronously. A partial access to memory reserves should be sufficient
just to guarantee that the oom victim is not starved due to other
memory consumers. This also means that we do not have to pretend to be
conservative and give access to memory reserves only to one thread from
the process at the time. This is patch 1.

Patch 2 is a simple cleanup which turns TIF_MEMDIE users to tsk_is_oom_victim
which is process rather than thread centric. None of those callers really
requires to be thread aware AFAICS.

The tricky part then is exit_oom_victim vs. oom_killer_disable because
TIF_MEMDIE acted as a token there so we had a way to count threads from
the process. It didn't work 100% reliably and had it own issues but we
have to replace it with something which doesn't rely on counting threads
but rather find a moment when all threads have reached steady state in
do_exit. This is what patch 3 does and I would really appreciate if Oleg
could double check my thinking there. I am also CCing Al on that one
because I am moving exit_io_context up in do_exit right before exit_notify.

The last patch just removes TIF_MEMDIE from the arch code because it is
no longer needed anywhere.

I would give it some more testing but I am leaving for couple of days
and I wanted to check whether this is a sound way to go before that.

I really appreciate any feedback.

Michal Hocko (4):
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
      mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
      arch: get rid of TIF_MEMDIE


The diffstat looks quite promissing to me
 arch/alpha/include/asm/thread_info.h      |  1 -
 arch/arc/include/asm/thread_info.h        |  2 --
 arch/arm/include/asm/thread_info.h        |  1 -
 arch/arm64/include/asm/thread_info.h      |  1 -
 arch/avr32/include/asm/thread_info.h      |  2 --
 arch/blackfin/include/asm/thread_info.h   |  1 -
 arch/c6x/include/asm/thread_info.h        |  1 -
 arch/cris/include/asm/thread_info.h       |  1 -
 arch/frv/include/asm/thread_info.h        |  1 -
 arch/h8300/include/asm/thread_info.h      |  1 -
 arch/hexagon/include/asm/thread_info.h    |  1 -
 arch/ia64/include/asm/thread_info.h       |  1 -
 arch/m32r/include/asm/thread_info.h       |  1 -
 arch/m68k/include/asm/thread_info.h       |  1 -
 arch/metag/include/asm/thread_info.h      |  1 -
 arch/microblaze/include/asm/thread_info.h |  1 -
 arch/mips/include/asm/thread_info.h       |  1 -
 arch/mn10300/include/asm/thread_info.h    |  1 -
 arch/nios2/include/asm/thread_info.h      |  1 -
 arch/openrisc/include/asm/thread_info.h   |  1 -
 arch/parisc/include/asm/thread_info.h     |  1 -
 arch/powerpc/include/asm/thread_info.h    |  1 -
 arch/s390/include/asm/thread_info.h       |  1 -
 arch/score/include/asm/thread_info.h      |  1 -
 arch/sh/include/asm/thread_info.h         |  1 -
 arch/sparc/include/asm/thread_info_32.h   |  1 -
 arch/sparc/include/asm/thread_info_64.h   |  1 -
 arch/tile/include/asm/thread_info.h       |  2 --
 arch/um/include/asm/thread_info.h         |  2 --
 arch/unicore32/include/asm/thread_info.h  |  1 -
 arch/x86/include/asm/thread_info.h        |  1 -
 arch/xtensa/include/asm/thread_info.h     |  1 -
 include/linux/sched.h                     |  2 +-
 kernel/cpuset.c                           |  9 ++---
 kernel/exit.c                             | 38 +++++++++++++++------
 kernel/freezer.c                          |  3 +-
 mm/internal.h                             | 11 ++++++
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 42 +++++++++++++----------
 mm/page_alloc.c                           | 57 +++++++++++++++++++++++++------
 40 files changed, 118 insertions(+), 82 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2016-09-01  9:51 [RFC 0/4] mm, oom: get rid of TIF_MEMDIE Michal Hocko
@ 2016-09-01  9:51 ` Michal Hocko
  2016-09-04  1:49   ` Tetsuo Handa
  2016-09-01  9:51 ` [RFC 2/4] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-09-01  9:51 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Johannes Weiner, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full
access to memory reserves. There are few shortcomings of this
implementation, though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates
on mm_struct granularity so all processes sharing the given mm are
killed. Giving the full access to all these task_structs could leave to
a quick memory reserves depletion. We have tried to reduce this risk by
giving TIF_MEMDIE only to the main thread and the currently allocating
task but that doesn't really solve this problem while it surely opens up
a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
the allocator without access to memory reserves because a particular
thread was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
(after "oom, oom_reaper: allow to reap mm shared by the kthreads")
we can be more conservative and grant only partial access to memory
reserves because there are reasonable chances of the parallel memory
freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any
usage of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach but be careful because they still might
deplete all the memory reserves so keep the semantic as close to the
original implementation as possible and give them access to memory
reserves only up to exit_mm (when tsk->mm is cleared) rather than while
tsk_is_oom_victim which is until signal struct is gone.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/internal.h   | 11 +++++++++++
 mm/oom_kill.c   |  9 +++++----
 mm/page_alloc.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5214bf8e3171..f693c4e61a0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -460,6 +460,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0034baf35f0c..b11977585c7b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -816,7 +816,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
@@ -876,9 +877,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
 	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
+	 * We should send SIGKILL before granting access to memory reserves
+	 * in order to prevent the OOM victim from depleting the memory
+	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee3997859f14..b10025aa3dc7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2737,7 +2737,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2750,10 +2750,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!alloc_harder)) {
 		free_pages -= z->nr_reserved_highatomic;
-	else
-		min -= min / 4;
+	} else {
+		/*
+		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * users
+		 */
+		if (alloc_flags & ALLOC_OOM)
+			min -= min / 2;
+		else
+			min -= min / 4;
+	}
+
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
@@ -2995,7 +3004,7 @@ void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 	 * of allowed nodes.
 	 */
 	if (!(gfp_mask & __GFP_NOMEMALLOC))
-		if (test_thread_flag(TIF_MEMDIE) ||
+		if (tsk_is_oom_victim(current) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
 	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3309,6 +3318,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+static bool oom_reserves_allowed(struct task_struct *tsk)
+{
+	if (!tsk_is_oom_victim(tsk))
+		return false;
+
+	/*
+	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
+	 * depletion and shouldn't give access to memory reserves passed the
+	 * exit_mm
+	 */
+	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
+		return false;
+
+	return true;
+}
+
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
 	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
@@ -3320,7 +3345,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		return true;
 	if (!in_interrupt() &&
 			((current->flags & PF_MEMALLOC) ||
-			 unlikely(test_thread_flag(TIF_MEMDIE))))
+			 oom_reserves_allowed(current)))
 		return true;
 
 	return false;
@@ -3424,6 +3449,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
+	bool reserves;
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
@@ -3514,15 +3540,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
-		alloc_flags = ALLOC_NO_WATERMARKS;
+	/*
+	 * Distinguish requests which really need access to whole memory
+	 * reserves from oom victims which can live with their own reserve
+	 */
+	reserves = gfp_pfmemalloc_allowed(gfp_mask);
+	if (reserves) {
+		if (tsk_is_oom_victim(current))
+			alloc_flags = ALLOC_OOM;
+		else
+			alloc_flags = ALLOC_NO_WATERMARKS;
+	}
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
@@ -3558,8 +3593,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 	}
 
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	/* Avoid allocations for oom victims from looping endlessly */
+	if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 2/4] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2016-09-01  9:51 [RFC 0/4] mm, oom: get rid of TIF_MEMDIE Michal Hocko
  2016-09-01  9:51 ` [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access Michal Hocko
@ 2016-09-01  9:51 ` Michal Hocko
  2016-09-04  1:49   ` Tetsuo Handa
  2016-09-01  9:51 ` [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-09-01  9:51 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Johannes Weiner, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path. tsk_is_oom_victim is more generic and allows to identify all tasks
(threads) which share the mm with the oom victim.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/cpuset.c | 9 +++++----
 mm/memcontrol.c | 2 +-
 mm/oom_kill.c   | 4 ++--
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c7fd2778ed50..8e370d9d63ee 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -54,6 +54,7 @@
 #include <linux/time64.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
+#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <linux/atomic.h>
@@ -2487,12 +2488,12 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * If we're in interrupt, yes, we can always allocate.  If @node is set in
  * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
  * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
- * yes.  If current has access to memory reserves due to TIF_MEMDIE, yes.
+ * yes.  If current has access to memory reserves as an oom victim, yes.
  * Otherwise, no.
  *
  * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
  * and do not allow allocations outside the current tasks cpuset
- * unless the task has been OOM killed as is marked TIF_MEMDIE.
+ * unless the task has been OOM killed.
  * GFP_KERNEL allocations are not so marked, so can escape to the
  * nearest enclosing hardwalled ancestor cpuset.
  *
@@ -2515,7 +2516,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * affect that:
  *	in_interrupt - any node ok (current task context irrelevant)
  *	GFP_ATOMIC   - any node ok
- *	TIF_MEMDIE   - any node ok
+ *	tsk_is_oom_victim - any node ok
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
@@ -2533,7 +2534,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
+	if (unlikely(tsk_is_oom_victim(current)))
 		return true;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ee178ba7b71..df58733ca48e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1899,7 +1899,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+	if (unlikely(tsk_is_oom_victim(current) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
 		goto force;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b11977585c7b..e26529edcee3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -477,7 +477,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 *				[...]
 	 *				out_of_memory
 	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
+	 *				    # no oom victim, selects new victim
 	 *  unmap_page_range # frees some memory
 	 */
 	mutex_lock(&oom_lock);
@@ -1078,7 +1078,7 @@ void pagefault_out_of_memory(void)
 		 * be a racing OOM victim for which oom_killer_disable()
 		 * is waiting for.
 		 */
-		WARN_ON(test_thread_flag(TIF_MEMDIE));
+		WARN_ON(tsk_is_oom_victim(current));
 	}
 
 	mutex_unlock(&oom_lock);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-01  9:51 [RFC 0/4] mm, oom: get rid of TIF_MEMDIE Michal Hocko
  2016-09-01  9:51 ` [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access Michal Hocko
  2016-09-01  9:51 ` [RFC 2/4] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim Michal Hocko
@ 2016-09-01  9:51 ` Michal Hocko
  2016-09-04  1:50   ` Tetsuo Handa
  2016-09-14 13:50   ` Michal Hocko
  2016-09-01  9:51 ` [RFC 4/4] arch: get rid of TIF_MEMDIE Michal Hocko
  2016-09-15 14:41 ` [RFC 0/4] mm, oom: " Johannes Weiner
  4 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-01  9:51 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Johannes Weiner, Andrew Morton,
	LKML, Michal Hocko, Oleg Nesterov, Al Viro

From: Michal Hocko <mhocko@suse.com>

mark_oom_victim and exit_oom_victim are used for oom_killer_disable
which should block as long as there any any oom victims alive. Up to now
we have relied on TIF_MEMDIE task flag to count how many oom victim
we have. This is not optimal because only one thread receives this flag
at the time while the whole process (thread group) is killed and should
die. As a result we do not thaw the whole thread group and so a multi
threaded process can leave some threads behind in the fridge. We really
want to thaw all the threads.

This is not all that easy because there is no reliable way to count
threads in the process as the oom killer might race with copy_process.
So marking all threads with TIF_MEMDIE and increment oom_victims
accordingly is not safe. Also TIF_MEMDIE flag should just die so
we should better come up with a different approach.

All we need to guarantee is that exit_oom_victim is called at the time
when no further access to (possibly suspended) devices or generate other
IO (which would clobber suspended image and only once per process)
is possible. It seems we can rely on exit_notify for that because we
already have to detect the last thread to do a cleanup. Let's propagate
that information up to do_exit and only call exit_oom_victim for such
a thread. With this in place we can safely increment oom_victims only
once per thread group and thaw all the threads from the process.
freezing_slow_path can also rely on tsk_is_oom_victim as well now.

exit_io_context is currently called after exit_notify but it seems it is
safe to call it right before exit_notify because that is passed
exit_files.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  2 +-
 kernel/exit.c         | 38 ++++++++++++++++++++++++++++----------
 kernel/freezer.c      |  3 ++-
 mm/oom_kill.c         | 29 +++++++++++++++++------------
 4 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 770d01e7a68e..605e40b47992 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2660,7 +2660,7 @@ static inline void kernel_signal_stop(void)
 	schedule();
 }
 
-extern void release_task(struct task_struct * p);
+extern bool release_task(struct task_struct * p);
 extern int send_sig_info(int, struct siginfo *, struct task_struct *);
 extern int force_sigsegv(int, struct task_struct *);
 extern int force_sig_info(int, struct siginfo *, struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 914088e8c2ac..c762416dbed1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -165,10 +165,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 }
 
 
-void release_task(struct task_struct *p)
+bool release_task(struct task_struct *p)
 {
 	struct task_struct *leader;
 	int zap_leader;
+	bool last = false;
 repeat:
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
@@ -197,8 +198,10 @@ void release_task(struct task_struct *p)
 		 * then we are the one who should release the leader.
 		 */
 		zap_leader = do_notify_parent(leader, leader->exit_signal);
-		if (zap_leader)
+		if (zap_leader) {
 			leader->exit_state = EXIT_DEAD;
+			last = true;
+		}
 	}
 
 	write_unlock_irq(&tasklist_lock);
@@ -208,6 +211,8 @@ void release_task(struct task_struct *p)
 	p = leader;
 	if (unlikely(zap_leader))
 		goto repeat;
+
+	return last;
 }
 
 /*
@@ -434,8 +439,6 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
@@ -584,12 +587,15 @@ static void forget_original_parent(struct task_struct *father,
 /*
  * Send signals to all our closest relatives so that they know
  * to properly mourn us..
+ *
+ * Returns true if this is the last thread from the thread group
  */
-static void exit_notify(struct task_struct *tsk, int group_dead)
+static bool exit_notify(struct task_struct *tsk, int group_dead)
 {
 	bool autoreap;
 	struct task_struct *p, *n;
 	LIST_HEAD(dead);
+	bool last = false;
 
 	write_lock_irq(&tasklist_lock);
 	forget_original_parent(tsk, &dead);
@@ -606,6 +612,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	} else if (thread_group_leader(tsk)) {
 		autoreap = thread_group_empty(tsk) &&
 			do_notify_parent(tsk, tsk->exit_signal);
+		last = thread_group_empty(tsk);
 	} else {
 		autoreap = true;
 	}
@@ -621,8 +628,11 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 
 	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
 		list_del_init(&p->ptrace_entry);
-		release_task(p);
+		if (release_task(p) && p == tsk)
+			last = true;
 	}
+
+	return last;
 }
 
 #ifdef CONFIG_DEBUG_STACK_USAGE
@@ -766,7 +776,18 @@ void do_exit(long code)
 	TASKS_RCU(preempt_disable());
 	TASKS_RCU(tasks_rcu_i = __srcu_read_lock(&tasks_rcu_exit_srcu));
 	TASKS_RCU(preempt_enable());
-	exit_notify(tsk, group_dead);
+
+	if (tsk->io_context)
+		exit_io_context(tsk);
+
+	/*
+	 * Notify oom_killer_disable that the last oom thread is exiting.
+	 * We might have more threads running at this point but none of them
+	 * will access any devices behind this point.
+	 */
+	if (exit_notify(tsk, group_dead) && tsk_is_oom_victim(current))
+		exit_oom_victim();
+
 	proc_exit_connector(tsk);
 	mpol_put_task_policy(tsk);
 #ifdef CONFIG_FUTEX
@@ -784,9 +805,6 @@ void do_exit(long code)
 	 */
 	tsk->flags |= PF_EXITPIDONE;
 
-	if (tsk->io_context)
-		exit_io_context(tsk);
-
 	if (tsk->splice_pipe)
 		free_pipe_info(tsk->splice_pipe);
 
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 6f56a9e219fa..c6a64474a408 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -10,6 +10,7 @@
 #include <linux/syscalls.h>
 #include <linux/freezer.h>
 #include <linux/kthread.h>
+#include <linux/oom.h>
 
 /* total number of freezing conditions in effect */
 atomic_t system_freezing_cnt = ATOMIC_INIT(0);
@@ -42,7 +43,7 @@ bool freezing_slow_path(struct task_struct *p)
 	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
 		return false;
 
-	if (test_tsk_thread_flag(p, TIF_MEMDIE))
+	if (tsk_is_oom_victim(p))
 		return false;
 
 	if (pm_nosig_freezing || cgroup_freezing(p))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e26529edcee3..5dec6321ac7b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -649,33 +649,38 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
 static void mark_oom_victim(struct task_struct *tsk)
 {
 	struct mm_struct *mm = tsk->mm;
+	struct task_struct *t;
 
 	WARN_ON(oom_killer_disabled);
-	/* OOM killer might race with memcg OOM */
-	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
 
 	/* oom_mm is bound to the signal struct life time. */
-	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
 		atomic_inc(&tsk->signal->oom_mm->mm_count);
 
+		/* Only count thread groups */
+		atomic_inc(&oom_victims);
+	}
+
 	/*
-	 * Make sure that the task is woken up from uninterruptible sleep
-	 * if it is frozen because OOM killer wouldn't be able to free
-	 * any memory and livelock. freezing_slow_path will tell the freezer
-	 * that TIF_MEMDIE tasks should be ignored.
+	 * Make sure that the the whole thread groupd is woken up from
+	 * uninterruptible sleep if it is frozen because the oom victim
+	 * will free its memory completely only after exit.
+	 * freezing_slow_path will tell the freezer that oom victims
+	 * should be ignored.
 	 */
-	__thaw_task(tsk);
-	atomic_inc(&oom_victims);
+	rcu_read_lock();
+	for_each_thread(tsk, t)
+		__thaw_task(t);
+	rcu_read_unlock();
 }
 
 /**
  * exit_oom_victim - note the exit of an OOM victim
+ *
+ * Has to be called only once per thread group.
  */
 void exit_oom_victim(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
-
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
 }
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 4/4] arch: get rid of TIF_MEMDIE
  2016-09-01  9:51 [RFC 0/4] mm, oom: get rid of TIF_MEMDIE Michal Hocko
                   ` (2 preceding siblings ...)
  2016-09-01  9:51 ` [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim Michal Hocko
@ 2016-09-01  9:51 ` Michal Hocko
  2016-09-15 14:41 ` [RFC 0/4] mm, oom: " Johannes Weiner
  4 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-01  9:51 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Tetsuo Handa, Johannes Weiner, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

nobody relies on the flag so make it go away.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/alpha/include/asm/thread_info.h      | 1 -
 arch/arc/include/asm/thread_info.h        | 2 --
 arch/arm/include/asm/thread_info.h        | 1 -
 arch/arm64/include/asm/thread_info.h      | 1 -
 arch/avr32/include/asm/thread_info.h      | 2 --
 arch/blackfin/include/asm/thread_info.h   | 1 -
 arch/c6x/include/asm/thread_info.h        | 1 -
 arch/cris/include/asm/thread_info.h       | 1 -
 arch/frv/include/asm/thread_info.h        | 1 -
 arch/h8300/include/asm/thread_info.h      | 1 -
 arch/hexagon/include/asm/thread_info.h    | 1 -
 arch/ia64/include/asm/thread_info.h       | 1 -
 arch/m32r/include/asm/thread_info.h       | 1 -
 arch/m68k/include/asm/thread_info.h       | 1 -
 arch/metag/include/asm/thread_info.h      | 1 -
 arch/microblaze/include/asm/thread_info.h | 1 -
 arch/mips/include/asm/thread_info.h       | 1 -
 arch/mn10300/include/asm/thread_info.h    | 1 -
 arch/nios2/include/asm/thread_info.h      | 1 -
 arch/openrisc/include/asm/thread_info.h   | 1 -
 arch/parisc/include/asm/thread_info.h     | 1 -
 arch/powerpc/include/asm/thread_info.h    | 1 -
 arch/s390/include/asm/thread_info.h       | 1 -
 arch/score/include/asm/thread_info.h      | 1 -
 arch/sh/include/asm/thread_info.h         | 1 -
 arch/sparc/include/asm/thread_info_32.h   | 1 -
 arch/sparc/include/asm/thread_info_64.h   | 1 -
 arch/tile/include/asm/thread_info.h       | 2 --
 arch/um/include/asm/thread_info.h         | 2 --
 arch/unicore32/include/asm/thread_info.h  | 1 -
 arch/x86/include/asm/thread_info.h        | 1 -
 arch/xtensa/include/asm/thread_info.h     | 1 -
 32 files changed, 36 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index e9e90bfa2b50..8eac8743437e 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -65,7 +65,6 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SYSCALL_AUDIT	4	/* syscall audit active */
 #define TIF_DIE_IF_KERNEL	9	/* dik recursion lock */
-#define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14	/* idle is polling for TIF_NEED_RESCHED */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/arc/include/asm/thread_info.h b/arch/arc/include/asm/thread_info.h
index 2d79e527fa50..a3f236006a73 100644
--- a/arch/arc/include/asm/thread_info.h
+++ b/arch/arc/include/asm/thread_info.h
@@ -88,14 +88,12 @@ static inline __attribute_const__ struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
 
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		16
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
-#define _TIF_MEMDIE		(1<<TIF_MEMDIE)
 
 /* work to do on interrupt/exception return */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 776757d1604a..6277e56f15fd 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -146,7 +146,6 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	20
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..d78b3b2945a9 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -114,7 +114,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_AUDIT	9
 #define TIF_SYSCALL_TRACEPOINT	10
 #define TIF_SECCOMP		11
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
 #define TIF_SINGLESTEP		21
diff --git a/arch/avr32/include/asm/thread_info.h b/arch/avr32/include/asm/thread_info.h
index d4d3079541ea..680be13234ab 100644
--- a/arch/avr32/include/asm/thread_info.h
+++ b/arch/avr32/include/asm/thread_info.h
@@ -70,7 +70,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED        2       /* rescheduling necessary */
 #define TIF_BREAKPOINT		4	/* enter monitor mode on return */
 #define TIF_SINGLE_STEP		5	/* single step in progress */
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	7	/* restore signal mask in do_signal */
 #define TIF_CPU_GOING_TO_SLEEP	8	/* CPU is entering sleep 0 mode */
 #define TIF_NOTIFY_RESUME	9	/* callback before returning to user */
@@ -82,7 +81,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_BREAKPOINT		(1 << TIF_BREAKPOINT)
 #define _TIF_SINGLE_STEP	(1 << TIF_SINGLE_STEP)
-#define _TIF_MEMDIE		(1 << TIF_MEMDIE)
 #define _TIF_CPU_GOING_TO_SLEEP (1 << TIF_CPU_GOING_TO_SLEEP)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 
diff --git a/arch/blackfin/include/asm/thread_info.h b/arch/blackfin/include/asm/thread_info.h
index 2966b93850a1..a45ff075ab6a 100644
--- a/arch/blackfin/include/asm/thread_info.h
+++ b/arch/blackfin/include/asm/thread_info.h
@@ -79,7 +79,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACE	0	/* syscall trace active */
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_IRQ_SYNC		7	/* sync pipeline stage */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/c6x/include/asm/thread_info.h b/arch/c6x/include/asm/thread_info.h
index acc70c135ab8..22ff7b03641d 100644
--- a/arch/c6x/include/asm/thread_info.h
+++ b/arch/c6x/include/asm/thread_info.h
@@ -89,7 +89,6 @@ struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	4	/* restore signal mask in do_signal() */
 
-#define TIF_MEMDIE		17	/* OOM killer killed process */
 
 #define TIF_WORK_MASK		0x00007FFE /* work on irq/exception return */
 #define TIF_ALLWORK_MASK	0x00007FFF /* work on any return to u-space */
diff --git a/arch/cris/include/asm/thread_info.h b/arch/cris/include/asm/thread_info.h
index 4ead1b40d2d7..79ebddc22aa3 100644
--- a/arch/cris/include/asm/thread_info.h
+++ b/arch/cris/include/asm/thread_info.h
@@ -70,7 +70,6 @@ struct thread_info {
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/frv/include/asm/thread_info.h b/arch/frv/include/asm/thread_info.h
index ccba3b6ce918..993930f59d8e 100644
--- a/arch/frv/include/asm/thread_info.h
+++ b/arch/frv/include/asm/thread_info.h
@@ -86,7 +86,6 @@ register struct thread_info *__current_thread_info asm("gr15");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		7	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/h8300/include/asm/thread_info.h b/arch/h8300/include/asm/thread_info.h
index b408fe660cf8..68c10bce921e 100644
--- a/arch/h8300/include/asm/thread_info.h
+++ b/arch/h8300/include/asm/thread_info.h
@@ -73,7 +73,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SINGLESTEP		3	/* singlestepping active */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	6	/* callback before returning to user */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
diff --git a/arch/hexagon/include/asm/thread_info.h b/arch/hexagon/include/asm/thread_info.h
index b80fe1db7b64..e55c7d0a1755 100644
--- a/arch/hexagon/include/asm/thread_info.h
+++ b/arch/hexagon/include/asm/thread_info.h
@@ -112,7 +112,6 @@ register struct thread_info *__current_thread_info asm(QUOTED_THREADINFO_REG);
 #define TIF_SINGLESTEP          4       /* restore ss @ return to usr mode */
 #define TIF_RESTORE_SIGMASK     6       /* restore sig mask in do_signal() */
 /* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE              17      /* OOM killer killed process */
 
 #define _TIF_SYSCALL_TRACE      (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME      (1 << TIF_NOTIFY_RESUME)
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index 29bd59790d6c..321b23dc1bdd 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -97,7 +97,6 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	3	/* syscall auditing active */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_NOTIFY_RESUME	6	/* resumption notification requested */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_MCA_INIT		18	/* this task is processing MCA or INIT */
 #define TIF_DB_DISABLED		19	/* debug trap disabled for fsyscall */
 #define TIF_RESTORE_RSE		21	/* user RBS is newer than kernel RBS */
diff --git a/arch/m32r/include/asm/thread_info.h b/arch/m32r/include/asm/thread_info.h
index f630d9c30b28..bc54a574fad0 100644
--- a/arch/m32r/include/asm/thread_info.h
+++ b/arch/m32r/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline unsigned int get_thread_fault_code(void)
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	8	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/m68k/include/asm/thread_info.h b/arch/m68k/include/asm/thread_info.h
index cee13c2e5161..ed497d31ea5d 100644
--- a/arch/m68k/include/asm/thread_info.h
+++ b/arch/m68k/include/asm/thread_info.h
@@ -68,7 +68,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	7	/* rescheduling necessary */
 #define TIF_DELAYED_TRACE	14	/* single step a syscall */
 #define TIF_SYSCALL_TRACE	15	/* syscall trace active */
-#define TIF_MEMDIE		16	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal */
 
 #endif	/* _ASM_M68K_THREAD_INFO_H */
diff --git a/arch/metag/include/asm/thread_info.h b/arch/metag/include/asm/thread_info.h
index 32677cc278aa..c506e5a61714 100644
--- a/arch/metag/include/asm/thread_info.h
+++ b/arch/metag/include/asm/thread_info.h
@@ -111,7 +111,6 @@ static inline int kstack_end(void *addr)
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
-#define TIF_MEMDIE		8	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 
 
diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h
index e7e8954e9815..63cf6e8c086f 100644
--- a/arch/microblaze/include/asm/thread_info.h
+++ b/arch/microblaze/include/asm/thread_info.h
@@ -113,7 +113,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	3 /* rescheduling necessary */
 /* restore singlestep on return to user mode */
 #define TIF_SINGLESTEP		4
-#define TIF_MEMDIE		6	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_AUDIT	9       /* syscall auditing active */
 #define TIF_SECCOMP		10      /* secure computing */
 
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index e309d8fcb516..3dd906330867 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -102,7 +102,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_UPROBE		6	/* breakpointed or singlestepping */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
diff --git a/arch/mn10300/include/asm/thread_info.h b/arch/mn10300/include/asm/thread_info.h
index f5f90bbf019d..d992e6d1b718 100644
--- a/arch/mn10300/include/asm/thread_info.h
+++ b/arch/mn10300/include/asm/thread_info.h
@@ -145,7 +145,6 @@ void arch_release_thread_stack(unsigned long *stack);
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	+(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	+(1 << TIF_NOTIFY_RESUME)
diff --git a/arch/nios2/include/asm/thread_info.h b/arch/nios2/include/asm/thread_info.h
index d69c338bd19c..bf7d38c1c6e2 100644
--- a/arch/nios2/include/asm/thread_info.h
+++ b/arch/nios2/include/asm/thread_info.h
@@ -86,7 +86,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOTIFY_RESUME	1	/* resumption notification requested */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_MEMDIE		4	/* is terminating due to OOM killer */
 #define TIF_SECCOMP		5	/* secure computing */
 #define TIF_SYSCALL_AUDIT	6	/* syscall auditing active */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h
index 6e619a79a401..7678a1b2dc64 100644
--- a/arch/openrisc/include/asm/thread_info.h
+++ b/arch/openrisc/include/asm/thread_info.h
@@ -108,7 +108,6 @@ register struct thread_info *current_thread_info_reg asm("r10");
 #define TIF_RESTORE_SIGMASK     9
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling						 * TIF_NEED_RESCHED
 					 */
-#define TIF_MEMDIE              17
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff --git a/arch/parisc/include/asm/thread_info.h b/arch/parisc/include/asm/thread_info.h
index 7581330ea35b..05ea8af5865d 100644
--- a/arch/parisc/include/asm/thread_info.h
+++ b/arch/parisc/include/asm/thread_info.h
@@ -48,7 +48,6 @@ struct thread_info {
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	3	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_32BIT               4       /* 32 bit binary */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	6	/* restore saved signal mask */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index cfc35195f95e..315a924af2ca 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -99,7 +99,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT	15	/* syscall tracepoint instrumentation */
 #define TIF_EMULATE_STACK_STORE	16	/* Is an instruction emulation
 						for stack store? */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #if defined(CONFIG_PPC64)
 #define TIF_ELF2ABI		18	/* function descriptors must die! */
 #endif
diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index f15c0398c363..7261dafde433 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -80,7 +80,6 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
 #define TIF_UPROBE		7	/* breakpointed or single-stepping */
 #define TIF_31BIT		16	/* 32bit process */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal() */
 #define TIF_SINGLE_STEP		19	/* This task is single stepped */
 #define TIF_BLOCK_STEP		20	/* This task is block stepped */
diff --git a/arch/score/include/asm/thread_info.h b/arch/score/include/asm/thread_info.h
index 7d9ffb15c477..f6e1cc89cef9 100644
--- a/arch/score/include/asm/thread_info.h
+++ b/arch/score/include/asm/thread_info.h
@@ -78,7 +78,6 @@ register struct thread_info *__current_thread_info __asm__("r28");
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_NOTIFY_RESUME	5	/* callback before returning to user */
 #define TIF_RESTORE_SIGMASK	9	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h
index 6c65dcd470ab..36d15c6e36e5 100644
--- a/arch/sh/include/asm/thread_info.h
+++ b/arch/sh/include/asm/thread_info.h
@@ -117,7 +117,6 @@ extern void init_thread_xstate(void);
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
 #define TIF_SYSCALL_TRACEPOINT	8	/* for ftrace syscall instrumentation */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */
-#define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/sparc/include/asm/thread_info_32.h b/arch/sparc/include/asm/thread_info_32.h
index 229475f0d7ce..bcf81999db0b 100644
--- a/arch/sparc/include/asm/thread_info_32.h
+++ b/arch/sparc/include/asm/thread_info_32.h
@@ -110,7 +110,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
 					 * this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	9	/* true if poll_idle() is polling
 					 * TIF_NEED_RESCHED */
-#define TIF_MEMDIE		10	/* is terminating due to OOM killer */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
index 3d7b925f6516..69612d8355f1 100644
--- a/arch/sparc/include/asm/thread_info_64.h
+++ b/arch/sparc/include/asm/thread_info_64.h
@@ -191,7 +191,6 @@ register struct thread_info *current_thread_info_reg asm("g6");
  *       an immediate value in instructions such as andcc.
  */
 /* flag bit 12 is available */
-#define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index b7659b8f1117..1ecdc1111052 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -121,7 +121,6 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SECCOMP		6	/* secure computing */
-#define TIF_MEMDIE		7	/* OOM killer at work */
 #define TIF_NOTIFY_RESUME	8	/* callback before returning to user */
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
@@ -134,7 +133,6 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1<<TIF_SECCOMP)
-#define _TIF_MEMDIE		(1<<TIF_MEMDIE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
diff --git a/arch/um/include/asm/thread_info.h b/arch/um/include/asm/thread_info.h
index 053baff03674..b13047eeaede 100644
--- a/arch/um/include/asm/thread_info.h
+++ b/arch/um/include/asm/thread_info.h
@@ -58,7 +58,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_RESTART_BLOCK	4
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_SYSCALL_AUDIT	6
 #define TIF_RESTORE_SIGMASK	7
 #define TIF_NOTIFY_RESUME	8
@@ -67,7 +66,6 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
-#define _TIF_MEMDIE		(1 << TIF_MEMDIE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 
diff --git a/arch/unicore32/include/asm/thread_info.h b/arch/unicore32/include/asm/thread_info.h
index e79ad6d5b5b2..2487cf9dd41e 100644
--- a/arch/unicore32/include/asm/thread_info.h
+++ b/arch/unicore32/include/asm/thread_info.h
@@ -121,7 +121,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_SYSCALL_TRACE	8
-#define TIF_MEMDIE		18
 #define TIF_RESTORE_SIGMASK	20
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index b45ffdda3549..a897f177b004 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -97,7 +97,6 @@ struct thread_info {
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
-#define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
diff --git a/arch/xtensa/include/asm/thread_info.h b/arch/xtensa/include/asm/thread_info.h
index 7be2400f745a..791a0a0b5827 100644
--- a/arch/xtensa/include/asm/thread_info.h
+++ b/arch/xtensa/include/asm/thread_info.h
@@ -108,7 +108,6 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SINGLESTEP		3	/* restore singlestep on return to user mode */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
 #define TIF_NOTIFY_RESUME	7	/* callback before returning to user */
 #define TIF_DB_DISABLED		8	/* debug trap disabled for syscall */
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2016-09-01  9:51 ` [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access Michal Hocko
@ 2016-09-04  1:49   ` Tetsuo Handa
  2016-09-09 14:00     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-09-04  1:49 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> @@ -816,7 +816,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> -	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> +	 * its children or threads, just give it access to memory reserves
> +	 * so it can die quickly
>  	 */
>  	task_lock(p);
>  	if (task_will_free_mem(p)) {
> @@ -876,9 +877,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	mm = victim->mm;
>  	atomic_inc(&mm->mm_count);
>  	/*
> -	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
> -	 * the OOM victim from depleting the memory reserves from the user
> -	 * space under its control.
> +	 * We should send SIGKILL before granting access to memory reserves
> +	 * in order to prevent the OOM victim from depleting the memory
> +	 * reserves from the user space under its control.
>  	 */

Removing TIF_MEMDIE usage inside comments can be done as a clean up
before this series.



> @@ -3309,6 +3318,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	return alloc_flags;
>  }
>  
> +static bool oom_reserves_allowed(struct task_struct *tsk)
> +{
> +	if (!tsk_is_oom_victim(tsk))
> +		return false;
> +
> +	/*
> +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> +	 * depletion and shouldn't give access to memory reserves passed the
> +	 * exit_mm
> +	 */
> +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> +		return false;
> +
> +	return true;
> +}
> +

Are you aware that you are trying to make !MMU kernel's allocations not only
after returning exit_mm() but also from __mmput() from mmput() from exit_mm()
fail without allowing access to memory reserves? The comment says only after
returning exit_mm(), but this change is not.



> @@ -3558,8 +3593,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto nopage;
>  	}
>  
> -	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> +	/* Avoid allocations for oom victims from looping endlessly */
> +	if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL))
>  		goto nopage;

This change increases possibility of giving up without trying ALLOC_OOM
(more allocation failure messages), for currently only one thread which
remotely got TIF_MEMDIE when it was between gfp_to_alloc_flags() and
test_thread_flag(TIF_MEMDIE) will give up without trying ALLOC_NO_WATERMARKS
while all threads which remotely got current->signal->oom_mm when they were
between gfp_to_alloc_flags() and test_thread_flag(TIF_MEMDIE) will give up
without trying ALLOC_OOM. I think we should make sure that ALLOC_OOM is
tried (by using a variable which remembers whether
get_page_from_freelist(ALLOC_OOM) was tried).

We are currently allowing TIF_MEMDIE threads try ALLOC_NO_WATERMARKS for
once and give up without invoking the OOM killer. This change makes
current->signal->oom_mm threads try ALLOC_OOM for once and give up without
invoking the OOM killer. This means that allocations for cleanly cleaning
up by oom victims might fail prematurely, but we don't want to scatter
around __GFP_NOFAIL. Since there are reasonable chances of the parallel
memory freeing, we don't need to give up without invoking the OOM killer
again. I think that

-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+#ifndef CONFIG_MMU
+	/* Avoid allocations for oom victims from looping endlessly */
+	if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL))
+		goto nopage;
+#endif

is possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/4] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2016-09-01  9:51 ` [RFC 2/4] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim Michal Hocko
@ 2016-09-04  1:49   ` Tetsuo Handa
  2016-09-09 14:05     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-09-04  1:49 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9ee178ba7b71..df58733ca48e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1899,7 +1899,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * bypass the last charges so that they can exit quickly and
>  	 * free their memory.
>  	 */
> -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> +	if (unlikely(tsk_is_oom_victim(current) ||
>  		     fatal_signal_pending(current) ||
>  		     current->flags & PF_EXITING))
>  		goto force;

Does this test_thread_flag(TIF_MEMDIE) (or tsk_is_oom_victim(current)) make sense?

If current thread is OOM-killed, SIGKILL must be pending before arriving at
do_exit() and PF_EXITING must be set after arriving at do_exit(). But I can't
find locations which do memory allocation between clearing SIGKILL and setting
PF_EXITING.

When can test_thread_flag(TIF_MEMDIE) == T (or tsk_is_oom_victim(current) == T) &&
fatal_signal_pending(current) == F && (current->flags & PF_EXITING) == 0 happen?



> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b11977585c7b..e26529edcee3 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1078,7 +1078,7 @@ void pagefault_out_of_memory(void)
>  		 * be a racing OOM victim for which oom_killer_disable()
>  		 * is waiting for.
>  		 */
> -		WARN_ON(test_thread_flag(TIF_MEMDIE));
> +		WARN_ON(tsk_is_oom_victim(current));
>  	}
>  
>  	mutex_unlock(&oom_lock);

Does this WARN_ON() make sense?

When some user thread called oom_killer_disable(), there are running
kernel threads but is no running user threads except the one which
called oom_killer_disable(). Since oom_killer_disable() waits for
oom_lock, out_of_memory() called from here shall not return false
before oom_killer_disable() sets oom_killer_disabled = true. Thus,
possible situation out_of_memory() called from here can return false
are limited to

 (a) the one which triggered pagefaults after returning from
     oom_killer_disable()
 (b) An OOM victim which was thawed triggered pagefaults from do_exit()
     after the one which called oom_killer_disable() released oom_lock
 (c) kernel threads which triggered pagefaults after use_mm()

. And since kernel threads are not subjected to mark_oom_victim(),
test_thread_flag(TIF_MEMDIE) == F (or tsk_is_oom_victim(current) == F)
for kernel threads. Thus, possible situation out_of_memory() called from
here can return false and we hit this WARN_ON() is limited to (b).

Even if (a) or (b) is possible, does continuously emitting backtraces
help? It seems to me that the system is under OOM livelock situation and
we need to take a different action (e.g. try to allocate a page using
ALLOC_NO_WATERMARKS in order to make the pagefault be solved, and panic()
if failed) than emitting same backtraces forever.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-01  9:51 ` [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim Michal Hocko
@ 2016-09-04  1:50   ` Tetsuo Handa
  2016-09-09 14:08     ` Michal Hocko
  2016-09-14 13:50   ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-09-04  1:50 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko, oleg, viro

Michal Hocko wrote:
> mark_oom_victim and exit_oom_victim are used for oom_killer_disable
> which should block as long as there any any oom victims alive. Up to now
> we have relied on TIF_MEMDIE task flag to count how many oom victim
> we have. This is not optimal because only one thread receives this flag
> at the time while the whole process (thread group) is killed and should
> die. As a result we do not thaw the whole thread group and so a multi
> threaded process can leave some threads behind in the fridge. We really
> want to thaw all the threads.
> 
> This is not all that easy because there is no reliable way to count
> threads in the process as the oom killer might race with copy_process.

What is wrong with racing with copy_process()? Threads doing copy_process()
are not frozen and thus we don't need to thaw such threads. Also, being
OOM-killed implies receiving SIGKILL. Thus, newly created thread will also
enter do_exit().

> So marking all threads with TIF_MEMDIE and increment oom_victims
> accordingly is not safe. Also TIF_MEMDIE flag should just die so
> we should better come up with a different approach.
> 
> All we need to guarantee is that exit_oom_victim is called at the time
> when no further access to (possibly suspended) devices or generate other
> IO (which would clobber suspended image and only once per process)
> is possible. It seems we can rely on exit_notify for that because we
> already have to detect the last thread to do a cleanup. Let's propagate
> that information up to do_exit and only call exit_oom_victim for such
> a thread. With this in place we can safely increment oom_victims only
> once per thread group and thaw all the threads from the process.
> freezing_slow_path can also rely on tsk_is_oom_victim as well now.

If marking all threads which belong to tsk thread group with TIF_MEMDIE
is not safe (due to possible race with copy_process()), how can

	rcu_read_lock();
	for_each_thread(tsk, t)
		__thaw_task(t);
	rcu_read_unlock();

in mark_oom_victim() guarantee that all threads which belong to tsk
thread group are thawed?

Unless all threads which belong to tsk thread group in __refrigerator()
are guaranteed to be thawed, they might fail to leave __refrigerator()
in order to enter do_exit() which means that exit_oom_victim() won't be
called.

Do we want to thaw OOM victims from the beginning? If the freezer
depends on CONFIG_MMU=y , we don't need to thaw OOM victims.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  2016-09-04  1:49   ` Tetsuo Handa
@ 2016-09-09 14:00     ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-09 14:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel

On Sun 04-09-16 10:49:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > @@ -3309,6 +3318,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	return alloc_flags;
> >  }
> >  
> > +static bool oom_reserves_allowed(struct task_struct *tsk)
> > +{
> > +	if (!tsk_is_oom_victim(tsk))
> > +		return false;
> > +
> > +	/*
> > +	 * !MMU doesn't have oom reaper so we shouldn't risk the memory reserves
> > +	 * depletion and shouldn't give access to memory reserves passed the
> > +	 * exit_mm
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_MMU) && !tsk->mm)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> 
> Are you aware that you are trying to make !MMU kernel's allocations not only
> after returning exit_mm() but also from __mmput() from mmput() from exit_mm()
> fail without allowing access to memory reserves?

Do we allocate from that path in !mmu and would that be more broken than
with the current code which clears TIF_MEMDIE after mmput even when
__mmput is not called (aka somebody is holding a reference to mm - e.g.
a proc file)?

> The comment says only after returning exit_mm(), but this change is
> not.

I can see that the comment is not ideal. Any suggestion how to make it
better?
 
> > @@ -3558,8 +3593,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto nopage;
> >  	}
> >  
> > -	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > +	/* Avoid allocations for oom victims from looping endlessly */
> > +	if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL))
> >  		goto nopage;
> 
> This change increases possibility of giving up without trying ALLOC_OOM
> (more allocation failure messages), for currently only one thread which
> remotely got TIF_MEMDIE when it was between gfp_to_alloc_flags() and
> test_thread_flag(TIF_MEMDIE) will give up without trying ALLOC_NO_WATERMARKS
> while all threads which remotely got current->signal->oom_mm when they were
> between gfp_to_alloc_flags() and test_thread_flag(TIF_MEMDIE) will give up
> without trying ALLOC_OOM. I think we should make sure that ALLOC_OOM is
> tried (by using a variable which remembers whether
> get_page_from_freelist(ALLOC_OOM) was tried).

Technically speaking you are right but I am not really sure that this
matters all that much. This code as always been racy. If we ever
consider the race harmfull we can reorganize the allo slow path in a way
to guarantee at least one allocation attempt with ALLOC_OOM I am just
not sure it is necessary right now. If this ever shows up as a problem
we would see a flood of allocation failures followed by the OOM report
so it would be quite easy to notice.

> We are currently allowing TIF_MEMDIE threads try ALLOC_NO_WATERMARKS for
> once and give up without invoking the OOM killer. This change makes
> current->signal->oom_mm threads try ALLOC_OOM for once and give up without
> invoking the OOM killer. This means that allocations for cleanly cleaning
> up by oom victims might fail prematurely, but we don't want to scatter
> around __GFP_NOFAIL. Since there are reasonable chances of the parallel
> memory freeing, we don't need to give up without invoking the OOM killer
> again. I think that
> 
> -	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> +#ifndef CONFIG_MMU
> +	/* Avoid allocations for oom victims from looping endlessly */
> +	if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL))
> +		goto nopage;
> +#endif
> 
> is possible.

I would prefer to not spread out MMU ifdefs all over the place.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/4] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  2016-09-04  1:49   ` Tetsuo Handa
@ 2016-09-09 14:05     ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-09 14:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel

On Sun 04-09-16 10:49:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 9ee178ba7b71..df58733ca48e 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1899,7 +1899,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * bypass the last charges so that they can exit quickly and
> >  	 * free their memory.
> >  	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > +	if (unlikely(tsk_is_oom_victim(current) ||
> >  		     fatal_signal_pending(current) ||
> >  		     current->flags & PF_EXITING))
> >  		goto force;
> 
> Does this test_thread_flag(TIF_MEMDIE) (or tsk_is_oom_victim(current)) make sense?
> 
> If current thread is OOM-killed, SIGKILL must be pending before arriving at
> do_exit() and PF_EXITING must be set after arriving at do_exit(). But I can't
> find locations which do memory allocation between clearing SIGKILL and setting
> PF_EXITING.
> 
> When can test_thread_flag(TIF_MEMDIE) == T (or tsk_is_oom_victim(current) == T) &&
> fatal_signal_pending(current) == F && (current->flags & PF_EXITING) == 0 happen?

Maybe you are right. I would have to double check. I am still drown in
a pile of emails after vacation so it will take some time to do that.
Anyway I believe this would be worth a separate patch. This one just
mechanically replaces TIF_MEMDIE check by tsk_is_oom_victim.

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index b11977585c7b..e26529edcee3 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1078,7 +1078,7 @@ void pagefault_out_of_memory(void)
> >  		 * be a racing OOM victim for which oom_killer_disable()
> >  		 * is waiting for.
> >  		 */
> > -		WARN_ON(test_thread_flag(TIF_MEMDIE));
> > +		WARN_ON(tsk_is_oom_victim(current));
> >  	}
> >  
> >  	mutex_unlock(&oom_lock);
> 
> Does this WARN_ON() make sense?
> 
> When some user thread called oom_killer_disable(), there are running
> kernel threads but is no running user threads except the one which
> called oom_killer_disable(). Since oom_killer_disable() waits for
> oom_lock, out_of_memory() called from here shall not return false
> before oom_killer_disable() sets oom_killer_disabled = true. Thus,
> possible situation out_of_memory() called from here can return false
> are limited to
> 
>  (a) the one which triggered pagefaults after returning from
>      oom_killer_disable()
>  (b) An OOM victim which was thawed triggered pagefaults from do_exit()
>      after the one which called oom_killer_disable() released oom_lock
>  (c) kernel threads which triggered pagefaults after use_mm()
> 
> . And since kernel threads are not subjected to mark_oom_victim(),
> test_thread_flag(TIF_MEMDIE) == F (or tsk_is_oom_victim(current) == F)
> for kernel threads. Thus, possible situation out_of_memory() called from
> here can return false and we hit this WARN_ON() is limited to (b).
> 
> Even if (a) or (b) is possible, does continuously emitting backtraces
> help? It seems to me that the system is under OOM livelock situation and
> we need to take a different action (e.g. try to allocate a page using
> ALLOC_NO_WATERMARKS in order to make the pagefault be solved, and panic()
> if failed) than emitting same backtraces forever.

I have to think about this some more, so I cannot give it any answer
right now. But again, this is just a mechanical change. If your concern
is correct and the WARN_ON is really pointless it should be similarly
pointless with the TIF_MEMDIE check unless I am missing something.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-04  1:50   ` Tetsuo Handa
@ 2016-09-09 14:08     ` Michal Hocko
  2016-09-10  6:29       ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-09-09 14:08 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel, oleg, viro

On Sun 04-09-16 10:50:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > mark_oom_victim and exit_oom_victim are used for oom_killer_disable
> > which should block as long as there any any oom victims alive. Up to now
> > we have relied on TIF_MEMDIE task flag to count how many oom victim
> > we have. This is not optimal because only one thread receives this flag
> > at the time while the whole process (thread group) is killed and should
> > die. As a result we do not thaw the whole thread group and so a multi
> > threaded process can leave some threads behind in the fridge. We really
> > want to thaw all the threads.
> > 
> > This is not all that easy because there is no reliable way to count
> > threads in the process as the oom killer might race with copy_process.
> 
> What is wrong with racing with copy_process()? Threads doing copy_process()
> are not frozen and thus we don't need to thaw such threads. Also, being
> OOM-killed implies receiving SIGKILL. Thus, newly created thread will also
> enter do_exit().

The problem is that we cannot rely on signal->nr_threads to know when
the last one is passing exit to declare the whole group done and wake
the waiter on the oom killer lock.

> > So marking all threads with TIF_MEMDIE and increment oom_victims
> > accordingly is not safe. Also TIF_MEMDIE flag should just die so
> > we should better come up with a different approach.
> > 
> > All we need to guarantee is that exit_oom_victim is called at the time
> > when no further access to (possibly suspended) devices or generate other
> > IO (which would clobber suspended image and only once per process)
> > is possible. It seems we can rely on exit_notify for that because we
> > already have to detect the last thread to do a cleanup. Let's propagate
> > that information up to do_exit and only call exit_oom_victim for such
> > a thread. With this in place we can safely increment oom_victims only
> > once per thread group and thaw all the threads from the process.
> > freezing_slow_path can also rely on tsk_is_oom_victim as well now.
> 
> If marking all threads which belong to tsk thread group with TIF_MEMDIE
> is not safe (due to possible race with copy_process()), how can
> 
> 	rcu_read_lock();
> 	for_each_thread(tsk, t)
> 		__thaw_task(t);
> 	rcu_read_unlock();
> 
> in mark_oom_victim() guarantee that all threads which belong to tsk
> thread group are thawed?

Because all the frozen thread already have to be hashed and those which
are in the middle of copy process will be tsk_is_oom_victim and so the
freezer will skip them.

> Unless all threads which belong to tsk thread group in __refrigerator()
> are guaranteed to be thawed, they might fail to leave __refrigerator()
> in order to enter do_exit() which means that exit_oom_victim() won't be
> called.
> 
> Do we want to thaw OOM victims from the beginning? If the freezer
> depends on CONFIG_MMU=y , we don't need to thaw OOM victims.

We want to thaw them, at least at this stage, because the task might be
sitting on a memory which is not reclaimable by the oom reaper (e.g.
different buffers of file descriptors etc.).

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-09 14:08     ` Michal Hocko
@ 2016-09-10  6:29       ` Tetsuo Handa
  2016-09-10 12:55         ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-09-10  6:29 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel, oleg, viro

Michal Hocko wrote:
> On Sun 04-09-16 10:50:02, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > mark_oom_victim and exit_oom_victim are used for oom_killer_disable
> > > which should block as long as there any any oom victims alive. Up to now
> > > we have relied on TIF_MEMDIE task flag to count how many oom victim
> > > we have. This is not optimal because only one thread receives this flag
> > > at the time while the whole process (thread group) is killed and should
> > > die. As a result we do not thaw the whole thread group and so a multi
> > > threaded process can leave some threads behind in the fridge. We really
> > > want to thaw all the threads.
> > > 
> > > This is not all that easy because there is no reliable way to count
> > > threads in the process as the oom killer might race with copy_process.
> > 
> > What is wrong with racing with copy_process()? Threads doing copy_process()
> > are not frozen and thus we don't need to thaw such threads. Also, being
> > OOM-killed implies receiving SIGKILL. Thus, newly created thread will also
> > enter do_exit().
> 
> The problem is that we cannot rely on signal->nr_threads to know when
> the last one is passing exit to declare the whole group done and wake
> the waiter on the oom killer lock.

I don't think we need to rely on signal->nr_threads. Why can't we simply
do something like below (like error reporting in try_to_freeze_tasks()) ?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f284e92..8a1c008 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -419,12 +419,6 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 		dump_tasks(oc->memcg, oc->nodemask);
 }
 
-/*
- * Number of OOM victims in flight
- */
-static atomic_t oom_victims = ATOMIC_INIT(0);
-static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
-
 static bool oom_killer_disabled __read_mostly;
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -666,7 +660,6 @@ static void mark_oom_victim(struct task_struct *tsk)
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
 	__thaw_task(tsk);
-	atomic_inc(&oom_victims);
 }
 
 /**
@@ -675,9 +668,6 @@ static void mark_oom_victim(struct task_struct *tsk)
 void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
-
-	if (!atomic_dec_return(&oom_victims))
-		wake_up_all(&oom_victims_wait);
 }
 
 /**
@@ -705,8 +695,6 @@ void oom_killer_enable(void)
  */
 bool oom_killer_disable(signed long timeout)
 {
-	signed long ret;
-
 	/*
 	 * Make sure to not race with an ongoing OOM killer. Check that the
 	 * current is not killed (possibly due to sharing the victim's memory).
@@ -716,14 +704,37 @@ bool oom_killer_disable(signed long timeout)
 	oom_killer_disabled = true;
 	mutex_unlock(&oom_lock);
 
-	ret = wait_event_interruptible_timeout(oom_victims_wait,
-			!atomic_read(&oom_victims), timeout);
-	if (ret <= 0) {
-		oom_killer_enable();
-		return false;
-	}
+	/*
+	 * Wait until all thawed threads reach final schedule() in do_exit()
+	 * in order to make sure that OOM victims no longer try to trigger I/O.
+	 *
+	 * Since freezing_slow_path(p) returns false if TIF_MEMDIE is set, we
+	 * need to check TASK_DEAD if TIF_MEMDIE is set.
+	 */
+	while (timeout > 0) {
+		struct task_struct *g, *p;
+		bool busy = false;
 
-	return true;
+		read_lock(&tasklist_lock);
+		for_each_process_thread(g, p) {
+			if (freezer_should_skip(p) || frozen(p) || p == current)
+				continue;
+			if (freezing(p) ||
+			    (p->state != TASK_DEAD &&
+			     test_tsk_thread_flag(p, TIF_MEMDIE))) {
+				busy = true;
+				goto out;
+			}
+		}
+out:
+		read_unlock(&tasklist_lock);
+		if (!busy)
+			return true;
+		schedule_timeout_killable(HZ / 10);
+		timeout -= HZ / 10;
+	}
+	oom_killer_enable();
+	return false;
 }
 
 static inline bool __task_will_free_mem(struct task_struct *task)

> 
> > > So marking all threads with TIF_MEMDIE and increment oom_victims
> > > accordingly is not safe. Also TIF_MEMDIE flag should just die so
> > > we should better come up with a different approach.
> > > 
> > > All we need to guarantee is that exit_oom_victim is called at the time
> > > when no further access to (possibly suspended) devices or generate other
> > > IO (which would clobber suspended image and only once per process)
> > > is possible. It seems we can rely on exit_notify for that because we
> > > already have to detect the last thread to do a cleanup. Let's propagate
> > > that information up to do_exit and only call exit_oom_victim for such
> > > a thread. With this in place we can safely increment oom_victims only
> > > once per thread group and thaw all the threads from the process.
> > > freezing_slow_path can also rely on tsk_is_oom_victim as well now.
> > 
> > If marking all threads which belong to tsk thread group with TIF_MEMDIE
> > is not safe (due to possible race with copy_process()), how can
> > 
> > 	rcu_read_lock();
> > 	for_each_thread(tsk, t)
> > 		__thaw_task(t);
> > 	rcu_read_unlock();
> > 
> > in mark_oom_victim() guarantee that all threads which belong to tsk
> > thread group are thawed?
> 
> Because all the frozen thread already have to be hashed and those which
> are in the middle of copy process will be tsk_is_oom_victim and so the
> freezer will skip them.

Is it true for clone(CLONE_VM && !CLONE_SIGHAND) case?
tsk_is_oom_victim() is a per signal check but we don't call mark_oom_victim()
on each thread group sharing the victim's memory.

> > Unless all threads which belong to tsk thread group in __refrigerator()
> > are guaranteed to be thawed, they might fail to leave __refrigerator()
> > in order to enter do_exit() which means that exit_oom_victim() won't be
> > called.
> > 
> > Do we want to thaw OOM victims from the beginning? If the freezer
> > depends on CONFIG_MMU=y , we don't need to thaw OOM victims.
> 
> We want to thaw them, at least at this stage, because the task might be
> sitting on a memory which is not reclaimable by the oom reaper (e.g.
> different buffers of file descriptors etc.).

If you worry about tasks which are sitting on a memory which is not
reclaimable by the oom reaper, why you don't worry about tasks which
share mm and do not share signal (i.e. clone(CLONE_VM && !CLONE_SIGHAND)
tasks) ? Thawing only tasks which share signal is a halfway job.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-10  6:29       ` Tetsuo Handa
@ 2016-09-10 12:55         ` Tetsuo Handa
  2016-09-12  9:11           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-09-10 12:55 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel, oleg, viro

Tetsuo Handa wrote:
> > > Do we want to thaw OOM victims from the beginning? If the freezer
> > > depends on CONFIG_MMU=y , we don't need to thaw OOM victims.
> > 
> > We want to thaw them, at least at this stage, because the task might be
> > sitting on a memory which is not reclaimable by the oom reaper (e.g.
> > different buffers of file descriptors etc.).

I haven't heard an answer to the question whether the freezer depends on
CONFIG_MMU=y. But I assume the answer is yes here.

> 
> If you worry about tasks which are sitting on a memory which is not
> reclaimable by the oom reaper, why you don't worry about tasks which
> share mm and do not share signal (i.e. clone(CLONE_VM && !CLONE_SIGHAND)
> tasks) ? Thawing only tasks which share signal is a halfway job.
> 

Here is a different approach which does not thaw tasks as of mark_oom_victim()
but thaws tasks as of oom_killer_disable(). I think that we don't need to
distinguish OOM victims and killed/exiting tasks when we disable the OOM
killer, for trying to reclaim as much memory as possible is preferable for
reducing the possibility of memory allocation failure after the OOM killer
is disabled.

Compared to your approach

>  include/linux/sched.h |  2 +-
>  kernel/exit.c         | 38 ++++++++++++++++++++++++++++----------
>  kernel/freezer.c      |  3 ++-
>  mm/oom_kill.c         | 29 +++++++++++++++++------------
>  4 files changed, 48 insertions(+), 24 deletions(-)

, my approach does not touch exit logic.

 include/linux/sched.h |    4 ++
 kernel/freezer.c      |    2 -
 mm/oom_kill.c         |   75 ++++++++++++++++++++++++++++++++------------------
 3 files changed, 54 insertions(+), 27 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index af39baf..4c8278f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2308,6 +2308,7 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
 #define PFA_SPREAD_SLAB  2      /* Spread some slab caches over cpuset */
 #define PFA_LMK_WAITING  3      /* Lowmemorykiller is waiting */
+#define PFA_THAW_WAITING 4      /* A thawed thread waiting for termination */
 
 
 #define TASK_PFA_TEST(name, func)					\
@@ -2334,6 +2335,9 @@ TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
 TASK_PFA_TEST(LMK_WAITING, lmk_waiting)
 TASK_PFA_SET(LMK_WAITING, lmk_waiting)
 
+TASK_PFA_TEST(THAW_WAITING, thaw_waiting)
+TASK_PFA_SET(THAW_WAITING, thaw_waiting)
+
 /*
  * task->jobctl flags
  */
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 6f56a9e..5a80d4d 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -42,7 +42,7 @@ bool freezing_slow_path(struct task_struct *p)
 	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
 		return false;
 
-	if (test_tsk_thread_flag(p, TIF_MEMDIE))
+	if (task_thaw_waiting(p))
 		return false;
 
 	if (pm_nosig_freezing || cgroup_freezing(p))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f284e92..599e256 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -419,12 +419,6 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 		dump_tasks(oc->memcg, oc->nodemask);
 }
 
-/*
- * Number of OOM victims in flight
- */
-static atomic_t oom_victims = ATOMIC_INIT(0);
-static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
-
 static bool oom_killer_disabled __read_mostly;
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -658,15 +652,6 @@ static void mark_oom_victim(struct task_struct *tsk)
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		atomic_inc(&tsk->signal->oom_mm->mm_count);
-
-	/*
-	 * Make sure that the task is woken up from uninterruptible sleep
-	 * if it is frozen because OOM killer wouldn't be able to free
-	 * any memory and livelock. freezing_slow_path will tell the freezer
-	 * that TIF_MEMDIE tasks should be ignored.
-	 */
-	__thaw_task(tsk);
-	atomic_inc(&oom_victims);
 }
 
 /**
@@ -675,9 +660,6 @@ static void mark_oom_victim(struct task_struct *tsk)
 void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
-
-	if (!atomic_dec_return(&oom_victims))
-		wake_up_all(&oom_victims_wait);
 }
 
 /**
@@ -705,7 +687,9 @@ void oom_killer_enable(void)
  */
 bool oom_killer_disable(signed long timeout)
 {
-	signed long ret;
+	struct task_struct *p;
+	struct task_struct *t;
+	bool busy = false;
 
 	/*
 	 * Make sure to not race with an ongoing OOM killer. Check that the
@@ -716,14 +700,53 @@ bool oom_killer_disable(signed long timeout)
 	oom_killer_disabled = true;
 	mutex_unlock(&oom_lock);
 
-	ret = wait_event_interruptible_timeout(oom_victims_wait,
-			!atomic_read(&oom_victims), timeout);
-	if (ret <= 0) {
-		oom_killer_enable();
-		return false;
+	/*
+	 * Thaw all killed/exiting threads and wait for them to reach final
+	 * schedule() in do_exit() in order to reclaim as much memory as
+	 * possible and make sure that OOM victims no longer try to trigger
+	 * I/O.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(p, t) {
+		if (frozen(t) &&
+		    (fatal_signal_pending(t) || (t->flags & PF_EXITING))) {
+			task_set_thaw_waiting(t);
+			/*
+			 * Thaw the task because it is frozen.
+			 * freezing_slow_path() will tell the freezer that
+			 * PFA_THAW_WAIT tasks should leave from
+			 * __refrigerator().
+			 */
+			__thaw_task(t);
+			busy = true;
+		}
 	}
-
-	return true;
+	rcu_read_unlock();
+	if (likely(!busy))
+		return true;
+	timeout += jiffies;
+	while (time_before(jiffies, (unsigned long) timeout)) {
+		busy = false;
+		rcu_read_lock();
+		for_each_process_thread(p, t) {
+			if (task_thaw_waiting(t) && t->state != TASK_DEAD) {
+				busy = true;
+				goto out;
+			}
+		}
+out:
+		rcu_read_unlock();
+		if (!busy)
+			return true;
+		schedule_timeout_killable(HZ / 10);
+	}
+	/*
+	 * We thawed at least one killed/exiting threads but failed to wait for
+	 * them to reach final schedule() in do_exit(). Abort operation because
+	 * they might try to trigger I/O.
+	 */
+	oom_killer_enable();
+	return false;
 }
 
 static inline bool __task_will_free_mem(struct task_struct *task)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-10 12:55         ` Tetsuo Handa
@ 2016-09-12  9:11           ` Michal Hocko
  2016-09-13  6:25             ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-09-12  9:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel, oleg, viro

On Sat 10-09-16 21:55:49, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > > > Do we want to thaw OOM victims from the beginning? If the freezer
> > > > depends on CONFIG_MMU=y , we don't need to thaw OOM victims.
> > > 
> > > We want to thaw them, at least at this stage, because the task might be
> > > sitting on a memory which is not reclaimable by the oom reaper (e.g.
> > > different buffers of file descriptors etc.).
> 
> I haven't heard an answer to the question whether the freezer depends on
> CONFIG_MMU=y. But I assume the answer is yes here.

I do not think it depends on CONFIG_MMU. At least CGROUP_FREEZER depends
on CONFIG_CGROUPS and that doesn't seem to have any direct dependency on
MMU.

> > 
> > If you worry about tasks which are sitting on a memory which is not
> > reclaimable by the oom reaper, why you don't worry about tasks which
> > share mm and do not share signal (i.e. clone(CLONE_VM && !CLONE_SIGHAND)
> > tasks) ? Thawing only tasks which share signal is a halfway job.
> > 
> 
> Here is a different approach which does not thaw tasks as of mark_oom_victim()
> but thaws tasks as of oom_killer_disable(). I think that we don't need to
> distinguish OOM victims and killed/exiting tasks when we disable the OOM
> killer, for trying to reclaim as much memory as possible is preferable for
> reducing the possibility of memory allocation failure after the OOM killer
> is disabled.

This makes the oom_killer_disable suspend specific which is imho not
necessary. While we do not have any other user outside of the suspend
path right now and I hope we will not need any in a foreseeable future
there is no real reason to do a hack like this if we can make the
implementation suspend independent.

> Compared to your approach
> 
> >  include/linux/sched.h |  2 +-
> >  kernel/exit.c         | 38 ++++++++++++++++++++++++++++----------
> >  kernel/freezer.c      |  3 ++-
> >  mm/oom_kill.c         | 29 +++++++++++++++++------------
> >  4 files changed, 48 insertions(+), 24 deletions(-)
> 
> , my approach does not touch exit logic.

I consider the exit path changes so miniscule that trading it with pm
specific code in the oom sounds like a worse solution. Well, all that
assuming that the actual change is correct, of course.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-12  9:11           ` Michal Hocko
@ 2016-09-13  6:25             ` Tetsuo Handa
  2016-09-13  7:21               ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2016-09-13  6:25 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel, oleg, viro

Michal Hocko wrote:
> On Sat 10-09-16 21:55:49, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > If you worry about tasks which are sitting on a memory which is not
> > > reclaimable by the oom reaper, why you don't worry about tasks which
> > > share mm and do not share signal (i.e. clone(CLONE_VM && !CLONE_SIGHAND)
> > > tasks) ? Thawing only tasks which share signal is a halfway job.
> > > 
> > 
> > Here is a different approach which does not thaw tasks as of mark_oom_victim()
> > but thaws tasks as of oom_killer_disable(). I think that we don't need to
> > distinguish OOM victims and killed/exiting tasks when we disable the OOM
> > killer, for trying to reclaim as much memory as possible is preferable for
> > reducing the possibility of memory allocation failure after the OOM killer
> > is disabled.
> 
> This makes the oom_killer_disable suspend specific which is imho not
> necessary. While we do not have any other user outside of the suspend
> path right now and I hope we will not need any in a foreseeable future
> there is no real reason to do a hack like this if we can make the
> implementation suspend independent.

My intention is to somehow get rid of oom_killer_disable(). While I wrote
this approach, I again came to wonder why we need to disable the OOM killer
during suspend.

If the reason is that the OOM killer thaws already frozen OOM victims,
we won't have reason to disable the OOM killer if the OOM killer does not
thaw OOM victims. We can rely on the OOM killer/reaper immediately before
start taking a memory snapshot for suspend.

If the reason is that the OOM killer changes SIGKILL pending state of
already frozen OOM victims during taking a memory snapshot, I think that
sending SIGKILL via not only SysRq-f but also SysRq-i will be problematic.

If the reason is that the OOM reaper changes content of mm_struct of
OOM victims during taking a memory snapshot, what guarantees that
the OOM reaper does not call __oom_reap_task_mm() because we are not
waiting for oom_reaper_list to become NULL at oom_killer_disable(), for
patch "oom, suspend: fix oom_killer_disable vs. pm suspend properly"
removed set_freezable() from oom_reaper() which made oom_reaper() no
longer enter __refrigerator() at wait_event_freezable() in oom_reaper() ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-13  6:25             ` Tetsuo Handa
@ 2016-09-13  7:21               ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-13  7:21 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel, oleg, viro

On Tue 13-09-16 15:25:51, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 10-09-16 21:55:49, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > If you worry about tasks which are sitting on a memory which is not
> > > > reclaimable by the oom reaper, why you don't worry about tasks which
> > > > share mm and do not share signal (i.e. clone(CLONE_VM && !CLONE_SIGHAND)
> > > > tasks) ? Thawing only tasks which share signal is a halfway job.
> > > > 
> > > 
> > > Here is a different approach which does not thaw tasks as of mark_oom_victim()
> > > but thaws tasks as of oom_killer_disable(). I think that we don't need to
> > > distinguish OOM victims and killed/exiting tasks when we disable the OOM
> > > killer, for trying to reclaim as much memory as possible is preferable for
> > > reducing the possibility of memory allocation failure after the OOM killer
> > > is disabled.
> > 
> > This makes the oom_killer_disable suspend specific which is imho not
> > necessary. While we do not have any other user outside of the suspend
> > path right now and I hope we will not need any in a foreseeable future
> > there is no real reason to do a hack like this if we can make the
> > implementation suspend independent.
> 
> My intention is to somehow get rid of oom_killer_disable(). While I wrote
> this approach, I again came to wonder why we need to disable the OOM killer
> during suspend.
> 
> If the reason is that the OOM killer thaws already frozen OOM victims,
> we won't have reason to disable the OOM killer if the OOM killer does not
> thaw OOM victims. We can rely on the OOM killer/reaper immediately before
> start taking a memory snapshot for suspend.

Yes, if we don't have to wake already frozen tasks then the life would
be easier. But as I've already mentioned the async oom doesn't cover all
we need and the tasks can be frozen also from the userspace which means
that this is under user control.

> If the reason is that the OOM killer changes SIGKILL pending state of
> already frozen OOM victims during taking a memory snapshot, I think that
> sending SIGKILL via not only SysRq-f but also SysRq-i will be problematic.

Sysrq+i will not be a problem because that will not thaw any frozen
tasks.

> If the reason is that the OOM reaper changes content of mm_struct of
> OOM victims during taking a memory snapshot,

I do not think this is a problem. But I have to think about this some
more. My thinking is that even if saved the original content before
reaping it then all that matters is that the victim just goes away so it
cannot observe the corruption.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim
  2016-09-01  9:51 ` [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim Michal Hocko
  2016-09-04  1:50   ` Tetsuo Handa
@ 2016-09-14 13:50   ` Michal Hocko
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-14 13:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Rientjes, Tetsuo Handa, Johannes Weiner, Andrew Morton,
	LKML, linux-mm, Al Viro

Hi Oleg,
does this sound sane/correct to you?

On Thu 01-09-16 11:51:03, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> mark_oom_victim and exit_oom_victim are used for oom_killer_disable
> which should block as long as there any any oom victims alive. Up to now
> we have relied on TIF_MEMDIE task flag to count how many oom victim
> we have. This is not optimal because only one thread receives this flag
> at the time while the whole process (thread group) is killed and should
> die. As a result we do not thaw the whole thread group and so a multi
> threaded process can leave some threads behind in the fridge. We really
> want to thaw all the threads.
> 
> This is not all that easy because there is no reliable way to count
> threads in the process as the oom killer might race with copy_process.
> So marking all threads with TIF_MEMDIE and increment oom_victims
> accordingly is not safe. Also TIF_MEMDIE flag should just die so
> we should better come up with a different approach.
> 
> All we need to guarantee is that exit_oom_victim is called at the time
> when no further access to (possibly suspended) devices or generate other
> IO (which would clobber suspended image and only once per process)
> is possible. It seems we can rely on exit_notify for that because we
> already have to detect the last thread to do a cleanup. Let's propagate
> that information up to do_exit and only call exit_oom_victim for such
> a thread. With this in place we can safely increment oom_victims only
> once per thread group and thaw all the threads from the process.
> freezing_slow_path can also rely on tsk_is_oom_victim as well now.
> 
> exit_io_context is currently called after exit_notify but it seems it is
> safe to call it right before exit_notify because that is passed
> exit_files.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/sched.h |  2 +-
>  kernel/exit.c         | 38 ++++++++++++++++++++++++++++----------
>  kernel/freezer.c      |  3 ++-
>  mm/oom_kill.c         | 29 +++++++++++++++++------------
>  4 files changed, 48 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 770d01e7a68e..605e40b47992 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2660,7 +2660,7 @@ static inline void kernel_signal_stop(void)
>  	schedule();
>  }
>  
> -extern void release_task(struct task_struct * p);
> +extern bool release_task(struct task_struct * p);
>  extern int send_sig_info(int, struct siginfo *, struct task_struct *);
>  extern int force_sigsegv(int, struct task_struct *);
>  extern int force_sig_info(int, struct siginfo *, struct task_struct *);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 914088e8c2ac..c762416dbed1 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -165,10 +165,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>  }
>  
>  
> -void release_task(struct task_struct *p)
> +bool release_task(struct task_struct *p)
>  {
>  	struct task_struct *leader;
>  	int zap_leader;
> +	bool last = false;
>  repeat:
>  	/* don't need to get the RCU readlock here - the process is dead and
>  	 * can't be modifying its own credentials. But shut RCU-lockdep up */
> @@ -197,8 +198,10 @@ void release_task(struct task_struct *p)
>  		 * then we are the one who should release the leader.
>  		 */
>  		zap_leader = do_notify_parent(leader, leader->exit_signal);
> -		if (zap_leader)
> +		if (zap_leader) {
>  			leader->exit_state = EXIT_DEAD;
> +			last = true;
> +		}
>  	}
>  
>  	write_unlock_irq(&tasklist_lock);
> @@ -208,6 +211,8 @@ void release_task(struct task_struct *p)
>  	p = leader;
>  	if (unlikely(zap_leader))
>  		goto repeat;
> +
> +	return last;
>  }
>  
>  /*
> @@ -434,8 +439,6 @@ static void exit_mm(struct task_struct *tsk)
>  	task_unlock(tsk);
>  	mm_update_next_owner(mm);
>  	mmput(mm);
> -	if (test_thread_flag(TIF_MEMDIE))
> -		exit_oom_victim();
>  }
>  
>  static struct task_struct *find_alive_thread(struct task_struct *p)
> @@ -584,12 +587,15 @@ static void forget_original_parent(struct task_struct *father,
>  /*
>   * Send signals to all our closest relatives so that they know
>   * to properly mourn us..
> + *
> + * Returns true if this is the last thread from the thread group
>   */
> -static void exit_notify(struct task_struct *tsk, int group_dead)
> +static bool exit_notify(struct task_struct *tsk, int group_dead)
>  {
>  	bool autoreap;
>  	struct task_struct *p, *n;
>  	LIST_HEAD(dead);
> +	bool last = false;
>  
>  	write_lock_irq(&tasklist_lock);
>  	forget_original_parent(tsk, &dead);
> @@ -606,6 +612,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	} else if (thread_group_leader(tsk)) {
>  		autoreap = thread_group_empty(tsk) &&
>  			do_notify_parent(tsk, tsk->exit_signal);
> +		last = thread_group_empty(tsk);
>  	} else {
>  		autoreap = true;
>  	}
> @@ -621,8 +628,11 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  
>  	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
>  		list_del_init(&p->ptrace_entry);
> -		release_task(p);
> +		if (release_task(p) && p == tsk)
> +			last = true;
>  	}
> +
> +	return last;
>  }
>  
>  #ifdef CONFIG_DEBUG_STACK_USAGE
> @@ -766,7 +776,18 @@ void do_exit(long code)
>  	TASKS_RCU(preempt_disable());
>  	TASKS_RCU(tasks_rcu_i = __srcu_read_lock(&tasks_rcu_exit_srcu));
>  	TASKS_RCU(preempt_enable());
> -	exit_notify(tsk, group_dead);
> +
> +	if (tsk->io_context)
> +		exit_io_context(tsk);
> +
> +	/*
> +	 * Notify oom_killer_disable that the last oom thread is exiting.
> +	 * We might have more threads running at this point but none of them
> +	 * will access any devices behind this point.
> +	 */
> +	if (exit_notify(tsk, group_dead) && tsk_is_oom_victim(current))
> +		exit_oom_victim();
> +
>  	proc_exit_connector(tsk);
>  	mpol_put_task_policy(tsk);
>  #ifdef CONFIG_FUTEX
> @@ -784,9 +805,6 @@ void do_exit(long code)
>  	 */
>  	tsk->flags |= PF_EXITPIDONE;
>  
> -	if (tsk->io_context)
> -		exit_io_context(tsk);
> -
>  	if (tsk->splice_pipe)
>  		free_pipe_info(tsk->splice_pipe);
>  
> diff --git a/kernel/freezer.c b/kernel/freezer.c
> index 6f56a9e219fa..c6a64474a408 100644
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -10,6 +10,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/freezer.h>
>  #include <linux/kthread.h>
> +#include <linux/oom.h>
>  
>  /* total number of freezing conditions in effect */
>  atomic_t system_freezing_cnt = ATOMIC_INIT(0);
> @@ -42,7 +43,7 @@ bool freezing_slow_path(struct task_struct *p)
>  	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
>  		return false;
>  
> -	if (test_tsk_thread_flag(p, TIF_MEMDIE))
> +	if (tsk_is_oom_victim(p))
>  		return false;
>  
>  	if (pm_nosig_freezing || cgroup_freezing(p))
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e26529edcee3..5dec6321ac7b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -649,33 +649,38 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
>  static void mark_oom_victim(struct task_struct *tsk)
>  {
>  	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *t;
>  
>  	WARN_ON(oom_killer_disabled);
> -	/* OOM killer might race with memcg OOM */
> -	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> -		return;
>  
>  	/* oom_mm is bound to the signal struct life time. */
> -	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
> +	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
>  		atomic_inc(&tsk->signal->oom_mm->mm_count);
>  
> +		/* Only count thread groups */
> +		atomic_inc(&oom_victims);
> +	}
> +
>  	/*
> -	 * Make sure that the task is woken up from uninterruptible sleep
> -	 * if it is frozen because OOM killer wouldn't be able to free
> -	 * any memory and livelock. freezing_slow_path will tell the freezer
> -	 * that TIF_MEMDIE tasks should be ignored.
> +	 * Make sure that the the whole thread groupd is woken up from
> +	 * uninterruptible sleep if it is frozen because the oom victim
> +	 * will free its memory completely only after exit.
> +	 * freezing_slow_path will tell the freezer that oom victims
> +	 * should be ignored.
>  	 */
> -	__thaw_task(tsk);
> -	atomic_inc(&oom_victims);
> +	rcu_read_lock();
> +	for_each_thread(tsk, t)
> +		__thaw_task(t);
> +	rcu_read_unlock();
>  }
>  
>  /**
>   * exit_oom_victim - note the exit of an OOM victim
> + *
> + * Has to be called only once per thread group.
>   */
>  void exit_oom_victim(void)
>  {
> -	clear_thread_flag(TIF_MEMDIE);
> -
>  	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
>  }
> -- 
> 2.8.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/4] mm, oom: get rid of TIF_MEMDIE
  2016-09-01  9:51 [RFC 0/4] mm, oom: get rid of TIF_MEMDIE Michal Hocko
                   ` (3 preceding siblings ...)
  2016-09-01  9:51 ` [RFC 4/4] arch: get rid of TIF_MEMDIE Michal Hocko
@ 2016-09-15 14:41 ` Johannes Weiner
  2016-09-16  7:15   ` Michal Hocko
  4 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2016-09-15 14:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, David Rientjes, Tetsuo Handa, Andrew Morton, LKML,
	Al Viro, Michal Hocko, Oleg Nesterov

Hi Michal,

On Thu, Sep 01, 2016 at 11:51:00AM +0200, Michal Hocko wrote:
> Hi,
> this is an early RFC to see whether the approach I've taken is acceptable.
> The series is on top of the current mmotm tree (2016-08-31-16-06). I didn't
> get to test it so it might be completely broken.
> 
> The primary point of this series is to get rid of TIF_MEMDIE finally.
> Recent changes in the oom proper allows for that finally, I believe. Now
> that all the oom victims are reapable we are no longer depending on
> ALLOC_NO_WATERMARKS because the memory held by the victim is reclaimed
> asynchronously. A partial access to memory reserves should be sufficient
> just to guarantee that the oom victim is not starved due to other
> memory consumers. This also means that we do not have to pretend to be
> conservative and give access to memory reserves only to one thread from
> the process at the time. This is patch 1.
>
> Patch 2 is a simple cleanup which turns TIF_MEMDIE users to tsk_is_oom_victim
> which is process rather than thread centric. None of those callers really
> requires to be thread aware AFAICS.
> 
> The tricky part then is exit_oom_victim vs. oom_killer_disable because
> TIF_MEMDIE acted as a token there so we had a way to count threads from
> the process. It didn't work 100% reliably and had it own issues but we
> have to replace it with something which doesn't rely on counting threads
> but rather find a moment when all threads have reached steady state in
> do_exit. This is what patch 3 does and I would really appreciate if Oleg
> could double check my thinking there. I am also CCing Al on that one
> because I am moving exit_io_context up in do_exit right before exit_notify.

You're explaining the mechanical thing you are doing, but I'm having
trouble understanding why you want to get rid of TIF_MEMDIE. For one,
it's more code. And apparently, it's also more complicated than what
we have right now.

Can you please explain in the cover letter what's broken/undesirable?

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/4] mm, oom: get rid of TIF_MEMDIE
  2016-09-15 14:41 ` [RFC 0/4] mm, oom: " Johannes Weiner
@ 2016-09-16  7:15   ` Michal Hocko
  2016-09-19 16:18     ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2016-09-16  7:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, David Rientjes, Tetsuo Handa, Andrew Morton, LKML,
	Al Viro, Oleg Nesterov

On Thu 15-09-16 10:41:18, Johannes Weiner wrote:
> Hi Michal,
> 
> On Thu, Sep 01, 2016 at 11:51:00AM +0200, Michal Hocko wrote:
> > Hi,
> > this is an early RFC to see whether the approach I've taken is acceptable.
> > The series is on top of the current mmotm tree (2016-08-31-16-06). I didn't
> > get to test it so it might be completely broken.
> > 
> > The primary point of this series is to get rid of TIF_MEMDIE finally.
> > Recent changes in the oom proper allows for that finally, I believe. Now
> > that all the oom victims are reapable we are no longer depending on
> > ALLOC_NO_WATERMARKS because the memory held by the victim is reclaimed
> > asynchronously. A partial access to memory reserves should be sufficient
> > just to guarantee that the oom victim is not starved due to other
> > memory consumers. This also means that we do not have to pretend to be
> > conservative and give access to memory reserves only to one thread from
> > the process at the time. This is patch 1.
> >
> > Patch 2 is a simple cleanup which turns TIF_MEMDIE users to tsk_is_oom_victim
> > which is process rather than thread centric. None of those callers really
> > requires to be thread aware AFAICS.
> > 
> > The tricky part then is exit_oom_victim vs. oom_killer_disable because
> > TIF_MEMDIE acted as a token there so we had a way to count threads from
> > the process. It didn't work 100% reliably and had it own issues but we
> > have to replace it with something which doesn't rely on counting threads
> > but rather find a moment when all threads have reached steady state in
> > do_exit. This is what patch 3 does and I would really appreciate if Oleg
> > could double check my thinking there. I am also CCing Al on that one
> > because I am moving exit_io_context up in do_exit right before exit_notify.
> 
> You're explaining the mechanical thing you are doing, but I'm having
> trouble understanding why you want to get rid of TIF_MEMDIE. For one,
> it's more code. And apparently, it's also more complicated than what
> we have right now.
> 
> Can you please explain in the cover letter what's broken/undesirable?

Sure, I will extend the cover when submitting the series again. This RFC
was mostly aimed at correctness so I focused more on technical details.
Patch 1 should contain some reasoning. Do you find it sufficient or I
should extend on top of that?

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/4] mm, oom: get rid of TIF_MEMDIE
  2016-09-16  7:15   ` Michal Hocko
@ 2016-09-19 16:18     ` Johannes Weiner
  2016-09-19 19:02       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2016-09-19 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, David Rientjes, Tetsuo Handa, Andrew Morton, LKML,
	Al Viro, Oleg Nesterov

On Fri, Sep 16, 2016 at 09:15:17AM +0200, Michal Hocko wrote:
> On Thu 15-09-16 10:41:18, Johannes Weiner wrote:
> > Hi Michal,
> > 
> > On Thu, Sep 01, 2016 at 11:51:00AM +0200, Michal Hocko wrote:
> > > Hi,
> > > this is an early RFC to see whether the approach I've taken is acceptable.
> > > The series is on top of the current mmotm tree (2016-08-31-16-06). I didn't
> > > get to test it so it might be completely broken.
> > > 
> > > The primary point of this series is to get rid of TIF_MEMDIE finally.
> > > Recent changes in the oom proper allows for that finally, I believe. Now
> > > that all the oom victims are reapable we are no longer depending on
> > > ALLOC_NO_WATERMARKS because the memory held by the victim is reclaimed
> > > asynchronously. A partial access to memory reserves should be sufficient
> > > just to guarantee that the oom victim is not starved due to other
> > > memory consumers. This also means that we do not have to pretend to be
> > > conservative and give access to memory reserves only to one thread from
> > > the process at the time. This is patch 1.
> > >
> > > Patch 2 is a simple cleanup which turns TIF_MEMDIE users to tsk_is_oom_victim
> > > which is process rather than thread centric. None of those callers really
> > > requires to be thread aware AFAICS.
> > > 
> > > The tricky part then is exit_oom_victim vs. oom_killer_disable because
> > > TIF_MEMDIE acted as a token there so we had a way to count threads from
> > > the process. It didn't work 100% reliably and had it own issues but we
> > > have to replace it with something which doesn't rely on counting threads
> > > but rather find a moment when all threads have reached steady state in
> > > do_exit. This is what patch 3 does and I would really appreciate if Oleg
> > > could double check my thinking there. I am also CCing Al on that one
> > > because I am moving exit_io_context up in do_exit right before exit_notify.
> > 
> > You're explaining the mechanical thing you are doing, but I'm having
> > trouble understanding why you want to get rid of TIF_MEMDIE. For one,
> > it's more code. And apparently, it's also more complicated than what
> > we have right now.
> > 
> > Can you please explain in the cover letter what's broken/undesirable?
> 
> Sure, I will extend the cover when submitting the series again. This RFC
> was mostly aimed at correctness so I focused more on technical details.
> Patch 1 should contain some reasoning. Do you find it sufficient or I
> should extend on top of that?

: For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
: victims and then, among other things, to give these threads full
: access to memory reserves. There are few shortcomings of this
: implementation, though.
: 
: First of all and the most serious one is that the full access to memory
: reserves is quite dangerous because we leave no safety room for the
: system to operate and potentially do last emergency steps to move on.

Do we encounter this in practice? I think one of the clues is that you
introduce the patch with "for ages we have been doing X", so I'd like
to see a more practical explanation of how we did it was flawed.

: Secondly this flag is per task_struct while the OOM killer operates
: on mm_struct granularity so all processes sharing the given mm are
: killed. Giving the full access to all these task_structs could leave to
: a quick memory reserves depletion. We have tried to reduce this risk by
: giving TIF_MEMDIE only to the main thread and the currently allocating
: task but that doesn't really solve this problem while it surely opens up
: a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
: the allocator without access to memory reserves because a particular
: thread was not the group leader.

Same here I guess.

It *sounds* to me like there are two different things going on
here. One being the access to emergency reserves, the other being the
synchronization token to count OOM victims. Maybe it would be easier
to separate those issues out in the line of argument?

For emergency reserves, you make the point that we now have the reaper
and don't rely on the reserves as much anymore. However, we need to
consider that the reaper relies on the mmap_sem and is thus not as
robust as the concept of an emergency pool to guarantee fwd progress.

For synchronization, I don't quite get the argument. The patch 4
changelog uses term like "not optimal" and that we "should" be doing
certain things, but doesn't explain the consequences of the "wrong"
behavior, the impact is of frozen threads getting left behind etc.

That stuff would be useful to know, both in the cover letter (higher
level user impact) and the changelogs (more detailed user impact).

I.e. sell the problem before selling the solution :-)

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/4] mm, oom: get rid of TIF_MEMDIE
  2016-09-19 16:18     ` Johannes Weiner
@ 2016-09-19 19:02       ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2016-09-19 19:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, David Rientjes, Tetsuo Handa, Andrew Morton, LKML,
	Al Viro, Oleg Nesterov

On Mon 19-09-16 12:18:37, Johannes Weiner wrote:
> On Fri, Sep 16, 2016 at 09:15:17AM +0200, Michal Hocko wrote:
[...]
> : For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
> : victims and then, among other things, to give these threads full
> : access to memory reserves. There are few shortcomings of this
> : implementation, though.
> : 
> : First of all and the most serious one is that the full access to memory
> : reserves is quite dangerous because we leave no safety room for the
> : system to operate and potentially do last emergency steps to move on.
> 
> Do we encounter this in practice?

We rarely experience anything from this in practice. Even OOM is an
outstanding situation. Most of the lockups I am trying to solve are
close to non-existent. But this doesn't mean they are non-existent so as
far as the resulting code makes sense and doesn't make the situation
overly more complicated then I would go with enhancements.

> I think one of the clues is that you
> introduce the patch with "for ages we have been doing X", so I'd like
> to see a more practical explanation of how we did it was flawed.

OK, I will try harder to explain that. The core idea is that we couldn't
do anything better without the async oom killing because we were
strictly synchronous and only the victim could make some progress.

> : Secondly this flag is per task_struct while the OOM killer operates
> : on mm_struct granularity so all processes sharing the given mm are
> : killed. Giving the full access to all these task_structs could leave to
> : a quick memory reserves depletion. We have tried to reduce this risk by
> : giving TIF_MEMDIE only to the main thread and the currently allocating
> : task but that doesn't really solve this problem while it surely opens up
> : a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
> : the allocator without access to memory reserves because a particular
> : thread was not the group leader.
> 
> Same here I guess.
> 
> It *sounds* to me like there are two different things going on
> here. One being the access to emergency reserves, the other being the
> synchronization token to count OOM victims. Maybe it would be easier
> to separate those issues out in the line of argument?

Yes this is precisely the problem. The meaning of the flag is overloaded
for multiple purposes and it is not as easy to describe all of this and
get tangled in all the details. Over time we have reduced the locking
side of the flag but we still use it for oom_disable synchronization
and memory reserves access.

> For emergency reserves, you make the point that we now have the reaper
> and don't rely on the reserves as much anymore. However, we need to
> consider that the reaper relies on the mmap_sem and is thus not as
> robust as the concept of an emergency pool to guarantee fwd progress.

Yes it is not 100% but if we get stuck there then we will have a way to
go on to another victim so we will not lockup. On the other hand lockup
due to mmap_sem for write should be really rare because the lock is
mostly killable and taken outside of the oom victim context even more
rarely.

> For synchronization, I don't quite get the argument. The patch 4
> changelog uses term like "not optimal" and that we "should" be doing
> certain things, but doesn't explain the consequences of the "wrong"
> behavior, the impact is of frozen threads getting left behind etc.

Yes I will try harder. The primary point is a mismatch between oom
killer being per-mm operation and the flag being per-thread thing. For
the freezer context it means that only one thread is woken up while
other are still in the fridge. This wouldn't be a big deal for the pm
freezer but the cgroup freezer allows normal user to hide processes in
the fridge so users might hide processes there intentionally. The other
part of the puzzle is that not all resources are reclaimable by the
async oom killer. Say pipe/socket buffers can consume a lot of memory
while they are pinned by process not threads.

> That stuff would be useful to know, both in the cover letter (higher
> level user impact) and the changelogs (more detailed user impact).
> 
> I.e. sell the problem before selling the solution :-)

Sure, I see what you are saying. I just feel there are so many subtle
details that it is hard to squeeze all of them into the changelog and
still make some sense ;)

Anyway, I will definitely try to be more specific in the next post.
Thanks for the feedback!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-09-19 19:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-01  9:51 [RFC 0/4] mm, oom: get rid of TIF_MEMDIE Michal Hocko
2016-09-01  9:51 ` [RFC 1/4] mm, oom: do not rely on TIF_MEMDIE for memory reserves access Michal Hocko
2016-09-04  1:49   ` Tetsuo Handa
2016-09-09 14:00     ` Michal Hocko
2016-09-01  9:51 ` [RFC 2/4] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim Michal Hocko
2016-09-04  1:49   ` Tetsuo Handa
2016-09-09 14:05     ` Michal Hocko
2016-09-01  9:51 ` [RFC 3/4] mm, oom: do not rely on TIF_MEMDIE for exit_oom_victim Michal Hocko
2016-09-04  1:50   ` Tetsuo Handa
2016-09-09 14:08     ` Michal Hocko
2016-09-10  6:29       ` Tetsuo Handa
2016-09-10 12:55         ` Tetsuo Handa
2016-09-12  9:11           ` Michal Hocko
2016-09-13  6:25             ` Tetsuo Handa
2016-09-13  7:21               ` Michal Hocko
2016-09-14 13:50   ` Michal Hocko
2016-09-01  9:51 ` [RFC 4/4] arch: get rid of TIF_MEMDIE Michal Hocko
2016-09-15 14:41 ` [RFC 0/4] mm, oom: " Johannes Weiner
2016-09-16  7:15   ` Michal Hocko
2016-09-19 16:18     ` Johannes Weiner
2016-09-19 19:02       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).