[PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.
@ 2017-11-08 11:01 Tetsuo Handa
  2017-11-08 11:01 ` [PATCH 2/5] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-08 11:01 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, Andrea Arcangeli, Johannes Weiner, Michal Hocko

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.

The first reason is explained in the comment that it aims to catch
potential parallel OOM killing. But there is no longer parallel OOM
killing (in the sense that out_of_memory() is called "concurrently")
because we serialize out_of_memory() calls using oom_lock.

The second reason is explained by Andrea Arcangeli (who added that code)
that it aims to reduce the likelihood of OOM livelocks and be sure to
invoke the OOM killer. There was a risk of livelock or anyway of delayed
OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
few pages which are constantly allocated and freed in the meantime will
not improve the situation. But there is no longer possibility of OOM
livelocks or failing to invoke the OOM killer because we need to mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt indirectly involve from failing.

Since the OOM killer does not always kill a process consuming significant
amount of memory (the OOM killer kills a process with highest OOM score
(or instead one of its children if any)), there will be cases where
ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.
Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
second allocation attempt might be better for minimizing number of OOM
victims. But that change should be done in a separate patch. This patch
just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 536431b..613814c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3341,11 +3341,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	}

 	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
+	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
+	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
+	 * already held. And since this allocation attempt does not sleep,
+	 * there is no reason we must use high watermark here.
 	 */
 	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
 				      ~__GFP_DIRECT_RECLAIM, order,
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] mm,oom: Move last second allocation to inside the OOM killer.
  2017-11-08 11:01 [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
@ 2017-11-08 11:01 ` Tetsuo Handa
  2017-11-08 14:50   ` Michal Hocko
  2017-11-08 11:01 ` [PATCH 3/5] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-08 11:01 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, Andrea Arcangeli, Johannes Weiner, Michal Hocko

When out_of_memory() is called consecutively, sometimes doing last second
allocation attempt after selecting an OOM victim can succeed because
somebody (presumably previously killed OOM victims) might have managed to
free memory while we were selecting an OOM victim which can take quite
some time, for setting MMF_OOM_SKIP by exiting OOM victims is not
serialized by oom_lock. Therefore, this patch moves last second
allocation attempt to after selecting an OOM victim. This patch is
expected to reduce the time window for potentially pre-mature OOM
killing considerably.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/oom.h | 13 +++++++++++++
 mm/oom_kill.c       | 14 ++++++++++++++
 mm/page_alloc.c     | 40 ++++++++++++++++++++++++----------------
 3 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 01c91d8..27cd36b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -14,6 +14,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used to
@@ -38,6 +40,15 @@ struct oom_control {
 	 */
 	const int order;
 
+	/* Context for really last second allocation attempt. */
+	const struct alloc_context *ac;
+	/*
+	 * Set by the OOM killer if ac != NULL and last second allocation
+	 * attempt succeeded. If ac != NULL, the caller must check for
+	 * page != NULL.
+	 */
+	struct page *page;
+
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
@@ -102,6 +113,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 85eced9..cf6f19b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1065,6 +1065,9 @@ bool out_of_memory(struct oom_control *oc)
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+		oc->page = alloc_pages_before_oomkill(oc);
+		if (oc->page)
+			return true;
 		get_task_struct(current);
 		oc->chosen = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
@@ -1072,6 +1075,17 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	select_bad_process(oc);
+	/*
+	 * Try really last second allocation attempt after we selected an OOM
+	 * victim, for somebody might have managed to free memory while we were
+	 * selecting an OOM victim which can take quite some time.
+	 */
+	oc->page = alloc_pages_before_oomkill(oc);
+	if (oc->page) {
+		if (oc->chosen && oc->chosen != (void *)-1UL)
+			put_task_struct(oc->chosen);
+		return true;
+	}
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 613814c..764f24c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3325,6 +3325,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
 		.order = order,
+		.ac = ac,
 	};
 	struct page *page;
 
@@ -3340,18 +3341,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		return NULL;
 	}
 
-	/*
-	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
-	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
-	 * already held. And since this allocation attempt does not sleep,
-	 * there is no reason we must use high watermark here.
-	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
-				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
-		goto out;
-
 	/* Coredumps can quickly deplete all memory reserves */
 	if (current->flags & PF_DUMPCORE)
 		goto out;
@@ -3386,16 +3375,18 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
+		*did_some_progress = 1;
+		page = oc.page;
+	} else if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
 		/*
 		 * Help non-failing allocations by giving them access to memory
 		 * reserves
 		 */
-		if (gfp_mask & __GFP_NOFAIL)
-			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
+						     ALLOC_NO_WATERMARKS, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -4155,6 +4146,23 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return page;
 }
 
+struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
+{
+	/*
+	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
+	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
+	 * already held. And since this allocation attempt does not sleep,
+	 * there is no reason we must use high watermark here.
+	 */
+	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
+	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
+
+	if (!oc->ac)
+		return NULL;
+	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+}
+
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] mm,oom: Move last second allocation to inside the OOM killer.
  2017-11-08 11:01 ` [PATCH 2/5] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
@ 2017-11-08 14:50   ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-11-08 14:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, Andrea Arcangeli, Johannes Weiner

On Wed 08-11-17 20:01:45, Tetsuo Handa wrote:
> When out_of_memory() is called consecutively, sometimes doing last second
> allocation attempt after selecting an OOM victim can succeed because
> somebody (presumably previously killed OOM victims) might have managed to
> free memory while we were selecting an OOM victim which can take quite
> some time, for setting MMF_OOM_SKIP by exiting OOM victims is not
> serialized by oom_lock. Therefore, this patch moves last second
> allocation attempt to after selecting an OOM victim. This patch is
> expected to reduce the time window for potentially pre-mature OOM
> killing considerably.

the changelog is not ideal because the primary point of moving the
allocation is that the oom victim selection can take quite some time and
the oom situation might have changed during that time. Speculating about
MMF_OOM_SKIP is more confusing than helpful.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>

Anyway, I do agree with moving even though it is a layer violation. I
have seen few OOMs during a large process tear down and this could have
helped in those cases. I do not have the specific OOM report at hands
unfortunately.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/oom.h | 13 +++++++++++++
>  mm/oom_kill.c       | 14 ++++++++++++++
>  mm/page_alloc.c     | 40 ++++++++++++++++++++++++----------------
>  3 files changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 01c91d8..27cd36b 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -14,6 +14,8 @@
>  struct notifier_block;
>  struct mem_cgroup;
>  struct task_struct;
> +struct alloc_context;
> +struct page;
>  
>  /*
>   * Details of the page allocation that triggered the oom killer that are used to
> @@ -38,6 +40,15 @@ struct oom_control {
>  	 */
>  	const int order;
>  
> +	/* Context for really last second allocation attempt. */
> +	const struct alloc_context *ac;
> +	/*
> +	 * Set by the OOM killer if ac != NULL and last second allocation
> +	 * attempt succeeded. If ac != NULL, the caller must check for
> +	 * page != NULL.
> +	 */
> +	struct page *page;
> +
>  	/* Used by oom implementation, do not set */
>  	unsigned long totalpages;
>  	struct task_struct *chosen;
> @@ -102,6 +113,8 @@ extern unsigned long oom_badness(struct task_struct *p,
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 85eced9..cf6f19b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1065,6 +1065,9 @@ bool out_of_memory(struct oom_control *oc)
>  	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>  	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		oc->page = alloc_pages_before_oomkill(oc);
> +		if (oc->page)
> +			return true;
>  		get_task_struct(current);
>  		oc->chosen = current;
>  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
> @@ -1072,6 +1075,17 @@ bool out_of_memory(struct oom_control *oc)
>  	}
>  
>  	select_bad_process(oc);
> +	/*
> +	 * Try really last second allocation attempt after we selected an OOM
> +	 * victim, for somebody might have managed to free memory while we were
> +	 * selecting an OOM victim which can take quite some time.
> +	 */
> +	oc->page = alloc_pages_before_oomkill(oc);
> +	if (oc->page) {
> +		if (oc->chosen && oc->chosen != (void *)-1UL)
> +			put_task_struct(oc->chosen);
> +		return true;
> +	}
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
>  	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
>  		dump_header(oc, NULL);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 613814c..764f24c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3325,6 +3325,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  		.memcg = NULL,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
> +		.ac = ac,
>  	};
>  	struct page *page;
>  
> @@ -3340,18 +3341,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  		return NULL;
>  	}
>  
> -	/*
> -	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> -	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
> -	 * already held. And since this allocation attempt does not sleep,
> -	 * there is no reason we must use high watermark here.
> -	 */
> -	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> -				      ~__GFP_DIRECT_RECLAIM, order,
> -				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> -	if (page)
> -		goto out;
> -
>  	/* Coredumps can quickly deplete all memory reserves */
>  	if (current->flags & PF_DUMPCORE)
>  		goto out;
> @@ -3386,16 +3375,18 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  		goto out;
>  
>  	/* Exhausted what can be done so it's blamo time */
> -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> +	if (out_of_memory(&oc)) {
> +		*did_some_progress = 1;
> +		page = oc.page;
> +	} else if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
>  		*did_some_progress = 1;
>  
>  		/*
>  		 * Help non-failing allocations by giving them access to memory
>  		 * reserves
>  		 */
> -		if (gfp_mask & __GFP_NOFAIL)
> -			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
> -					ALLOC_NO_WATERMARKS, ac);
> +		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
> +						     ALLOC_NO_WATERMARKS, ac);
>  	}
>  out:
>  	mutex_unlock(&oom_lock);
> @@ -4155,6 +4146,23 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	return page;
>  }
>  
> +struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
> +{
> +	/*
> +	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> +	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
> +	 * already held. And since this allocation attempt does not sleep,
> +	 * there is no reason we must use high watermark here.
> +	 */
> +	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> +	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
> +
> +	if (!oc->ac)
> +		return NULL;
> +	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
> +}
> +
>  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  		int preferred_nid, nodemask_t *nodemask,
>  		struct alloc_context *ac, gfp_t *alloc_mask,
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/5] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-08 11:01 [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
  2017-11-08 11:01 ` [PATCH 2/5] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
@ 2017-11-08 11:01 ` Tetsuo Handa
  2017-11-08 14:50   ` Michal Hocko
  2017-11-08 11:01 ` [PATCH 4/5] mm,oom: Remove oom_lock serialization from the OOM reaper Tetsuo Handa
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-08 11:01 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Manish Jaggi,
	Michal Hocko, Oleg Nesterov, Vladimir Davydov

Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
count causes random kernel panics when an OOM victim which consumed memory
in a way the OOM reaper does not help was selected by the OOM killer [1].
Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
victim's mm were not able to try allocation from memory reserves after the
OOM reaper gave up reclaiming memory.

Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
last second allocation attempt.

[1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 764f24c..fbbc95a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4153,13 +4153,19 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
 	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
 	 * already held. And since this allocation attempt does not sleep,
 	 * there is no reason we must use high watermark here.
+	 * But anyway, make sure that OOM victims can try ALLOC_OOM watermark
+	 * in case they haven't tried ALLOC_OOM watermark.
 	 */
 	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
 	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
+	int reserve_flags;
 
 	if (!oc->ac)
 		return NULL;
 	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
 }
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation.
  2017-11-08 11:01 ` [PATCH 3/5] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
@ 2017-11-08 14:50   ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-11-08 14:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, David Rientjes, Manish Jaggi, Oleg Nesterov,
	Vladimir Davydov

On Wed 08-11-17 20:01:46, Tetsuo Handa wrote:
> Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].
> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.
> 
> Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> last second allocation attempt.
> 
> [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> 
> Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>

I do not like the changelog because it doesn't explain the race window
but the patch itself is OK. I have offered a better explanation [1] but
I will not really insist on my wording.
Acked-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/20171101135855.bqg2kuj6ao2cicqi@dhcp22.suse.cz

> ---
>  mm/page_alloc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 764f24c..fbbc95a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4153,13 +4153,19 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
>  	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
>  	 * already held. And since this allocation attempt does not sleep,
>  	 * there is no reason we must use high watermark here.
> +	 * But anyway, make sure that OOM victims can try ALLOC_OOM watermark
> +	 * in case they haven't tried ALLOC_OOM watermark.
>  	 */
>  	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
>  	gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
> +	int reserve_flags;
>  
>  	if (!oc->ac)
>  		return NULL;
>  	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> +	if (reserve_flags)
> +		alloc_flags = reserve_flags;
>  	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
>  }
>  
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/5] mm,oom: Remove oom_lock serialization from the OOM reaper.
  2017-11-08 11:01 [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
  2017-11-08 11:01 ` [PATCH 2/5] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
  2017-11-08 11:01 ` [PATCH 3/5] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
@ 2017-11-08 11:01 ` Tetsuo Handa
  2017-11-08 15:03   ` Michal Hocko
  2017-11-08 11:01 ` [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination Tetsuo Handa
  2017-11-08 14:50 ` [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Michal Hocko
  4 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-08 11:01 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Tetsuo Handa

Since "mm,oom: Move last second allocation to inside the OOM killer."
changed to do last second allocation attempt after confirming that there
is no OOM victim's mm without MMF_OOM_SKIP set, we no longer need to
block the OOM reaper using oom_lock. This patch should allow start
reclaiming earlier than now.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cf6f19b..6949465 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -495,22 +495,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	bool ret = true;
 
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		trace_skip_task_reaping(tsk->pid);
@@ -584,7 +568,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_finish_task_reaping(tsk->pid);
 unlock_oom:
-	mutex_unlock(&oom_lock);
 	return ret;
 }
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] mm,oom: Remove oom_lock serialization from the OOM reaper.
  2017-11-08 11:01 ` [PATCH 4/5] mm,oom: Remove oom_lock serialization from the OOM reaper Tetsuo Handa
@ 2017-11-08 15:03   ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-11-08 15:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm

Thank you for CCing me (as an author of the code)

On Wed 08-11-17 20:01:47, Tetsuo Handa wrote:
> Since "mm,oom: Move last second allocation to inside the OOM killer."
> changed to do last second allocation attempt after confirming that there
> is no OOM victim's mm without MMF_OOM_SKIP set, we no longer need to
> block the OOM reaper using oom_lock. This patch should allow start
> reclaiming earlier than now.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/oom_kill.c | 17 -----------------
>  1 file changed, 17 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index cf6f19b..6949465 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -495,22 +495,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	bool ret = true;
>  
> -	/*
> -	 * We have to make sure to not race with the victim exit path
> -	 * and cause premature new oom victim selection:
> -	 * __oom_reap_task_mm		exit_mm
> -	 *   mmget_not_zero
> -	 *				  mmput
> -	 *				    atomic_dec_and_test
> -	 *				  exit_oom_victim
> -	 *				[...]
> -	 *				out_of_memory
> -	 *				  select_bad_process
> -	 *				    # no TIF_MEMDIE task selects new victim
> -	 *  unmap_page_range # frees some memory
> -	 */
> -	mutex_lock(&oom_lock);
> -
>  	if (!down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
>  		trace_skip_task_reaping(tsk->pid);
> @@ -584,7 +568,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  
>  	trace_finish_task_reaping(tsk->pid);
>  unlock_oom:
> -	mutex_unlock(&oom_lock);
>  	return ret;
>  }
>  
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination.
  2017-11-08 11:01 [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
                   ` (2 preceding siblings ...)
  2017-11-08 11:01 ` [PATCH 4/5] mm,oom: Remove oom_lock serialization from the OOM reaper Tetsuo Handa
@ 2017-11-08 11:01 ` Tetsuo Handa
  2017-11-08 14:56   ` Michal Hocko
  2017-11-08 16:24   ` Michal Hocko
  2017-11-08 14:50 ` [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Michal Hocko
  4 siblings, 2 replies; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-08 11:01 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Tetsuo Handa, Mel Gorman, Michal Hocko

Commit 212925802454672e ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") moved the location of setting MMF_OOM_SKIP from __mmput()
in kernel/fork.c (which is used by both MMU and !MMU) to exit_mm() in
mm/mmap.c (which is used by MMU only). As a result, that commit required
OOM victims in !MMU kernels to disappear from the task list in order to
reenable the OOM killer, for !MMU kernels can no longer set MMF_OOM_SKIP
(unless the OOM victim's mm is shared with global init process).

While it would be possible to restore MMF_OOM_SKIP in __mmput() for !MMU
kernels, let's forget about possibility of OOM livelock for !MMU kernels
caused by failing to set MMF_OOM_SKIP, by setting MMF_OOM_SKIP at
oom_kill_process(), for the invocation of the OOM killer is a rare event
for !MMU systems from the beginning. By doing so, we can get rid of
special treatment for !MMU case in commit cd04ae1e2dc8e365 ("mm, oom:
do not rely on TIF_MEMDIE for memory reserves access"). And "mm,oom:
Use ALLOC_OOM for OOM victim's last second allocation." will allow the
OOM victim to try ALLOC_OOM (instead of ALLOC_NO_WATERMARKS) allocation
before killing more OOM victims.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  9 ---------
 mm/oom_kill.c   |  7 +++++--
 mm/page_alloc.c | 12 +-----------
 3 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e6bd351..f0eb8d90 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -481,16 +481,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
-/*
- * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
- * cannot assume a reduced access to memory reserves is sufficient for
- * !MMU
- */
-#ifdef CONFIG_MMU
 #define ALLOC_OOM		0x08
-#else
-#define ALLOC_OOM		ALLOC_NO_WATERMARKS
-#endif
 
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6949465..d57dcd5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -647,6 +647,8 @@ static int __init oom_init(void)
 #else
 static inline void wake_oom_reaper(struct task_struct *tsk)
 {
+	/* tsk->mm != NULL because tsk == current or task_lock is held. */
+	set_bit(MMF_OOM_SKIP, &tsk->mm->flags);
 }
 #endif /* CONFIG_MMU */
 
@@ -829,7 +831,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap = IS_ENABLED(CONFIG_MMU);
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -929,7 +931,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 			continue;
 		if (is_global_init(p)) {
 			can_oom_reap = false;
-			set_bit(MMF_OOM_SKIP, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
@@ -947,6 +948,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 
 	if (can_oom_reap)
 		wake_oom_reaper(victim);
+	else
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	mmdrop(mm);
 	put_task_struct(victim);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fbbc95a..ff435f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3711,17 +3711,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 static bool oom_reserves_allowed(struct task_struct *tsk)
 {
-	if (!tsk_is_oom_victim(tsk))
-		return false;
-
-	/*
-	 * !MMU doesn't have oom reaper so give access to memory reserves
-	 * only to the thread with TIF_MEMDIE set
-	 */
-	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
-		return false;
-
-	return true;
+	return tsk_is_oom_victim(tsk);
 }
 
 /*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination.
  2017-11-08 11:01 ` [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination Tetsuo Handa
@ 2017-11-08 14:56   ` Michal Hocko
  2017-11-08 16:24   ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-11-08 14:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, Mel Gorman

On Wed 08-11-17 20:01:48, Tetsuo Handa wrote:
> Commit 212925802454672e ("mm: oom: let oom_reap_task and exit_mmap run
> concurrently") moved the location of setting MMF_OOM_SKIP from __mmput()
> in kernel/fork.c (which is used by both MMU and !MMU) to exit_mm() in
> mm/mmap.c (which is used by MMU only). As a result, that commit required
> OOM victims in !MMU kernels to disappear from the task list in order to
> reenable the OOM killer, for !MMU kernels can no longer set MMF_OOM_SKIP
> (unless the OOM victim's mm is shared with global init process).

nack withtout demonstrating that the problem is real. It is true it
removes some lines but this is mostly this...
 
> While it would be possible to restore MMF_OOM_SKIP in __mmput() for !MMU
> kernels, let's forget about possibility of OOM livelock for !MMU kernels
> caused by failing to set MMF_OOM_SKIP, by setting MMF_OOM_SKIP at
> oom_kill_process(), for the invocation of the OOM killer is a rare event
> for !MMU systems from the beginning. By doing so, we can get rid of
> special treatment for !MMU case in commit cd04ae1e2dc8e365 ("mm, oom:
> do not rely on TIF_MEMDIE for memory reserves access"). And "mm,oom:
> Use ALLOC_OOM for OOM victim's last second allocation." will allow the
> OOM victim to try ALLOC_OOM (instead of ALLOC_NO_WATERMARKS) allocation
> before killing more OOM victims.
...
>  static bool oom_reserves_allowed(struct task_struct *tsk)
>  {
> -	if (!tsk_is_oom_victim(tsk))
> -		return false;
> -
> -	/*
> -	 * !MMU doesn't have oom reaper so give access to memory reserves
> -	 * only to the thread with TIF_MEMDIE set
> -	 */
> -	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> -		return false;
> -
> -	return true;
> +	return tsk_is_oom_victim(tsk);
>  }

and the respective ALLOC_OOM change for nommu. The sole purpose of the
code was to prevent from potential problem pointed out by _you_ that
nommu doesn't have the oom reaper and as such we cannot rely on partial
oom reserves. So I am quite surprised that you no longer insist on
the nommu theoretical issue. AFAIR you insisted hard back then. I am not
really sure what has changed since then. I would love to ack a patch
which removes the conditional oom reserves handling with an explanation
why it is not a problem anymore.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination.
  2017-11-08 11:01 ` [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination Tetsuo Handa
  2017-11-08 14:56   ` Michal Hocko
@ 2017-11-08 16:24   ` Michal Hocko
  2017-11-09 10:49     ` Tetsuo Handa
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-11-08 16:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, Mel Gorman

On Wed 08-11-17 20:01:48, Tetsuo Handa wrote:
[...]
> @@ -829,7 +831,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> -	bool can_oom_reap = true;
> +	bool can_oom_reap = IS_ENABLED(CONFIG_MMU);
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -929,7 +931,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  			continue;
>  		if (is_global_init(p)) {
>  			can_oom_reap = false;
> -			set_bit(MMF_OOM_SKIP, &mm->flags);
>  			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
>  					task_pid_nr(victim), victim->comm,
>  					task_pid_nr(p), p->comm);
> @@ -947,6 +948,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  
>  	if (can_oom_reap)
>  		wake_oom_reaper(victim);
> +	else
> +		set_bit(MMF_OOM_SKIP, &mm->flags);
>  
>  	mmdrop(mm);
>  	put_task_struct(victim);

Also this looks completely broken. nommu kernels lose the premature oom
killing protection almost completely (they simply rely on the sleep
before dropping the oom_lock).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination.
  2017-11-08 16:24   ` Michal Hocko
@ 2017-11-09 10:49     ` Tetsuo Handa
  2017-11-09 11:27       ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-09 10:49 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, mgorman

Michal Hocko wrote:
> On Wed 08-11-17 20:01:48, Tetsuo Handa wrote:
> > Commit 212925802454672e ("mm: oom: let oom_reap_task and exit_mmap run
> > concurrently") moved the location of setting MMF_OOM_SKIP from __mmput()
> > in kernel/fork.c (which is used by both MMU and !MMU) to exit_mm() in
> > mm/mmap.c (which is used by MMU only). As a result, that commit required
> > OOM victims in !MMU kernels to disappear from the task list in order to
> > reenable the OOM killer, for !MMU kernels can no longer set MMF_OOM_SKIP
> > (unless the OOM victim's mm is shared with global init process).
> 
> nack withtout demonstrating that the problem is real. It is true it
> removes some lines but this is mostly this...

Then, it is impossible unless somebody volunteers proving it.
I'm not a nommu kernel user.

>  
> > While it would be possible to restore MMF_OOM_SKIP in __mmput() for !MMU
> > kernels, let's forget about possibility of OOM livelock for !MMU kernels
> > caused by failing to set MMF_OOM_SKIP, by setting MMF_OOM_SKIP at
> > oom_kill_process(), for the invocation of the OOM killer is a rare event
> > for !MMU systems from the beginning. By doing so, we can get rid of
> > special treatment for !MMU case in commit cd04ae1e2dc8e365 ("mm, oom:
> > do not rely on TIF_MEMDIE for memory reserves access"). And "mm,oom:
> > Use ALLOC_OOM for OOM victim's last second allocation." will allow the
> > OOM victim to try ALLOC_OOM (instead of ALLOC_NO_WATERMARKS) allocation
> > before killing more OOM victims.
> ...
> >  static bool oom_reserves_allowed(struct task_struct *tsk)
> >  {
> > -	if (!tsk_is_oom_victim(tsk))
> > -		return false;
> > -
> > -	/*
> > -	 * !MMU doesn't have oom reaper so give access to memory reserves
> > -	 * only to the thread with TIF_MEMDIE set
> > -	 */
> > -	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> > -		return false;
> > -
> > -	return true;
> > +	return tsk_is_oom_victim(tsk);
> >  }
> 
> and the respective ALLOC_OOM change for nommu. The sole purpose of the
> code was to prevent from potential problem pointed out by _you_ that
> nommu doesn't have the oom reaper and as such we cannot rely on partial
> oom reserves. So I am quite surprised that you no longer insist on
> the nommu theoretical issue. AFAIR you insisted hard back then. I am not
> really sure what has changed since then. I would love to ack a patch
> which removes the conditional oom reserves handling with an explanation
> why it is not a problem anymore.

Because this patch changes to guarantee that MMF_OOM_SKIP is set, based on
an assumption that OOM lockup for !MMU kernel is a theoretical issue and
invocation of the OOM killer is a rare event for !MMU systems.

> 
> On Wed 08-11-17 20:01:48, Tetsuo Handa wrote:
> [...]
> > @@ -829,7 +831,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	unsigned int victim_points = 0;
> >  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> >  					      DEFAULT_RATELIMIT_BURST);
> > -	bool can_oom_reap = true;
> > +	bool can_oom_reap = IS_ENABLED(CONFIG_MMU);
> >  
> >  	/*
> >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> > @@ -929,7 +931,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  			continue;
> >  		if (is_global_init(p)) {
> >  			can_oom_reap = false;
> > -			set_bit(MMF_OOM_SKIP, &mm->flags);
> >  			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
> >  					task_pid_nr(victim), victim->comm,
> >  					task_pid_nr(p), p->comm);
> > @@ -947,6 +948,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  
> >  	if (can_oom_reap)
> >  		wake_oom_reaper(victim);
> > +	else
> > +		set_bit(MMF_OOM_SKIP, &mm->flags);
> >  
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> 
> Also this looks completely broken. nommu kernels lose the premature oom
> killing protection almost completely (they simply rely on the sleep
> before dropping the oom_lock).
> 

If you are worrying that setting MMF_OOM_SKIP immediately might cause
premature OOM killing), what we would afford is timeout-based approach
shown below, for it will be a waste of resource to add the OOM reaper kernel
thread which does nothing but setting MMF_OOM_SKIP.

---
 include/linux/mm_types.h |  3 +++
 mm/internal.h            |  9 ---------
 mm/oom_kill.c            | 16 +++++++++++++++-
 mm/page_alloc.c          | 12 +-----------
 4 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cfd0ac4..ad60b33 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -501,6 +501,9 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#ifndef CONFIG_MMU
+	unsigned long oom_victim_start;
+#endif
 
 #if IS_ENABLED(CONFIG_HMM)
 	/* HMM needs to track a few things per mm */
diff --git a/mm/internal.h b/mm/internal.h
index e6bd351..f0eb8d9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -481,16 +481,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
-/*
- * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
- * cannot assume a reduced access to memory reserves is sufficient for
- * !MMU
- */
-#ifdef CONFIG_MMU
 #define ALLOC_OOM		0x08
-#else
-#define ALLOC_OOM		ALLOC_NO_WATERMARKS
-#endif
 
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1472917..f277619 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -324,7 +324,13 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
+		struct mm_struct *mm = task->signal->oom_mm;
+
+#ifndef CONFIG_MMU
+		if (time_after(jiffies, mm->oom_victim_start + HZ))
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+#endif
+		if (test_bit(MMF_OOM_SKIP, &mm->flags))
 			goto next;
 		goto abort;
 	}
@@ -671,6 +677,9 @@ static void mark_oom_victim(struct task_struct *tsk)
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		mmgrab(tsk->signal->oom_mm);
+#ifndef CONFIG_MMU
+	mm->oom_victim_start = jiffies;
+#endif
 
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
@@ -791,6 +800,11 @@ static bool task_will_free_mem(struct task_struct *task)
 	 * This task has already been drained by the oom reaper so there are
 	 * only small chances it will free some more
 	 */
+#ifndef CONFIG_MMU
+	if (tsk_is_oom_victim(task) &&
+	    time_after(jiffies, mm->oom_victim_start + HZ))
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+#endif
 	if (test_bit(MMF_OOM_SKIP, &mm->flags))
 		return false;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fbbc95a..ff435f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3711,17 +3711,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 static bool oom_reserves_allowed(struct task_struct *tsk)
 {
-	if (!tsk_is_oom_victim(tsk))
-		return false;
-
-	/*
-	 * !MMU doesn't have oom reaper so give access to memory reserves
-	 * only to the thread with TIF_MEMDIE set
-	 */
-	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
-		return false;
-
-	return true;
+	return tsk_is_oom_victim(tsk);
 }
 
 /*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination.
  2017-11-09 10:49     ` Tetsuo Handa
@ 2017-11-09 11:27       ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-11-09 11:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, mgorman

On Thu 09-11-17 19:49:16, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 08-11-17 20:01:48, Tetsuo Handa wrote:
> > > Commit 212925802454672e ("mm: oom: let oom_reap_task and exit_mmap run
> > > concurrently") moved the location of setting MMF_OOM_SKIP from __mmput()
> > > in kernel/fork.c (which is used by both MMU and !MMU) to exit_mm() in
> > > mm/mmap.c (which is used by MMU only). As a result, that commit required
> > > OOM victims in !MMU kernels to disappear from the task list in order to
> > > reenable the OOM killer, for !MMU kernels can no longer set MMF_OOM_SKIP
> > > (unless the OOM victim's mm is shared with global init process).
> > 
> > nack withtout demonstrating that the problem is real. It is true it
> > removes some lines but this is mostly this...
> 
> Then, it is impossible unless somebody volunteers proving it.
> I'm not a nommu kernel user.

Do not convolute the code for a non-existent problem. Full stop.
 
[...]
> > On Wed 08-11-17 20:01:48, Tetsuo Handa wrote:
> > [...]
> > > @@ -829,7 +831,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > >  	unsigned int victim_points = 0;
> > >  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > >  					      DEFAULT_RATELIMIT_BURST);
> > > -	bool can_oom_reap = true;
> > > +	bool can_oom_reap = IS_ENABLED(CONFIG_MMU);
> > >  
> > >  	/*
> > >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> > > @@ -929,7 +931,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > >  			continue;
> > >  		if (is_global_init(p)) {
> > >  			can_oom_reap = false;
> > > -			set_bit(MMF_OOM_SKIP, &mm->flags);
> > >  			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
> > >  					task_pid_nr(victim), victim->comm,
> > >  					task_pid_nr(p), p->comm);
> > > @@ -947,6 +948,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > >  
> > >  	if (can_oom_reap)
> > >  		wake_oom_reaper(victim);
> > > +	else
> > > +		set_bit(MMF_OOM_SKIP, &mm->flags);
> > >  
> > >  	mmdrop(mm);
> > >  	put_task_struct(victim);
> > 
> > Also this looks completely broken. nommu kernels lose the premature oom
> > killing protection almost completely (they simply rely on the sleep
> > before dropping the oom_lock).
> > 
> 
> If you are worrying that setting MMF_OOM_SKIP immediately might cause
> premature OOM killing), what we would afford is timeout-based approach
> shown below, for it will be a waste of resource to add the OOM reaper kernel
> thread which does nothing but setting MMF_OOM_SKIP.

No! See above
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-08 11:01 [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
                   ` (3 preceding siblings ...)
  2017-11-08 11:01 ` [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination Tetsuo Handa
@ 2017-11-08 14:50 ` Michal Hocko
  2017-11-09 10:45   ` Tetsuo Handa
  4 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-11-08 14:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, Andrea Arcangeli, Johannes Weiner

On Wed 08-11-17 20:01:44, Tetsuo Handa wrote:
> __alloc_pages_may_oom() is doing last second allocation attempt using
> ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.
> 
> The first reason is explained in the comment that it aims to catch
> potential parallel OOM killing. But there is no longer parallel OOM
> killing (in the sense that out_of_memory() is called "concurrently")
> because we serialize out_of_memory() calls using oom_lock.
> 
> The second reason is explained by Andrea Arcangeli (who added that code)
> that it aims to reduce the likelihood of OOM livelocks and be sure to
> invoke the OOM killer. There was a risk of livelock or anyway of delayed
> OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
> few pages which are constantly allocated and freed in the meantime will
> not improve the situation.

> But there is no longer possibility of OOM
> livelocks or failing to invoke the OOM killer because we need to mask
> __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
> prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
> second allocation attempt indirectly involve from failing.

This is an unfounded, misleading and actually even wrong statement that
has nothing to do with what Andrea had in mind. __GFP_DIRECT_RECLAIM
doesn't have anything to do with the livelock as I've already mentioned
several times already.

> Since the OOM killer does not always kill a process consuming significant
> amount of memory (the OOM killer kills a process with highest OOM score
> (or instead one of its children if any)), there will be cases where
> ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.

This is possible but not really interesting case as already explained.

> Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
> by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
> second allocation attempt might be better for minimizing number of OOM
> victims. But that change should be done in a separate patch. This patch
> just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.

Again unfounded claim.

That being said, the comment removing a note about parallel oom killing
is OK. I am not sure this is something worth a separate patch. The
changelog is just wrong and so Nack to the patch.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 536431b..613814c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3341,11 +3341,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	}
>  
>  	/*
> -	 * Go through the zonelist yet one more time, keep very high watermark
> -	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure. But make sure that this reclaim
> -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> -	 * allocation which will never fail due to oom_lock already held.
> +	 * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM &&
> +	 * !__GFP_NORETRY allocation which will never fail due to oom_lock
> +	 * already held. And since this allocation attempt does not sleep,
> +	 * there is no reason we must use high watermark here.
>  	 */
>  	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
>  				      ~__GFP_DIRECT_RECLAIM, order,
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-08 14:50 ` [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Michal Hocko
@ 2017-11-09 10:45   ` Tetsuo Handa
  2017-11-09 11:30     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-09 10:45 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, aarcange, hannes

Michal Hocko wrote:
> On Wed 08-11-17 20:01:44, Tetsuo Handa wrote:
> > __alloc_pages_may_oom() is doing last second allocation attempt using
> > ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.
> > 
> > The first reason is explained in the comment that it aims to catch
> > potential parallel OOM killing. But there is no longer parallel OOM
> > killing (in the sense that out_of_memory() is called "concurrently")
> > because we serialize out_of_memory() calls using oom_lock.
> > 
> > The second reason is explained by Andrea Arcangeli (who added that code)
> > that it aims to reduce the likelihood of OOM livelocks and be sure to
> > invoke the OOM killer. There was a risk of livelock or anyway of delayed
> > OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
> > few pages which are constantly allocated and freed in the meantime will
> > not improve the situation.

Above part is OK, isn't it?

> 
> > But there is no longer possibility of OOM
> > livelocks or failing to invoke the OOM killer because we need to mask
> > __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
> > prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
> > second allocation attempt indirectly involve from failing.
> 
> This is an unfounded, misleading and actually even wrong statement that
> has nothing to do with what Andrea had in mind. __GFP_DIRECT_RECLAIM
> doesn't have anything to do with the livelock as I've already mentioned
> several times already.

I know that this part is not what Andrea had in mind when he added this comment.
What I'm saying is that "precondition has changed after Andrea added this comment"
and "these reasons which Andrea had in mind when he added this comment no longer
holds". I'm posting "for the record" purpose in order to describe reasons for
current code.

When we introduced oom_lock (or formerly the per-zone oom lock) for serializing invocation
of the OOM killer, we introduced two bugs at the same time. One bug is that since doing
__GFP_DIRECT_RECLAIM with oom_lock held can make __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
allocations (which __GFP_DIRECT_RECLAIM indirectly involved) lockup, we need to avoid
__GFP_DIRECT_RECLAIM allocations with oom_lock held. This is why commit e746bf730a76fe53
("mm,page_alloc: don't call __node_reclaim() with oom_lock held.") was made. This in turn
forbids using __GFP_DIRECT_RECLAIM for last second allocation attempt which was not
forbidden when Andrea added this comment.

( The other bug is that we assumed that somebody is making progress for us when
mutex_trylock(&oom_lock) in __alloc_pages_may_oom() failed, for we did not take
scheduling priority into account when we introduced oom_lock. But the other bug
is not what I'm writing in this patch. You can forget about the other bug
regarding this patch. )

> 
> > Since the OOM killer does not always kill a process consuming significant
> > amount of memory (the OOM killer kills a process with highest OOM score
> > (or instead one of its children if any)), there will be cases where
> > ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.
> 
> This is possible but not really interesting case as already explained.
> 
> > Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
> > by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
> > second allocation attempt might be better for minimizing number of OOM
> > victims. But that change should be done in a separate patch. This patch
> > just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.
> 
> Again unfounded claim.

Since use of __GFP_DIRECT_RECLAIM for last second allocation attempt is now
forbidden due to oom_lock already held, possibility of failing last allocation
attempt has increased compared to when Andrea added this comment. Andrea said

  The high wmark is used to be sure the failure of reclaim isn't going to be
  ignored. If using the min wmark like you propose there's risk of livelock or
  anyway of delayed OOM killer invocation.

but there is no longer possibility of OOM livelock because __GFP_DIRECT_RECLAIM
is masked. Therefore, while using ALLOC_WMARK_HIGH might made sense before we
introduced oom_lock, using ALLOC_WMARK_HIGH no longer has strong background after
we introduced oom_lock. Therefore, I'm updating the comment in the source code,
with a suggestion in the changelog that ALLOC_WMARK_MIN might be better for
current code, in order to help someone who find this patch 5 or 10 years future
can figure out why we are using ALLOC_WMARK_HIGH (like you did at
http://lkml.kernel.org/r/20160128163802.GA15953@dhcp22.suse.cz ).

> 
> That being said, the comment removing a note about parallel oom killing
> is OK. I am not sure this is something worth a separate patch. The
> changelog is just wrong and so Nack to the patch.

So, I believe that the changelog is not wrong, and I don't want to preserve

  keep very high watermark here, this is only to catch a parallel oom killing,
  we must fail if we're still under heavy pressure

part which lost strong background.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-09 10:45   ` Tetsuo Handa
@ 2017-11-09 11:30     ` Michal Hocko
  2017-11-09 12:19       ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-11-09 11:30 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, aarcange, hannes

On Thu 09-11-17 19:45:04, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 08-11-17 20:01:44, Tetsuo Handa wrote:
> > > __alloc_pages_may_oom() is doing last second allocation attempt using
> > > ALLOC_WMARK_HIGH before calling out_of_memory(). This had two reasons.
> > > 
> > > The first reason is explained in the comment that it aims to catch
> > > potential parallel OOM killing. But there is no longer parallel OOM
> > > killing (in the sense that out_of_memory() is called "concurrently")
> > > because we serialize out_of_memory() calls using oom_lock.
> > > 
> > > The second reason is explained by Andrea Arcangeli (who added that code)
> > > that it aims to reduce the likelihood of OOM livelocks and be sure to
> > > invoke the OOM killer. There was a risk of livelock or anyway of delayed
> > > OOM killer invocation if ALLOC_WMARK_MIN is used, for relying on last
> > > few pages which are constantly allocated and freed in the meantime will
> > > not improve the situation.
> 
> Above part is OK, isn't it?
> 
> > 
> > > But there is no longer possibility of OOM
> > > livelocks or failing to invoke the OOM killer because we need to mask
> > > __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
> > > prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
> > > second allocation attempt indirectly involve from failing.
> > 
> > This is an unfounded, misleading and actually even wrong statement that
> > has nothing to do with what Andrea had in mind. __GFP_DIRECT_RECLAIM
> > doesn't have anything to do with the livelock as I've already mentioned
> > several times already.
> 
> I know that this part is not what Andrea had in mind when he added this comment.
> What I'm saying is that "precondition has changed after Andrea added this comment"
> and "these reasons which Andrea had in mind when he added this comment no longer
> holds". I'm posting "for the record" purpose in order to describe reasons for
> current code.
> 
> When we introduced oom_lock (or formerly the per-zone oom lock) for serializing invocation
> of the OOM killer, we introduced two bugs at the same time. One bug is that since doing
> __GFP_DIRECT_RECLAIM with oom_lock held can make __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> allocations (which __GFP_DIRECT_RECLAIM indirectly involved) lockup, we need to avoid
> __GFP_DIRECT_RECLAIM allocations with oom_lock held. This is why commit e746bf730a76fe53
> ("mm,page_alloc: don't call __node_reclaim() with oom_lock held.") was made. This in turn
> forbids using __GFP_DIRECT_RECLAIM for last second allocation attempt which was not
> forbidden when Andrea added this comment.

But this has anything to do with the original motivation for the high
watermark allocation.
 
> ( The other bug is that we assumed that somebody is making progress for us when
> mutex_trylock(&oom_lock) in __alloc_pages_may_oom() failed, for we did not take
> scheduling priority into account when we introduced oom_lock. But the other bug
> is not what I'm writing in this patch. You can forget about the other bug
> regarding this patch. )
> 
> > 
> > > Since the OOM killer does not always kill a process consuming significant
> > > amount of memory (the OOM killer kills a process with highest OOM score
> > > (or instead one of its children if any)), there will be cases where
> > > ALLOC_WMARK_HIGH fails and ALLOC_WMARK_MIN succeeds.
> > 
> > This is possible but not really interesting case as already explained.
> > 
> > > Since the gap between ALLOC_WMARK_HIGH and ALLOC_WMARK_MIN can be changed
> > > by /proc/sys/vm/min_free_kbytes parameter, using ALLOC_WMARK_MIN for last
> > > second allocation attempt might be better for minimizing number of OOM
> > > victims. But that change should be done in a separate patch. This patch
> > > just clarifies that ALLOC_WMARK_HIGH is an arbitrary choice.
> > 
> > Again unfounded claim.
> 
> Since use of __GFP_DIRECT_RECLAIM for last second allocation attempt is now
> forbidden due to oom_lock already held, possibility of failing last allocation
> attempt has increased compared to when Andrea added this comment. Andrea said
> 
>   The high wmark is used to be sure the failure of reclaim isn't going to be
>   ignored. If using the min wmark like you propose there's risk of livelock or
>   anyway of delayed OOM killer invocation.

Wrong. It just takes an unrelated single page alloc/free loop to prevent
from the oom killer invocation.
 
[...]
> So, I believe that the changelog is not wrong, and I don't want to preserve
> 
>   keep very high watermark here, this is only to catch a parallel oom killing,
>   we must fail if we're still under heavy pressure
> 
> part which lost strong background.

I do not see how. You simply do not address the original concern Andrea
had and keep repeating unrelated stuff.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-09 11:30     ` Michal Hocko
@ 2017-11-09 12:19       ` Tetsuo Handa
  2017-11-09 12:25         ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2017-11-09 12:19 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, linux-mm, aarcange, hannes

Michal Hocko wrote:
> > So, I believe that the changelog is not wrong, and I don't want to preserve
> > 
> >   keep very high watermark here, this is only to catch a parallel oom killing,
> >   we must fail if we're still under heavy pressure
> > 
> > part which lost strong background.
> 
> I do not see how. You simply do not address the original concern Andrea
> had and keep repeating unrelated stuff.

What does "address the original concern Andrea had" mean?
I'm still thinking that the original concern Andrea had is no longer
valid in the current code because precondition has changed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-09 12:19       ` Tetsuo Handa
@ 2017-11-09 12:25         ` Michal Hocko
  2017-11-09 12:32           ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-11-09 12:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, aarcange, hannes

On Thu 09-11-17 21:19:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > So, I believe that the changelog is not wrong, and I don't want to preserve
> > > 
> > >   keep very high watermark here, this is only to catch a parallel oom killing,
> > >   we must fail if we're still under heavy pressure
> > > 
> > > part which lost strong background.
> > 
> > I do not see how. You simply do not address the original concern Andrea
> > had and keep repeating unrelated stuff.
> 
> What does "address the original concern Andrea had" mean?
> I'm still thinking that the original concern Andrea had is no longer
> valid in the current code because precondition has changed.

I am sorry but I am not going to repeat myself.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt.
  2017-11-09 12:25         ` Michal Hocko
@ 2017-11-09 12:32           ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-11-09 12:32 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, aarcange, hannes

On Thu 09-11-17 13:25:19, Michal Hocko wrote:
> On Thu 09-11-17 21:19:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > So, I believe that the changelog is not wrong, and I don't want to preserve
> > > > 
> > > >   keep very high watermark here, this is only to catch a parallel oom killing,
> > > >   we must fail if we're still under heavy pressure
> > > > 
> > > > part which lost strong background.
> > > 
> > > I do not see how. You simply do not address the original concern Andrea
> > > had and keep repeating unrelated stuff.
> > 
> > What does "address the original concern Andrea had" mean?
> > I'm still thinking that the original concern Andrea had is no longer
> > valid in the current code because precondition has changed.
> 
> I am sorry but I am not going to repeat myself.

In any case, if you want to change high->low watermark for the last
allocation then it deserves a separate patch with the justification,
user visible changes. All you do here is to make the comment disagree
with the code which is not an improvement at all. Quite contrary I would
dare to say.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-11-09 12:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-08 11:01 [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Tetsuo Handa
2017-11-08 11:01 ` [PATCH 2/5] mm,oom: Move last second allocation to inside the OOM killer Tetsuo Handa
2017-11-08 14:50   ` Michal Hocko
2017-11-08 11:01 ` [PATCH 3/5] mm,oom: Use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
2017-11-08 14:50   ` Michal Hocko
2017-11-08 11:01 ` [PATCH 4/5] mm,oom: Remove oom_lock serialization from the OOM reaper Tetsuo Handa
2017-11-08 15:03   ` Michal Hocko
2017-11-08 11:01 ` [PATCH 5/5] nommu,oom: Set MMF_OOM_SKIP without waiting for termination Tetsuo Handa
2017-11-08 14:56   ` Michal Hocko
2017-11-08 16:24   ` Michal Hocko
2017-11-09 10:49     ` Tetsuo Handa
2017-11-09 11:27       ` Michal Hocko
2017-11-08 14:50 ` [PATCH 1/5] mm,page_alloc: Update comment for last second allocation attempt Michal Hocko
2017-11-09 10:45   ` Tetsuo Handa
2017-11-09 11:30     ` Michal Hocko
2017-11-09 12:19       ` Tetsuo Handa
2017-11-09 12:25         ` Michal Hocko
2017-11-09 12:32           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).