linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
@ 2017-10-28  8:07 Tetsuo Handa
  2017-10-30 14:18 ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2017-10-28  8:07 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, Tetsuo Handa, Andrea Arcangeli,
	David Rientjes, Johannes Weiner, Manish Jaggi, Mel Gorman,
	Michal Hocko, Oleg Nesterov, Vladimir Davydov, Vlastimil Babka

This patch splits last second allocation attempt into two locations, once
before selecting an OOM victim and again after selecting an OOM victim,
and uses normal watermark for last second allocation attempts.

As of linux-2.6.11, nothing prevented from concurrently calling
out_of_memory(). TIF_MEMDIE test in select_bad_process() tried to avoid
needless OOM killing. Thus, it was safe to do __GFP_DIRECT_RECLAIM
allocation (apart from which watermark should be used) just before
calling out_of_memory().

As of linux-2.6.24, try_set_zone_oom() was added to
__alloc_pages_may_oom() by commit ff0ceb9deb6eb017 ("oom: serialize out
of memory calls") which effectively started acting as a kind of today's
mutex_trylock(&oom_lock).

As of linux-4.2, try_set_zone_oom() was replaced with oom_lock by
commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer locking").
At least by this time, it became no longer safe to do
__GFP_DIRECT_RECLAIM allocation with oom_lock held.

And as of linux-4.13, last second allocation attempt stopped using
__GFP_DIRECT_RECLAIM by commit e746bf730a76fe53 ("mm,page_alloc: don't
call __node_reclaim() with oom_lock held.").

Therefore, there is no longer valid reason to use ALLOC_WMARK_HIGH for
last second allocation attempt [1]. And this patch changes to do normal
allocation attempt, with handling of ALLOC_OOM added in order to mitigate
extra OOM victim selection problem reported by Manish Jaggi [2].

Doing really last second allocation attempt after selecting an OOM victim
will also help the OOM reaper to start reclaiming memory without waiting
for oom_lock to be released.

[1] http://lkml.kernel.org/r/20160128163802.GA15953@dhcp22.suse.cz
[2] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Manish Jaggi <mjaggi@caviumnetworks.com>
---
 include/linux/oom.h | 13 +++++++++++++
 mm/oom_kill.c       | 13 +++++++++++++
 mm/page_alloc.c     | 47 ++++++++++++++++++++++++++++++++++-------------
 3 files changed, 60 insertions(+), 13 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..eb92aa8 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -13,6 +13,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used to
@@ -37,6 +39,15 @@ struct oom_control {
 	 */
 	const int order;
 
+	/* Context for really last second allocation attempt. */
+	struct alloc_context *ac;
+	/*
+	 * Set by the OOM killer if ac != NULL and last second allocation
+	 * attempt succeeded. If ac != NULL, the caller must check for
+	 * page != NULL.
+	 */
+	struct page *page;
+
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
@@ -101,6 +112,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..dcde1d5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -870,6 +870,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(p);
 
+	/*
+	 * Try really last second allocation attempt after we selected an OOM
+	 * victim, for somebody might have managed to free memory while we were
+	 * selecting an OOM victim which can take quite some time.
+	 */
+	if (oc->ac) {
+		oc->page = alloc_pages_before_oomkill(oc);
+		if (oc->page) {
+			put_task_struct(p);
+			return;
+		}
+	}
+
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97687b3..ba0ef7b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3265,7 +3265,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+	struct alloc_context *ac, unsigned long *did_some_progress)
 {
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
@@ -3273,6 +3273,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
 		.order = order,
+		.ac = ac,
 	};
 	struct page *page;
 
@@ -3289,15 +3290,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	}
 
 	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
+	 * Try almost last second allocation attempt before we select an OOM
+	 * victim, for somebody might have managed to free memory or the OOM
+	 * killer might have called mark_oom_victim(current).
 	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
-				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+	page = alloc_pages_before_oomkill(&oc);
 	if (page)
 		goto out;
 
@@ -3335,16 +3332,18 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
+		*did_some_progress = 1;
+		page = oc.page;
+	} else if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
 		/*
 		 * Help non-failing allocations by giving them access to memory
 		 * reserves
 		 */
-		if (gfp_mask & __GFP_NOFAIL)
-			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
+						     ALLOC_NO_WATERMARKS, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -4114,6 +4113,28 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return page;
 }
 
+struct page *alloc_pages_before_oomkill(struct oom_control *oc)
+{
+	/*
+	 * Make sure that this allocation attempt shall not depend on
+	 * __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation, for the caller is
+	 * already holding oom_lock.
+	 */
+	const gfp_t gfp_mask = oc->gfp_mask & ~__GFP_DIRECT_RECLAIM;
+	struct alloc_context *ac = oc->ac;
+	unsigned int alloc_flags = gfp_to_alloc_flags(gfp_mask);
+	const int reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+
+	/* Need to update zonelist if selected as OOM victim. */
+	if (reserve_flags) {
+		alloc_flags = reserve_flags;
+		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
+		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+					ac->high_zoneidx, ac->nodemask);
+	}
+	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, ac);
+}
+
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-28  8:07 [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim Tetsuo Handa
@ 2017-10-30 14:18 ` Michal Hocko
  2017-10-31 10:40   ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-10-30 14:18 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-kernel, Andrea Arcangeli, David Rientjes,
	Johannes Weiner, Manish Jaggi, Mel Gorman, Oleg Nesterov,
	Vladimir Davydov, Vlastimil Babka

On Sat 28-10-17 17:07:09, Tetsuo Handa wrote:
> This patch splits last second allocation attempt into two locations, once
> before selecting an OOM victim and again after selecting an OOM victim,
> and uses normal watermark for last second allocation attempts.

Why do we need both?

> As of linux-2.6.11, nothing prevented from concurrently calling
> out_of_memory(). TIF_MEMDIE test in select_bad_process() tried to avoid
> needless OOM killing. Thus, it was safe to do __GFP_DIRECT_RECLAIM
> allocation (apart from which watermark should be used) just before
> calling out_of_memory().
> 
> As of linux-2.6.24, try_set_zone_oom() was added to
> __alloc_pages_may_oom() by commit ff0ceb9deb6eb017 ("oom: serialize out
> of memory calls") which effectively started acting as a kind of today's
> mutex_trylock(&oom_lock).
> 
> As of linux-4.2, try_set_zone_oom() was replaced with oom_lock by
> commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer locking").
> At least by this time, it became no longer safe to do
> __GFP_DIRECT_RECLAIM allocation with oom_lock held.
> 
> And as of linux-4.13, last second allocation attempt stopped using
> __GFP_DIRECT_RECLAIM by commit e746bf730a76fe53 ("mm,page_alloc: don't
> call __node_reclaim() with oom_lock held.").
> 
> Therefore, there is no longer valid reason to use ALLOC_WMARK_HIGH for
> last second allocation attempt [1].

Another reason to use the high watermark as explained by Andrea was
"
: Elaborating the comment: the reason for the high wmark is to reduce
: the likelihood of livelocks and be sure to invoke the OOM killer, if
: we're still under pressure and reclaim just failed. The high wmark is
: used to be sure the failure of reclaim isn't going to be ignored. If
: using the min wmark like you propose there's risk of livelock or
: anyway of delayed OOM killer invocation.
"

How is that affected by changes in locking you discribe above?

> And this patch changes to do normal
> allocation attempt, with handling of ALLOC_OOM added in order to mitigate
> extra OOM victim selection problem reported by Manish Jaggi [2].
> 
> Doing really last second allocation attempt after selecting an OOM victim
> will also help the OOM reaper to start reclaiming memory without waiting
> for oom_lock to be released.

The changelog is much more obscure than it really needs to be. You fail
to explain _why_ we need this and and _what_ the actual problem is. You
are simply drowning in details here (btw. this is not the first time
your changelog has this issues). Try to focus on _what_ is the problem
_why_ do we care and _how_ are you addressing it.
 
[...]

> +struct page *alloc_pages_before_oomkill(struct oom_control *oc)
> +{
> +	/*
> +	 * Make sure that this allocation attempt shall not depend on
> +	 * __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation, for the caller is
> +	 * already holding oom_lock.
> +	 */
> +	const gfp_t gfp_mask = oc->gfp_mask & ~__GFP_DIRECT_RECLAIM;
> +	struct alloc_context *ac = oc->ac;
> +	unsigned int alloc_flags = gfp_to_alloc_flags(gfp_mask);
> +	const int reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> +
> +	/* Need to update zonelist if selected as OOM victim. */
> +	if (reserve_flags) {
> +		alloc_flags = reserve_flags;
> +		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> +					ac->high_zoneidx, ac->nodemask);
> +	}

Why do we need this zone list rebuilding?

> +	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, ac);
> +}
> +
>  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  		int preferred_nid, nodemask_t *nodemask,
>  		struct alloc_context *ac, gfp_t *alloc_mask,
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-30 14:18 ` Michal Hocko
@ 2017-10-31 10:40   ` Tetsuo Handa
  2017-10-31 12:10     ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2017-10-31 10:40 UTC (permalink / raw)
  To: mhocko, aarcange
  Cc: akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> > +struct page *alloc_pages_before_oomkill(struct oom_control *oc)
> > +{
> > +	/*
> > +	 * Make sure that this allocation attempt shall not depend on
> > +	 * __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation, for the caller is
> > +	 * already holding oom_lock.
> > +	 */
> > +	const gfp_t gfp_mask = oc->gfp_mask & ~__GFP_DIRECT_RECLAIM;
> > +	struct alloc_context *ac = oc->ac;
> > +	unsigned int alloc_flags = gfp_to_alloc_flags(gfp_mask);
> > +	const int reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> > +
> > +	/* Need to update zonelist if selected as OOM victim. */
> > +	if (reserve_flags) {
> > +		alloc_flags = reserve_flags;
> > +		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> > +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > +					ac->high_zoneidx, ac->nodemask);
> > +	}
> 
> Why do we need this zone list rebuilding?

Why we do _not_ need this zone list rebuilding?

The reason I used __alloc_pages_slowpath() in alloc_pages_before_oomkill() is
to avoid duplicating code (such as checking for ALLOC_OOM and rebuilding zone
list) which needs to be maintained in sync with __alloc_pages_slowpath().
If you don't like calling __alloc_pages_slowpath() from
alloc_pages_before_oomkill(), I'm OK with calling __alloc_pages_nodemask()
(with __GFP_DIRECT_RECLAIM/__GFP_NOFAIL cleared and __GFP_NOWARN set), for
direct reclaim functions can call __alloc_pages_nodemask() (with PF_MEMALLOC
set in order to avoid recursion of direct reclaim).

We are rebuilding zone list if selected as an OOM victim, for
__gfp_pfmemalloc_flags() returns ALLOC_OOM if oom_reserves_allowed(current)
is true.

----------
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                       struct alloc_context *ac)
{
(...snipped...)
retry:
(...snipped...)
	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
	if (reserve_flags)
		alloc_flags = reserve_flags;

	/*
	 * Reset the zonelist iterators if memory policies can be ignored.
	 * These allocations are high priority and system rather than user
	 * orientated.
	 */
	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
							     ac->high_zoneidx, ac->nodemask);
	}

	/* Attempt with potentially adjusted zonelist and alloc_flags */
	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
	if (page)
		goto got_pg;

	/* Caller is not willing to reclaim, we can't balance anything */
	if (!can_direct_reclaim)
		goto nopage;
(...snipped...)
	/* Reclaim has failed us, start killing things */
	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
	if (page)
		goto got_pg;

	/* Avoid allocations with no watermarks from looping endlessly */
	if (tsk_is_oom_victim(current) &&
	    (alloc_flags == ALLOC_OOM ||
	     (gfp_mask & __GFP_NOMEMALLOC)))
		goto nopage;

	/* Retry as long as the OOM killer is making progress */
	if (did_some_progress) {
		no_progress_loops = 0;
		goto retry;
	}
(...snipped...)
}
----------

The reason I'm proposing this "mm,oom: Try last second allocation before and
after selecting an OOM victim." is that since oom_reserves_allowed(current) can
become true when current is between post __gfp_pfmemalloc_flags(gfp_mask) and
pre mutex_trylock(&oom_lock), an OOM victim can fail to try ALLOC_OOM attempt
before selecting next OOM victim when MMF_OOM_SKIP was set quickly.

I tried to handle such race window by "mm, oom: task_will_free_mem(current)
should ignore MMF_OOM_SKIP for once." [1], and your response [2] was

  Find a _robust_ solution rather than
  fiddling with try-once-more kind of hacks. E.g. do an allocation attempt
  _before_ we do any disruptive action (aka kill a victim). This would
  help other cases when we race with an exiting tasks or somebody managed
  to free memory while we were selecting an oom victim which can take
  quite some time.

. Therefore, I tried to handle such race window by "mm,oom: Try last second
allocation after selecting an OOM victim." [3], and your response [4] was

  My primary question was

  : that the above link contains an explanation from Andrea that the reason
  : for the high wmark is to reduce the likelihood of livelocks and be sure
  : to invoke the OOM killer,

  I am not sure how much that reason applies to the current code but if it
  still applies then we should do the same for later
  last-minute-allocation as well. Having both and disagreeing is just a
  mess.

. Therefore, I proposed this "mm,oom: Try last second allocation before and
after selecting an OOM victim." which uses the same watermark, and this time
you are still worrying about stop using ALLOC_WMARK_HIGH. You are giving
inconsistent messages here. If stop using ALLOC_WMARK_HIGH has some risk (which
Andrea needs to clarify it), we can't stop using ALLOC_WMARK_HIGH here. But we
have to allow using ALLOC_OOM (either before and/or after selecting an OOM
victim) which will result in disagreement you don't like, for we cannot stop
using ALLOC_WMARK_HIGH when we need to use ALLOC_OOM in order to solve the race
window which [1] tries to handle.

[1] http://lkml.kernel.org/r/201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/20170821121022.GF25956@dhcp22.suse.cz
[3] http://lkml.kernel.org/r/1503577106-9196-2-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
[4] http://lkml.kernel.org/r/20170825080020.GE25498@dhcp22.suse.cz

> 
> > +	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, ac);
> > +}
> > +
> >  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> >  		int preferred_nid, nodemask_t *nodemask,
> >  		struct alloc_context *ac, gfp_t *alloc_mask,



> On Sat 28-10-17 17:07:09, Tetsuo Handa wrote:
> > This patch splits last second allocation attempt into two locations, once
> > before selecting an OOM victim and again after selecting an OOM victim,
> > and uses normal watermark for last second allocation attempts.
> 
> Why do we need both?

For two reasons.

(1) You said

      E.g. do an allocation attempt
      _before_ we do any disruptive action (aka kill a victim). This would
      help other cases when we race with an exiting tasks or somebody managed
      to free memory while we were selecting an oom victim which can take
      quite some time.

    at [2]. By doing really last second allocation attempt after we select
    an OOM victim, we can remove oom_lock serialization from the OOM reaper
    which is currently there in order to avoid racing with MMF_OOM_SKIP.

(2) I said

      Since we call panic() before calling oom_kill_process() when there is
      no killable process, panic() will be prematurely called which could
      have been avoided if [2] is used. For example, if a multithreaded
      application running with a dedicated CPUs/memory was OOM-killed, we
      can wait until ALLOC_OOM allocation fails to solve OOM situation.

   at [3]. By doing almost last second allocation attempt before we select
   an OOM victim, we can avoid needlessly calling panic() when there is no
   other eligible threads other than existing OOM victim threads, as well as
   avoid needlessly calling out_of_memory() which you think that selecting
   an OOM victim can take quite some time.

> 
> > As of linux-2.6.11, nothing prevented from concurrently calling
> > out_of_memory(). TIF_MEMDIE test in select_bad_process() tried to avoid
> > needless OOM killing. Thus, it was safe to do __GFP_DIRECT_RECLAIM
> > allocation (apart from which watermark should be used) just before
> > calling out_of_memory().
> > 
> > As of linux-2.6.24, try_set_zone_oom() was added to
> > __alloc_pages_may_oom() by commit ff0ceb9deb6eb017 ("oom: serialize out
> > of memory calls") which effectively started acting as a kind of today's
> > mutex_trylock(&oom_lock).
> > 
> > As of linux-4.2, try_set_zone_oom() was replaced with oom_lock by
> > commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer locking").
> > At least by this time, it became no longer safe to do
> > __GFP_DIRECT_RECLAIM allocation with oom_lock held.
> > 
> > And as of linux-4.13, last second allocation attempt stopped using
> > __GFP_DIRECT_RECLAIM by commit e746bf730a76fe53 ("mm,page_alloc: don't
> > call __node_reclaim() with oom_lock held.").
> > 
> > Therefore, there is no longer valid reason to use ALLOC_WMARK_HIGH for
> > last second allocation attempt [1].
> 
> Another reason to use the high watermark as explained by Andrea was
> "
> : Elaborating the comment: the reason for the high wmark is to reduce
> : the likelihood of livelocks and be sure to invoke the OOM killer, if
> : we're still under pressure and reclaim just failed. The high wmark is
> : used to be sure the failure of reclaim isn't going to be ignored. If
> : using the min wmark like you propose there's risk of livelock or
> : anyway of delayed OOM killer invocation.
> "
> 
> How is that affected by changes in locking you discribe above?

Andrea, please come out and explain why using ALLOC_WMARK_HIGH helped
avoiding OOM livelock and delayed OOM killer invocation in linux-2.6.11.

Since we use !__GFP_DIRECT_RECLAIM for almost/really last second allocation
attempts (due to oom_lock already held), there is no possibility of OOM
livelock nor delayed OOM killer invocation. We stopped using
__GFP_DIRECT_RECLAIM for last second allocation attempt after Andrea
explained the reason to use ALLOC_WMARK_HIGH. The preconditions have
changed after Andrea explained it. If there is still possibility of OOM
livelock or delayed OOM killer invocation caused by stopped using
ALLOC_WMARK_HIGH, please explain why (and we will fall back to [1]).

> 
> > And this patch changes to do normal
> > allocation attempt, with handling of ALLOC_OOM added in order to mitigate
> > extra OOM victim selection problem reported by Manish Jaggi [2].
> > 
> > Doing really last second allocation attempt after selecting an OOM victim
> > will also help the OOM reaper to start reclaiming memory without waiting
> > for oom_lock to be released.
> 
> The changelog is much more obscure than it really needs to be. You fail
> to explain _why_ we need this and and _what_ the actual problem is. You
> are simply drowning in details here (btw. this is not the first time
> your changelog has this issues). Try to focus on _what_ is the problem
> _why_ do we care and _how_ are you addressing it.
>  

The actual problem is [1], and your response [2] was

  Find a _robust_ solution rather than
  fiddling with try-once-more kind of hacks. E.g. do an allocation attempt
  _before_ we do any disruptive action (aka kill a victim). This would
  help other cases when we race with an exiting tasks or somebody managed
  to free memory while we were selecting an oom victim which can take
  quite some time.

. I tried to follow your response as [3] and your response [4] was

  This a lot of text which can be more confusing than helpful. Could you
  state the problem clearly without detours?

. You are suggesting me to obscure it by trying to find a _robust_
solution which would help other cases.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 10:40   ` Tetsuo Handa
@ 2017-10-31 12:10     ` Michal Hocko
  2017-10-31 12:42       ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-10-31 12:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

On Tue 31-10-17 19:40:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > +struct page *alloc_pages_before_oomkill(struct oom_control *oc)
> > > +{
> > > +	/*
> > > +	 * Make sure that this allocation attempt shall not depend on
> > > +	 * __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation, for the caller is
> > > +	 * already holding oom_lock.
> > > +	 */
> > > +	const gfp_t gfp_mask = oc->gfp_mask & ~__GFP_DIRECT_RECLAIM;
> > > +	struct alloc_context *ac = oc->ac;
> > > +	unsigned int alloc_flags = gfp_to_alloc_flags(gfp_mask);
> > > +	const int reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> > > +
> > > +	/* Need to update zonelist if selected as OOM victim. */
> > > +	if (reserve_flags) {
> > > +		alloc_flags = reserve_flags;
> > > +		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> > > +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > > +					ac->high_zoneidx, ac->nodemask);
> > > +	}
> > 
> > Why do we need this zone list rebuilding?
> 
> Why we do _not_ need this zone list rebuilding?
> 
> The reason I used __alloc_pages_slowpath() in alloc_pages_before_oomkill() is
> to avoid duplicating code (such as checking for ALLOC_OOM and rebuilding zone
> list) which needs to be maintained in sync with __alloc_pages_slowpath().
>
> If you don't like calling __alloc_pages_slowpath() from
> alloc_pages_before_oomkill(), I'm OK with calling __alloc_pages_nodemask()
> (with __GFP_DIRECT_RECLAIM/__GFP_NOFAIL cleared and __GFP_NOWARN set), for
> direct reclaim functions can call __alloc_pages_nodemask() (with PF_MEMALLOC
> set in order to avoid recursion of direct reclaim).
> 
> We are rebuilding zone list if selected as an OOM victim, for
> __gfp_pfmemalloc_flags() returns ALLOC_OOM if oom_reserves_allowed(current)
> is true.

So your answer is copy&paste without a deeper understanding, righ?

[...]

> The reason I'm proposing this "mm,oom: Try last second allocation before and
> after selecting an OOM victim." is that since oom_reserves_allowed(current) can
> become true when current is between post __gfp_pfmemalloc_flags(gfp_mask) and
> pre mutex_trylock(&oom_lock), an OOM victim can fail to try ALLOC_OOM attempt
> before selecting next OOM victim when MMF_OOM_SKIP was set quickly.

ENOPARSE. I am not even going to finish this email sorry. This is way
beyond my time budget.

Can you actually come with something that doesn't make ones head explode
and yet describe what the actual problem is and how you deal with it?

E.g something like this
"
OOM killer is invoked after all the reclaim attempts have failed and
there doesn't seem to be a viable chance for the situation to change.
__alloc_pages_may_oom tries to reduce chances of a race during OOM
handling by taking oom lock so only one caller is allowed to really
invoke the oom killer.

__alloc_pages_may_oom also tries last time ALLOC_WMARK_HIGH allocation
request before really invoking out_of_memory handler. This has two
motivations. The first one is explained by the comment and it aims to
catch potential parallel OOM killing and the second one was explained by
Andrea Arcangeli as follows:
: Elaborating the comment: the reason for the high wmark is to reduce
: the likelihood of livelocks and be sure to invoke the OOM killer, if
: we're still under pressure and reclaim just failed. The high wmark is
: used to be sure the failure of reclaim isn't going to be ignored. If
: using the min wmark like you propose there's risk of livelock or
: anyway of delayed OOM killer invocation.

While both have some merit, the first reason is mostly historical
because we have the explicit locking now and it is really unlikely that
the memory would be available right after we have given up trying.
Last attempt allocation makes some sense of course but considering that
the oom victim selection is quite an expensive operation which can take
a considerable amount of time it makes much more sense to retry the
allocation after the most expensive part rather than before. Therefore
move the last attempt right before we are trying to kill an oom victim
to rule potential races when somebody could have freed a lot of memory
in the meantime. This will reduce the time window for potentially
pre-mature OOM killing considerably.
"
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 12:10     ` Michal Hocko
@ 2017-10-31 12:42       ` Tetsuo Handa
  2017-10-31 12:48         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2017-10-31 12:42 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Tue 31-10-17 19:40:09, Tetsuo Handa wrote:
> > The reason I used __alloc_pages_slowpath() in alloc_pages_before_oomkill() is
> > to avoid duplicating code (such as checking for ALLOC_OOM and rebuilding zone
> > list) which needs to be maintained in sync with __alloc_pages_slowpath().
> >
> > If you don't like calling __alloc_pages_slowpath() from
> > alloc_pages_before_oomkill(), I'm OK with calling __alloc_pages_nodemask()
> > (with __GFP_DIRECT_RECLAIM/__GFP_NOFAIL cleared and __GFP_NOWARN set), for
> > direct reclaim functions can call __alloc_pages_nodemask() (with PF_MEMALLOC
> > set in order to avoid recursion of direct reclaim).
> > 
> > We are rebuilding zone list if selected as an OOM victim, for
> > __gfp_pfmemalloc_flags() returns ALLOC_OOM if oom_reserves_allowed(current)
> > is true.
> 
> So your answer is copy&paste without a deeper understanding, righ?

Right. I wanted to avoid duplicating code.
But I had to duplicate in order to allow OOM victims to try ALLOC_OOM.

> 
> [...]
> 
> > The reason I'm proposing this "mm,oom: Try last second allocation before and
> > after selecting an OOM victim." is that since oom_reserves_allowed(current) can
> > become true when current is between post __gfp_pfmemalloc_flags(gfp_mask) and
> > pre mutex_trylock(&oom_lock), an OOM victim can fail to try ALLOC_OOM attempt
> > before selecting next OOM victim when MMF_OOM_SKIP was set quickly.
> 
> ENOPARSE. I am not even going to finish this email sorry. This is way
> beyond my time budget.
> 
> Can you actually come with something that doesn't make ones head explode
> and yet describe what the actual problem is and how you deal with it?

http://lkml.kernel.org/r/201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp
is least head exploding while it describes what the actual problem is and
how I deal with it.

> 
> E.g something like this
> "
> OOM killer is invoked after all the reclaim attempts have failed and
> there doesn't seem to be a viable chance for the situation to change.
> __alloc_pages_may_oom tries to reduce chances of a race during OOM
> handling by taking oom lock so only one caller is allowed to really
> invoke the oom killer.

OK.
> 
> __alloc_pages_may_oom also tries last time ALLOC_WMARK_HIGH allocation
> request before really invoking out_of_memory handler. This has two
> motivations. The first one is explained by the comment and it aims to
> catch potential parallel OOM killing and the second one was explained by
> Andrea Arcangeli as follows:
> : Elaborating the comment: the reason for the high wmark is to reduce
> : the likelihood of livelocks and be sure to invoke the OOM killer, if
> : we're still under pressure and reclaim just failed. The high wmark is
> : used to be sure the failure of reclaim isn't going to be ignored. If
> : using the min wmark like you propose there's risk of livelock or
> : anyway of delayed OOM killer invocation.
> 

OK.

> While both have some merit, the first reason is mostly historical
> because we have the explicit locking now and it is really unlikely that
> the memory would be available right after we have given up trying.
> Last attempt allocation makes some sense of course but considering that
> the oom victim selection is quite an expensive operation which can take
> a considerable amount of time it makes much more sense to retry the
> allocation after the most expensive part rather than before. Therefore
> move the last attempt right before we are trying to kill an oom victim
> to rule potential races when somebody could have freed a lot of memory
> in the meantime. This will reduce the time window for potentially
> pre-mature OOM killing considerably.

But this is about "doing last second allocation attempt after selecting
an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
before selecting next OOM victim" which is the actual problem I'm trying
to deal with. Moving last second allocation attempt from "before" to
"after" does not solve the problem if ALLOC_OOM cannot be used. What I'm
proposing is to allow OOM victims to try ALLOC_OOM.

> "

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 12:42       ` Tetsuo Handa
@ 2017-10-31 12:48         ` Michal Hocko
  2017-10-31 13:13           ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-10-31 12:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 31-10-17 19:40:09, Tetsuo Handa wrote:
> > > The reason I used __alloc_pages_slowpath() in alloc_pages_before_oomkill() is
> > > to avoid duplicating code (such as checking for ALLOC_OOM and rebuilding zone
> > > list) which needs to be maintained in sync with __alloc_pages_slowpath().
> > >
> > > If you don't like calling __alloc_pages_slowpath() from
> > > alloc_pages_before_oomkill(), I'm OK with calling __alloc_pages_nodemask()
> > > (with __GFP_DIRECT_RECLAIM/__GFP_NOFAIL cleared and __GFP_NOWARN set), for
> > > direct reclaim functions can call __alloc_pages_nodemask() (with PF_MEMALLOC
> > > set in order to avoid recursion of direct reclaim).
> > > 
> > > We are rebuilding zone list if selected as an OOM victim, for
> > > __gfp_pfmemalloc_flags() returns ALLOC_OOM if oom_reserves_allowed(current)
> > > is true.
> > 
> > So your answer is copy&paste without a deeper understanding, righ?
> 
> Right. I wanted to avoid duplicating code.
> But I had to duplicate in order to allow OOM victims to try ALLOC_OOM.

I absolutely hate this cargo cult programming!

[...]

> > While both have some merit, the first reason is mostly historical
> > because we have the explicit locking now and it is really unlikely that
> > the memory would be available right after we have given up trying.
> > Last attempt allocation makes some sense of course but considering that
> > the oom victim selection is quite an expensive operation which can take
> > a considerable amount of time it makes much more sense to retry the
> > allocation after the most expensive part rather than before. Therefore
> > move the last attempt right before we are trying to kill an oom victim
> > to rule potential races when somebody could have freed a lot of memory
> > in the meantime. This will reduce the time window for potentially
> > pre-mature OOM killing considerably.
> 
> But this is about "doing last second allocation attempt after selecting
> an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> before selecting next OOM victim" which is the actual problem I'm trying
> to deal with.

then split it into two. First make the general case and then add a more
sophisticated on top. Dealing with multiple issues at once is what makes
all those brain cells suffer.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 12:48         ` Michal Hocko
@ 2017-10-31 13:13           ` Tetsuo Handa
  2017-10-31 13:22             ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2017-10-31 13:13 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > While both have some merit, the first reason is mostly historical
> > > because we have the explicit locking now and it is really unlikely that
> > > the memory would be available right after we have given up trying.
> > > Last attempt allocation makes some sense of course but considering that
> > > the oom victim selection is quite an expensive operation which can take
> > > a considerable amount of time it makes much more sense to retry the
> > > allocation after the most expensive part rather than before. Therefore
> > > move the last attempt right before we are trying to kill an oom victim
> > > to rule potential races when somebody could have freed a lot of memory
> > > in the meantime. This will reduce the time window for potentially
> > > pre-mature OOM killing considerably.
> > 
> > But this is about "doing last second allocation attempt after selecting
> > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > before selecting next OOM victim" which is the actual problem I'm trying
> > to deal with.
> 
> then split it into two. First make the general case and then add a more
> sophisticated on top. Dealing with multiple issues at once is what makes
> all those brain cells suffer.

I'm failing to understand. I was dealing with single issue at once.
The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from trying
ALLOC_OOM before selecting next OOM victims". Then, what are the general case and
a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow OOM
victims to try ALLOC_OOM for once before selecting next OOM victims" can exist...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 13:13           ` Tetsuo Handa
@ 2017-10-31 13:22             ` Michal Hocko
  2017-10-31 13:51               ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-10-31 13:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

On Tue 31-10-17 22:13:05, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > > While both have some merit, the first reason is mostly historical
> > > > because we have the explicit locking now and it is really unlikely that
> > > > the memory would be available right after we have given up trying.
> > > > Last attempt allocation makes some sense of course but considering that
> > > > the oom victim selection is quite an expensive operation which can take
> > > > a considerable amount of time it makes much more sense to retry the
> > > > allocation after the most expensive part rather than before. Therefore
> > > > move the last attempt right before we are trying to kill an oom victim
> > > > to rule potential races when somebody could have freed a lot of memory
> > > > in the meantime. This will reduce the time window for potentially
> > > > pre-mature OOM killing considerably.
> > > 
> > > But this is about "doing last second allocation attempt after selecting
> > > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > > before selecting next OOM victim" which is the actual problem I'm trying
> > > to deal with.
> > 
> > then split it into two. First make the general case and then add a more
> > sophisticated on top. Dealing with multiple issues at once is what makes
> > all those brain cells suffer.
> 
> I'm failing to understand. I was dealing with single issue at once.
> The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from trying
> ALLOC_OOM before selecting next OOM victims". Then, what are the general case and
> a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow OOM
> victims to try ALLOC_OOM for once before selecting next OOM victims" can exist...

Try to think little bit out of your very specific and borderline usecase
and it will become obvious. ALLOC_OOM is a trivial update on top of
moving get_page_from_freelist to oom_kill_process which is a more
generic race window reducer.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 13:22             ` Michal Hocko
@ 2017-10-31 13:51               ` Tetsuo Handa
  2017-10-31 14:10                 ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2017-10-31 13:51 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Tue 31-10-17 22:13:05, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > > > While both have some merit, the first reason is mostly historical
> > > > > because we have the explicit locking now and it is really unlikely that
> > > > > the memory would be available right after we have given up trying.
> > > > > Last attempt allocation makes some sense of course but considering that
> > > > > the oom victim selection is quite an expensive operation which can take
> > > > > a considerable amount of time it makes much more sense to retry the
> > > > > allocation after the most expensive part rather than before. Therefore
> > > > > move the last attempt right before we are trying to kill an oom victim
> > > > > to rule potential races when somebody could have freed a lot of memory
> > > > > in the meantime. This will reduce the time window for potentially
> > > > > pre-mature OOM killing considerably.
> > > > 
> > > > But this is about "doing last second allocation attempt after selecting
> > > > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > > > before selecting next OOM victim" which is the actual problem I'm trying
> > > > to deal with.
> > > 
> > > then split it into two. First make the general case and then add a more
> > > sophisticated on top. Dealing with multiple issues at once is what makes
> > > all those brain cells suffer.
> > 
> > I'm failing to understand. I was dealing with single issue at once.
> > The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from trying
> > ALLOC_OOM before selecting next OOM victims". Then, what are the general case and
> > a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow OOM
> > victims to try ALLOC_OOM for once before selecting next OOM victims" can exist...
> 
> Try to think little bit out of your very specific and borderline usecase
> and it will become obvious. ALLOC_OOM is a trivial update on top of
> moving get_page_from_freelist to oom_kill_process which is a more
> generic race window reducer.

So, you meant "doing last second allocation attempt after selecting an OOM victim"
as the general case and "using ALLOC_OOM at last second allocation attempt" as a
more sophisticated. Then, you won't object conditionally switching ALLOC_WMARK_HIGH
and ALLOC_OOM for last second allocation attempt, will you?

But doing ALLOC_OOM for last second allocation attempt from out_of_memory() involve
duplicating code (e.g. rebuilding zone list). What is your preferred approach?
Duplicate relevant code? Use get_page_from_freelist() without rebuilding the zone list?
Use __alloc_pages_nodemask() ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 13:51               ` Tetsuo Handa
@ 2017-10-31 14:10                 ` Michal Hocko
  2017-11-01 11:58                   ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-10-31 14:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

On Tue 31-10-17 22:51:49, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 31-10-17 22:13:05, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > > > > While both have some merit, the first reason is mostly historical
> > > > > > because we have the explicit locking now and it is really unlikely that
> > > > > > the memory would be available right after we have given up trying.
> > > > > > Last attempt allocation makes some sense of course but considering that
> > > > > > the oom victim selection is quite an expensive operation which can take
> > > > > > a considerable amount of time it makes much more sense to retry the
> > > > > > allocation after the most expensive part rather than before. Therefore
> > > > > > move the last attempt right before we are trying to kill an oom victim
> > > > > > to rule potential races when somebody could have freed a lot of memory
> > > > > > in the meantime. This will reduce the time window for potentially
> > > > > > pre-mature OOM killing considerably.
> > > > > 
> > > > > But this is about "doing last second allocation attempt after selecting
> > > > > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > > > > before selecting next OOM victim" which is the actual problem I'm trying
> > > > > to deal with.
> > > > 
> > > > then split it into two. First make the general case and then add a more
> > > > sophisticated on top. Dealing with multiple issues at once is what makes
> > > > all those brain cells suffer.
> > > 
> > > I'm failing to understand. I was dealing with single issue at once.
> > > The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from trying
> > > ALLOC_OOM before selecting next OOM victims". Then, what are the general case and
> > > a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow OOM
> > > victims to try ALLOC_OOM for once before selecting next OOM victims" can exist...
> > 
> > Try to think little bit out of your very specific and borderline usecase
> > and it will become obvious. ALLOC_OOM is a trivial update on top of
> > moving get_page_from_freelist to oom_kill_process which is a more
> > generic race window reducer.
> 
> So, you meant "doing last second allocation attempt after selecting an OOM victim"
> as the general case and "using ALLOC_OOM at last second allocation attempt" as a
> more sophisticated. Then, you won't object conditionally switching ALLOC_WMARK_HIGH
> and ALLOC_OOM for last second allocation attempt, will you?

yes for oom_victims

> But doing ALLOC_OOM for last second allocation attempt from out_of_memory() involve
> duplicating code (e.g. rebuilding zone list).

Why would you do it? Do not blindly copy and paste code without
a good reason. What kind of problem does this actually solve?

> What is your preferred approach?
> Duplicate relevant code? Use get_page_from_freelist() without rebuilding the zone list?
> Use __alloc_pages_nodemask() ?

Just do what we do now with ALLOC_WMARK_HIGH and in a separate patch use
ALLOC_OOM for oom victims. There shouldn't be any reasons to play
additional tricks here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-10-31 14:10                 ` Michal Hocko
@ 2017-11-01 11:58                   ` Tetsuo Handa
  2017-11-01 12:46                     ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2017-11-01 11:58 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Tue 31-10-17 22:51:49, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 31-10-17 22:13:05, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > > > > > While both have some merit, the first reason is mostly historical
> > > > > > > because we have the explicit locking now and it is really unlikely that
> > > > > > > the memory would be available right after we have given up trying.
> > > > > > > Last attempt allocation makes some sense of course but considering that
> > > > > > > the oom victim selection is quite an expensive operation which can take
> > > > > > > a considerable amount of time it makes much more sense to retry the
> > > > > > > allocation after the most expensive part rather than before. Therefore
> > > > > > > move the last attempt right before we are trying to kill an oom victim
> > > > > > > to rule potential races when somebody could have freed a lot of memory
> > > > > > > in the meantime. This will reduce the time window for potentially
> > > > > > > pre-mature OOM killing considerably.
> > > > > > 
> > > > > > But this is about "doing last second allocation attempt after selecting
> > > > > > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > > > > > before selecting next OOM victim" which is the actual problem I'm trying
> > > > > > to deal with.
> > > > > 
> > > > > then split it into two. First make the general case and then add a more
> > > > > sophisticated on top. Dealing with multiple issues at once is what makes
> > > > > all those brain cells suffer.
> > > > 
> > > > I'm failing to understand. I was dealing with single issue at once.
> > > > The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from trying
> > > > ALLOC_OOM before selecting next OOM victims". Then, what are the general case and
> > > > a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow OOM
> > > > victims to try ALLOC_OOM for once before selecting next OOM victims" can exist...
> > > 
> > > Try to think little bit out of your very specific and borderline usecase
> > > and it will become obvious. ALLOC_OOM is a trivial update on top of
> > > moving get_page_from_freelist to oom_kill_process which is a more
> > > generic race window reducer.
> > 
> > So, you meant "doing last second allocation attempt after selecting an OOM victim"
> > as the general case and "using ALLOC_OOM at last second allocation attempt" as a
> > more sophisticated. Then, you won't object conditionally switching ALLOC_WMARK_HIGH
> > and ALLOC_OOM for last second allocation attempt, will you?
> 
> yes for oom_victims

OK.

> 
> > But doing ALLOC_OOM for last second allocation attempt from out_of_memory() involve
> > duplicating code (e.g. rebuilding zone list).
> 
> Why would you do it? Do not blindly copy and paste code without
> a good reason. What kind of problem does this actually solve?

prepare_alloc_pages()/finalise_ac() initializes as

	ac->high_zoneidx = gfp_zone(gfp_mask);
	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
						     ac->high_zoneidx, ac->nodemask);

and selecting as an OOM victim reinitializes as

	ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
						     ac->high_zoneidx, ac->nodemask);

and I assume that this reinitialization might affect which memory reserve
the OOM victim allocates from.

You mean such difference is too trivial to care about?

> 
> > What is your preferred approach?
> > Duplicate relevant code? Use get_page_from_freelist() without rebuilding the zone list?
> > Use __alloc_pages_nodemask() ?
> 
> Just do what we do now with ALLOC_WMARK_HIGH and in a separate patch use
> ALLOC_OOM for oom victims. There shouldn't be any reasons to play
> additional tricks here.
> 

Posted as http://lkml.kernel.org/r/1509537268-4726-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .

But I'm still unable to understand why "moving get_page_from_freelist to
oom_kill_process" is better than "copying get_page_from_freelist to
oom_kill_process", for "moving" increases possibility of allocation failures
when out_of_memory() is not called. Also, I'm still unable to understand why
to use ALLOC_WMARK_HIGH. I think that using regular watermark for last second
allocation attempt is better (as described below).

__alloc_pages_may_oom() is doing last second allocation attempt using
ALLOC_WMARK_HIGH before calling out_of_memory(). This has two motivations.
The first one is explained by the comment that it aims to catch potential
parallel OOM killing and the second one was explained by Andrea Arcangeli
as follows:
: Elaborating the comment: the reason for the high wmark is to reduce
: the likelihood of livelocks and be sure to invoke the OOM killer, if
: we're still under pressure and reclaim just failed. The high wmark is
: used to be sure the failure of reclaim isn't going to be ignored. If
: using the min wmark like you propose there's risk of livelock or
: anyway of delayed OOM killer invocation.

But neither motivation applies to current code. Regarding the former,
there is no parallel OOM killing (in the sense that out_of_memory() is
called "concurrently") because we serialize out_of_memory() calls using
oom_lock. Regarding the latter, there is no possibility of OOM livelocks
nor possibility of failing to invoke the OOM killer because we mask
__GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
second allocation attempt depends on from failing.

However, parallel OOM killing still exists (in the sense that
out_of_memory() is called "consecutively") because last second allocation
attempt cannot fully utilize memory reclaimed by previous round of OOM
killer invocation due to use of ALLOC_WMARK_HIGH. Sometimes doing last second
allocation attempt after selecting an OOM victim can succeed because
somebody might have managed to free memory while we were selecting an OOM
victim which can take quite some time. This suggests that giving up last
second allocation attempt as soon as ALLOC_WMARK_HIGH fails can be premature.

Even if last second allocation attempt after selecting an OOM victim would
still fail, already killed OOM victim might allow ALLOC_WMARK_MIN to succeed.
We don't need to select next OOM victims if ALLOC_WMARK_MIN can succeed.
Also, we don't need to select next OOM victims if existing OOM victims can
proceed with ALLOC_OOM. And these are what we are after all doing if
mutex_trylock(&oom_lock) in __alloc_pages_may_oom() fails. But it is
theoretically possible to hit sequence like

  Thread1           Thread2           Thread3

    Enters __alloc_pages_may_oom().
                    Enters __alloc_pages_may_oom().
                                      Enters __alloc_pages_may_oom().
    Preempted by somebody else.
                    Preempted by somebody else.
                                      mutex_trylock(&oom_lock) succeeds.
                                      get_page_from_freelist(ALLOC_WMARK_HIGH) fails. And get_page_from_freelist(ALLOC_WMARK_MIN) would have failed.
                                      Calls out_of_memory() and kills a not-such-memhog victim.
                                      Calls mutex_unlock(&oom_lock)
                    Returns from preemption.
                    mutex_trylock(&oom_lock) succeeds.
                    get_page_from_freelist(ALLOC_WMARK_HIGH) fails. But get_page_from_freelist(ALLOC_WMARK_MIN) would have succeeded.
                    Calls out_of_memory() and kills next not-such-memhog victim.
                    Calls mutex_unlock(&oom_lock)
    Returns from preemption.
    mutex_trylock(&oom_lock) succeeds.
    get_page_from_freelist(ALLOC_WMARK_HIGH) fails. But get_page_from_freelist(ALLOC_WMARK_MIN) would have succeeded.
    Calls out_of_memory() and kills next not-such-memhog victim.
    Calls mutex_unlock(&oom_lock)

and Thread1/Thread2 did not need to OOM-kill if ALLOC_WMARK_MIN were used.
When we hit sequence like above, using ALLOC_WMARK_HIGH for last second allocation
attempt unlikely helps avoiding potential parallel OOM killing. Rather, using
ALLOC_WMARK_MIN likely helps avoiding potential parallel OOM killing.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-11-01 11:58                   ` Tetsuo Handa
@ 2017-11-01 12:46                     ` Michal Hocko
  2017-11-01 14:38                       ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-11-01 12:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

On Wed 01-11-17 20:58:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 31-10-17 22:51:49, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 31-10-17 22:13:05, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Tue 31-10-17 21:42:23, Tetsuo Handa wrote:
> > > > > > > > While both have some merit, the first reason is mostly historical
> > > > > > > > because we have the explicit locking now and it is really unlikely that
> > > > > > > > the memory would be available right after we have given up trying.
> > > > > > > > Last attempt allocation makes some sense of course but considering that
> > > > > > > > the oom victim selection is quite an expensive operation which can take
> > > > > > > > a considerable amount of time it makes much more sense to retry the
> > > > > > > > allocation after the most expensive part rather than before. Therefore
> > > > > > > > move the last attempt right before we are trying to kill an oom victim
> > > > > > > > to rule potential races when somebody could have freed a lot of memory
> > > > > > > > in the meantime. This will reduce the time window for potentially
> > > > > > > > pre-mature OOM killing considerably.
> > > > > > > 
> > > > > > > But this is about "doing last second allocation attempt after selecting
> > > > > > > an OOM victim". This is not about "allowing OOM victims to try ALLOC_OOM
> > > > > > > before selecting next OOM victim" which is the actual problem I'm trying
> > > > > > > to deal with.
> > > > > > 
> > > > > > then split it into two. First make the general case and then add a more
> > > > > > sophisticated on top. Dealing with multiple issues at once is what makes
> > > > > > all those brain cells suffer.
> > > > > 
> > > > > I'm failing to understand. I was dealing with single issue at once.
> > > > > The single issue is "MMF_OOM_SKIP prematurely prevents OOM victims from trying
> > > > > ALLOC_OOM before selecting next OOM victims". Then, what are the general case and
> > > > > a more sophisticated? I wonder what other than "MMF_OOM_SKIP should allow OOM
> > > > > victims to try ALLOC_OOM for once before selecting next OOM victims" can exist...
> > > > 
> > > > Try to think little bit out of your very specific and borderline usecase
> > > > and it will become obvious. ALLOC_OOM is a trivial update on top of
> > > > moving get_page_from_freelist to oom_kill_process which is a more
> > > > generic race window reducer.
> > > 
> > > So, you meant "doing last second allocation attempt after selecting an OOM victim"
> > > as the general case and "using ALLOC_OOM at last second allocation attempt" as a
> > > more sophisticated. Then, you won't object conditionally switching ALLOC_WMARK_HIGH
> > > and ALLOC_OOM for last second allocation attempt, will you?
> > 
> > yes for oom_victims
> 
> OK.
> 
> > 
> > > But doing ALLOC_OOM for last second allocation attempt from out_of_memory() involve
> > > duplicating code (e.g. rebuilding zone list).
> > 
> > Why would you do it? Do not blindly copy and paste code without
> > a good reason. What kind of problem does this actually solve?
> 
> prepare_alloc_pages()/finalise_ac() initializes as
> 
> 	ac->high_zoneidx = gfp_zone(gfp_mask);
> 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> 						     ac->high_zoneidx, ac->nodemask);
> 
> and selecting as an OOM victim reinitializes as
> 
> 	ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> 						     ac->high_zoneidx, ac->nodemask);
> 
> and I assume that this reinitialization might affect which memory reserve
> the OOM victim allocates from.
> 
> You mean such difference is too trivial to care about?

You keep repeating what the _current_ code does without explaining _why_
do we need the same thing in the oom path. Could you finaly answer my
question please?

> > > What is your preferred approach?
> > > Duplicate relevant code? Use get_page_from_freelist() without rebuilding the zone list?
> > > Use __alloc_pages_nodemask() ?
> > 
> > Just do what we do now with ALLOC_WMARK_HIGH and in a separate patch use
> > ALLOC_OOM for oom victims. There shouldn't be any reasons to play
> > additional tricks here.
> > 
> 
> Posted as http://lkml.kernel.org/r/1509537268-4726-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> 
> But I'm still unable to understand why "moving get_page_from_freelist to
> oom_kill_process" is better than "copying get_page_from_freelist to
> oom_kill_process", for "moving" increases possibility of allocation failures
> when out_of_memory() is not called.

The changelog I have provided to you should answer that. It is highly
unlikely there high wmark would succeed _right_ after we have just given
up. If this assumption is not correct then we can _add_ such a call
based on a real data rather than add more bloat "just because we used to
do that". As I've said I completely hate the cargo cult programming. Do
not add more.

> Also, I'm still unable to understand why
> to use ALLOC_WMARK_HIGH. I think that using regular watermark for last second
> allocation attempt is better (as described below).

If you believe that a standard wmark is sufficient then make it a
separate patch with the full explanation why.

> __alloc_pages_may_oom() is doing last second allocation attempt using
> ALLOC_WMARK_HIGH before calling out_of_memory(). This has two motivations.
> The first one is explained by the comment that it aims to catch potential
> parallel OOM killing and the second one was explained by Andrea Arcangeli
> as follows:
> : Elaborating the comment: the reason for the high wmark is to reduce
> : the likelihood of livelocks and be sure to invoke the OOM killer, if
> : we're still under pressure and reclaim just failed. The high wmark is
> : used to be sure the failure of reclaim isn't going to be ignored. If
> : using the min wmark like you propose there's risk of livelock or
> : anyway of delayed OOM killer invocation.
> 
> But neither motivation applies to current code. Regarding the former,
> there is no parallel OOM killing (in the sense that out_of_memory() is
> called "concurrently") because we serialize out_of_memory() calls using
> oom_lock. Regarding the latter, there is no possibility of OOM livelocks
> nor possibility of failing to invoke the OOM killer because we mask
> __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
> prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
> second allocation attempt depends on from failing.

Read that comment again. I believe you have misunderstood it. It is not
about gfp flags at all. It is that we simply never invoke oom killer
just because of small allocation fluctuations.

[...]
>   Thread1           Thread2           Thread3
> 
>     Enters __alloc_pages_may_oom().
>                     Enters __alloc_pages_may_oom().
>                                       Enters __alloc_pages_may_oom().
>     Preempted by somebody else.
>                     Preempted by somebody else.
>                                       mutex_trylock(&oom_lock) succeeds.
>                                       get_page_from_freelist(ALLOC_WMARK_HIGH) fails. And get_page_from_freelist(ALLOC_WMARK_MIN) would have failed.
>                                       Calls out_of_memory() and kills a not-such-memhog victim.
>                                       Calls mutex_unlock(&oom_lock)
>                     Returns from preemption.
>                     mutex_trylock(&oom_lock) succeeds.
>                     get_page_from_freelist(ALLOC_WMARK_HIGH) fails. But get_page_from_freelist(ALLOC_WMARK_MIN) would have succeeded.
>                     Calls out_of_memory() and kills next not-such-memhog victim.
>                     Calls mutex_unlock(&oom_lock)
>     Returns from preemption.
>     mutex_trylock(&oom_lock) succeeds.
>     get_page_from_freelist(ALLOC_WMARK_HIGH) fails. But get_page_from_freelist(ALLOC_WMARK_MIN) would have succeeded.
>     Calls out_of_memory() and kills next not-such-memhog victim.
>     Calls mutex_unlock(&oom_lock)
> 
> and Thread1/Thread2 did not need to OOM-kill if ALLOC_WMARK_MIN were used.
> When we hit sequence like above, using ALLOC_WMARK_HIGH for last second allocation
> attempt unlikely helps avoiding potential parallel OOM killing. Rather, using
> ALLOC_WMARK_MIN likely helps avoiding potential parallel OOM killing.

I am not sure such a scenario matters all that much because it assumes
that the oom victim doesn't really free much memory [1] (basically less than
HIGH-MIN). Most OOM situation simply have a memory hog consuming
significant amount of memory. Sure you can construct a workload which
spans many zones (especially on NUMA systems with many nodes/zones) but
can we focus on reasonable workloads rather than overcomplicate things
without a good reasons because "we can screw up systems so many
different ways"? In other words, please be reasonable...

[1] Take this as an example
Node 0, zone   Normal
  pages free     78348
        min      11522
        low      14402
        high     17282
        spanned  1368064
        present  1368064
        managed  1332983

which is a 5GB zone where high-min is ~20MB.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-11-01 12:46                     ` Michal Hocko
@ 2017-11-01 14:38                       ` Tetsuo Handa
  2017-11-01 14:48                         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2017-11-01 14:38 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Wed 01-11-17 20:58:50, Tetsuo Handa wrote:
> > > > But doing ALLOC_OOM for last second allocation attempt from out_of_memory() involve
> > > > duplicating code (e.g. rebuilding zone list).
> > > 
> > > Why would you do it? Do not blindly copy and paste code without
> > > a good reason. What kind of problem does this actually solve?
> > 
> > prepare_alloc_pages()/finalise_ac() initializes as
> > 
> > 	ac->high_zoneidx = gfp_zone(gfp_mask);
> > 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > 						     ac->high_zoneidx, ac->nodemask);
> > 
> > and selecting as an OOM victim reinitializes as
> > 
> > 	ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> > 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > 						     ac->high_zoneidx, ac->nodemask);
> > 
> > and I assume that this reinitialization might affect which memory reserve
> > the OOM victim allocates from.
> > 
> > You mean such difference is too trivial to care about?
> 
> You keep repeating what the _current_ code does without explaining _why_
> do we need the same thing in the oom path. Could you finaly answer my
> question please?

Because I consider that following what the current code does is reasonable
unless there are explicit reasons not to follow.

> 
> > > > What is your preferred approach?
> > > > Duplicate relevant code? Use get_page_from_freelist() without rebuilding the zone list?
> > > > Use __alloc_pages_nodemask() ?
> > > 
> > > Just do what we do now with ALLOC_WMARK_HIGH and in a separate patch use
> > > ALLOC_OOM for oom victims. There shouldn't be any reasons to play
> > > additional tricks here.
> > > 
> > 
> > Posted as http://lkml.kernel.org/r/1509537268-4726-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> > 
> > But I'm still unable to understand why "moving get_page_from_freelist to
> > oom_kill_process" is better than "copying get_page_from_freelist to
> > oom_kill_process", for "moving" increases possibility of allocation failures
> > when out_of_memory() is not called.
> 
> The changelog I have provided to you should answer that. It is highly
> unlikely there high wmark would succeed _right_ after we have just given
> up. If this assumption is not correct then we can _add_ such a call
> based on a real data rather than add more bloat "just because we used to
> do that". As I've said I completely hate the cargo cult programming. Do
> not add more.

I think that it is highly unlikely high wmark would succeed. But
I don't think that it is highly unlikely normal wmark would succeed.

> 
> > Also, I'm still unable to understand why
> > to use ALLOC_WMARK_HIGH. I think that using regular watermark for last second
> > allocation attempt is better (as described below).
> 
> If you believe that a standard wmark is sufficient then make it a
> separate patch with the full explanation why.
> 
> > __alloc_pages_may_oom() is doing last second allocation attempt using
> > ALLOC_WMARK_HIGH before calling out_of_memory(). This has two motivations.
> > The first one is explained by the comment that it aims to catch potential
> > parallel OOM killing and the second one was explained by Andrea Arcangeli
> > as follows:
> > : Elaborating the comment: the reason for the high wmark is to reduce
> > : the likelihood of livelocks and be sure to invoke the OOM killer, if
> > : we're still under pressure and reclaim just failed. The high wmark is
> > : used to be sure the failure of reclaim isn't going to be ignored. If
> > : using the min wmark like you propose there's risk of livelock or
> > : anyway of delayed OOM killer invocation.
> > 
> > But neither motivation applies to current code. Regarding the former,
> > there is no parallel OOM killing (in the sense that out_of_memory() is
> > called "concurrently") because we serialize out_of_memory() calls using
> > oom_lock. Regarding the latter, there is no possibility of OOM livelocks
> > nor possibility of failing to invoke the OOM killer because we mask
> > __GFP_DIRECT_RECLAIM for last second allocation attempt because oom_lock
> > prevents __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocations which last
> > second allocation attempt depends on from failing.
> 
> Read that comment again. I believe you have misunderstood it. It is not
> about gfp flags at all. It is that we simply never invoke oom killer
> just because of small allocation fluctuations.

???

Does "that comment" refer to

  Go through the zonelist yet one more time, keep very high watermark
  here, this is only to catch a parallel oom killing, we must fail if
  we're still under heavy pressure.

part? Then, I know it is not about gfp flags.

Am I misunderstanding what "we must fail" means? My interpretation is
that "we must call out_of_memory() if the system is under memory pressure
that is enough to fail ALLOC_WMARK_HIGH". And I think that "enough to fail
ALLOC_WMARK_MIN" is heavier pressure than "enough to fail ALLOC_WMARK_HIGH".

Does "that comment" refer to

  Elaborating the comment: the reason for the high wmark is to reduce
  the likelihood of livelocks and be sure to invoke the OOM killer, if
  we're still under pressure and reclaim just failed. The high wmark is
  used to be sure the failure of reclaim isn't going to be ignored. If
  using the min wmark like you propose there's risk of livelock or
  anyway of delayed OOM killer invocation.

part? Then, I know it is not about gfp flags.

But how can OOM livelock happen when the last second allocation does not
wait for memory reclaim (because __GFP_DIRECT_RECLAIM is masked) ?
The last second allocation shall return immediately, and we will call
out_of_memory() if the last second allocation failed.

> 
> [...]
> >   Thread1           Thread2           Thread3
> > 
> >     Enters __alloc_pages_may_oom().
> >                     Enters __alloc_pages_may_oom().
> >                                       Enters __alloc_pages_may_oom().
> >     Preempted by somebody else.
> >                     Preempted by somebody else.
> >                                       mutex_trylock(&oom_lock) succeeds.
> >                                       get_page_from_freelist(ALLOC_WMARK_HIGH) fails. And get_page_from_freelist(ALLOC_WMARK_MIN) would have failed.
> >                                       Calls out_of_memory() and kills a not-such-memhog victim.
> >                                       Calls mutex_unlock(&oom_lock)
> >                     Returns from preemption.
> >                     mutex_trylock(&oom_lock) succeeds.
> >                     get_page_from_freelist(ALLOC_WMARK_HIGH) fails. But get_page_from_freelist(ALLOC_WMARK_MIN) would have succeeded.
> >                     Calls out_of_memory() and kills next not-such-memhog victim.
> >                     Calls mutex_unlock(&oom_lock)
> >     Returns from preemption.
> >     mutex_trylock(&oom_lock) succeeds.
> >     get_page_from_freelist(ALLOC_WMARK_HIGH) fails. But get_page_from_freelist(ALLOC_WMARK_MIN) would have succeeded.
> >     Calls out_of_memory() and kills next not-such-memhog victim.
> >     Calls mutex_unlock(&oom_lock)
> > 
> > and Thread1/Thread2 did not need to OOM-kill if ALLOC_WMARK_MIN were used.
> > When we hit sequence like above, using ALLOC_WMARK_HIGH for last second allocation
> > attempt unlikely helps avoiding potential parallel OOM killing. Rather, using
> > ALLOC_WMARK_MIN likely helps avoiding potential parallel OOM killing.
> 
> I am not sure such a scenario matters all that much because it assumes
> that the oom victim doesn't really free much memory [1] (basically less than
> HIGH-MIN). Most OOM situation simply have a memory hog consuming
> significant amount of memory.

The OOM killer does not always kill a memory hog consuming significant amount
of memory. The OOM killer kills a process with highest OOM score (and instead
one of its children if any). I don't think that assuming an OOM victim will free
memory enough to succeed ALLOC_WMARK_HIGH is appropriate.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-11-01 14:38                       ` Tetsuo Handa
@ 2017-11-01 14:48                         ` Michal Hocko
  2017-11-01 15:37                           ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-11-01 14:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

On Wed 01-11-17 23:38:49, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 01-11-17 20:58:50, Tetsuo Handa wrote:
> > > > > But doing ALLOC_OOM for last second allocation attempt from out_of_memory() involve
> > > > > duplicating code (e.g. rebuilding zone list).
> > > > 
> > > > Why would you do it? Do not blindly copy and paste code without
> > > > a good reason. What kind of problem does this actually solve?
> > > 
> > > prepare_alloc_pages()/finalise_ac() initializes as
> > > 
> > > 	ac->high_zoneidx = gfp_zone(gfp_mask);
> > > 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > > 						     ac->high_zoneidx, ac->nodemask);
> > > 
> > > and selecting as an OOM victim reinitializes as
> > > 
> > > 	ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> > > 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > > 						     ac->high_zoneidx, ac->nodemask);
> > > 
> > > and I assume that this reinitialization might affect which memory reserve
> > > the OOM victim allocates from.
> > > 
> > > You mean such difference is too trivial to care about?
> > 
> > You keep repeating what the _current_ code does without explaining _why_
> > do we need the same thing in the oom path. Could you finaly answer my
> > question please?
> 
> Because I consider that following what the current code does is reasonable
> unless there are explicit reasons not to follow.

Following this pattern makes a code mess over time because nobody
remembers why something is done a specific way anymore. Everybody just
keeps the ball rolling because he is afraid to change the code he
doesn't understand. Don't do that!

[...]
> Does "that comment" refer to
> 
>   Elaborating the comment: the reason for the high wmark is to reduce
>   the likelihood of livelocks and be sure to invoke the OOM killer, if
>   we're still under pressure and reclaim just failed. The high wmark is
>   used to be sure the failure of reclaim isn't going to be ignored. If
>   using the min wmark like you propose there's risk of livelock or
>   anyway of delayed OOM killer invocation.
> 
> part? Then, I know it is not about gfp flags.
> 
> But how can OOM livelock happen when the last second allocation does not
> wait for memory reclaim (because __GFP_DIRECT_RECLAIM is masked) ?
> The last second allocation shall return immediately, and we will call
> out_of_memory() if the last second allocation failed.

I think Andrea just wanted to say that we do want to invoke OOM killer
and resolve the memory pressure rather than keep looping in the
reclaim/oom path just because there are few pages allocated and freed in
the meantime.

[...]
> > I am not sure such a scenario matters all that much because it assumes
> > that the oom victim doesn't really free much memory [1] (basically less than
> > HIGH-MIN). Most OOM situation simply have a memory hog consuming
> > significant amount of memory.
> 
> The OOM killer does not always kill a memory hog consuming significant amount
> of memory. The OOM killer kills a process with highest OOM score (and instead
> one of its children if any). I don't think that assuming an OOM victim will free
> memory enough to succeed ALLOC_WMARK_HIGH is appropriate.

OK, so let's agree to disagree. I claim that we shouldn't care all that
much. If any of the current heuristics turns out to lead to killing too
many tasks then we should simply remove it rather than keep bloating an
already complex code with more and more kluges.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim.
  2017-11-01 14:48                         ` Michal Hocko
@ 2017-11-01 15:37                           ` Tetsuo Handa
  0 siblings, 0 replies; 15+ messages in thread
From: Tetsuo Handa @ 2017-11-01 15:37 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, akpm, linux-mm, linux-kernel, rientjes, hannes, mjaggi,
	mgorman, oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> > Does "that comment" refer to
> > 
> >   Elaborating the comment: the reason for the high wmark is to reduce
> >   the likelihood of livelocks and be sure to invoke the OOM killer, if
> >   we're still under pressure and reclaim just failed. The high wmark is
> >   used to be sure the failure of reclaim isn't going to be ignored. If
> >   using the min wmark like you propose there's risk of livelock or
> >   anyway of delayed OOM killer invocation.
> > 
> > part? Then, I know it is not about gfp flags.
> > 
> > But how can OOM livelock happen when the last second allocation does not
> > wait for memory reclaim (because __GFP_DIRECT_RECLAIM is masked) ?
> > The last second allocation shall return immediately, and we will call
> > out_of_memory() if the last second allocation failed.
> 
> I think Andrea just wanted to say that we do want to invoke OOM killer
> and resolve the memory pressure rather than keep looping in the
> reclaim/oom path just because there are few pages allocated and freed in
> the meantime.

I see. Then, that motivation no longer applies to current code, except

> 
> [...]
> > > I am not sure such a scenario matters all that much because it assumes
> > > that the oom victim doesn't really free much memory [1] (basically less than
> > > HIGH-MIN). Most OOM situation simply have a memory hog consuming
> > > significant amount of memory.
> > 
> > The OOM killer does not always kill a memory hog consuming significant amount
> > of memory. The OOM killer kills a process with highest OOM score (and instead
> > one of its children if any). I don't think that assuming an OOM victim will free
> > memory enough to succeed ALLOC_WMARK_HIGH is appropriate.
> 
> OK, so let's agree to disagree. I claim that we shouldn't care all that
> much. If any of the current heuristics turns out to lead to killing too
> many tasks then we should simply remove it rather than keep bloating an
> already complex code with more and more kluges.

using ALLOC_WMARK_HIGH might cause more OOM-killing than ALLOC_WMARK_MIN.

Thanks for clarification.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-11-01 15:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-28  8:07 [PATCH] mm,oom: Try last second allocation before and after selecting an OOM victim Tetsuo Handa
2017-10-30 14:18 ` Michal Hocko
2017-10-31 10:40   ` Tetsuo Handa
2017-10-31 12:10     ` Michal Hocko
2017-10-31 12:42       ` Tetsuo Handa
2017-10-31 12:48         ` Michal Hocko
2017-10-31 13:13           ` Tetsuo Handa
2017-10-31 13:22             ` Michal Hocko
2017-10-31 13:51               ` Tetsuo Handa
2017-10-31 14:10                 ` Michal Hocko
2017-11-01 11:58                   ` Tetsuo Handa
2017-11-01 12:46                     ` Michal Hocko
2017-11-01 14:38                       ` Tetsuo Handa
2017-11-01 14:48                         ` Michal Hocko
2017-11-01 15:37                           ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).