linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3 -v3] GFP_NOFAIL cleanups
@ 2016-12-20 13:49 Michal Hocko
  2016-12-20 13:49 ` [PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Michal Hocko @ 2016-12-20 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, David Rientjes, Mel Gorman,
	Hillf Danton, linux-mm, LKML

Hi,
This has been posted [1] initially to later be reduced to a single patch
[2].  Johannes then suggested [3] to split up the second patch and make
the access to memory reserves by __GF_NOFAIL requests which do not
invoke the oom killer a separate change. This is patch 3 now.

Tetsuo has noticed [4] that recent changes have changed GFP_NOFAIL
semantic for costly order requests. I believe that the primary reason
why this happened is that our GFP_NOFAIL checks are too scattered
and it is really easy to forget about adding one. That's why I am
proposing patch 1 which consolidates all the nofail handling at a single
place. This should help to make this code better maintainable.

Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
surprising. As things stand currently GFP_NOFAIL overrides the oom killer
prevention code which is both subtle and not really needed. The patch 2
has more details about issues this might cause. We have also seen
a report where __GFP_NOFAIL|GFP_NOFS requests cause the oom killer which
is premature.

Patch 3 is an attempt to reduce chances of GFP_NOFAIL requests being
preempted by other memory consumers by giving them access to memory
reserves.

[1] http://lkml.kernel.org/r/20161123064925.9716-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20161214150706.27412-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20161216173151.GA23182@cmpxchg.org
[4] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  2016-12-20 13:49 [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
@ 2016-12-20 13:49 ` Michal Hocko
  2016-12-20 13:49 ` [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-12-20 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, David Rientjes, Mel Gorman,
	Hillf Danton, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Changes since v1
- do not skip direct reclaim for TIF_MEMDIE && GFP_NOFAIL as per Hillf
- do not skip __alloc_pages_may_oom for TIF_MEMDIE && GFP_NOFAIL as
  per Tetsuo

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c24112308d6..c8eed66d8abb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3647,35 +3647,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim) {
-		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually allow this type of allocation
-		 * to fail.
-		 */
-		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+	if (!can_direct_reclaim)
 		goto nopage;
-	}
 
-	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC) {
-		/*
-		 * __GFP_NOFAIL request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us.
-		 */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-			cond_resched();
-			goto retry;
-		}
-		goto nopage;
+	/* Make sure we know about allocations which stall for too long */
+	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		warn_alloc(gfp_mask,
+			"page allocation stalls for %ums, order:%u",
+			jiffies_to_msecs(jiffies-alloc_start), order);
+		stall_timeout += 10 * HZ;
 	}
 
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	/* Avoid recursion of direct reclaim */
+	if (current->flags & PF_MEMALLOC)
 		goto nopage;
 
-
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
 							&did_some_progress);
@@ -3699,14 +3685,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page allocation stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
 		goto retry;
@@ -3728,6 +3706,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	/* Avoid allocations with no watermarks from looping endlessly */
+	if (test_thread_flag(TIF_MEMDIE))
+		goto nopage;
+
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
@@ -3735,6 +3717,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+	 * we always retry
+	 */
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * All existing users of the __GFP_NOFAIL are blockable, so warn
+		 * of any new users that actually require GFP_NOWAIT
+		 */
+		if (WARN_ON_ONCE(!can_direct_reclaim))
+			goto fail;
+
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+		/*
+		 * non failing costly orders are a hard requirement which we
+		 * are not prepared for much so let's warn about these users
+		 * so that we can identify them and convert them to something
+		 * else.
+		 */
+		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+		cond_resched();
+		goto retry;
+	}
+fail:
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.10.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-20 13:49 [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
  2016-12-20 13:49 ` [PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
@ 2016-12-20 13:49 ` Michal Hocko
  2016-12-20 15:31   ` Tetsuo Handa
                     ` (2 more replies)
  2016-12-20 13:49 ` [PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not trigger OOM killer Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 24+ messages in thread
From: Michal Hocko @ 2016-12-20 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, David Rientjes, Mel Gorman,
	Hillf Danton, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is far from obvious. Note that the primary motivation
for skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
a forward progress for those exceptional cases:
	- it will hardly help to form costly order which in turn can
	  result in the system panic because of no oom killable task in
	  the end - I believe we certainly do not want to put the system
	  down just because there is a nasty driver asking for order-9
	  page with GFP_NOFAIL not realizing all the consequences. It is
	  much better this request would loop for ever than the massive
	  system disruption
	- lowmem is also highly unlikely to be freed during OOM killer
	- GFP_NOFS request could trigger while there is still a lot of
	  memory pinned by filesystems.

The pre-mature OOM killer is a real issue as reported by Nils Holland
	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
	kworker/u4:5 cpuset=/ mems_allowed=0
	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
	Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
	Workqueue: writeback wb_workfn (flush-btrfs-1)
	 eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
	 eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
	 eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
	Call Trace:
	 [<c142bcce>] dump_stack+0x47/0x69
	 [<c1163332>] dump_header+0x60/0x178
	 [<c1431876>] ? ___ratelimit+0x86/0xe0
	 [<c110795f>] oom_kill_process+0x20f/0x3d0
	 [<c1043895>] ? has_capability_noaudit+0x15/0x20
	 [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
	 [<c1107df9>] out_of_memory+0xd9/0x260
	 [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
	 [<c110414d>] pagecache_get_page+0xad/0x270
	 [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
	 [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
	[...]
	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
	lowmem_reserve[]: 0 0 21292 21292
	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

this is a GFP_NOFS|__GFP_NOFAIL request which invokes the OOM killer
because there is clearly nothing reclaimable in the zone Normal while
there is a lot of page cache which is most probably pinned by the fs but
GFP_NOFS cannot reclaim it.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects.

Reported-by: Nils Holland <nholland@tisys.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 49 ++++++++++++++++++++++++-------------------------
 2 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c8eed66d8abb..2dda7c3eba52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3098,32 +3098,31 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto out;
 
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		/* Coredumps can quickly deplete all memory reserves */
-		if (current->flags & PF_DUMPCORE)
-			goto out;
-		/* The OOM killer will not help higher order allocs */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto out;
-		/* The OOM killer does not needlessly kill tasks for lowmem */
-		if (ac->high_zoneidx < ZONE_NORMAL)
-			goto out;
-		if (pm_suspended_storage())
-			goto out;
-		/*
-		 * XXX: GFP_NOFS allocations should rather fail than rely on
-		 * other request to make a forward progress.
-		 * We are in an unfortunate situation where out_of_memory cannot
-		 * do much for this context but let's try it to at least get
-		 * access to memory reserved if the current task is killed (see
-		 * out_of_memory). Once filesystems are ready to handle allocation
-		 * failures more gracefully we should just bail out here.
-		 */
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		goto out;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+	if (pm_suspended_storage())
+		goto out;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 
-		/* The OOM killer may not free memory on a specific node */
-		if (gfp_mask & __GFP_THISNODE)
-			goto out;
-	}
 	/* Exhausted what can be done so it's blamo time */
 	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
-- 
2.10.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not trigger OOM killer
  2016-12-20 13:49 [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
  2016-12-20 13:49 ` [PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
  2016-12-20 13:49 ` [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
@ 2016-12-20 13:49 ` Michal Hocko
  2017-01-02 15:49 ` [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
  2017-01-18 18:42 ` Michal Hocko
  4 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-12-20 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, David Rientjes, Mel Gorman,
	Hillf Danton, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Now that __GFP_NOFAIL doesn't override decisions to skip the oom killer
we are left with requests which require to loop inside the allocator
without invoking the oom killer (e.g. GFP_NOFS|__GFP_NOFAIL used by fs
code) and so they might, in very unlikely situations, loop for ever -
e.g. other parallel request could starve them.

This patch tries to limit the likelihood of such a lockup by giving
these __GFP_NOFAIL requests a chance to move on by consuming a small
part of memory reserves. We are using ALLOC_HARDER which should be
enough to prevent from the starvation by regular allocation requests,
yet it shouldn't consume enough from the reserves to disrupt high
priority requests (ALLOC_HIGH).

While we are at it, let's introduce a helper __alloc_pages_cpuset_fallback
which enforces the cpusets but allows to fallback to ignore them if
the first attempt fails. __GFP_NOFAIL requests can be considered
important enough to allow cpuset runaway in order for the system to move
on. It is highly unlikely that any of these will be GFP_USER anyway.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 36 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2dda7c3eba52..e8e551015d48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3064,6 +3064,26 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order,
+			      unsigned int alloc_flags,
+			      const struct alloc_context *ac)
+{
+	struct page *page;
+
+	page = get_page_from_freelist(gfp_mask, order,
+			alloc_flags|ALLOC_CPUSET, ac);
+	/*
+	 * fallback to ignore cpuset restriction if our nodes
+	 * are depleted
+	 */
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order,
+				alloc_flags, ac);
+
+	return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3127,17 +3147,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		if (gfp_mask & __GFP_NOFAIL)
+			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
 					ALLOC_NO_WATERMARKS, ac);
-		}
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -3743,6 +3759,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves but do not use ALLOC_NO_WATERMARKS because this
+		 * could deplete whole memory reserves which would just make
+		 * the situation worse
+		 */
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}
-- 
2.10.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-20 13:49 ` [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
@ 2016-12-20 15:31   ` Tetsuo Handa
  2016-12-21  8:15     ` Michal Hocko
  2017-01-19 18:41   ` Johannes Weiner
  2017-01-20  8:33   ` Hillf Danton
  2 siblings, 1 reply; 24+ messages in thread
From: Tetsuo Handa @ 2016-12-20 15:31 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c8eed66d8abb..2dda7c3eba52 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3098,32 +3098,31 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	if (page)
>  		goto out;
>  
> -	if (!(gfp_mask & __GFP_NOFAIL)) {
> -		/* Coredumps can quickly deplete all memory reserves */
> -		if (current->flags & PF_DUMPCORE)
> -			goto out;
> -		/* The OOM killer will not help higher order allocs */
> -		if (order > PAGE_ALLOC_COSTLY_ORDER)
> -			goto out;
> -		/* The OOM killer does not needlessly kill tasks for lowmem */
> -		if (ac->high_zoneidx < ZONE_NORMAL)
> -			goto out;
> -		if (pm_suspended_storage())
> -			goto out;
> -		/*
> -		 * XXX: GFP_NOFS allocations should rather fail than rely on
> -		 * other request to make a forward progress.
> -		 * We are in an unfortunate situation where out_of_memory cannot
> -		 * do much for this context but let's try it to at least get
> -		 * access to memory reserved if the current task is killed (see
> -		 * out_of_memory). Once filesystems are ready to handle allocation
> -		 * failures more gracefully we should just bail out here.
> -		 */
> +	/* Coredumps can quickly deplete all memory reserves */
> +	if (current->flags & PF_DUMPCORE)
> +		goto out;
> +	/* The OOM killer will not help higher order allocs */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> +		goto out;
> +	/* The OOM killer does not needlessly kill tasks for lowmem */
> +	if (ac->high_zoneidx < ZONE_NORMAL)
> +		goto out;
> +	if (pm_suspended_storage())
> +		goto out;
> +	/*
> +	 * XXX: GFP_NOFS allocations should rather fail than rely on
> +	 * other request to make a forward progress.
> +	 * We are in an unfortunate situation where out_of_memory cannot
> +	 * do much for this context but let's try it to at least get
> +	 * access to memory reserved if the current task is killed (see
> +	 * out_of_memory). Once filesystems are ready to handle allocation
> +	 * failures more gracefully we should just bail out here.
> +	 */
> +
> +	/* The OOM killer may not free memory on a specific node */
> +	if (gfp_mask & __GFP_THISNODE)
> +		goto out;
>  
> -		/* The OOM killer may not free memory on a specific node */
> -		if (gfp_mask & __GFP_THISNODE)
> -			goto out;
> -	}
>  	/* Exhausted what can be done so it's blamo time */
>  	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
>  		*did_some_progress = 1;

Why do we need to change this part in this patch?

This change silently prohibits invoking the OOM killer for e.g. costly
GFP_KERNEL allocation. While it would be better if vmalloc() can be used,
there might be users who cannot accept vmalloc() as a fallback (e.g.
CONFIG_MMU=n where vmalloc() == kmalloc() ?).

This change is not "do not enforce OOM killer automatically" but "never allow
OOM killer". No exception is allowed. If we change this part, title for this part
should be something strong like "mm,oom: Never allow OOM killer for coredumps,
costly allocations, lowmem etc.".

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-20 15:31   ` Tetsuo Handa
@ 2016-12-21  8:15     ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-12-21  8:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

On Wed 21-12-16 00:31:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c8eed66d8abb..2dda7c3eba52 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3098,32 +3098,31 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  	if (page)
> >  		goto out;
> >  
> > -	if (!(gfp_mask & __GFP_NOFAIL)) {
> > -		/* Coredumps can quickly deplete all memory reserves */
> > -		if (current->flags & PF_DUMPCORE)
> > -			goto out;
> > -		/* The OOM killer will not help higher order allocs */
> > -		if (order > PAGE_ALLOC_COSTLY_ORDER)
> > -			goto out;
> > -		/* The OOM killer does not needlessly kill tasks for lowmem */
> > -		if (ac->high_zoneidx < ZONE_NORMAL)
> > -			goto out;
> > -		if (pm_suspended_storage())
> > -			goto out;
> > -		/*
> > -		 * XXX: GFP_NOFS allocations should rather fail than rely on
> > -		 * other request to make a forward progress.
> > -		 * We are in an unfortunate situation where out_of_memory cannot
> > -		 * do much for this context but let's try it to at least get
> > -		 * access to memory reserved if the current task is killed (see
> > -		 * out_of_memory). Once filesystems are ready to handle allocation
> > -		 * failures more gracefully we should just bail out here.
> > -		 */
> > +	/* Coredumps can quickly deplete all memory reserves */
> > +	if (current->flags & PF_DUMPCORE)
> > +		goto out;
> > +	/* The OOM killer will not help higher order allocs */
> > +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> > +		goto out;
> > +	/* The OOM killer does not needlessly kill tasks for lowmem */
> > +	if (ac->high_zoneidx < ZONE_NORMAL)
> > +		goto out;
> > +	if (pm_suspended_storage())
> > +		goto out;
> > +	/*
> > +	 * XXX: GFP_NOFS allocations should rather fail than rely on
> > +	 * other request to make a forward progress.
> > +	 * We are in an unfortunate situation where out_of_memory cannot
> > +	 * do much for this context but let's try it to at least get
> > +	 * access to memory reserved if the current task is killed (see
> > +	 * out_of_memory). Once filesystems are ready to handle allocation
> > +	 * failures more gracefully we should just bail out here.
> > +	 */
> > +
> > +	/* The OOM killer may not free memory on a specific node */
> > +	if (gfp_mask & __GFP_THISNODE)
> > +		goto out;
> >  
> > -		/* The OOM killer may not free memory on a specific node */
> > -		if (gfp_mask & __GFP_THISNODE)
> > -			goto out;
> > -	}
> >  	/* Exhausted what can be done so it's blamo time */
> >  	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> >  		*did_some_progress = 1;
> 
> Why do we need to change this part in this patch?
> 
> This change silently prohibits invoking the OOM killer for e.g. costly
> GFP_KERNEL allocation.

We have never allowed coslty GFP_KERNEL requests to invoke the oom
killer. And there is a good reason for it which is even mentioned in the
changelog. The only change here is that GFP_NOFAIL doesn't override this
decision - again for reasons mentioned in the changelog.

> While it would be better if vmalloc() can be used,
> there might be users who cannot accept vmalloc() as a fallback (e.g.
> CONFIG_MMU=n where vmalloc() == kmalloc() ?).

I haven't ever heard any complains about this in the past. If there is a
valid usecase then we can treat nommu specialy. That would require more
changes though.

> This change is not "do not enforce OOM killer automatically" but
> "never allow OOM killer". No exception is allowed. If we change
> this part, title for this part should be something strong like
> "mm,oom: Never allow OOM killer for coredumps, costly allocations,
> lowmem etc.".

Sigh. We didn't allow the oom killer for all those cases and the only
thing that is changed here is to not override those decisions with
__GFP_NOFAIL which is imho reflected in the title. If that is not clear
then I would suggest "mm, oom: do not override OOM killer decisions with
__GFP_NOFAIL".

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2016-12-20 13:49 [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
                   ` (2 preceding siblings ...)
  2016-12-20 13:49 ` [PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not trigger OOM killer Michal Hocko
@ 2017-01-02 15:49 ` Michal Hocko
  2017-01-03  1:36   ` Tetsuo Handa
  2017-01-18 18:42 ` Michal Hocko
  4 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-01-02 15:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, David Rientjes, Mel Gorman,
	Hillf Danton, linux-mm, LKML

On Tue 20-12-16 14:49:01, Michal Hocko wrote:
> Hi,
> This has been posted [1] initially to later be reduced to a single patch
> [2].  Johannes then suggested [3] to split up the second patch and make
> the access to memory reserves by __GF_NOFAIL requests which do not
> invoke the oom killer a separate change. This is patch 3 now.
> 
> Tetsuo has noticed [4] that recent changes have changed GFP_NOFAIL
> semantic for costly order requests. I believe that the primary reason
> why this happened is that our GFP_NOFAIL checks are too scattered
> and it is really easy to forget about adding one. That's why I am
> proposing patch 1 which consolidates all the nofail handling at a single
> place. This should help to make this code better maintainable.
> 
> Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
> surprising. As things stand currently GFP_NOFAIL overrides the oom killer
> prevention code which is both subtle and not really needed. The patch 2
> has more details about issues this might cause. We have also seen
> a report where __GFP_NOFAIL|GFP_NOFS requests cause the oom killer which
> is premature.
> 
> Patch 3 is an attempt to reduce chances of GFP_NOFAIL requests being
> preempted by other memory consumers by giving them access to memory
> reserves.

a friendly ping on this

> [1] http://lkml.kernel.org/r/20161123064925.9716-1-mhocko@kernel.org
> [2] http://lkml.kernel.org/r/20161214150706.27412-1-mhocko@kernel.org
> [3] http://lkml.kernel.org/r/20161216173151.GA23182@cmpxchg.org
> [4] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-02 15:49 ` [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
@ 2017-01-03  1:36   ` Tetsuo Handa
  2017-01-03  8:42     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Tetsuo Handa @ 2017-01-03  1:36 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 20-12-16 14:49:01, Michal Hocko wrote:
> > Hi,
> > This has been posted [1] initially to later be reduced to a single patch
> > [2].  Johannes then suggested [3] to split up the second patch and make
> > the access to memory reserves by __GF_NOFAIL requests which do not
> > invoke the oom killer a separate change. This is patch 3 now.
> > 
> > Tetsuo has noticed [4] that recent changes have changed GFP_NOFAIL
> > semantic for costly order requests. I believe that the primary reason
> > why this happened is that our GFP_NOFAIL checks are too scattered
> > and it is really easy to forget about adding one. That's why I am
> > proposing patch 1 which consolidates all the nofail handling at a single
> > place. This should help to make this code better maintainable.
> > 
> > Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
> > surprising. As things stand currently GFP_NOFAIL overrides the oom killer
> > prevention code which is both subtle and not really needed. The patch 2
> > has more details about issues this might cause. We have also seen
> > a report where __GFP_NOFAIL|GFP_NOFS requests cause the oom killer which
> > is premature.
> > 
> > Patch 3 is an attempt to reduce chances of GFP_NOFAIL requests being
> > preempted by other memory consumers by giving them access to memory
> > reserves.
> 
> a friendly ping on this
> 
> > [1] http://lkml.kernel.org/r/20161123064925.9716-1-mhocko@kernel.org
> > [2] http://lkml.kernel.org/r/20161214150706.27412-1-mhocko@kernel.org
> > [3] http://lkml.kernel.org/r/20161216173151.GA23182@cmpxchg.org
> > [4] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
slowpath" given that we describe that we make __GFP_NOFAIL stronger than
__GFP_NORETRY with this patch in the changelog.

But I don't think "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
automatically" is correct. Firstly, we need to confirm

  "The pre-mature OOM killer is a real issue as reported by Nils Holland"

in the changelog is still true because we haven't tested with "[PATCH] mm, memcg:
fix the active list aging for lowmem requests when memcg is enabled" applied and
without "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
automatically" and "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not
trigger OOM killer" applied.

Secondly, as you are using __GFP_NORETRY in "[PATCH] mm: introduce kv[mz]alloc
helpers" as a mean to enforce not to invoke the OOM killer

	/*
	 * Make sure that larger requests are not too disruptive - no OOM
	 * killer and no allocation failure warnings as we have a fallback
	 */
	if (size > PAGE_SIZE)
		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;

, we can use __GFP_NORETRY as a mean to enforce not to invoke the OOM killer
rather than applying "[PATCH 2/3] mm, oom: do not enfore OOM killer for
__GFP_NOFAIL automatically".

Additionally, although currently there seems to be no
kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) users, kvmalloc_node() in
"[PATCH] mm: introduce kv[mz]alloc helpers" will be confused when a
kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) user comes in in the future because
"[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" makes
__GFP_NOFAIL stronger than __GFP_NORETRY.

My concern with "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which
do not trigger OOM killer" is

  "AFAIU, this is an allocation path which doesn't block a forward progress
   on a regular IO. It is merely a check whether there is a new medium in
   the CDROM (aka regular polling of the device). I really fail to see any
   reason why this one should get any access to memory reserves at all."

in http://lkml.kernel.org/r/20161218163727.GC8440@dhcp22.suse.cz .
Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
other workqueue items which a regular I/O depend on, I think there are
!__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
allocations with memory reserves. If a SCSI disk I/O request fails due to
GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
use memory reserves, it adds a new problem.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-03  1:36   ` Tetsuo Handa
@ 2017-01-03  8:42     ` Michal Hocko
  2017-01-03 14:38       ` Tetsuo Handa
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-01-03  8:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
[...]
> I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> __GFP_NORETRY with this patch in the changelog.

Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
reason to describe all the nonsense combinations of gfp flags.

> But I don't think "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> automatically" is correct. Firstly, we need to confirm
> 
>   "The pre-mature OOM killer is a real issue as reported by Nils Holland"
> 
> in the changelog is still true because we haven't tested with "[PATCH] mm, memcg:
> fix the active list aging for lowmem requests when memcg is enabled" applied and
> without "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> automatically" and "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not
> trigger OOM killer" applied.

Yes I have dropped the reference to this report already in my local
patch because in this particular case the issue was somewhere else
indeed!

> Secondly, as you are using __GFP_NORETRY in "[PATCH] mm: introduce kv[mz]alloc
> helpers" as a mean to enforce not to invoke the OOM killer
> 
> 	/*
> 	 * Make sure that larger requests are not too disruptive - no OOM
> 	 * killer and no allocation failure warnings as we have a fallback
> 	 */
> 	if (size > PAGE_SIZE)
> 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> 
> , we can use __GFP_NORETRY as a mean to enforce not to invoke the OOM killer
> rather than applying "[PATCH 2/3] mm, oom: do not enfore OOM killer for
> __GFP_NOFAIL automatically".
> 
> Additionally, although currently there seems to be no
> kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) users, kvmalloc_node() in
> "[PATCH] mm: introduce kv[mz]alloc helpers" will be confused when a
> kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) user comes in in the future because
> "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" makes
> __GFP_NOFAIL stronger than __GFP_NORETRY.

Using NOFAIL in kv[mz]alloc simply makes no sense at all. The vmalloc
fallback would be simply unreachable!

> My concern with "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which
> do not trigger OOM killer" is
> 
>   "AFAIU, this is an allocation path which doesn't block a forward progress
>    on a regular IO. It is merely a check whether there is a new medium in
>    the CDROM (aka regular polling of the device). I really fail to see any
>    reason why this one should get any access to memory reserves at all."
> 
> in http://lkml.kernel.org/r/20161218163727.GC8440@dhcp22.suse.cz .
> Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
> other workqueue items which a regular I/O depend on, I think there are
> !__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
> which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
> allocations with memory reserves. If a SCSI disk I/O request fails due to
> GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
> use memory reserves, it adds a new problem.

Do you have any example of such a request? Anything that requires
a forward progress during IO should be using mempools otherwise it
is broken pretty much by design already. Also IO depending on NOFS
allocations sounds pretty much broken already. So I suspect the above
reasoning is just bogus.

That being said, to summarize your arguments again. 1) you do not like
that a combination of __GFP_NORETRY | __GFP_NOFAIL is not documented
to never fail, 2) based on that you argue that kv[mvz]alloc with
__GFP_NOFAIL will never reach vmalloc and 3) that there might be some IO
paths depending on NOFS|NOFAIL allocation which would have harder time
to make forward progress.

I would call 1 and 2 just bogus and 3 highly dubious at best. Do not
get me wrong but this is not what I call a useful review feedback yet
alone a reason to block these patches. If there are any reasons to not
merge them these are not those.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-03  8:42     ` Michal Hocko
@ 2017-01-03 14:38       ` Tetsuo Handa
  2017-01-03 16:25         ` Vlastimil Babka
  2017-01-03 20:40         ` Michal Hocko
  0 siblings, 2 replies; 24+ messages in thread
From: Tetsuo Handa @ 2017-01-03 14:38 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
> [...]
> > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> > __GFP_NORETRY with this patch in the changelog.
> 
> Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
> reason to describe all the nonsense combinations of gfp flags.

Before [PATCH 1/3]:

  __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
  request even if __GFP_NOFAIL is specified if direct reclaim/compaction
  did not help."

  __GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
  is specified even if direct reclaim/compaction did not help."

After [PATCH 1/3]:

  __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
  request unless __GFP_NOFAIL is specified."

  __GFP_NOFAIL is used as "Never fail allocation request even if direct
  reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
  specified."

Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
"Never fail allocation request if direct reclaim/compaction did not help.
But do not invoke the OOM killer even if direct reclaim/compaction did not help."

> 
> > But I don't think "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" is correct. Firstly, we need to confirm
> > 
> >   "The pre-mature OOM killer is a real issue as reported by Nils Holland"
> > 
> > in the changelog is still true because we haven't tested with "[PATCH] mm, memcg:
> > fix the active list aging for lowmem requests when memcg is enabled" applied and
> > without "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" and "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not
> > trigger OOM killer" applied.
> 
> Yes I have dropped the reference to this report already in my local
> patch because in this particular case the issue was somewhere else
> indeed!

OK.

> 
> > Secondly, as you are using __GFP_NORETRY in "[PATCH] mm: introduce kv[mz]alloc
> > helpers" as a mean to enforce not to invoke the OOM killer
> > 
> > 	/*
> > 	 * Make sure that larger requests are not too disruptive - no OOM
> > 	 * killer and no allocation failure warnings as we have a fallback
> > 	 */
> > 	if (size > PAGE_SIZE)
> > 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> > 
> > , we can use __GFP_NORETRY as a mean to enforce not to invoke the OOM killer
> > rather than applying "[PATCH 2/3] mm, oom: do not enfore OOM killer for
> > __GFP_NOFAIL automatically".
> > 

As I wrote above, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense.

> > Additionally, although currently there seems to be no
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) users, kvmalloc_node() in
> > "[PATCH] mm: introduce kv[mz]alloc helpers" will be confused when a
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) user comes in in the future because
> > "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" makes
> > __GFP_NOFAIL stronger than __GFP_NORETRY.
> 
> Using NOFAIL in kv[mz]alloc simply makes no sense at all. The vmalloc
> fallback would be simply unreachable!

My intention is shown below.

 void *kvmalloc_node(size_t size, gfp_t flags, int node)
 {
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
 	/*
 	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
 	 * so the given set of flags has to be compatible.
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
 	/*
 	 * Make sure that larger requests are not too disruptive - no OOM
 	 * killer and no allocation failure warnings as we have a fallback
 	 */
-	if (size > PAGE_SIZE)
+	if (size > PAGE_SIZE) {
 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+		kmalloc_flags &= ~__GFP_NOFAIL;
+	}
 
 	ret = kmalloc_node(size, kmalloc_flags, node);
 
 	/*
 	 * It doesn't really make sense to fallback to vmalloc for sub page
 	 * requests
 	 */
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
 	return __vmalloc_node_flags(size, node, flags);
 }

> 
> > My concern with "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which
> > do not trigger OOM killer" is
> > 
> >   "AFAIU, this is an allocation path which doesn't block a forward progress
> >    on a regular IO. It is merely a check whether there is a new medium in
> >    the CDROM (aka regular polling of the device). I really fail to see any
> >    reason why this one should get any access to memory reserves at all."
> > 
> > in http://lkml.kernel.org/r/20161218163727.GC8440@dhcp22.suse.cz .
> > Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
> > other workqueue items which a regular I/O depend on, I think there are
> > !__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
> > which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
> > allocations with memory reserves. If a SCSI disk I/O request fails due to
> > GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
> > use memory reserves, it adds a new problem.
> 
> Do you have any example of such a request? Anything that requires
> a forward progress during IO should be using mempools otherwise it
> is broken pretty much by design already. Also IO depending on NOFS
> allocations sounds pretty much broken already. So I suspect the above
> reasoning is just bogus.

You are missing my point. My question is "who needs memory reserves".
I'm not saying that disk I/O depends on GFP_NOFS allocation. I'm worrying
that [PATCH 3/3] consumes memory reserves when disk I/O also depends on
memory reserves.

My understanding is that when accessing SCSI disks, SCSI protocol is used.
SCSI driver allocates memory at runtime for using SCSI protocol using
GFP_ATOMIC. And GFP_ATOMIC uses some of memory reserves. But [PATCH 3/3]
also uses memory reserves. If memory reserves are consumed by [PATCH 3/3]
to the level where GFP_ATOMIC cannot succeed, I think it causes troubles.

I'm unable to obtain nice backtraces, but I think we can confirm that
there are GFP_ATOMIC allocations (e.g. sg_alloc_table_chained() calls
__sg_alloc_table(GFP_ATOMIC)) when we are using SCSI disks.

stap -DSTAP_NO_OVERLOAD=1 -DMAXSKIPPED=10000000 -e 'global in_scope; global traces;
probe begin { println("Ready."); }
probe kernel.function("scsi_*").call { in_scope[tid()]++; }
probe kernel.function("scsi_*").return { in_scope[tid()]--; }
probe kernel.function("__alloc_pages_nodemask") {
  if (in_scope[tid()] != 0) {
    bt = backtrace();
    if (!([bt,$gfp_mask] in traces)) {
      traces[bt,$gfp_mask] = 1;
      printf("mode=0x%x,order=%u by %s(%u)\n",
             $gfp_mask, $order, execname(), pid());
      print_backtrace();
      println();
    }
  }
}'

[  183.887841] ------------[ cut here ]------------
[  183.888847] WARNING: CPU: 0 PID: 0 at /root/systemtap.tmp/share/systemtap/runtime/linux/addr-map.c:42 lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.892011] Modules linked in: stap_40159b816dfa3e3814a532dc982f551b_1_4182(O) nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper vmw_vsock_vmci_transport ppdev vmw_balloon vsock pcspkr sg parport_pc parport vmw_vmci i2c_piix4 shpchp ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel
[  183.905779]  serio_raw mptspi scsi_transport_spi vmwgfx mptscsih ahci drm_kms_helper libahci syscopyarea sysfillrect mptbase sysimgblt fb_sys_fops ttm ata_piix drm e1000 i2c_core libata
[  183.909128] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.10.0-rc2-next-20170103 #486
[  183.910839] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  183.912815] Call Trace:
[  183.913326]  <IRQ>
[  183.913729]  dump_stack+0x85/0xc9
[  183.914366]  __warn+0xd1/0xf0
[  183.914930]  warn_slowpath_null+0x1d/0x20
[  183.915691]  lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.917539]  unwind_frame.constprop.79+0xbed/0x1270 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.919238]  _stp_stack_kernel_get+0x16e/0x4d0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.920870]  ? gfp_pfmemalloc_allowed+0x80/0x80
[  183.921737]  _stp_stack_kernel_print+0x3e/0xc0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.923583]  probe_3838+0x1f6/0x8c0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.925005]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.925875]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.926724]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.927598]  enter_kprobe_probe+0x1e5/0x310 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.929164]  kprobe_ftrace_handler+0xe9/0x150
[  183.930260]  ? __alloc_pages_nodemask+0x5/0x4e0
[  183.931123]  ftrace_ops_assist_func+0xbf/0x110
[  183.932012]  ? sched_clock+0x9/0x10
[  183.932815]  ? sched_clock_cpu+0x84/0xb0
[  183.933550]  0xffffffffa00430d5
[  183.934947]  ? gfp_pfmemalloc_allowed+0x80/0x80
[  183.936581]  __alloc_pages_nodemask+0x5/0x4e0
[  183.938409]  alloc_pages_current+0x97/0x1b0
[  183.939971]  ? __alloc_pages_nodemask+0x5/0x4e0
[  183.941567]  ? alloc_pages_current+0x97/0x1b0
[  183.943173]  new_slab+0x3a8/0x6a0
[  183.944702]  ___slab_alloc+0x3a3/0x620
[  183.946157]  ? stp_lock_probe+0x4a/0xb0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.948375]  ? mempool_alloc_slab+0x1c/0x20
[  183.950135]  ? mempool_alloc_slab+0x1c/0x20
[  183.951957]  __slab_alloc+0x46/0x7d
[  183.953415]  ? mempool_alloc_slab+0x1c/0x20
[  183.955001]  kmem_cache_alloc+0x2e8/0x370
[  183.956552]  ? aggr_pre_handler+0x3f/0x80
[  183.958428]  mempool_alloc_slab+0x1c/0x20
[  183.960042]  mempool_alloc+0x79/0x1b0
[  183.961512]  ? kprobe_ftrace_handler+0xa3/0x150
[  183.963082]  ? scsi_init_io+0x5/0x1d0
[  183.964501]  sg_pool_alloc+0x45/0x50
[  183.966064]  __sg_alloc_table+0xdf/0x150
[  183.967402]  ? sg_free_table_chained+0x30/0x30
[  183.969030]  sg_alloc_table_chained+0x3f/0x90
[  183.970405]  scsi_init_sgtable+0x31/0x70
[  183.971783]  copy_oldmem_page+0xd0/0xd0
[  183.973506]  copy_oldmem_page+0xd0/0xd0
[  183.974802]  sd_init_command+0x3c/0xb0 [sd_mod]
[  183.976187]  scsi_setup_cmnd+0xf0/0x150
[  183.977617]  copy_oldmem_page+0xd0/0xd0
[  183.978849]  ? scsi_prep_fn+0x5/0x170
[  183.980119]  copy_oldmem_page+0xd0/0xd0
[  183.981406]  scsi_request_fn+0x42/0x740
[  183.982760]  ? scsi_request_fn+0x5/0x740
[  183.984049]  copy_oldmem_page+0xd0/0xd0
[  183.985438]  blk_run_queue+0x26/0x40
[  183.986751]  scsi_kick_queue+0x25/0x30
[  183.987891]  copy_oldmem_page+0xd0/0xd0
[  183.989073]  copy_oldmem_page+0xd0/0xd0
[  183.990175]  copy_oldmem_page+0xd0/0xd0
[  183.991348]  copy_oldmem_page+0xd0/0xd0
[  183.992450]  copy_oldmem_page+0xd0/0xd0
[  183.993677]  blk_done_softirq+0xa8/0xd0
[  183.994766]  __do_softirq+0xc0/0x52d
[  183.995792]  irq_exit+0xf5/0x110
[  183.996748]  smp_call_function_single_interrupt+0x33/0x40
[  183.998230]  call_function_single_interrupt+0x9d/0xb0
[  183.999521] RIP: 0010:native_safe_halt+0x6/0x10
[  184.000791] RSP: 0018:ffffffff81c03dd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff04
[  184.002749] RAX: ffffffff81c18500 RBX: 0000000000000000 RCX: 0000000000000000
[  184.004440] RDX: ffffffff81c18500 RSI: 0000000000000001 RDI: ffffffff81c18500
[  184.006281] RBP: ffffffff81c03dd0 R08: 0000000000000000 R09: 0000000000000000
[  184.008063] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[  184.009842] R13: ffffffff81c18500 R14: ffffffff81c18500 R15: 0000000000000000
[  184.011758]  </IRQ>
[  184.012952]  default_idle+0x23/0x1d0
[  184.014085]  arch_cpu_idle+0xf/0x20
[  184.015204]  default_idle_call+0x23/0x40
[  184.016347]  do_idle+0x162/0x230
[  184.017353]  cpu_startup_entry+0x71/0x80
[  184.018686]  rest_init+0x138/0x140
[  184.019961]  ? rest_init+0x5/0x140
[  184.021034]  start_kernel+0x4ba/0x4db
[  184.022305]  ? set_init_arg+0x55/0x55
[  184.023405]  ? early_idt_handler_array+0x120/0x120
[  184.024730]  x86_64_start_reservations+0x2a/0x2c
[  184.026121]  x86_64_start_kernel+0x14c/0x16f
[  184.027329]  start_cpu+0x14/0x14
[  184.028391] ---[ end trace cbfebb3ae93a99b9 ]---

> 
> That being said, to summarize your arguments again. 1) you do not like
> that a combination of __GFP_NORETRY | __GFP_NOFAIL is not documented
> to never fail,

Correct.

>                2) based on that you argue that kv[mvz]alloc with
> __GFP_NOFAIL will never reach vmalloc

Wrong.

>                                       and 3) that there might be some IO
> paths depending on NOFS|NOFAIL allocation which would have harder time
> to make forward progress.

Wrong.

> 
> I would call 1 and 2 just bogus and 3 highly dubious at best. Do not
> get me wrong but this is not what I call a useful review feedback yet
> alone a reason to block these patches. If there are any reasons to not
> merge them these are not those.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-03 14:38       ` Tetsuo Handa
@ 2017-01-03 16:25         ` Vlastimil Babka
  2017-01-03 20:40         ` Michal Hocko
  1 sibling, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2017-01-03 16:25 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko
  Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

On 01/03/2017 03:38 PM, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
>> [...]
>> > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
>> > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
>> > __GFP_NORETRY with this patch in the changelog.
>>
>> Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
>> reason to describe all the nonsense combinations of gfp flags.
>
> Before [PATCH 1/3]:
>
>   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
>   request even if __GFP_NOFAIL is specified if direct reclaim/compaction
>   did not help."
>
>   __GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
>   is specified even if direct reclaim/compaction did not help."
>
> After [PATCH 1/3]:
>
>   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
>   request unless __GFP_NOFAIL is specified."
>
>   __GFP_NOFAIL is used as "Never fail allocation request even if direct
>   reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
>   specified."
>
> Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
> "Never fail allocation request if direct reclaim/compaction did not help.
> But do not invoke the OOM killer even if direct reclaim/compaction did not help."

It may technically do that, but how exactly is that useful, i.e. "make sense"? 
Patch 2/3 here makes sure that OOM killer is not invoked when the allocation 
context is "limited" and thus OOM might be premature (despite __GFP_NOFAIL).
What's the use case for __GFP_NORETRY | __GFP_NOFAIL ?

>
>>
>> > But I don't think "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
>> > automatically" is correct. Firstly, we need to confirm
>> >
>> >   "The pre-mature OOM killer is a real issue as reported by Nils Holland"
>> >
>> > in the changelog is still true because we haven't tested with "[PATCH] mm, memcg:
>> > fix the active list aging for lowmem requests when memcg is enabled" applied and
>> > without "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
>> > automatically" and "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not
>> > trigger OOM killer" applied.
>>
>> Yes I have dropped the reference to this report already in my local
>> patch because in this particular case the issue was somewhere else
>> indeed!
>
> OK.
>
>>
>> > Secondly, as you are using __GFP_NORETRY in "[PATCH] mm: introduce kv[mz]alloc
>> > helpers" as a mean to enforce not to invoke the OOM killer
>> >
>> > 	/*
>> > 	 * Make sure that larger requests are not too disruptive - no OOM
>> > 	 * killer and no allocation failure warnings as we have a fallback
>> > 	 */
>> > 	if (size > PAGE_SIZE)
>> > 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
>> >
>> > , we can use __GFP_NORETRY as a mean to enforce not to invoke the OOM killer
>> > rather than applying "[PATCH 2/3] mm, oom: do not enfore OOM killer for
>> > __GFP_NOFAIL automatically".
>> >
>
> As I wrote above, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense.
>
>> > Additionally, although currently there seems to be no
>> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) users, kvmalloc_node() in
>> > "[PATCH] mm: introduce kv[mz]alloc helpers" will be confused when a
>> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) user comes in in the future because
>> > "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" makes
>> > __GFP_NOFAIL stronger than __GFP_NORETRY.
>>
>> Using NOFAIL in kv[mz]alloc simply makes no sense at all. The vmalloc
>> fallback would be simply unreachable!
>
> My intention is shown below.
>
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
>
>  	/*
>  	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
>  	 * so the given set of flags has to be compatible.
>  	 */
>  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>
>  	/*
>  	 * Make sure that larger requests are not too disruptive - no OOM
>  	 * killer and no allocation failure warnings as we have a fallback
>  	 */
> -	if (size > PAGE_SIZE)
> +	if (size > PAGE_SIZE) {
>  		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> +		kmalloc_flags &= ~__GFP_NOFAIL;

This does make kvmalloc_node more robust against callers that would try to use 
it with __GFP_NOFAIL, but is it a good idea to allow that right now? If there 
are none yet (AFAIK?), we should rather let the existing WARN_ON kick in (which 
won't happen if we strip __GFP_NOFAIL) and discuss a better solution for such 
new future caller.

Also this means the kmalloc() cannot do "__GFP_NORETRY | __GFP_NOFAIL" so I'm 
not sure how it's related with your points above - it's not an example of the 
combination that would show that "it makes perfect sense".

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-03 14:38       ` Tetsuo Handa
  2017-01-03 16:25         ` Vlastimil Babka
@ 2017-01-03 20:40         ` Michal Hocko
  2017-01-04 14:22           ` Tetsuo Handa
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-01-03 20:40 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

On Tue 03-01-17 23:38:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
> > [...]
> > > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> > > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> > > __GFP_NORETRY with this patch in the changelog.
> > 
> > Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
> > reason to describe all the nonsense combinations of gfp flags.
> 
> Before [PATCH 1/3]:
> 
>   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
>   request even if __GFP_NOFAIL is specified if direct reclaim/compaction
>   did not help."
> 
>   __GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
>   is specified even if direct reclaim/compaction did not help."
> 
> After [PATCH 1/3]:
> 
>   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
>   request unless __GFP_NOFAIL is specified."
> 
>   __GFP_NOFAIL is used as "Never fail allocation request even if direct
>   reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
>   specified."
> 
> Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
> "Never fail allocation request if direct reclaim/compaction did not help.
> But do not invoke the OOM killer even if direct reclaim/compaction did not help."

Stop this! Seriously... This is just wasting time...

 * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
 *   return NULL when direct reclaim and memory compaction have failed to allow
 *   the allocation to succeed.  The OOM killer is not called with the current
 *   implementation.

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 *   cannot handle allocation failures. New users should be evaluated carefully
 *   (and the flag should be used only when there is no reasonable failure
 *   policy) but it is definitely preferable to use the flag rather than
 *   opencode endless loop around allocator.

Can you see how the two are asking for opposite behavior?  Asking for
not retrying for ever and not failing and rather retrying for ever
simply doesn't make any sense in any reasonable universe I can think
of. Therefore I think that it is fair to say that behavior is undefined
when both are specified.

Considering there are _no_ users which would do that any further
discussion about this is just pointless and I will not respond to any
further emails in this direction.

This is just ridiculous!

[...]
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
>  
>  	/*
>  	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
>  	 * so the given set of flags has to be compatible.
>  	 */
>  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>  
>  	/*
>  	 * Make sure that larger requests are not too disruptive - no OOM
>  	 * killer and no allocation failure warnings as we have a fallback
>  	 */
> -	if (size > PAGE_SIZE)
> +	if (size > PAGE_SIZE) {
>  		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> +		kmalloc_flags &= ~__GFP_NOFAIL;
> +	}

No there are simply no users of this and even if had one which would be
legitimate it wouldn't be as simple as this. vmalloc _doesn't_ support
GFP_NOFAIL and it would be really non-trivial to implement it. If for
nothing else there are unconditional GFP_KERNEL allocations in some
vmalloc paths (which is btw. the reason why vmalloc is not GFP_NOFS
unsafe). It would take much more to add the non-failing semantic. And I
see _no_ reason to argue with this possibility when a) there is no such
user currently and b) it is even not clear whether we want to support
such a usecase.

[...]
> > > in http://lkml.kernel.org/r/20161218163727.GC8440@dhcp22.suse.cz .
> > > Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
> > > other workqueue items which a regular I/O depend on, I think there are
> > > !__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
> > > which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
> > > allocations with memory reserves. If a SCSI disk I/O request fails due to
> > > GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
> > > use memory reserves, it adds a new problem.
> > 
> > Do you have any example of such a request? Anything that requires
> > a forward progress during IO should be using mempools otherwise it
> > is broken pretty much by design already. Also IO depending on NOFS
> > allocations sounds pretty much broken already. So I suspect the above
> > reasoning is just bogus.
> 
> You are missing my point. My question is "who needs memory reserves".
> I'm not saying that disk I/O depends on GFP_NOFS allocation. I'm worrying
> that [PATCH 3/3] consumes memory reserves when disk I/O also depends on
> memory reserves.
> 
> My understanding is that when accessing SCSI disks, SCSI protocol is used.
> SCSI driver allocates memory at runtime for using SCSI protocol using
> GFP_ATOMIC. And GFP_ATOMIC uses some of memory reserves. But [PATCH 3/3]
> also uses memory reserves. If memory reserves are consumed by [PATCH 3/3]
> to the level where GFP_ATOMIC cannot succeed, I think it causes troubles.

Yes and GFP_ATOMIC will have a deeper access to memory reserves than what
we are giving access to in patch 3. There is difference between
ALLOC_HARDER and ALLOC_HIGH. This is described in the changelog. Sure
GFP_NOFAIL will eat into part of the reserves which GFP_ATOMIC (aka
GFP_HIGH) could have used but a) this shouldn't happen unless we are
really getting out of memory and b) it should help other presumably
important allocations (why would they be GFP_NOFAIL otherwise right?).
So it is not just a free ticket to a scarce resource and IMHO it is
justified.

> I'm unable to obtain nice backtraces, but I think we can confirm that
> there are GFP_ATOMIC allocations (e.g. sg_alloc_table_chained() calls
> __sg_alloc_table(GFP_ATOMIC)) when we are using SCSI disks.

How are those blocking further progress? Failing atomic allocations are
nothing to lose sleep over. They cannot be, pretty by definition, relied
on to make a further progress.

[...]

I am _really_ getting tired of this discussion. You are making wrong or
unfounded claims again and again. I have no idea what are you trying to
achieve here but I simply do not see any sense in continuing in this
discussion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-03 20:40         ` Michal Hocko
@ 2017-01-04 14:22           ` Tetsuo Handa
  2017-01-04 15:20             ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Tetsuo Handa @ 2017-01-04 14:22 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 03-01-17 23:38:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
> > > [...]
> > > > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> > > > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> > > > __GFP_NORETRY with this patch in the changelog.
> > > 
> > > Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
> > > reason to describe all the nonsense combinations of gfp flags.
> > 
> > Before [PATCH 1/3]:
> > 
> >   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
> >   request even if __GFP_NOFAIL is specified if direct reclaim/compaction
> >   did not help."
> > 
> >   __GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
> >   is specified even if direct reclaim/compaction did not help."
> > 
> > After [PATCH 1/3]:
> > 
> >   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
> >   request unless __GFP_NOFAIL is specified."
> > 
> >   __GFP_NOFAIL is used as "Never fail allocation request even if direct
> >   reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
> >   specified."
> > 
> > Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
> > "Never fail allocation request if direct reclaim/compaction did not help.
> > But do not invoke the OOM killer even if direct reclaim/compaction did not help."
> 
> Stop this! Seriously... This is just wasting time...

You are free to ignore me. But

> 
>  * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
>  *   return NULL when direct reclaim and memory compaction have failed to allow
>  *   the allocation to succeed.  The OOM killer is not called with the current
>  *   implementation.
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  *   cannot handle allocation failures. New users should be evaluated carefully
>  *   (and the flag should be used only when there is no reasonable failure
>  *   policy) but it is definitely preferable to use the flag rather than
>  *   opencode endless loop around allocator.
> 
> Can you see how the two are asking for opposite behavior?  Asking for
> not retrying for ever and not failing and rather retrying for ever
> simply doesn't make any sense in any reasonable universe I can think
> of. Therefore I think that it is fair to say that behavior is undefined
> when both are specified.

I consider that I'm using __GFP_NORETRY as a mean to avoid calling the OOM
killer rather than avoid retrying indefinitely. Therefore, I want

  __GFP_NOOOMKILL: The VM implementation must not call the OOM killer when
  direct reclaim and memory compaction have failed to allow the allocation
  to succeed.

and __GFP_NOOOMKILL | __GFP_NOFAIL makes sense.

Technically PATCH 1/3 allows __GFP_NOOOMKILL | __GFP_NOFAIL emulation
via __GFP_NOFAIL | __GFP_NOFAIL. If you don't like such emulation,
I welcome __GFP_NOOOMKILL.

> 
> Considering there are _no_ users which would do that any further
> discussion about this is just pointless and I will not respond to any
> further emails in this direction.
> 
> This is just ridiculous!

Regardless of whether we define __GFP_NOOOMKILL, I wonder we need PATCH 2/3 now
because currently premature OOM killer invocation due to !__GFP_FS && __GFP_NOFAIL
is a prophetical problem. We can consider PATCH 2/3 (or __GFP_NOOOMKILL) when
someone reported OOM killer invocation via !__GFP_FS && __GFP_NOFAIL and
confirmed that the memory counter says premature enough to suppress the OOM
killer invocation.

> 
> [...]
> >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> >  {
> >  	gfp_t kmalloc_flags = flags;
> >  	void *ret;
> >  
> >  	/*
> >  	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
> >  	 * so the given set of flags has to be compatible.
> >  	 */
> >  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> >  
> >  	/*
> >  	 * Make sure that larger requests are not too disruptive - no OOM
> >  	 * killer and no allocation failure warnings as we have a fallback
> >  	 */
> > -	if (size > PAGE_SIZE)
> > +	if (size > PAGE_SIZE) {
> >  		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> > +		kmalloc_flags &= ~__GFP_NOFAIL;
> > +	}
> 
> No there are simply no users of this and even if had one which would be
> legitimate it wouldn't be as simple as this. vmalloc _doesn't_ support
> GFP_NOFAIL and it would be really non-trivial to implement it. If for
> nothing else there are unconditional GFP_KERNEL allocations in some
> vmalloc paths (which is btw. the reason why vmalloc is not GFP_NOFS
> unsafe). It would take much more to add the non-failing semantic. And I
> see _no_ reason to argue with this possibility when a) there is no such
> user currently and b) it is even not clear whether we want to support
> such a usecase.

I didn't know vmalloc() doesn't support GFP_NOFAIL.
Anyway, we don't need to make such change now because of a).

> 
> [...]
> > > > in http://lkml.kernel.org/r/20161218163727.GC8440@dhcp22.suse.cz .
> > > > Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
> > > > other workqueue items which a regular I/O depend on, I think there are
> > > > !__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
> > > > which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
> > > > allocations with memory reserves. If a SCSI disk I/O request fails due to
> > > > GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
> > > > use memory reserves, it adds a new problem.
> > > 
> > > Do you have any example of such a request? Anything that requires
> > > a forward progress during IO should be using mempools otherwise it
> > > is broken pretty much by design already. Also IO depending on NOFS
> > > allocations sounds pretty much broken already. So I suspect the above
> > > reasoning is just bogus.
> > 
> > You are missing my point. My question is "who needs memory reserves".
> > I'm not saying that disk I/O depends on GFP_NOFS allocation. I'm worrying
> > that [PATCH 3/3] consumes memory reserves when disk I/O also depends on
> > memory reserves.
> > 
> > My understanding is that when accessing SCSI disks, SCSI protocol is used.
> > SCSI driver allocates memory at runtime for using SCSI protocol using
> > GFP_ATOMIC. And GFP_ATOMIC uses some of memory reserves. But [PATCH 3/3]
> > also uses memory reserves. If memory reserves are consumed by [PATCH 3/3]
> > to the level where GFP_ATOMIC cannot succeed, I think it causes troubles.
> 
> Yes and GFP_ATOMIC will have a deeper access to memory reserves than what
> we are giving access to in patch 3. There is difference between
> ALLOC_HARDER and ALLOC_HIGH. This is described in the changelog. Sure
> GFP_NOFAIL will eat into part of the reserves which GFP_ATOMIC (aka
> GFP_HIGH) could have used but a) this shouldn't happen unless we are
> really getting out of memory and b) it should help other presumably
> important allocations (why would they be GFP_NOFAIL otherwise right?).
> So it is not just a free ticket to a scarce resource and IMHO it is
> justified.

If you try to allow !__GFP_FS allocations to fail rather than retry,
someone might start scattering __GFP_NOFAIL regardless of importance.

> 
> > I'm unable to obtain nice backtraces, but I think we can confirm that
> > there are GFP_ATOMIC allocations (e.g. sg_alloc_table_chained() calls
> > __sg_alloc_table(GFP_ATOMIC)) when we are using SCSI disks.
> 
> How are those blocking further progress? Failing atomic allocations are
> nothing to lose sleep over. They cannot be, pretty by definition, relied
> on to make a further progress.

So, regarding simple SCSI disk case, it is guaranteed that disk I/O request
can recover from transient failures (e.g. timeout?) and complete unless
fatal failures (e.g. hardware out of order?) occur, isn't it? Then,
PATCH 3/3 would be helpful for this case.

What about other cases, such as loopback devices ( /dev/loopX ) and/or
networking storage? Are they also guaranteed that I/O requests never be
blocked on memory allocation requests which are not allowed to access
memory reserves? If yes, PATCH 3/3 would be helpful. If no, I think
what we need is a mechanism to propagate allowing access to memory
reserves similar to scope GFP_NOFS API.

> 
> [...]
> 
> I am _really_ getting tired of this discussion. You are making wrong or
> unfounded claims again and again. I have no idea what are you trying to
> achieve here but I simply do not see any sense in continuing in this
> discussion.
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-04 14:22           ` Tetsuo Handa
@ 2017-01-04 15:20             ` Michal Hocko
  2017-01-05 10:50               ` Tetsuo Handa
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-01-04 15:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

On Wed 04-01-17 23:22:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 03-01-17 23:38:30, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
> > > > [...]
> > > > > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> > > > > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> > > > > __GFP_NORETRY with this patch in the changelog.
> > > > 
> > > > Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
> > > > reason to describe all the nonsense combinations of gfp flags.
> > > 
> > > Before [PATCH 1/3]:
> > > 
> > >   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
> > >   request even if __GFP_NOFAIL is specified if direct reclaim/compaction
> > >   did not help."
> > > 
> > >   __GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
> > >   is specified even if direct reclaim/compaction did not help."
> > > 
> > > After [PATCH 1/3]:
> > > 
> > >   __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
> > >   request unless __GFP_NOFAIL is specified."
> > > 
> > >   __GFP_NOFAIL is used as "Never fail allocation request even if direct
> > >   reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
> > >   specified."
> > > 
> > > Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
> > > "Never fail allocation request if direct reclaim/compaction did not help.
> > > But do not invoke the OOM killer even if direct reclaim/compaction did not help."
> > 
> > Stop this! Seriously... This is just wasting time...
> 
> You are free to ignore me. But

my last reply in this subthread

> >  * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> >  *   return NULL when direct reclaim and memory compaction have failed to allow
> >  *   the allocation to succeed.  The OOM killer is not called with the current
> >  *   implementation.
> > 
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  *   cannot handle allocation failures. New users should be evaluated carefully
> >  *   (and the flag should be used only when there is no reasonable failure
> >  *   policy) but it is definitely preferable to use the flag rather than
> >  *   opencode endless loop around allocator.
> > 
> > Can you see how the two are asking for opposite behavior?  Asking for
> > not retrying for ever and not failing and rather retrying for ever
> > simply doesn't make any sense in any reasonable universe I can think
> > of. Therefore I think that it is fair to say that behavior is undefined
> > when both are specified.
> 
> I consider that I'm using __GFP_NORETRY as a mean to avoid calling the OOM
> killer rather than avoid retrying indefinitely. Therefore, I want

This would be an abuse. Not invoking the oom killer is an implementation
detail and can change in the future. This is documented.

>   __GFP_NOOOMKILL: The VM implementation must not call the OOM killer when
>   direct reclaim and memory compaction have failed to allow the allocation
>   to succeed.

I am not going to give such a flag to users. The longer I am looking
into how those flags are used the more I am convinced that they are very
often wrong when trying to be too clever. Decision whether to trigger
OOM killer or not is the MM internal thing and _no code_ outside the MM
proper has any word into it.
 
> and __GFP_NOOOMKILL | __GFP_NOFAIL makes sense.

and this example just shows why I think that my cautiousness is
justified...

> Technically PATCH 1/3 allows __GFP_NOOOMKILL | __GFP_NOFAIL emulation
> via __GFP_NOFAIL | __GFP_NOFAIL. If you don't like such emulation,
> I welcome __GFP_NOOOMKILL.
> 
> > 
> > Considering there are _no_ users which would do that any further
> > discussion about this is just pointless and I will not respond to any
> > further emails in this direction.
> > 
> > This is just ridiculous!
> 
> Regardless of whether we define __GFP_NOOOMKILL, I wonder we need PATCH 2/3 now
> because currently premature OOM killer invocation due to !__GFP_FS && __GFP_NOFAIL
> is a prophetical problem. We can consider PATCH 2/3 (or __GFP_NOOOMKILL) when
> someone reported OOM killer invocation via !__GFP_FS && __GFP_NOFAIL and
> confirmed that the memory counter says premature enough to suppress the OOM
> killer invocation.

Again. GFP_NOFS should behave consistently regardless GFP_NOFAIL. The
mere fact that the opencoded endless loop around GFP_NOFS behaves
differently is something to raise a red flag. I want to fix that. So no,
I really do not want to keep the status quo.

[...]
> > > I'm unable to obtain nice backtraces, but I think we can confirm that
> > > there are GFP_ATOMIC allocations (e.g. sg_alloc_table_chained() calls
> > > __sg_alloc_table(GFP_ATOMIC)) when we are using SCSI disks.
> > 
> > How are those blocking further progress? Failing atomic allocations are
> > nothing to lose sleep over. They cannot be, pretty by definition, relied
> > on to make a further progress.
> 
> So, regarding simple SCSI disk case, it is guaranteed that disk I/O request
> can recover from transient failures (e.g. timeout?) and complete unless
> fatal failures (e.g. hardware out of order?) occur, isn't it? Then,
> PATCH 3/3 would be helpful for this case.
> 
> What about other cases, such as loopback devices ( /dev/loopX ) and/or
> networking storage? Are they also guaranteed that I/O requests never be
> blocked on memory allocation requests which are not allowed to access
> memory reserves? If yes, PATCH 3/3 would be helpful. If no, I think
> what we need is a mechanism to propagate allowing access to memory
> reserves similar to scope GFP_NOFS API.

Again, which cannot recover from GFP_ATOMIC requests is broken by
definition.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-04 15:20             ` Michal Hocko
@ 2017-01-05 10:50               ` Tetsuo Handa
  2017-01-05 11:54                 ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Tetsuo Handa @ 2017-01-05 10:50 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

Michal Hocko wrote:
> > > Stop this! Seriously... This is just wasting time...
> > 
> > You are free to ignore me. But
> 
> my last reply in this subthread
> 

OK. You can ignore me; I just won't give my Acked-by: or Reviewed-by: to this series.

My understanding is that we changed to tolerate __GFP_NOFAIL usage because
allocation failure leads to unacceptable side effect (e.g. remounting
read-only, kernel panic) rather than allocation helps reclaiming memory.

  commit 647757197cd34fae ("mm: clarify __GFP_NOFAIL deprecation status")
  commit 277fb5fc177dc467 ("btrfs: use __GFP_NOFAIL in alloc_btrfs_bio")

I don't know whether __GFP_NOFAIL users are using __GFP_NOFAIL based on
whether it helps reclaiming memory rather than whether allocation failure
leads to unacceptable side effect, if we allow access to memory reserves
based on __GFP_NOFAIL.

  commit 7444a072c387a93e ("ext4: replace open coded nofail allocation in ext4_free_blocks()")
  commit adb7ef600cc9d9d1 ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
  commit c9af28fdd44922a6 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
  commit b32e4482aadfd132 ("fscrypto: don't let data integrity writebacks fail with ENOMEM")
  commit 80c545055dc7c1f7 ("f2fs: use __GFP_NOFAIL to avoid infinite loop")

If __GFP_NOFAIL users are using __GFP_NOFAIL based on whether allocation failure
leads to unacceptable side effect, allowing access to memory reserves based on
__GFP_NOFAIL might not help reclaiming memory; something like scope GFP_NOFS API
will be needed.

Anyway, I suggest merging description update shown below into this series and
getting confirmation from all existing __GFP_NOFAIL users. If all existing
__GFP_NOFAIL users are OK with this series (in other words, informed the risk
caused by this series), I'm also OK with this series.

--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -135,16 +135,24 @@
  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
  *   _might_ fail.  This depends upon the particular VM implementation.
  *
- * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- *   cannot handle allocation failures. New users should be evaluated carefully
- *   (and the flag should be used only when there is no reasonable failure
- *   policy) but it is definitely preferable to use the flag rather than
- *   opencode endless loop around allocator.
- *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- *   return NULL when direct reclaim and memory compaction have failed to allow
- *   the allocation to succeed.  The OOM killer is not called with the current
- *   implementation.
+ * __GFP_NOFAIL: The VM implementation must not give up even after direct
+ *   reclaim and memory compaction have failed to allow the allocation to
+ *   succeed. Note that since the OOM killer is not called with the current
+ *   implementation when direct reclaim and memory compaction have failed to
+ *   allow the allocation to succeed unless __GFP_FS is also used (and some
+ *   other conditions are met), e.g. GFP_NOFS | __GFP_NOFAIL allocation has
+ *   possibility of lockup. To reduce the possibility of lockup, __GFP_HIGH is
+ *   implicitly granted by the current implementation if __GFP_NOFAIL is used.
+ *   New users of __GFP_NOFAIL should be evaluated carefully (and __GFP_NOFAIL
+ *   should be used only when there is no reasonable failure policy) but it is
+ *   definitely preferable to use __GFP_NOFAIL rather than opencode endless
+ *   loop around allocator, for a stall detection check inside allocator will
+ *   likely be able to emit possible lockup warnings unless __GFP_NOWARN is
+ *   also used.
+ *
+ * __GFP_NORETRY: The VM implementation must give up as soon as direct reclaim
+ *   and memory compaction have failed to allow the allocation to succeed.
+ *   Therefore, __GFP_NORETRY cannot be used with __GFP_NOFAIL.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)

I do not like "mm, oom: get rid of TIF_MEMDIE" series because you have not
gotten confirmation from all users who might be affected (e.g. start failing
inside do_exit() which currently do not fail) by that series. If you clarify
possible side effects and get confirmation from affected users (in case some
users might need to add __GFP_NOFAIL), I will be OK with that series as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2017-01-05 10:50               ` Tetsuo Handa
@ 2017-01-05 11:54                 ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-01-05 11:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, rientjes, mgorman, hillf.zj, linux-mm, linux-kernel

On Thu 05-01-17 19:50:23, Tetsuo Handa wrote:
[...]
> Anyway, I suggest merging description update shown below into this series and
> getting confirmation from all existing __GFP_NOFAIL users. If all existing
> __GFP_NOFAIL users are OK with this series (in other words, informed the risk
> caused by this series), I'm also OK with this series.
> 
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -135,16 +135,24 @@
>   * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>   *   _might_ fail.  This depends upon the particular VM implementation.
>   *
> - * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - *   cannot handle allocation failures. New users should be evaluated carefully
> - *   (and the flag should be used only when there is no reasonable failure
> - *   policy) but it is definitely preferable to use the flag rather than
> - *   opencode endless loop around allocator.
> - *
> - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> - *   return NULL when direct reclaim and memory compaction have failed to allow
> - *   the allocation to succeed.  The OOM killer is not called with the current
> - *   implementation.
> + * __GFP_NOFAIL: The VM implementation must not give up even after direct
> + *   reclaim and memory compaction have failed to allow the allocation to
> + *   succeed. Note that since the OOM killer is not called with the current
> + *   implementation when direct reclaim and memory compaction have failed to
> + *   allow the allocation to succeed unless __GFP_FS is also used (and some
> + *   other conditions are met), e.g. GFP_NOFS | __GFP_NOFAIL allocation has
> + *   possibility of lockup. To reduce the possibility of lockup, __GFP_HIGH is
> + *   implicitly granted by the current implementation if __GFP_NOFAIL is used.
> + *   New users of __GFP_NOFAIL should be evaluated carefully (and __GFP_NOFAIL
> + *   should be used only when there is no reasonable failure policy) but it is
> + *   definitely preferable to use __GFP_NOFAIL rather than opencode endless
> + *   loop around allocator, for a stall detection check inside allocator will
> + *   likely be able to emit possible lockup warnings unless __GFP_NOWARN is
> + *   also used.

This is both wrong and unnecessarily describing implementation details.
Non-failing allocation which must not give up can lockup pretty much by
definition. IMHO the current description is sufficient.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
  2016-12-20 13:49 [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
                   ` (3 preceding siblings ...)
  2017-01-02 15:49 ` [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
@ 2017-01-18 18:42 ` Michal Hocko
  4 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-01-18 18:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tetsuo Handa, David Rientjes, Mel Gorman,
	Hillf Danton, linux-mm, LKML

On Tue 20-12-16 14:49:01, Michal Hocko wrote:
> Hi,
> This has been posted [1] initially to later be reduced to a single patch
> [2].  Johannes then suggested [3] to split up the second patch and make
> the access to memory reserves by __GF_NOFAIL requests which do not
> invoke the oom killer a separate change. This is patch 3 now.
> 
> Tetsuo has noticed [4] that recent changes have changed GFP_NOFAIL
> semantic for costly order requests. I believe that the primary reason
> why this happened is that our GFP_NOFAIL checks are too scattered
> and it is really easy to forget about adding one. That's why I am
> proposing patch 1 which consolidates all the nofail handling at a single
> place. This should help to make this code better maintainable.
> 
> Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
> surprising. As things stand currently GFP_NOFAIL overrides the oom killer
> prevention code which is both subtle and not really needed. The patch 2
> has more details about issues this might cause. We have also seen
> a report where __GFP_NOFAIL|GFP_NOFS requests cause the oom killer which
> is premature.
> 
> Patch 3 is an attempt to reduce chances of GFP_NOFAIL requests being
> preempted by other memory consumers by giving them access to memory
> reserves.
> 
> [1] http://lkml.kernel.org/r/20161123064925.9716-1-mhocko@kernel.org
> [2] http://lkml.kernel.org/r/20161214150706.27412-1-mhocko@kernel.org
> [3] http://lkml.kernel.org/r/20161216173151.GA23182@cmpxchg.org
> [4] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

Friendly ping on this.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-20 13:49 ` [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  2016-12-20 15:31   ` Tetsuo Handa
@ 2017-01-19 18:41   ` Johannes Weiner
  2017-01-20  8:33   ` Hillf Danton
  2 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2017-01-19 18:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Mel Gorman,
	Hillf Danton, linux-mm, LKML, Michal Hocko

On Tue, Dec 20, 2016 at 02:49:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is far from obvious. Note that the primary motivation
> for skipping the OOM killer is to prevent from pre-mature invocation.
> 
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> a forward progress for those exceptional cases:
> 	- it will hardly help to form costly order which in turn can
> 	  result in the system panic because of no oom killable task in
> 	  the end - I believe we certainly do not want to put the system
> 	  down just because there is a nasty driver asking for order-9
> 	  page with GFP_NOFAIL not realizing all the consequences. It is
> 	  much better this request would loop for ever than the massive
> 	  system disruption
> 	- lowmem is also highly unlikely to be freed during OOM killer
> 	- GFP_NOFS request could trigger while there is still a lot of
> 	  memory pinned by filesystems.
> 
> The pre-mature OOM killer is a real issue as reported by Nils Holland
> 	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
> 	kworker/u4:5 cpuset=/ mems_allowed=0
> 	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> 	Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
> 	Workqueue: writeback wb_workfn (flush-btrfs-1)
> 	 eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
> 	 eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
> 	 eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
> 	Call Trace:
> 	 [<c142bcce>] dump_stack+0x47/0x69
> 	 [<c1163332>] dump_header+0x60/0x178
> 	 [<c1431876>] ? ___ratelimit+0x86/0xe0
> 	 [<c110795f>] oom_kill_process+0x20f/0x3d0
> 	 [<c1043895>] ? has_capability_noaudit+0x15/0x20
> 	 [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
> 	 [<c1107df9>] out_of_memory+0xd9/0x260
> 	 [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
> 	 [<c110414d>] pagecache_get_page+0xad/0x270
> 	 [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
> 	 [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
> 	[...]
> 	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
> 	lowmem_reserve[]: 0 0 21292 21292
> 	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
> 
> this is a GFP_NOFS|__GFP_NOFAIL request which invokes the OOM killer
> because there is clearly nothing reclaimable in the zone Normal while
> there is a lot of page cache which is most probably pinned by the fs but
> GFP_NOFS cannot reclaim it.
> 
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects.
> 
> Reported-by: Nils Holland <nholland@tisys.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-20 13:49 ` [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  2016-12-20 15:31   ` Tetsuo Handa
  2017-01-19 18:41   ` Johannes Weiner
@ 2017-01-20  8:33   ` Hillf Danton
  2017-01-24 12:40     ` Michal Hocko
  2 siblings, 1 reply; 24+ messages in thread
From: Hillf Danton @ 2017-01-20  8:33 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Johannes Weiner', 'Tetsuo Handa',
	'David Rientjes', 'Mel Gorman',
	linux-mm, 'LKML', 'Michal Hocko'


On Tuesday, December 20, 2016 9:49 PM Michal Hocko wrote: 
> 
> @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
>  	 * make sure exclude 0 mask - all other users should have at least
>  	 * ___GFP_DIRECT_RECLAIM to get here.
>  	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
>  		return true;
> 
As to GFP_NOFS|__GFP_NOFAIL request, can we check gfp mask
one bit after another?

	if (oc->gfp_mask) {
		if (!(oc->gfp_mask & __GFP_FS))
			return false;

		/* No service for request that can handle fail result itself */
		if (!(oc->gfp_mask & __GFP_NOFAIL))
			return false;
	}

thanks
Hillf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2017-01-20  8:33   ` Hillf Danton
@ 2017-01-24 12:40     ` Michal Hocko
  2017-01-25  7:00       ` Hillf Danton
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-01-24 12:40 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrew Morton', 'Johannes Weiner',
	'Tetsuo Handa', 'David Rientjes',
	'Mel Gorman', linux-mm, 'LKML'

On Fri 20-01-17 16:33:36, Hillf Danton wrote:
> 
> On Tuesday, December 20, 2016 9:49 PM Michal Hocko wrote: 
> > 
> > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> >  	 * make sure exclude 0 mask - all other users should have at least
> >  	 * ___GFP_DIRECT_RECLAIM to get here.
> >  	 */
> > -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> >  		return true;
> > 
> As to GFP_NOFS|__GFP_NOFAIL request, can we check gfp mask
> one bit after another?
> 
> 	if (oc->gfp_mask) {
> 		if (!(oc->gfp_mask & __GFP_FS))
> 			return false;
> 
> 		/* No service for request that can handle fail result itself */
> 		if (!(oc->gfp_mask & __GFP_NOFAIL))
> 			return false;
> 	}

I really do not understand this request. This patch is removing the
__GFP_NOFAIL part... Besides that why should they return false?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2017-01-24 12:40     ` Michal Hocko
@ 2017-01-25  7:00       ` Hillf Danton
  2017-01-25  7:59         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Hillf Danton @ 2017-01-25  7:00 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Andrew Morton', 'Johannes Weiner',
	'Tetsuo Handa', 'David Rientjes',
	'Mel Gorman', linux-mm, 'LKML'

On Tuesday, January 24, 2017 8:41 PM Michal Hocko wrote: 
> On Fri 20-01-17 16:33:36, Hillf Danton wrote:
> >
> > On Tuesday, December 20, 2016 9:49 PM Michal Hocko wrote:
> > >
> > > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> > >  	 * make sure exclude 0 mask - all other users should have at least
> > >  	 * ___GFP_DIRECT_RECLAIM to get here.
> > >  	 */
> > > -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > > +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> > >  		return true;
> > >
> > As to GFP_NOFS|__GFP_NOFAIL request, can we check gfp mask
> > one bit after another?
> >
> > 	if (oc->gfp_mask) {
> > 		if (!(oc->gfp_mask & __GFP_FS))
> > 			return false;
> >
> > 		/* No service for request that can handle fail result itself */
> > 		if (!(oc->gfp_mask & __GFP_NOFAIL))
> > 			return false;
> > 	}
> 
> I really do not understand this request. 

It's a request of both NOFS and NOFAIL, and I think we can keep it from
hitting oom killer by shuffling the current gfp checks.
I hope it can make nit sense to your work.

> This patch is removing the __GFP_NOFAIL part... 

Yes, and I don't stick to handling NOFAIL requests inside oom.
 
> Besides that why should they return false?

It's feedback to page allocator that no kill is issued, and 
extra attention is needed.

thanks
Hillf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2017-01-25  7:00       ` Hillf Danton
@ 2017-01-25  7:59         ` Michal Hocko
  2017-01-25  8:41           ` Hillf Danton
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-01-25  7:59 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrew Morton', 'Johannes Weiner',
	'Tetsuo Handa', 'David Rientjes',
	'Mel Gorman', linux-mm, 'LKML'

On Wed 25-01-17 15:00:51, Hillf Danton wrote:
> On Tuesday, January 24, 2017 8:41 PM Michal Hocko wrote: 
> > On Fri 20-01-17 16:33:36, Hillf Danton wrote:
> > >
> > > On Tuesday, December 20, 2016 9:49 PM Michal Hocko wrote:
> > > >
> > > > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> > > >  	 * make sure exclude 0 mask - all other users should have at least
> > > >  	 * ___GFP_DIRECT_RECLAIM to get here.
> > > >  	 */
> > > > -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > > > +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> > > >  		return true;
> > > >
> > > As to GFP_NOFS|__GFP_NOFAIL request, can we check gfp mask
> > > one bit after another?
> > >
> > > 	if (oc->gfp_mask) {
> > > 		if (!(oc->gfp_mask & __GFP_FS))
> > > 			return false;
> > >
> > > 		/* No service for request that can handle fail result itself */
> > > 		if (!(oc->gfp_mask & __GFP_NOFAIL))
> > > 			return false;
> > > 	}
> > 
> > I really do not understand this request. 
> 
> It's a request of both NOFS and NOFAIL, and I think we can keep it from
> hitting oom killer by shuffling the current gfp checks.
> I hope it can make nit sense to your work.
> 

I still do not understand. The whole point we are doing the late
__GFP_FS check is explained in 3da88fb3bacf ("mm, oom: move GFP_NOFS
check to out_of_memory"). And the reason why I am _removing_
__GFP_NOFAIL is explained in the changelog of this patch.

> > This patch is removing the __GFP_NOFAIL part... 
> 
> Yes, and I don't stick to handling NOFAIL requests inside oom.
>  
> > Besides that why should they return false?
> 
> It's feedback to page allocator that no kill is issued, and 
> extra attention is needed.

Be careful, the semantic of out_of_memory is different. Returning false
means that the oom killer has been disabled and so the allocation should
fail rather than loop for ever.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2017-01-25  7:59         ` Michal Hocko
@ 2017-01-25  8:41           ` Hillf Danton
  2017-01-25 10:19             ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Hillf Danton @ 2017-01-25  8:41 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Andrew Morton', 'Johannes Weiner',
	'Tetsuo Handa', 'David Rientjes',
	'Mel Gorman', linux-mm, 'LKML'

On Wednesday, January 25, 2017 4:00 PM Michal Hocko wrote: 
> On Wed 25-01-17 15:00:51, Hillf Danton wrote:
> > On Tuesday, January 24, 2017 8:41 PM Michal Hocko wrote:
> > > On Fri 20-01-17 16:33:36, Hillf Danton wrote:
> > > >
> > > > On Tuesday, December 20, 2016 9:49 PM Michal Hocko wrote:
> > > > >
> > > > > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> > > > >  	 * make sure exclude 0 mask - all other users should have at least
> > > > >  	 * ___GFP_DIRECT_RECLAIM to get here.
> > > > >  	 */
> > > > > -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > > > > +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> > > > >  		return true;
> > > > >
> > > > As to GFP_NOFS|__GFP_NOFAIL request, can we check gfp mask
> > > > one bit after another?
> > > >
> > > > 	if (oc->gfp_mask) {
> > > > 		if (!(oc->gfp_mask & __GFP_FS))
> > > > 			return false;
> > > >
> > > > 		/* No service for request that can handle fail result itself */
> > > > 		if (!(oc->gfp_mask & __GFP_NOFAIL))
> > > > 			return false;
> > > > 	}
> > >
> > > I really do not understand this request.
> >
> > It's a request of both NOFS and NOFAIL, and I think we can keep it from
> > hitting oom killer by shuffling the current gfp checks.
> > I hope it can make nit sense to your work.
> >
> 
> I still do not understand. The whole point we are doing the late
> __GFP_FS check is explained in 3da88fb3bacf ("mm, oom: move GFP_NOFS
> check to out_of_memory"). And the reason why I am _removing_
> __GFP_NOFAIL is explained in the changelog of this patch.
> 
> > > This patch is removing the __GFP_NOFAIL part...
> >
> > Yes, and I don't stick to handling NOFAIL requests inside oom.
> >
> > > Besides that why should they return false?
> >
> > It's feedback to page allocator that no kill is issued, and
> > extra attention is needed.
> 
> Be careful, the semantic of out_of_memory is different. Returning false
> means that the oom killer has been disabled and so the allocation should
> fail rather than loop for ever.
> 
By returning  false, I mean that oom killer is making no progress.
And I prefer to give up looping if oom killer can't help.
It's a change in the current semantic to fail the request and I have
to test it isn't bad.

thanks
Hillf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2017-01-25  8:41           ` Hillf Danton
@ 2017-01-25 10:19             ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-01-25 10:19 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrew Morton', 'Johannes Weiner',
	'Tetsuo Handa', 'David Rientjes',
	'Mel Gorman', linux-mm, 'LKML'

On Wed 25-01-17 16:41:54, Hillf Danton wrote:
> On Wednesday, January 25, 2017 4:00 PM Michal Hocko wrote: 
> > On Wed 25-01-17 15:00:51, Hillf Danton wrote:
> > > On Tuesday, January 24, 2017 8:41 PM Michal Hocko wrote:
> > > > On Fri 20-01-17 16:33:36, Hillf Danton wrote:
> > > > >
> > > > > On Tuesday, December 20, 2016 9:49 PM Michal Hocko wrote:
> > > > > >
> > > > > > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> > > > > >  	 * make sure exclude 0 mask - all other users should have at least
> > > > > >  	 * ___GFP_DIRECT_RECLAIM to get here.
> > > > > >  	 */
> > > > > > -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > > > > > +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> > > > > >  		return true;
> > > > > >
> > > > > As to GFP_NOFS|__GFP_NOFAIL request, can we check gfp mask
> > > > > one bit after another?
> > > > >
> > > > > 	if (oc->gfp_mask) {
> > > > > 		if (!(oc->gfp_mask & __GFP_FS))
> > > > > 			return false;
> > > > >
> > > > > 		/* No service for request that can handle fail result itself */
> > > > > 		if (!(oc->gfp_mask & __GFP_NOFAIL))
> > > > > 			return false;
> > > > > 	}
> > > >
> > > > I really do not understand this request.
> > >
> > > It's a request of both NOFS and NOFAIL, and I think we can keep it from
> > > hitting oom killer by shuffling the current gfp checks.
> > > I hope it can make nit sense to your work.
> > >
> > 
> > I still do not understand. The whole point we are doing the late
> > __GFP_FS check is explained in 3da88fb3bacf ("mm, oom: move GFP_NOFS
> > check to out_of_memory"). And the reason why I am _removing_
> > __GFP_NOFAIL is explained in the changelog of this patch.
> > 
> > > > This patch is removing the __GFP_NOFAIL part...
> > >
> > > Yes, and I don't stick to handling NOFAIL requests inside oom.
> > >
> > > > Besides that why should they return false?
> > >
> > > It's feedback to page allocator that no kill is issued, and
> > > extra attention is needed.
> > 
> > Be careful, the semantic of out_of_memory is different. Returning false
> > means that the oom killer has been disabled and so the allocation should
> > fail rather than loop for ever.
> > 
> By returning  false, I mean that oom killer is making no progress.
> And I prefer to give up looping if oom killer can't help.
> It's a change in the current semantic to fail the request and I have
> to test it isn't bad.

And it is really off-topic to this particular patch which really
confused me. And no, this wouldn't fly. I have tried that 2 years ago
and failed because the risk of unexpected ENOMEM is just too high.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-01-25 10:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20 13:49 [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
2016-12-20 13:49 ` [PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
2016-12-20 13:49 ` [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
2016-12-20 15:31   ` Tetsuo Handa
2016-12-21  8:15     ` Michal Hocko
2017-01-19 18:41   ` Johannes Weiner
2017-01-20  8:33   ` Hillf Danton
2017-01-24 12:40     ` Michal Hocko
2017-01-25  7:00       ` Hillf Danton
2017-01-25  7:59         ` Michal Hocko
2017-01-25  8:41           ` Hillf Danton
2017-01-25 10:19             ` Michal Hocko
2016-12-20 13:49 ` [PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not trigger OOM killer Michal Hocko
2017-01-02 15:49 ` [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
2017-01-03  1:36   ` Tetsuo Handa
2017-01-03  8:42     ` Michal Hocko
2017-01-03 14:38       ` Tetsuo Handa
2017-01-03 16:25         ` Vlastimil Babka
2017-01-03 20:40         ` Michal Hocko
2017-01-04 14:22           ` Tetsuo Handa
2017-01-04 15:20             ` Michal Hocko
2017-01-05 10:50               ` Tetsuo Handa
2017-01-05 11:54                 ` Michal Hocko
2017-01-18 18:42 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).