linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] GFP_NOFAIL cleanups
@ 2016-12-01 15:25 Michal Hocko
  2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman,
	David Rientjes, linux-mm, LKML

Hi,
I have posted this as an RFC previously [1] as there was no fundamental
disagreement I would like to ask for inclusion.

Tetsuo has noticed [2] that recent changes have changed GFP_NOFAIL
semantic for costly order requests. I believe that the primary reason
why this happened is that our GFP_NOFAIL checks are too scattered
and it is really easy to forget about adding one. That's why I am
proposing patch 1 which consolidates all the nofail handling at a single
place. This should help to make this code better maintainable.

Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
surprising. As things stand currently GFP_NOFAIL overrides the oom killer
prevention code which is both subtle and not really needed. The patch 2
has more details about issues this might cause.

I would consider both patches more a cleanup than anything else. Any
feedback is highly appreciated.

[1] http://lkml.kernel.org/r/20161123064925.9716-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko
@ 2016-12-01 15:25 ` Michal Hocko
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman,
	David Rientjes, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fbfead6aa7d..76c0b6bb0baf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3627,32 +3627,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim) {
-		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually allow this type of allocation
-		 * to fail.
-		 */
-		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+	if (!can_direct_reclaim)
 		goto nopage;
+
+	/* Make sure we know about allocations which stall for too long */
+	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		warn_alloc(gfp_mask,
+			"page alloction stalls for %ums, order:%u",
+			jiffies_to_msecs(jiffies-alloc_start), order);
+		stall_timeout += 10 * HZ;
 	}
 
 	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC) {
-		/*
-		 * __GFP_NOFAIL request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us.
-		 */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-			cond_resched();
-			goto retry;
-		}
+	if (current->flags & PF_MEMALLOC)
 		goto nopage;
-	}
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	if (test_thread_flag(TIF_MEMDIE))
 		goto nopage;
 
 
@@ -3679,14 +3670,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page alloction stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
 		goto retry;
@@ -3715,6 +3698,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+	 * we always retry
+	 */
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * All existing users of the __GFP_NOFAIL are blockable, so warn
+		 * of any new users that actually require GFP_NOWAIT
+		 */
+		if (WARN_ON_ONCE(!can_direct_reclaim))
+			goto fail;
+
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+		/*
+		 * non failing costly orders are a hard requirement which we
+		 * are not prepared for much so let's warn about these users
+		 * so that we can identify them and convert them to something
+		 * else.
+		 */
+		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+		cond_resched();
+		goto retry;
+	}
+fail:
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko
  2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
@ 2016-12-01 15:25 ` Michal Hocko
  2016-12-02  7:23   ` Vlastimil Babka
  2016-12-05 13:45   ` Tetsuo Handa
  1 sibling, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman,
	David Rientjes, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases - e.g. it will hardly
help to form costly order - I believe we certainly do not want to kill
all processes and eventually panic the system just because there is a
nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
the consequences - it is much better this request would loop for ever
than the massive system disruption, lowmem is also highly unlikely to be
freed during OOM killer and GFP_NOFS request could trigger while there
is still a lot of memory pinned by filesystems.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
__alloc_pages_nowmark helper has been introduced for that purpose.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++----------------------
 2 files changed, 59 insertions(+), 38 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c0b6bb0baf..7102641147c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3044,6 +3044,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
+						const struct alloc_context *ac)
+{
+	struct page *page;
+
+	page = get_page_from_freelist(gfp_mask, order,
+			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+	/*
+	 * fallback to ignore cpuset restriction if our nodes
+	 * are depleted
+	 */
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order,
+				ALLOC_NO_WATERMARKS, ac);
+
+	return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3078,47 +3097,41 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto out;
 
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		/* Coredumps can quickly deplete all memory reserves */
-		if (current->flags & PF_DUMPCORE)
-			goto out;
-		/* The OOM killer will not help higher order allocs */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto out;
-		/* The OOM killer does not needlessly kill tasks for lowmem */
-		if (ac->high_zoneidx < ZONE_NORMAL)
-			goto out;
-		if (pm_suspended_storage())
-			goto out;
-		/*
-		 * XXX: GFP_NOFS allocations should rather fail than rely on
-		 * other request to make a forward progress.
-		 * We are in an unfortunate situation where out_of_memory cannot
-		 * do much for this context but let's try it to at least get
-		 * access to memory reserved if the current task is killed (see
-		 * out_of_memory). Once filesystems are ready to handle allocation
-		 * failures more gracefully we should just bail out here.
-		 */
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		goto out;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+	if (pm_suspended_storage())
+		goto out;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 
-		/* The OOM killer may not free memory on a specific node */
-		if (gfp_mask & __GFP_THISNODE)
-			goto out;
-	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
 		*did_some_progress = 1;
 
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
-		}
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		if (gfp_mask & __GFP_NOFAIL)
+			page = __alloc_pages_nowmark(gfp_mask, order, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		page = __alloc_pages_nowmark(gfp_mask, order, ac);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
@ 2016-12-02  7:23   ` Vlastimil Babka
  2016-12-05 13:45   ` Tetsuo Handa
  1 sibling, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2016-12-02  7:23 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Tetsuo Handa, Johannes Weiner, Mel Gorman, David Rientjes,
	linux-mm, LKML, Michal Hocko

On 12/01/2016 04:25 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
>
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.
>
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  2016-12-02  7:23   ` Vlastimil Babka
@ 2016-12-05 13:45   ` Tetsuo Handa
  2016-12-05 14:10     ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2016-12-05 13:45 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
> 
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.

I disagree. I believe that panic caused by OOM killer is much much better
than a locked up system. I hate to add new locations that can lockup inside
page allocator. This is __GFP_NOFAIL and reclaim has failed. Administrator
has to go in front of console and press SysRq-f until locked up situation
gets resolved is silly.

If there is a nasty driver asking for order-9 page with __GFP_NOFAIL, fix
that driver.

> 
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-05 13:45   ` Tetsuo Handa
@ 2016-12-05 14:10     ` Michal Hocko
  2016-12-06  8:27       ` Michal Hocko
  2016-12-06 10:38       ` Tetsuo Handa
  0 siblings, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-05 14:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Mon 05-12-16 22:45:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > the allocation request. This includes lowmem requests, costly high
> > order requests and others. For a long time __GFP_NOFAIL acted as an
> > override for all those rules. This is not documented and it can be quite
> > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > the existing open coded loops around allocator to nofail request (and we
> > have done that in the past) then such a change would have a non trivial
> > side effect which is not obvious. Note that the primary motivation for
> > skipping the OOM killer is to prevent from pre-mature invocation.
> > 
> > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > be invoked otherwise the request would be looping for ever. But this
> > argument is rather weak because the OOM killer doesn't really guarantee
> > any forward progress for those exceptional cases - e.g. it will hardly
> > help to form costly order - I believe we certainly do not want to kill
> > all processes and eventually panic the system just because there is a
> > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > the consequences - it is much better this request would loop for ever
> > than the massive system disruption, lowmem is also highly unlikely to be
> > freed during OOM killer and GFP_NOFS request could trigger while there
> > is still a lot of memory pinned by filesystems.
> 
> I disagree. I believe that panic caused by OOM killer is much much better
> than a locked up system. I hate to add new locations that can lockup inside
> page allocator. This is __GFP_NOFAIL and reclaim has failed.

As a matter of fact any __GFP_NOFAIL can lockup inside the allocator.
Full stop. There is no guaranteed way to make a forward progress with
the current page allocator implementation.

So we are somewhere in the middle between pre-mature and pointless
system disruption (GFP_NOFS with a lots of metadata or lowmem request)
where the OOM killer even might not help and potential lockup which is
inevitable with the current design. Dunno about you but I would rather
go with the first option. To be honest I really fail to understand your
line of argumentation. We have this
	do {
		cond_resched();
	} (page = alloc_page(GFP_NOFS));
vs.
	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);

the first one doesn't invoke OOM killer while the later does. This
discrepancy just cannot make any sense... The same is true for

	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)

Now we can discuss whether it is a _good_ idea to not invoke OOM killer
for those exceptions but whatever we do __GFP_NOFAIL is not a way to
give such a subtle side effect. Or do you disagree even with that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-05 14:10     ` Michal Hocko
@ 2016-12-06  8:27       ` Michal Hocko
  2016-12-06 10:38       ` Tetsuo Handa
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-06  8:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Mon 05-12-16 15:10:09, Michal Hocko wrote:
[...]
> So we are somewhere in the middle between pre-mature and pointless
> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> where the OOM killer even might not help and potential lockup which is
> inevitable with the current design. Dunno about you but I would rather
> go with the first option. To be honest I really fail to understand your
> line of argumentation. We have this
> 	do {
> 		cond_resched();
> 	} (page = alloc_page(GFP_NOFS));

This should have been while (!(page = alloc_page(GFP_NOFS))) of
course...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-05 14:10     ` Michal Hocko
  2016-12-06  8:27       ` Michal Hocko
@ 2016-12-06 10:38       ` Tetsuo Handa
  2016-12-06 11:03         ` Vlastimil Babka
  2016-12-06 19:22         ` Michal Hocko
  1 sibling, 2 replies; 23+ messages in thread
From: Tetsuo Handa @ 2016-12-06 10:38 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 05-12-16 22:45:19, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > > the allocation request. This includes lowmem requests, costly high
> > > order requests and others. For a long time __GFP_NOFAIL acted as an
> > > override for all those rules. This is not documented and it can be quite
> > > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > > the existing open coded loops around allocator to nofail request (and we
> > > have done that in the past) then such a change would have a non trivial
> > > side effect which is not obvious. Note that the primary motivation for
> > > skipping the OOM killer is to prevent from pre-mature invocation.
> > > 
> > > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > > be invoked otherwise the request would be looping for ever. But this
> > > argument is rather weak because the OOM killer doesn't really guarantee
> > > any forward progress for those exceptional cases - e.g. it will hardly
> > > help to form costly order - I believe we certainly do not want to kill
> > > all processes and eventually panic the system just because there is a
> > > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > > the consequences - it is much better this request would loop for ever
> > > than the massive system disruption, lowmem is also highly unlikely to be
> > > freed during OOM killer and GFP_NOFS request could trigger while there
> > > is still a lot of memory pinned by filesystems.
> > 
> > I disagree. I believe that panic caused by OOM killer is much much better
> > than a locked up system. I hate to add new locations that can lockup inside
> > page allocator. This is __GFP_NOFAIL and reclaim has failed.
> 
> As a matter of fact any __GFP_NOFAIL can lockup inside the allocator.

You are trying to increase possible locations of lockups by changing
default behavior of __GFP_NOFAIL.

> Full stop. There is no guaranteed way to make a forward progress with
> the current page allocator implementation.

Then, will you accept kmallocwd until page allocator implementation
can provide a guaranteed way to make a forward progress?

> 
> So we are somewhere in the middle between pre-mature and pointless
> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> where the OOM killer even might not help and potential lockup which is
> inevitable with the current design. Dunno about you but I would rather
> go with the first option. To be honest I really fail to understand your
> line of argumentation. We have this
> 	do {
> 		cond_resched();
> 	} while (!(page = alloc_page(GFP_NOFS)));
> vs.
> 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> 
> the first one doesn't invoke OOM killer while the later does. This
> discrepancy just cannot make any sense... The same is true for
> 
> 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
> 
> Now we can discuss whether it is a _good_ idea to not invoke OOM killer
> for those exceptions but whatever we do __GFP_NOFAIL is not a way to
> give such a subtle side effect. Or do you disagree even with that?

"[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.

Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
allocation requests fail without invoking the OOM killer when both
__GFP_NORETRY and __GFP_NOFAIL are given.

With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
__GFP_NOFAIL allocation requests will loop forever without invoking
the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.

Those callers which prefer lockup over panic can specify both
__GFP_NORETRY and __GFP_NOFAIL. You are trying to change behavior of
__GFP_NOFAIL without asking whether existing __GFP_NOFAIL users
want to invoke the OOM killer.

And the story is not specific to existing __GFP_NOFAIL users;
it applies to existing GFP_NOFS users as well.

Quoting from http://lkml.kernel.org/r/20161125131806.GB24353@dhcp22.suse.cz :
> > Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
> > commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
> > 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
> > re-entering the fs code") ? My understanding is that mkdir() system call
> > caused memory allocation for inode creation and that memory allocation
> > caused memory reclaim which had to be !__GFP_FS.
> 
> I will have a look later, thanks for the points.

What is your answer to this problem? For those who prefer panic over lockup,
please provide a mean to invoke the OOM killer (e.g. __GFP_WANT_OOM_KILLER).

If you provide __GFP_WANT_OOM_KILLER, you can change

-#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_WANT_OOM_KILLER)

in gfp.h and add __GFP_WANT_OOM_KILLER to any existing __GFP_NOFAIL/GFP_NOFS
users who prefer panic over lockup. Then, I think I'm fine with

-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
-		return true;
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_WANT_OOM_KILLER))
+		return false;

in out_of_memory().

As you know, quite few people are responding to discussions regarding
almost OOM situation. Changing default behavior without asking existing
users is annoying.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 10:38       ` Tetsuo Handa
@ 2016-12-06 11:03         ` Vlastimil Babka
  2016-12-06 19:25           ` Michal Hocko
  2016-12-06 19:22         ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2016-12-06 11:03 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko
  Cc: akpm, hannes, mgorman, rientjes, linux-mm, linux-kernel

On 12/06/2016 11:38 AM, Tetsuo Handa wrote:
>>
>> So we are somewhere in the middle between pre-mature and pointless
>> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
>> where the OOM killer even might not help and potential lockup which is
>> inevitable with the current design. Dunno about you but I would rather
>> go with the first option. To be honest I really fail to understand your
>> line of argumentation. We have this
>> 	do {
>> 		cond_resched();
>> 	} while (!(page = alloc_page(GFP_NOFS)));
>> vs.
>> 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
>>
>> the first one doesn't invoke OOM killer while the later does. This
>> discrepancy just cannot make any sense... The same is true for
>>
>> 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
>>
>> Now we can discuss whether it is a _good_ idea to not invoke OOM killer
>> for those exceptions but whatever we do __GFP_NOFAIL is not a way to
>> give such a subtle side effect. Or do you disagree even with that?
> 
> "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.

I guess that wasn't intended?

> Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> allocation requests fail without invoking the OOM killer when both
> __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> __GFP_NOFAIL allocation requests will loop forever without invoking
> the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.

Does such combination of flag make sense? Should we warn about it, or
even silently remove __GFP_NORETRY in such case?

> Those callers which prefer lockup over panic can specify both
> __GFP_NORETRY and __GFP_NOFAIL.

What lockup exactly, if __GFP_NORETRY did lead to fail?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 10:38       ` Tetsuo Handa
  2016-12-06 11:03         ` Vlastimil Babka
@ 2016-12-06 19:22         ` Michal Hocko
  2016-12-08 12:53           ` Tetsuo Handa
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-06 19:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Tue 06-12-16 19:38:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 05-12-16 22:45:19, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > > > the allocation request. This includes lowmem requests, costly high
> > > > order requests and others. For a long time __GFP_NOFAIL acted as an
> > > > override for all those rules. This is not documented and it can be quite
> > > > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > > > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > > > the existing open coded loops around allocator to nofail request (and we
> > > > have done that in the past) then such a change would have a non trivial
> > > > side effect which is not obvious. Note that the primary motivation for
> > > > skipping the OOM killer is to prevent from pre-mature invocation.
> > > > 
> > > > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > > > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > > > be invoked otherwise the request would be looping for ever. But this
> > > > argument is rather weak because the OOM killer doesn't really guarantee
> > > > any forward progress for those exceptional cases - e.g. it will hardly
> > > > help to form costly order - I believe we certainly do not want to kill
> > > > all processes and eventually panic the system just because there is a
> > > > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > > > the consequences - it is much better this request would loop for ever
> > > > than the massive system disruption, lowmem is also highly unlikely to be
> > > > freed during OOM killer and GFP_NOFS request could trigger while there
> > > > is still a lot of memory pinned by filesystems.
> > > 
> > > I disagree. I believe that panic caused by OOM killer is much much better
> > > than a locked up system. I hate to add new locations that can lockup inside
> > > page allocator. This is __GFP_NOFAIL and reclaim has failed.
> > 
> > As a matter of fact any __GFP_NOFAIL can lockup inside the allocator.
> 
> You are trying to increase possible locations of lockups by changing
> default behavior of __GFP_NOFAIL.

I disagree. I have tried to explain that it is much more important to
have fewer silent side effects than optimize for the very worst case.  I
simply do not see __GFP_NOFAIL lockups so common to even care or tweak
their semantic in a weird way. It seems you prefer to optimize for the
absolute worst case and even for that case you cannot offer anything
better than randomly OOM killing random processes until the system
somehow resurrects or panics. I consider this a very bad design. So
let's agree to disagree here.

> > Full stop. There is no guaranteed way to make a forward progress with
> > the current page allocator implementation.
> 
> Then, will you accept kmallocwd until page allocator implementation
> can provide a guaranteed way to make a forward progress?

No, I find your kmallocwd too complex for the advantage it provides.
 
> > So we are somewhere in the middle between pre-mature and pointless
> > system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> > where the OOM killer even might not help and potential lockup which is
> > inevitable with the current design. Dunno about you but I would rather
> > go with the first option. To be honest I really fail to understand your
> > line of argumentation. We have this
> > 	do {
> > 		cond_resched();
> > 	} while (!(page = alloc_page(GFP_NOFS)));
> > vs.
> > 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> > 
> > the first one doesn't invoke OOM killer while the later does. This
> > discrepancy just cannot make any sense... The same is true for
> > 
> > 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
> > 
> > Now we can discuss whether it is a _good_ idea to not invoke OOM killer
> > for those exceptions but whatever we do __GFP_NOFAIL is not a way to
> > give such a subtle side effect. Or do you disagree even with that?
> 
> "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.
> 
> Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> allocation requests fail without invoking the OOM killer when both
> __GFP_NORETRY and __GFP_NOFAIL are given.

Sigh... __GFP_NORETRY | __GFP_NOFAIL _doesn't_ make _any_ sense what so
ever.

> With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> __GFP_NOFAIL allocation requests will loop forever without invoking
> the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.

So what? Strictly speaking __GFP_NOFAIL should be always stronger but I
really fail to see why we should even consider __GFP_NORETRY in that
context. I definitely do not want to complicate the page fault path for
a nonsense combination of flags.

> Those callers which prefer lockup over panic can specify both
> __GFP_NORETRY and __GFP_NOFAIL.

No! This combination just doesn't make any sense. The same way how
__GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense
as well. Please use a common sense!

> You are trying to change behavior of
> __GFP_NOFAIL without asking whether existing __GFP_NOFAIL users
> want to invoke the OOM killer.

Invoking or not invoking the oom killer is the page allocator internal
business. No code outside of the MM is to talk about those decisions.
The fact that we provide a lightweight allocation mode which doesn't
invoke the OOM killer is a mere implementation detail.

> And the story is not specific to existing __GFP_NOFAIL users;
> it applies to existing GFP_NOFS users as well.
> 
> Quoting from http://lkml.kernel.org/r/20161125131806.GB24353@dhcp22.suse.cz :
> > > Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
> > > commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
> > > 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
> > > re-entering the fs code") ? My understanding is that mkdir() system call
> > > caused memory allocation for inode creation and that memory allocation
> > > caused memory reclaim which had to be !__GFP_FS.
> > 
> > I will have a look later, thanks for the points.
> 
> What is your answer to this problem? For those who prefer panic over lockup,
> please provide a mean to invoke the OOM killer (e.g. __GFP_WANT_OOM_KILLER).

Please stop shifting discussion off the scope of the discussed patch.
This just distracts from the main point. I really think that at least
patch 1 is a good clean up and the second one makes a lot of sense as
well from the semantic point of view. Now you constantly push to the
extreme with very strong statements without any actual data point. This
is not really helpful!

If you believe that my argumentation is incorrect then you are free to
nak the patch with your reasoning. But please stop this nit picking on
nonsense combination of flags.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 11:03         ` Vlastimil Babka
@ 2016-12-06 19:25           ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-06 19:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, akpm, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Tue 06-12-16 12:03:02, Vlastimil Babka wrote:
> On 12/06/2016 11:38 AM, Tetsuo Handa wrote:
> >>
> >> So we are somewhere in the middle between pre-mature and pointless
> >> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> >> where the OOM killer even might not help and potential lockup which is
> >> inevitable with the current design. Dunno about you but I would rather
> >> go with the first option. To be honest I really fail to understand your
> >> line of argumentation. We have this
> >> 	do {
> >> 		cond_resched();
> >> 	} while (!(page = alloc_page(GFP_NOFS)));
> >> vs.
> >> 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> >>
> >> the first one doesn't invoke OOM killer while the later does. This
> >> discrepancy just cannot make any sense... The same is true for
> >>
> >> 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
> >>
> >> Now we can discuss whether it is a _good_ idea to not invoke OOM killer
> >> for those exceptions but whatever we do __GFP_NOFAIL is not a way to
> >> give such a subtle side effect. Or do you disagree even with that?
> > 
> > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.
> 
> I guess that wasn't intended?

I even didn't think about that possibility because it just doesn't make
any sense.

> > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> > allocation requests fail without invoking the OOM killer when both
> > __GFP_NORETRY and __GFP_NOFAIL are given.
> > 
> > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> > __GFP_NOFAIL allocation requests will loop forever without invoking
> > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> Does such combination of flag make sense? Should we warn about it, or
> even silently remove __GFP_NORETRY in such case?

No this combination doesn't make any sense. I seriously doubt we should
even care about it and simply following the stronger requirement makes
more sense from a semantic point of view.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 19:22         ` Michal Hocko
@ 2016-12-08 12:53           ` Tetsuo Handa
  2016-12-08 13:47             ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2016-12-08 12:53 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 06-12-16 19:38:38, Tetsuo Handa wrote:
> > You are trying to increase possible locations of lockups by changing
> > default behavior of __GFP_NOFAIL.
> 
> I disagree. I have tried to explain that it is much more important to
> have fewer silent side effects than optimize for the very worst case.  I
> simply do not see __GFP_NOFAIL lockups so common to even care or tweak
> their semantic in a weird way. It seems you prefer to optimize for the
> absolute worst case and even for that case you cannot offer anything
> better than randomly OOM killing random processes until the system
> somehow resurrects or panics. I consider this a very bad design. So
> let's agree to disagree here.

You think that invoking the OOM killer with __GFP_NOFAIL is worse than
locking up with __GFP_NOFAIL. But I think that locking up with __GFP_NOFAIL
is worse than invoking the OOM killer with __GFP_NOFAIL. If we could agree
with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
is given, we can avoid locking up while minimizing possibility of invoking
the OOM killer...

I suggest "when you change something, ask users who are affected by
your change" because patch 2 has values-based conflict.

> > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.
> > 
> > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> > allocation requests fail without invoking the OOM killer when both
> > __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> Sigh... __GFP_NORETRY | __GFP_NOFAIL _doesn't_ make _any_ sense what so
> ever.
> 
> > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> > __GFP_NOFAIL allocation requests will loop forever without invoking
> > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> So what? Strictly speaking __GFP_NOFAIL should be always stronger but I
> really fail to see why we should even consider __GFP_NORETRY in that
> context. I definitely do not want to complicate the page fault path for
> a nonsense combination of flags.
> 
> > Those callers which prefer lockup over panic can specify both
> > __GFP_NORETRY and __GFP_NOFAIL.
> 
> No! This combination just doesn't make any sense. The same way how
> __GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense
> as well. Please use a common sense!

I wonder why I'm accused so much. I mentioned that patch 2 might be a
garbage because patch 1 alone unexpectedly provided a mean to retry forever
without invoking the OOM killer. You are not describing that fact in the
description. You are not describing what combinations are valid and
which flag is stronger requirement in gfp.h (e.g. __GFP_NOFAIL v.s.
__GFP_NORETRY).

> Invoking or not invoking the oom killer is the page allocator internal
> business. No code outside of the MM is to talk about those decisions.
> The fact that we provide a lightweight allocation mode which doesn't
> invoke the OOM killer is a mere implementation detail.

__GFP_NOFAIL allocation requests for e.g. fs writeback is considered as
code inside the MM because they are operations for reclaiming memory.
Such __GFP_NOFAIL allocation requests should be given a chance to choose
which one (possibility of lockup by not invoking the OOM killer or
possibility of panic by invoking the OOM killer) they prefer.

Therefore,

> If you believe that my argumentation is incorrect then you are free to
> nak the patch with your reasoning. But please stop this nit picking on
> nonsense combination of flags.

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
provide a mean to invoke the OOM killer if someone chose possibility of
panic" or "you accept kmallocwd".

> > > Full stop. There is no guaranteed way to make a forward progress with
> > > the current page allocator implementation.
> > 
> > Then, will you accept kmallocwd until page allocator implementation
> > can provide a guaranteed way to make a forward progress?
> 
> No, I find your kmallocwd too complex for the advantage it provides.

My kmallocwd provides us two advantages.

One is that, if the cause of lockup is not memory allocation request,
kmallocwd gives us a proof that you are doing well. You seriously tend to
ignore corner cases with your wish that the absolute worst case won't
happen; that makes me seriously explorer corner cases as a secure coder
for proactive protection; that irritates you further.
You say that I constantly push to the extreme with very strong statements
without any actual data point, but I say that you constantly reject
without any proof just because you have never heard. The reality is that
we can hardly expect people to have knowledge/skills for reporting corner
cases.

The other is that, synchronous mechanism (like warn_alloc()) is prone to
corner cases. We won't be able to catch all corner cases before people
are trapped by them. If the cause of lockup is memory allocation request,
kmallocwd gives us a trigger to take actions. This keeps me away from
exploring corner cases which you think unlikely happen. This helps you
to choose whatever semantic/logic you prefer.

Since I'm not a __GFP_NOFAIL user, I don't care as long as lockups are
detected and reported using a catch-all approach (i.e. asynchronous
mechanism). Instead of cruelly rejecting kmallocwd with "too complex",
will you explain why you feel complex (as a reply to kmallocwd thread)?
I have my goal for written as such, but there would be room for reducing
complexity if you explain details.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-08 12:53           ` Tetsuo Handa
@ 2016-12-08 13:47             ` Michal Hocko
  2016-12-11 11:23               ` Tetsuo Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-08 13:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 06-12-16 19:38:38, Tetsuo Handa wrote:
> > > You are trying to increase possible locations of lockups by changing
> > > default behavior of __GFP_NOFAIL.
> > 
> > I disagree. I have tried to explain that it is much more important to
> > have fewer silent side effects than optimize for the very worst case.  I
> > simply do not see __GFP_NOFAIL lockups so common to even care or tweak
> > their semantic in a weird way. It seems you prefer to optimize for the
> > absolute worst case and even for that case you cannot offer anything
> > better than randomly OOM killing random processes until the system
> > somehow resurrects or panics. I consider this a very bad design. So
> > let's agree to disagree here.
> 
> You think that invoking the OOM killer with __GFP_NOFAIL is worse than
> locking up with __GFP_NOFAIL.

Yes and I have explained why.

> But I think that locking up with __GFP_NOFAIL
> is worse than invoking the OOM killer with __GFP_NOFAIL.

Without any actual arguments based just on handwaving.

> If we could agree
> with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> is given, we can avoid locking up while minimizing possibility of invoking
> the OOM killer...

I do not understand. We do __alloc_pages_nowmark even when oom is called
for GFP_NOFAIL.

> I suggest "when you change something, ask users who are affected by
> your change" because patch 2 has values-based conflict.
> 
[...]
> > > Those callers which prefer lockup over panic can specify both
> > > __GFP_NORETRY and __GFP_NOFAIL.
> > 
> > No! This combination just doesn't make any sense. The same way how
> > __GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense
> > as well. Please use a common sense!
> 
> I wonder why I'm accused so much. I mentioned that patch 2 might be a
> garbage because patch 1 alone unexpectedly provided a mean to retry forever
> without invoking the OOM killer.

Which is the whole point of the patch and the changelog is vocal about
that. Even explaining why it is desirable to not override decisions when
the oom killer is not invoked. Please reread that and object if the
argument is not correct.

> You are not describing that fact in the
> description. You are not describing what combinations are valid and
> which flag is stronger requirement in gfp.h (e.g. __GFP_NOFAIL v.s.
> __GFP_NORETRY).

Sigh... I really fail to see why I should describe an impossible gfp
mask combination which is _not_ used in the kernel. Please stop this
strawman, I am really tired of it.
 
> > Invoking or not invoking the oom killer is the page allocator internal
> > business. No code outside of the MM is to talk about those decisions.
> > The fact that we provide a lightweight allocation mode which doesn't
> > invoke the OOM killer is a mere implementation detail.
> 
> __GFP_NOFAIL allocation requests for e.g. fs writeback is considered as
> code inside the MM because they are operations for reclaiming memory.
> Such __GFP_NOFAIL allocation requests should be given a chance to choose
> which one (possibility of lockup by not invoking the OOM killer or
> possibility of panic by invoking the OOM killer) they prefer.

Please be more specific. How and why they should choose that. Which
allocation are we talking about and why do you believe that the current
implementation with access to memory reserves is not sufficient.

> Therefore,
> 
> > If you believe that my argumentation is incorrect then you are free to
> > nak the patch with your reasoning. But please stop this nit picking on
> > nonsense combination of flags.
> 
> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
> provide a mean to invoke the OOM killer if someone chose possibility of
> panic"

I believe that the changelog contains my reasonining and so far I
haven't heard any _argument_ from you why they are wrong. You just
managed to nitpick on an impossible and pointless gfp_mask combination
and some handwaving on possible lockups without any backing arguments.
This is not something I would consider as a basis for a serious nack. So
if you really hate this patch then do please start being reasonable and
put some real arguments into your review without any off topics and/or
strawman arguments without any relevance.

> or "you accept kmallocwd".

Are you serious? Are you really suggesting that your patch has to be
accepted in order to have this one in?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-08 13:47             ` Michal Hocko
@ 2016-12-11 11:23               ` Tetsuo Handa
  2016-12-11 13:53                 ` Tetsuo Handa
  2016-12-12  8:48                 ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Tetsuo Handa @ 2016-12-11 11:23 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> > If we could agree
> > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> > is given, we can avoid locking up while minimizing possibility of invoking
> > the OOM killer...
>
> I do not understand. We do __alloc_pages_nowmark even when oom is called
> for GFP_NOFAIL.

Where is that? I can find __alloc_pages_nowmark() after out_of_memory()
if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before
out_of_memory() if __GFP_NOFAIL is given.

What I mean is below patch folded into
"[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath".

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3057,6 +3057,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
+                        const struct alloc_context *ac)
+{
+    struct page *page;
+
+    page = get_page_from_freelist(gfp_mask, order,
+            ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+    /*
+     * fallback to ignore cpuset restriction if our nodes
+     * are depleted
+     */
+    if (!page)
+        page = get_page_from_freelist(gfp_mask, order,
+                ALLOC_NO_WATERMARKS, ac);
+
+    return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
     const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3118,21 +3137,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
             goto out;
     }
     /* Exhausted what can be done so it's blamo time */
-    if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+    if (out_of_memory(&oc))
         *did_some_progress = 1;
-
-        if (gfp_mask & __GFP_NOFAIL) {
-            page = get_page_from_freelist(gfp_mask, order,
-                    ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-            /*
-             * fallback to ignore cpuset restriction if our nodes
-             * are depleted
-             */
-            if (!page)
-                page = get_page_from_freelist(gfp_mask, order,
-                    ALLOC_NO_WATERMARKS, ac);
-        }
-    }
 out:
     mutex_unlock(&oom_lock);
     return page;
@@ -3738,6 +3744,19 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
          */
         WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+        /*
+         * Help non-failing allocations by giving them access to memory
+         * reserves
+         */
+        page = __alloc_pages_nowmark(gfp_mask, order, ac);
+        if (page)
+            goto got_pg;
+
+        /* Memory reserves did not help. Start killing things. */
+        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+        if (page)
+            goto got_pg;
+
         cond_resched();
         goto retry;
     }

> > Therefore,
> >
> > > If you believe that my argumentation is incorrect then you are free to
> > > nak the patch with your reasoning. But please stop this nit picking on
> > > nonsense combination of flags.
> >
> > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >
> > on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
> > provide a mean to invoke the OOM killer if someone chose possibility of
> > panic"
>
> I believe that the changelog contains my reasoning and so far I
> haven't heard any _argument_ from you why they are wrong. You just
> managed to nitpick on an impossible and pointless gfp_mask combination
> and some handwaving on possible lockups without any backing arguments.
> This is not something I would consider as a basis for a serious nack. So
> if you really hate this patch then do please start being reasonable and
> put some real arguments into your review without any off topics and/or
> strawman arguments without any relevance.
>

Are you aware that I'm not objecting to "change __GFP_NOFAIL not to invoke
the OOM killer". What I'm objecting is that you are trying to change
!__GFP_FS && !__GFP_NOFAIL allocation requests without offering transition
plan like below.

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
      * make sure exclude 0 mask - all other users should have at least
      * ___GFP_DIRECT_RECLAIM to get here.
      */
-    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_WANT_OOM_KILLER))
         return true;
 
     /*

If you do want to push

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
      * make sure exclude 0 mask - all other users should have at least
      * ___GFP_DIRECT_RECLAIM to get here.
      */
-    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+    if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
         return true;
 
     /*

change, ask existing __GFP_NOFAIL users with explanations like below

  I believe that __GFP_NOFAIL should not imply invocation of the OOM killer.
  Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer.
  But since currently the OOM killer is not invoked unless either __GFP_FS or
  __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM
  killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking
  by not invoking the OOM killer. Although I can't prove that this change
  never causes livelock, I don't want to provide an alternative flag like
  __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must
  agree with accepting the risk introduced by this change.

and confirm that all existing __GFP_NOFAIL users are willing to accept
the risk of livelocking by not invoking the OOM killer.

Unless you do this procedure, I continue:

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

> > or "you accept kmallocwd".
>
> Are you serious? Are you really suggesting that your patch has to be
> accepted in order to have this one in?

I'm surprised you think my kmallocwd as a choice for applying this patch,
for I was assuming that you choose the procedure above which is super
straightforward.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-11 11:23               ` Tetsuo Handa
@ 2016-12-11 13:53                 ` Tetsuo Handa
  2016-12-12  8:52                   ` Michal Hocko
  2016-12-12  8:48                 ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2016-12-11 13:53 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> > > If we could agree
> > > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> > > is given, we can avoid locking up while minimizing possibility of invoking
> > > the OOM killer...
> >
> > I do not understand. We do __alloc_pages_nowmark even when oom is called
> > for GFP_NOFAIL.
> 
> Where is that? I can find __alloc_pages_nowmark() after out_of_memory()
> if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before
> out_of_memory() if __GFP_NOFAIL is given.
> 
> What I mean is below patch folded into
> "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath".
> 
Oops, I wrongly implemented "__alloc_pages_nowmark() before out_of_memory() if
__GFP_NOFAIL is given." case. Updated version is shown below.

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3116,23 +3116,27 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 		/* The OOM killer may not free memory on a specific node */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
+	} else {
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		page = get_page_from_freelist(gfp_mask, order,
+					      ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+		/*
+		 * fallback to ignore cpuset restriction if our nodes
+		 * are depleted
+		 */
+		if (!page)
+			page = get_page_from_freelist(gfp_mask, order,
+						      ALLOC_NO_WATERMARKS, ac);
+		if (page)
+			goto out;
 	}
+
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc))
 		*did_some_progress = 1;
-
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
-		}
-	}
 out:
 	mutex_unlock(&oom_lock);
 	return page;
@@ -3738,6 +3742,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/* Try memory reserves and then start killing things. */
+		page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}

I'm calling __alloc_pages_may_oom() from nopage: because we reach without
calling __alloc_pages_may_oom(), for PATCH 1/2 is not for stop enforcing
the OOM killer for __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-11 11:23               ` Tetsuo Handa
  2016-12-11 13:53                 ` Tetsuo Handa
@ 2016-12-12  8:48                 ` Michal Hocko
  2016-12-14 10:34                   ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-12  8:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Sun 11-12-16 20:23:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
[...]
> > > > If you believe that my argumentation is incorrect then you are free to
> > > > nak the patch with your reasoning. But please stop this nit picking on
> > > > nonsense combination of flags.
> > >
> > > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > >
> > > on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
> > > provide a mean to invoke the OOM killer if someone chose possibility of
> > > panic"
> >
> > I believe that the changelog contains my reasoning and so far I
> > haven't heard any _argument_ from you why they are wrong. You just
> > managed to nitpick on an impossible and pointless gfp_mask combination
> > and some handwaving on possible lockups without any backing arguments.
> > This is not something I would consider as a basis for a serious nack. So
> > if you really hate this patch then do please start being reasonable and
> > put some real arguments into your review without any off topics and/or
> > strawman arguments without any relevance.
> >
> 
> Are you aware that I'm not objecting to "change __GFP_NOFAIL not to invoke
> the OOM killer". What I'm objecting is that you are trying to change
> !__GFP_FS && !__GFP_NOFAIL allocation requests without offering transition
> plan like below.
> 
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
>       * make sure exclude 0 mask - all other users should have at least
>       * ___GFP_DIRECT_RECLAIM to get here.
>       */
> -    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> +    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_WANT_OOM_KILLER))
>          return true;

I have already asked that but let me ask again. _Who_ would use this
flag and not risk the pre-mature OOM killer invocation?

[...]

>   I believe that __GFP_NOFAIL should not imply invocation of the OOM killer.
>   Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer.
>   But since currently the OOM killer is not invoked unless either __GFP_FS or
>   __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM
>   killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking
>   by not invoking the OOM killer. Although I can't prove that this change
>   never causes livelock, I don't want to provide an alternative flag like
>   __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must
>   agree with accepting the risk introduced by this change.

I think you are seriously misled here. First of all, I have gone through
GFP_NOFS | GFP_NOFAIL users and _none_ of them have added the nofail
flag to enforce the OOM killer. Those users just want to express that an
allocation failure is simply not acceptable. Most of them were simply
conversions from the open-conded
	do { } while (! (page = page_alloc(GFP_NOFS));
loops. Which _does_ not invoke the OOM killer. And that is the most
importatnt point here. Why the above open coded (and as you say lockup
prone) loop is OK while GFP_NOFAIL varian should behave any differently?

> and confirm that all existing __GFP_NOFAIL users are willing to accept
> the risk of livelocking by not invoking the OOM killer.
> 
> Unless you do this procedure, I continue:
> 
> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

I was hoping for some actual arguments but I am afraid this is circling
in a loop. You are still handwaving with theoretical lockups without any
actual proof they are real. While I am not saying the risk is not there
I also say that there are other aspects to consider
	- lockups will happen only if there are no other GFP_FS requests
	  which trigger the OOM which is quite unlikely in most
	  situations
	- triggering oom for GFP_NOFS | GFP_NOFAIL has a non negligible
	  risk of pre-mature OOM killer invocation for the same reason
	  we do not trigger oom for GFP_NOFS. Even worse metadata heavy
	  workloads are much harder to contain so this might be used as
	  a DoS vector.
	- one of the primary point of GFP_NOFAIL existence is to prevent
	  from open coding endless loops in the code because the page
	  allocator can handle most situations more gracefully (e.g.
	  grant access to memory reserves). Having a completely
	  different OOM killer behavior is both confusing and encourages
	  abuse. If we have users who definitely need to control the OOM
	  behavior then we should add a gfp flag for them. But this
	  needs a strong use case and consider whether there are other
	  options to go around that.

I can add the above to the changelog if you think this is helpful but I
still maintain my position that your "this might cause lockups
theoretically" is unfounded and not justified to block the patch. I will
of course retract this patch if you can demonstrate the issue is real or
that any of my argumentation in the changelog is not correct.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-11 13:53                 ` Tetsuo Handa
@ 2016-12-12  8:52                   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-12  8:52 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Sun 11-12-16 22:53:55, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> > > > If we could agree
> > > > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> > > > is given, we can avoid locking up while minimizing possibility of invoking
> > > > the OOM killer...
> > >
> > > I do not understand. We do __alloc_pages_nowmark even when oom is called
> > > for GFP_NOFAIL.
> > 
> > Where is that? I can find __alloc_pages_nowmark() after out_of_memory()
> > if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before
> > out_of_memory() if __GFP_NOFAIL is given.
> > 
> > What I mean is below patch folded into
> > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath".
> > 
> Oops, I wrongly implemented "__alloc_pages_nowmark() before out_of_memory() if
> __GFP_NOFAIL is given." case. Updated version is shown below.

If you want to introduce such a change then make sure to justify it
properly in the changelog. I will not comment on this change here
because I believe it is not directly needed for neither of the two
patches.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-12  8:48                 ` Michal Hocko
@ 2016-12-14 10:34                   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-14 10:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Mon 12-12-16 09:48:37, Michal Hocko wrote:
> On Sun 11-12-16 20:23:47, Tetsuo Handa wrote:
[...]
> >   I believe that __GFP_NOFAIL should not imply invocation of the OOM killer.
> >   Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer.
> >   But since currently the OOM killer is not invoked unless either __GFP_FS or
> >   __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM
> >   killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking
> >   by not invoking the OOM killer. Although I can't prove that this change
> >   never causes livelock, I don't want to provide an alternative flag like
> >   __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must
> >   agree with accepting the risk introduced by this change.
> 
> I think you are seriously misled here. First of all, I have gone through
> GFP_NOFS | GFP_NOFAIL users and _none_ of them have added the nofail
> flag to enforce the OOM killer. Those users just want to express that an
> allocation failure is simply not acceptable. Most of them were simply
> conversions from the open-conded
> 	do { } while (! (page = page_alloc(GFP_NOFS));
> loops. Which _does_ not invoke the OOM killer. And that is the most
> importatnt point here. Why the above open coded (and as you say lockup
> prone) loop is OK while GFP_NOFAIL varian should behave any differently?
> 
> > and confirm that all existing __GFP_NOFAIL users are willing to accept
> > the risk of livelocking by not invoking the OOM killer.
> > 
> > Unless you do this procedure, I continue:
> > 
> > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> I was hoping for some actual arguments but I am afraid this is circling
> in a loop. You are still handwaving with theoretical lockups without any
> actual proof they are real. While I am not saying the risk is not there
> I also say that there are other aspects to consider
> 	- lockups will happen only if there are no other GFP_FS requests
> 	  which trigger the OOM which is quite unlikely in most
> 	  situations
> 	- triggering oom for GFP_NOFS | GFP_NOFAIL has a non negligible
> 	  risk of pre-mature OOM killer invocation for the same reason
> 	  we do not trigger oom for GFP_NOFS. Even worse metadata heavy
> 	  workloads are much harder to contain so this might be used as
> 	  a DoS vector.
> 	- one of the primary point of GFP_NOFAIL existence is to prevent
> 	  from open coding endless loops in the code because the page
> 	  allocator can handle most situations more gracefully (e.g.
> 	  grant access to memory reserves). Having a completely
> 	  different OOM killer behavior is both confusing and encourages
> 	  abuse. If we have users who definitely need to control the OOM
> 	  behavior then we should add a gfp flag for them. But this
> 	  needs a strong use case and consider whether there are other
> 	  options to go around that.
> 
> I can add the above to the changelog if you think this is helpful but I
> still maintain my position that your "this might cause lockups
> theoretically" is unfounded and not justified to block the patch. I will
> of course retract this patch if you can demonstrate the issue is real or
> that any of my argumentation in the changelog is not correct.

I was thinking about this some more and realized that there is a
different risk which this patch would introduce and have to be
considered. Heavy GFP_NOFS | __GFP_NOFAIL users might actually deplete
memory reserves. This was less of a problem with the current code
because we invoke the oom killer and so at least _some_ memory might be
freed. I will think about it some more but I guess I will just allow a
partial access in the no-oom case. I will post the patch 1 in the
meantime because I believe this is a reasonable cleanup.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-17 11:17         ` Tetsuo Handa
@ 2016-12-18 16:37           ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-18 16:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Johannes Weiner, Nils Holland, linux-kernel, linux-mm,
	Chris Mason, David Sterba, linux-btrfs

On Sat 17-12-16 20:17:07, Tetsuo Handa wrote:
[...]
> I feel that allowing access to memory reserves based on __GFP_NOFAIL might not
> make sense. My understanding is that actual I/O operation triggered by I/O
> requests by filesystem code are processed by other threads. Even if we grant
> access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code,
> I think that it is possible that memory allocations by underlying bio code
> fails to make a further progress unless memory reserves are granted as well.

IO layer should rely on mempools to guarantee a forward progress.

> Below is a typical trace which I observe under OOM lockuped situation (though
> this trace is from an OOM stress test using XFS).
> 
> ----------------------------------------
> [ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible
> [ 1845.187248] kworker/2:1     D12712 14498      2 0x00000080
> [ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1845.187252] Call Trace:
> [ 1845.187253]  ? __schedule+0x23f/0xba0
> [ 1845.187254]  schedule+0x38/0x90
> [ 1845.187255]  schedule_timeout+0x205/0x4a0
> [ 1845.187256]  ? del_timer_sync+0xd0/0xd0
> [ 1845.187257]  schedule_timeout_uninterruptible+0x25/0x30
> [ 1845.187258]  __alloc_pages_nodemask+0x1035/0x10e0
> [ 1845.187259]  ? alloc_request_struct+0x14/0x20
> [ 1845.187261]  alloc_pages_current+0x96/0x1b0
> [ 1845.187262]  ? bio_alloc_bioset+0x20f/0x2e0
> [ 1845.187264]  bio_copy_kern+0xc4/0x180
> [ 1845.187265]  blk_rq_map_kern+0x6f/0x120
> [ 1845.187268]  __scsi_execute.isra.23+0x12f/0x160
> [ 1845.187270]  scsi_execute_req_flags+0x8f/0x100
> [ 1845.187271]  sr_check_events+0xba/0x2b0 [sr_mod]
> [ 1845.187274]  cdrom_check_events+0x13/0x30 [cdrom]
> [ 1845.187275]  sr_block_check_events+0x25/0x30 [sr_mod]
> [ 1845.187276]  disk_check_events+0x5b/0x150
> [ 1845.187277]  disk_events_workfn+0x17/0x20
> [ 1845.187278]  process_one_work+0x1fc/0x750
> [ 1845.187279]  ? process_one_work+0x167/0x750
> [ 1845.187279]  worker_thread+0x126/0x4a0
> [ 1845.187280]  kthread+0x10a/0x140
> [ 1845.187281]  ? process_one_work+0x750/0x750
> [ 1845.187282]  ? kthread_create_on_node+0x60/0x60
> [ 1845.187283]  ret_from_fork+0x2a/0x40
> ----------------------------------------
> 
> I think that this GFP_NOIO allocation request needs to consume more memory reserves
> than GFP_NOFS allocation request to make progress. 

AFAIU, this is an allocation path which doesn't block a forward progress
on a regular IO. It is merely a check whether there is a new medium in
the CDROM (aka regular polling of the device). I really fail to see any
reason why this one should get any access to memory reserves at all.

I actually do not see any reason why it should be NOIO in the first
place but I am not familiar with this code much so there might be some
reasons for that. The fact that it might stall under a heavy memory
pressure is sad but who actually cares?

> Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request
> in order to allow access to memory reserves as well as GFP_NOFS |
> __GFP_NOFAIL allocation request?

Why?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-16 22:12       ` Michal Hocko
@ 2016-12-17 11:17         ` Tetsuo Handa
  2016-12-18 16:37           ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Tetsuo Handa @ 2016-12-17 11:17 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Nils Holland, linux-kernel, linux-mm, Chris Mason, David Sterba,
	linux-btrfs

Michal Hocko wrote:
> On Fri 16-12-16 12:31:51, Johannes Weiner wrote:
>>> @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>  		 */
>>>  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>>>  
>>> +		/*
>>> +		 * Help non-failing allocations by giving them access to memory
>>> +		 * reserves but do not use ALLOC_NO_WATERMARKS because this
>>> +		 * could deplete whole memory reserves which would just make
>>> +		 * the situation worse
>>> +		 */
>>> +		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
>>> +		if (page)
>>> +			goto got_pg;
>>> +
>>
>> But this should be a separate patch, IMO.
>>
>> Do we observe GFP_NOFS lockups when we don't do this? 
> 
> this is hard to tell but considering users like grow_dev_page we can get
> stuck with a very slow progress I believe. Those allocations could see
> some help.
> 
>> Don't we risk
>> premature exhaustion of the memory reserves, and it's better to wait
>> for other reclaimers to make some progress instead?
> 
> waiting for other reclaimers would be preferable but we should at least
> give these some priority, which is what ALLOC_HARDER should help with.
> 
>> Should we give
>> reserve access to all GFP_NOFS allocations, or just the ones from a
>> reclaim/cleaning context?
> 
> I would focus only for those which are important enough. Which are those
> is a harder question. But certainly those with GFP_NOFAIL are important
> enough.
> 
>> All that should go into the changelog of a separate allocation booster
>> patch, I think.
> 
> The reason I did both in the same patch is to address the concern about
> potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen
> ALLOC_HARDER to give the minimum portion of the reserves so that we do
> not risk other high priority users to be blocked out but still help a
> bit at least and prevent from starvation when other reclaimers are
> faster to consume the reclaimed memory.
> 
> I can extend the changelog of course but I believe that having both
> changes together makes some sense. NOFS|NOFAIL allocations are not all
> that rare and sometimes we really depend on them making a further
> progress.
> 

I feel that allowing access to memory reserves based on __GFP_NOFAIL might not
make sense. My understanding is that actual I/O operation triggered by I/O
requests by filesystem code are processed by other threads. Even if we grant
access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code,
I think that it is possible that memory allocations by underlying bio code
fails to make a further progress unless memory reserves are granted as well.

Below is a typical trace which I observe under OOM lockuped situation (though
this trace is from an OOM stress test using XFS).

----------------------------------------
[ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible
[ 1845.187248] kworker/2:1     D12712 14498      2 0x00000080
[ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn
[ 1845.187252] Call Trace:
[ 1845.187253]  ? __schedule+0x23f/0xba0
[ 1845.187254]  schedule+0x38/0x90
[ 1845.187255]  schedule_timeout+0x205/0x4a0
[ 1845.187256]  ? del_timer_sync+0xd0/0xd0
[ 1845.187257]  schedule_timeout_uninterruptible+0x25/0x30
[ 1845.187258]  __alloc_pages_nodemask+0x1035/0x10e0
[ 1845.187259]  ? alloc_request_struct+0x14/0x20
[ 1845.187261]  alloc_pages_current+0x96/0x1b0
[ 1845.187262]  ? bio_alloc_bioset+0x20f/0x2e0
[ 1845.187264]  bio_copy_kern+0xc4/0x180
[ 1845.187265]  blk_rq_map_kern+0x6f/0x120
[ 1845.187268]  __scsi_execute.isra.23+0x12f/0x160
[ 1845.187270]  scsi_execute_req_flags+0x8f/0x100
[ 1845.187271]  sr_check_events+0xba/0x2b0 [sr_mod]
[ 1845.187274]  cdrom_check_events+0x13/0x30 [cdrom]
[ 1845.187275]  sr_block_check_events+0x25/0x30 [sr_mod]
[ 1845.187276]  disk_check_events+0x5b/0x150
[ 1845.187277]  disk_events_workfn+0x17/0x20
[ 1845.187278]  process_one_work+0x1fc/0x750
[ 1845.187279]  ? process_one_work+0x167/0x750
[ 1845.187279]  worker_thread+0x126/0x4a0
[ 1845.187280]  kthread+0x10a/0x140
[ 1845.187281]  ? process_one_work+0x750/0x750
[ 1845.187282]  ? kthread_create_on_node+0x60/0x60
[ 1845.187283]  ret_from_fork+0x2a/0x40
----------------------------------------

I think that this GFP_NOIO allocation request needs to consume more memory reserves
than GFP_NOFS allocation request to make progress. 
Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request in order to allow
access to memory reserves as well as GFP_NOFS | __GFP_NOFAIL allocation request?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-16 17:31     ` Johannes Weiner
@ 2016-12-16 22:12       ` Michal Hocko
  2016-12-17 11:17         ` Tetsuo Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-16 22:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nils Holland, linux-kernel, linux-mm, Chris Mason, David Sterba,
	linux-btrfs

On Fri 16-12-16 12:31:51, Johannes Weiner wrote:
> On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote:
> > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> >  	 * make sure exclude 0 mask - all other users should have at least
> >  	 * ___GFP_DIRECT_RECLAIM to get here.
> >  	 */
> > -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> >  		return true;
> 
> This makes sense, we should go back to what we had here. Because it's
> not that the reported OOMs are premature - there is genuinely no more
> memory reclaimable from the allocating context - but that this class
> of allocations should never invoke the OOM killer in the first place.

agreed, at least not with the current implementtion. If we had a proper
accounting where we know that the memory pinned by the fs is not really
there then we could invoke the oom killer and be safe

> > @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		 */
> >  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
> >  
> > +		/*
> > +		 * Help non-failing allocations by giving them access to memory
> > +		 * reserves but do not use ALLOC_NO_WATERMARKS because this
> > +		 * could deplete whole memory reserves which would just make
> > +		 * the situation worse
> > +		 */
> > +		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
> > +		if (page)
> > +			goto got_pg;
> > +
> 
> But this should be a separate patch, IMO.
> 
> Do we observe GFP_NOFS lockups when we don't do this? 

this is hard to tell but considering users like grow_dev_page we can get
stuck with a very slow progress I believe. Those allocations could see
some help.

> Don't we risk
> premature exhaustion of the memory reserves, and it's better to wait
> for other reclaimers to make some progress instead?

waiting for other reclaimers would be preferable but we should at least
give these some priority, which is what ALLOC_HARDER should help with.

> Should we give
> reserve access to all GFP_NOFS allocations, or just the ones from a
> reclaim/cleaning context?

I would focus only for those which are important enough. Which are those
is a harder question. But certainly those with GFP_NOFAIL are important
enough.

> All that should go into the changelog of a separate allocation booster
> patch, I think.

The reason I did both in the same patch is to address the concern about
potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen
ALLOC_HARDER to give the minimum portion of the reserves so that we do
not risk other high priority users to be blocked out but still help a
bit at least and prevent from starvation when other reclaimers are
faster to consume the reclaimed memory.

I can extend the changelog of course but I believe that having both
changes together makes some sense. NOFS|NOFAIL allocations are not all
that rare and sometimes we really depend on them making a further
progress.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-16 15:58   ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
@ 2016-12-16 17:31     ` Johannes Weiner
  2016-12-16 22:12       ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2016-12-16 17:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nils Holland, linux-kernel, linux-mm, Chris Mason, David Sterba,
	linux-btrfs, Michal Hocko

On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote:
> @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
>  	 * make sure exclude 0 mask - all other users should have at least
>  	 * ___GFP_DIRECT_RECLAIM to get here.
>  	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
>  		return true;

This makes sense, we should go back to what we had here. Because it's
not that the reported OOMs are premature - there is genuinely no more
memory reclaimable from the allocating context - but that this class
of allocations should never invoke the OOM killer in the first place.

> @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 */
>  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>  
> +		/*
> +		 * Help non-failing allocations by giving them access to memory
> +		 * reserves but do not use ALLOC_NO_WATERMARKS because this
> +		 * could deplete whole memory reserves which would just make
> +		 * the situation worse
> +		 */
> +		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
> +		if (page)
> +			goto got_pg;
> +

But this should be a separate patch, IMO.

Do we observe GFP_NOFS lockups when we don't do this? Don't we risk
premature exhaustion of the memory reserves, and it's better to wait
for other reclaimers to make some progress instead? Should we give
reserve access to all GFP_NOFS allocations, or just the ones from a
reclaim/cleaning context? All that should go into the changelog of a
separate allocation booster patch, I think.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-16 15:58 ` OOM: Better, but still there on Michal Hocko
@ 2016-12-16 15:58   ` Michal Hocko
  2016-12-16 17:31     ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-16 15:58 UTC (permalink / raw)
  To: Nils Holland
  Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases:
	- it will hardly help to form costly order which in turn can
	  result in the system panic because of no oom killable task in
	  the end - I believe we certainly do not want to put the system
	  down just because there is a nasty driver asking for order-9
	  page with GFP_NOFAIL not realizing all the consequences. It is
	  much better this request would loop for ever than the massive
	  system disruption
	- lowmem is also highly unlikely to be freed during OOM killer
	- GFP_NOFS request could trigger while there is still a lot of
	  memory pinned by filesystems.

The pre-mature OOM killer is a real issue as reported by Nils Holland
	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
	kworker/u4:5 cpuset=/ mems_allowed=0
	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
	Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014
	Workqueue: writeback wb_workfn (flush-btrfs-1)
	 eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292
	 eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
	 eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000
	Call Trace:
	 [<c142bcce>] dump_stack+0x47/0x69
	 [<c1163332>] dump_header+0x60/0x178
	 [<c1431876>] ? ___ratelimit+0x86/0xe0
	 [<c110795f>] oom_kill_process+0x20f/0x3d0
	 [<c1043895>] ? has_capability_noaudit+0x15/0x20
	 [<c11075c7>] ? oom_badness.part.13+0xb7/0x130
	 [<c1107df9>] out_of_memory+0xd9/0x260
	 [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80
	 [<c110414d>] pagecache_get_page+0xad/0x270
	 [<c13664a6>] alloc_extent_buffer+0x116/0x3e0
	 [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10
	[...]
	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
	lowmem_reserve[]: 0 0 21292 21292
	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

this is a GFP_NOFS|__GFP_NOFAIL request which invokes the OOM killer
because there is clearly nothing reclaimable in the zone Normal while
there is a lot of page cache which is most probably pinned by the fs but
GFP_NOFS cannot reclaim it.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
In the later case we are more careful and only allow a partial access
because we do not want to risk the whole reserves depleting. There
are users doing GFP_NOFS|__GFP_NOFAIL heavily (e.g. __getblk_gfp ->
grow_dev_page).

Introduce __alloc_pages_cpuset_fallback helper which allows to bypass
allocation constrains for the given gfp mask while it enforces cpusets
whenever possible.

Reported-by: Nils Holland <nholland@tisys.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 97 ++++++++++++++++++++++++++++++++++++---------------------
 2 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 095e2fa286de..d6bc3e4f1a0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3057,6 +3057,26 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order,
+			      unsigned int alloc_flags,
+			      const struct alloc_context *ac)
+{
+	struct page *page;
+
+	page = get_page_from_freelist(gfp_mask, order,
+			alloc_flags|ALLOC_CPUSET, ac);
+	/*
+	 * fallback to ignore cpuset restriction if our nodes
+	 * are depleted
+	 */
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order,
+				alloc_flags, ac);
+
+	return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3091,47 +3111,42 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto out;
 
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		/* Coredumps can quickly deplete all memory reserves */
-		if (current->flags & PF_DUMPCORE)
-			goto out;
-		/* The OOM killer will not help higher order allocs */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto out;
-		/* The OOM killer does not needlessly kill tasks for lowmem */
-		if (ac->high_zoneidx < ZONE_NORMAL)
-			goto out;
-		if (pm_suspended_storage())
-			goto out;
-		/*
-		 * XXX: GFP_NOFS allocations should rather fail than rely on
-		 * other request to make a forward progress.
-		 * We are in an unfortunate situation where out_of_memory cannot
-		 * do much for this context but let's try it to at least get
-		 * access to memory reserved if the current task is killed (see
-		 * out_of_memory). Once filesystems are ready to handle allocation
-		 * failures more gracefully we should just bail out here.
-		 */
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		goto out;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+	if (pm_suspended_storage())
+		goto out;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 
-		/* The OOM killer may not free memory on a specific node */
-		if (gfp_mask & __GFP_THISNODE)
-			goto out;
-	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
 		*did_some_progress = 1;
 
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		if (gfp_mask & __GFP_NOFAIL)
+			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
 					ALLOC_NO_WATERMARKS, ac);
-		}
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves but do not use ALLOC_NO_WATERMARKS because this
+		 * could deplete whole memory reserves which would just make
+		 * the situation worse
+		 */
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-12-18 16:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko
2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
2016-12-02  7:23   ` Vlastimil Babka
2016-12-05 13:45   ` Tetsuo Handa
2016-12-05 14:10     ` Michal Hocko
2016-12-06  8:27       ` Michal Hocko
2016-12-06 10:38       ` Tetsuo Handa
2016-12-06 11:03         ` Vlastimil Babka
2016-12-06 19:25           ` Michal Hocko
2016-12-06 19:22         ` Michal Hocko
2016-12-08 12:53           ` Tetsuo Handa
2016-12-08 13:47             ` Michal Hocko
2016-12-11 11:23               ` Tetsuo Handa
2016-12-11 13:53                 ` Tetsuo Handa
2016-12-12  8:52                   ` Michal Hocko
2016-12-12  8:48                 ` Michal Hocko
2016-12-14 10:34                   ` Michal Hocko
2016-12-16  7:39 OOM: Better, but still there on 4.9 Michal Hocko
2016-12-16 15:58 ` OOM: Better, but still there on Michal Hocko
2016-12-16 15:58   ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
2016-12-16 17:31     ` Johannes Weiner
2016-12-16 22:12       ` Michal Hocko
2016-12-17 11:17         ` Tetsuo Handa
2016-12-18 16:37           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).