linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] GFP_NOFAIL cleanups
@ 2016-12-01 15:25 Michal Hocko
  2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  0 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman,
	David Rientjes, linux-mm, LKML

Hi,
I have posted this as an RFC previously [1] as there was no fundamental
disagreement I would like to ask for inclusion.

Tetsuo has noticed [2] that recent changes have changed GFP_NOFAIL
semantic for costly order requests. I believe that the primary reason
why this happened is that our GFP_NOFAIL checks are too scattered
and it is really easy to forget about adding one. That's why I am
proposing patch 1 which consolidates all the nofail handling at a single
place. This should help to make this code better maintainable.

Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
surprising. As things stand currently GFP_NOFAIL overrides the oom killer
prevention code which is both subtle and not really needed. The patch 2
has more details about issues this might cause.

I would consider both patches more a cleanup than anything else. Any
feedback is highly appreciated.

[1] http://lkml.kernel.org/r/20161123064925.9716-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko
@ 2016-12-01 15:25 ` Michal Hocko
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman,
	David Rientjes, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fbfead6aa7d..76c0b6bb0baf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3627,32 +3627,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim) {
-		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually allow this type of allocation
-		 * to fail.
-		 */
-		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+	if (!can_direct_reclaim)
 		goto nopage;
+
+	/* Make sure we know about allocations which stall for too long */
+	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		warn_alloc(gfp_mask,
+			"page alloction stalls for %ums, order:%u",
+			jiffies_to_msecs(jiffies-alloc_start), order);
+		stall_timeout += 10 * HZ;
 	}
 
 	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC) {
-		/*
-		 * __GFP_NOFAIL request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us.
-		 */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-			cond_resched();
-			goto retry;
-		}
+	if (current->flags & PF_MEMALLOC)
 		goto nopage;
-	}
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	if (test_thread_flag(TIF_MEMDIE))
 		goto nopage;
 
 
@@ -3679,14 +3670,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page alloction stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
 		goto retry;
@@ -3715,6 +3698,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+	 * we always retry
+	 */
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * All existing users of the __GFP_NOFAIL are blockable, so warn
+		 * of any new users that actually require GFP_NOWAIT
+		 */
+		if (WARN_ON_ONCE(!can_direct_reclaim))
+			goto fail;
+
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+		/*
+		 * non failing costly orders are a hard requirement which we
+		 * are not prepared for much so let's warn about these users
+		 * so that we can identify them and convert them to something
+		 * else.
+		 */
+		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+		cond_resched();
+		goto retry;
+	}
+fail:
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko
  2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
@ 2016-12-01 15:25 ` Michal Hocko
  2016-12-02  7:23   ` Vlastimil Babka
  2016-12-05 13:45   ` Tetsuo Handa
  1 sibling, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman,
	David Rientjes, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases - e.g. it will hardly
help to form costly order - I believe we certainly do not want to kill
all processes and eventually panic the system just because there is a
nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
the consequences - it is much better this request would loop for ever
than the massive system disruption, lowmem is also highly unlikely to be
freed during OOM killer and GFP_NOFS request could trigger while there
is still a lot of memory pinned by filesystems.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
__alloc_pages_nowmark helper has been introduced for that purpose.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++----------------------
 2 files changed, 59 insertions(+), 38 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c0b6bb0baf..7102641147c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3044,6 +3044,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
+						const struct alloc_context *ac)
+{
+	struct page *page;
+
+	page = get_page_from_freelist(gfp_mask, order,
+			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+	/*
+	 * fallback to ignore cpuset restriction if our nodes
+	 * are depleted
+	 */
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order,
+				ALLOC_NO_WATERMARKS, ac);
+
+	return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3078,47 +3097,41 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto out;
 
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		/* Coredumps can quickly deplete all memory reserves */
-		if (current->flags & PF_DUMPCORE)
-			goto out;
-		/* The OOM killer will not help higher order allocs */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto out;
-		/* The OOM killer does not needlessly kill tasks for lowmem */
-		if (ac->high_zoneidx < ZONE_NORMAL)
-			goto out;
-		if (pm_suspended_storage())
-			goto out;
-		/*
-		 * XXX: GFP_NOFS allocations should rather fail than rely on
-		 * other request to make a forward progress.
-		 * We are in an unfortunate situation where out_of_memory cannot
-		 * do much for this context but let's try it to at least get
-		 * access to memory reserved if the current task is killed (see
-		 * out_of_memory). Once filesystems are ready to handle allocation
-		 * failures more gracefully we should just bail out here.
-		 */
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		goto out;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+	if (pm_suspended_storage())
+		goto out;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 
-		/* The OOM killer may not free memory on a specific node */
-		if (gfp_mask & __GFP_THISNODE)
-			goto out;
-	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
 		*did_some_progress = 1;
 
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
-		}
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		if (gfp_mask & __GFP_NOFAIL)
+			page = __alloc_pages_nowmark(gfp_mask, order, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		page = __alloc_pages_nowmark(gfp_mask, order, ac);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
@ 2016-12-02  7:23   ` Vlastimil Babka
  2016-12-05 13:45   ` Tetsuo Handa
  1 sibling, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2016-12-02  7:23 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Tetsuo Handa, Johannes Weiner, Mel Gorman, David Rientjes,
	linux-mm, LKML, Michal Hocko

On 12/01/2016 04:25 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
>
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.
>
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
  2016-12-02  7:23   ` Vlastimil Babka
@ 2016-12-05 13:45   ` Tetsuo Handa
  2016-12-05 14:10     ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-12-05 13:45 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
> 
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.

I disagree. I believe that panic caused by OOM killer is much much better
than a locked up system. I hate to add new locations that can lockup inside
page allocator. This is __GFP_NOFAIL and reclaim has failed. Administrator
has to go in front of console and press SysRq-f until locked up situation
gets resolved is silly.

If there is a nasty driver asking for order-9 page with __GFP_NOFAIL, fix
that driver.

> 
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-05 13:45   ` Tetsuo Handa
@ 2016-12-05 14:10     ` Michal Hocko
  2016-12-06  8:27       ` Michal Hocko
  2016-12-06 10:38       ` Tetsuo Handa
  0 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-05 14:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Mon 05-12-16 22:45:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > the allocation request. This includes lowmem requests, costly high
> > order requests and others. For a long time __GFP_NOFAIL acted as an
> > override for all those rules. This is not documented and it can be quite
> > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > the existing open coded loops around allocator to nofail request (and we
> > have done that in the past) then such a change would have a non trivial
> > side effect which is not obvious. Note that the primary motivation for
> > skipping the OOM killer is to prevent from pre-mature invocation.
> > 
> > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > be invoked otherwise the request would be looping for ever. But this
> > argument is rather weak because the OOM killer doesn't really guarantee
> > any forward progress for those exceptional cases - e.g. it will hardly
> > help to form costly order - I believe we certainly do not want to kill
> > all processes and eventually panic the system just because there is a
> > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > the consequences - it is much better this request would loop for ever
> > than the massive system disruption, lowmem is also highly unlikely to be
> > freed during OOM killer and GFP_NOFS request could trigger while there
> > is still a lot of memory pinned by filesystems.
> 
> I disagree. I believe that panic caused by OOM killer is much much better
> than a locked up system. I hate to add new locations that can lockup inside
> page allocator. This is __GFP_NOFAIL and reclaim has failed.

As a matter of fact any __GFP_NOFAIL can lockup inside the allocator.
Full stop. There is no guaranteed way to make a forward progress with
the current page allocator implementation.

So we are somewhere in the middle between pre-mature and pointless
system disruption (GFP_NOFS with a lots of metadata or lowmem request)
where the OOM killer even might not help and potential lockup which is
inevitable with the current design. Dunno about you but I would rather
go with the first option. To be honest I really fail to understand your
line of argumentation. We have this
	do {
		cond_resched();
	} (page = alloc_page(GFP_NOFS));
vs.
	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);

the first one doesn't invoke OOM killer while the later does. This
discrepancy just cannot make any sense... The same is true for

	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)

Now we can discuss whether it is a _good_ idea to not invoke OOM killer
for those exceptions but whatever we do __GFP_NOFAIL is not a way to
give such a subtle side effect. Or do you disagree even with that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-05 14:10     ` Michal Hocko
@ 2016-12-06  8:27       ` Michal Hocko
  2016-12-06 10:38       ` Tetsuo Handa
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-06  8:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Mon 05-12-16 15:10:09, Michal Hocko wrote:
[...]
> So we are somewhere in the middle between pre-mature and pointless
> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> where the OOM killer even might not help and potential lockup which is
> inevitable with the current design. Dunno about you but I would rather
> go with the first option. To be honest I really fail to understand your
> line of argumentation. We have this
> 	do {
> 		cond_resched();
> 	} (page = alloc_page(GFP_NOFS));

This should have been while (!(page = alloc_page(GFP_NOFS))) of
course...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-05 14:10     ` Michal Hocko
  2016-12-06  8:27       ` Michal Hocko
@ 2016-12-06 10:38       ` Tetsuo Handa
  2016-12-06 11:03         ` Vlastimil Babka
  2016-12-06 19:22         ` Michal Hocko
  1 sibling, 2 replies; 19+ messages in thread
From: Tetsuo Handa @ 2016-12-06 10:38 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 05-12-16 22:45:19, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > > the allocation request. This includes lowmem requests, costly high
> > > order requests and others. For a long time __GFP_NOFAIL acted as an
> > > override for all those rules. This is not documented and it can be quite
> > > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > > the existing open coded loops around allocator to nofail request (and we
> > > have done that in the past) then such a change would have a non trivial
> > > side effect which is not obvious. Note that the primary motivation for
> > > skipping the OOM killer is to prevent from pre-mature invocation.
> > > 
> > > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > > be invoked otherwise the request would be looping for ever. But this
> > > argument is rather weak because the OOM killer doesn't really guarantee
> > > any forward progress for those exceptional cases - e.g. it will hardly
> > > help to form costly order - I believe we certainly do not want to kill
> > > all processes and eventually panic the system just because there is a
> > > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > > the consequences - it is much better this request would loop for ever
> > > than the massive system disruption, lowmem is also highly unlikely to be
> > > freed during OOM killer and GFP_NOFS request could trigger while there
> > > is still a lot of memory pinned by filesystems.
> > 
> > I disagree. I believe that panic caused by OOM killer is much much better
> > than a locked up system. I hate to add new locations that can lockup inside
> > page allocator. This is __GFP_NOFAIL and reclaim has failed.
> 
> As a matter of fact any __GFP_NOFAIL can lockup inside the allocator.

You are trying to increase possible locations of lockups by changing
default behavior of __GFP_NOFAIL.

> Full stop. There is no guaranteed way to make a forward progress with
> the current page allocator implementation.

Then, will you accept kmallocwd until page allocator implementation
can provide a guaranteed way to make a forward progress?

> 
> So we are somewhere in the middle between pre-mature and pointless
> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> where the OOM killer even might not help and potential lockup which is
> inevitable with the current design. Dunno about you but I would rather
> go with the first option. To be honest I really fail to understand your
> line of argumentation. We have this
> 	do {
> 		cond_resched();
> 	} while (!(page = alloc_page(GFP_NOFS)));
> vs.
> 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> 
> the first one doesn't invoke OOM killer while the later does. This
> discrepancy just cannot make any sense... The same is true for
> 
> 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
> 
> Now we can discuss whether it is a _good_ idea to not invoke OOM killer
> for those exceptions but whatever we do __GFP_NOFAIL is not a way to
> give such a subtle side effect. Or do you disagree even with that?

"[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.

Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
allocation requests fail without invoking the OOM killer when both
__GFP_NORETRY and __GFP_NOFAIL are given.

With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
__GFP_NOFAIL allocation requests will loop forever without invoking
the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.

Those callers which prefer lockup over panic can specify both
__GFP_NORETRY and __GFP_NOFAIL. You are trying to change behavior of
__GFP_NOFAIL without asking whether existing __GFP_NOFAIL users
want to invoke the OOM killer.

And the story is not specific to existing __GFP_NOFAIL users;
it applies to existing GFP_NOFS users as well.

Quoting from http://lkml.kernel.org/r/20161125131806.GB24353@dhcp22.suse.cz :
> > Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
> > commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
> > 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
> > re-entering the fs code") ? My understanding is that mkdir() system call
> > caused memory allocation for inode creation and that memory allocation
> > caused memory reclaim which had to be !__GFP_FS.
> 
> I will have a look later, thanks for the points.

What is your answer to this problem? For those who prefer panic over lockup,
please provide a mean to invoke the OOM killer (e.g. __GFP_WANT_OOM_KILLER).

If you provide __GFP_WANT_OOM_KILLER, you can change

-#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_WANT_OOM_KILLER)

in gfp.h and add __GFP_WANT_OOM_KILLER to any existing __GFP_NOFAIL/GFP_NOFS
users who prefer panic over lockup. Then, I think I'm fine with

-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
-		return true;
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_WANT_OOM_KILLER))
+		return false;

in out_of_memory().

As you know, quite few people are responding to discussions regarding
almost OOM situation. Changing default behavior without asking existing
users is annoying.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 10:38       ` Tetsuo Handa
@ 2016-12-06 11:03         ` Vlastimil Babka
  2016-12-06 19:25           ` Michal Hocko
  2016-12-06 19:22         ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2016-12-06 11:03 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko
  Cc: akpm, hannes, mgorman, rientjes, linux-mm, linux-kernel

On 12/06/2016 11:38 AM, Tetsuo Handa wrote:
>>
>> So we are somewhere in the middle between pre-mature and pointless
>> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
>> where the OOM killer even might not help and potential lockup which is
>> inevitable with the current design. Dunno about you but I would rather
>> go with the first option. To be honest I really fail to understand your
>> line of argumentation. We have this
>> 	do {
>> 		cond_resched();
>> 	} while (!(page = alloc_page(GFP_NOFS)));
>> vs.
>> 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
>>
>> the first one doesn't invoke OOM killer while the later does. This
>> discrepancy just cannot make any sense... The same is true for
>>
>> 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
>>
>> Now we can discuss whether it is a _good_ idea to not invoke OOM killer
>> for those exceptions but whatever we do __GFP_NOFAIL is not a way to
>> give such a subtle side effect. Or do you disagree even with that?
> 
> "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.

I guess that wasn't intended?

> Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> allocation requests fail without invoking the OOM killer when both
> __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> __GFP_NOFAIL allocation requests will loop forever without invoking
> the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.

Does such combination of flag make sense? Should we warn about it, or
even silently remove __GFP_NORETRY in such case?

> Those callers which prefer lockup over panic can specify both
> __GFP_NORETRY and __GFP_NOFAIL.

What lockup exactly, if __GFP_NORETRY did lead to fail?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 10:38       ` Tetsuo Handa
  2016-12-06 11:03         ` Vlastimil Babka
@ 2016-12-06 19:22         ` Michal Hocko
  2016-12-08 12:53           ` Tetsuo Handa
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-12-06 19:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Tue 06-12-16 19:38:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 05-12-16 22:45:19, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > > > the allocation request. This includes lowmem requests, costly high
> > > > order requests and others. For a long time __GFP_NOFAIL acted as an
> > > > override for all those rules. This is not documented and it can be quite
> > > > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > > > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > > > the existing open coded loops around allocator to nofail request (and we
> > > > have done that in the past) then such a change would have a non trivial
> > > > side effect which is not obvious. Note that the primary motivation for
> > > > skipping the OOM killer is to prevent from pre-mature invocation.
> > > > 
> > > > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > > > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > > > be invoked otherwise the request would be looping for ever. But this
> > > > argument is rather weak because the OOM killer doesn't really guarantee
> > > > any forward progress for those exceptional cases - e.g. it will hardly
> > > > help to form costly order - I believe we certainly do not want to kill
> > > > all processes and eventually panic the system just because there is a
> > > > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > > > the consequences - it is much better this request would loop for ever
> > > > than the massive system disruption, lowmem is also highly unlikely to be
> > > > freed during OOM killer and GFP_NOFS request could trigger while there
> > > > is still a lot of memory pinned by filesystems.
> > > 
> > > I disagree. I believe that panic caused by OOM killer is much much better
> > > than a locked up system. I hate to add new locations that can lockup inside
> > > page allocator. This is __GFP_NOFAIL and reclaim has failed.
> > 
> > As a matter of fact any __GFP_NOFAIL can lockup inside the allocator.
> 
> You are trying to increase possible locations of lockups by changing
> default behavior of __GFP_NOFAIL.

I disagree. I have tried to explain that it is much more important to
have fewer silent side effects than optimize for the very worst case.  I
simply do not see __GFP_NOFAIL lockups so common to even care or tweak
their semantic in a weird way. It seems you prefer to optimize for the
absolute worst case and even for that case you cannot offer anything
better than randomly OOM killing random processes until the system
somehow resurrects or panics. I consider this a very bad design. So
let's agree to disagree here.

> > Full stop. There is no guaranteed way to make a forward progress with
> > the current page allocator implementation.
> 
> Then, will you accept kmallocwd until page allocator implementation
> can provide a guaranteed way to make a forward progress?

No, I find your kmallocwd too complex for the advantage it provides.
 
> > So we are somewhere in the middle between pre-mature and pointless
> > system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> > where the OOM killer even might not help and potential lockup which is
> > inevitable with the current design. Dunno about you but I would rather
> > go with the first option. To be honest I really fail to understand your
> > line of argumentation. We have this
> > 	do {
> > 		cond_resched();
> > 	} while (!(page = alloc_page(GFP_NOFS)));
> > vs.
> > 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> > 
> > the first one doesn't invoke OOM killer while the later does. This
> > discrepancy just cannot make any sense... The same is true for
> > 
> > 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
> > 
> > Now we can discuss whether it is a _good_ idea to not invoke OOM killer
> > for those exceptions but whatever we do __GFP_NOFAIL is not a way to
> > give such a subtle side effect. Or do you disagree even with that?
> 
> "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.
> 
> Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> allocation requests fail without invoking the OOM killer when both
> __GFP_NORETRY and __GFP_NOFAIL are given.

Sigh... __GFP_NORETRY | __GFP_NOFAIL _doesn't_ make _any_ sense what so
ever.

> With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> __GFP_NOFAIL allocation requests will loop forever without invoking
> the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.

So what? Strictly speaking __GFP_NOFAIL should be always stronger but I
really fail to see why we should even consider __GFP_NORETRY in that
context. I definitely do not want to complicate the page fault path for
a nonsense combination of flags.

> Those callers which prefer lockup over panic can specify both
> __GFP_NORETRY and __GFP_NOFAIL.

No! This combination just doesn't make any sense. The same way how
__GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense
as well. Please use a common sense!

> You are trying to change behavior of
> __GFP_NOFAIL without asking whether existing __GFP_NOFAIL users
> want to invoke the OOM killer.

Invoking or not invoking the oom killer is the page allocator internal
business. No code outside of the MM is to talk about those decisions.
The fact that we provide a lightweight allocation mode which doesn't
invoke the OOM killer is a mere implementation detail.

> And the story is not specific to existing __GFP_NOFAIL users;
> it applies to existing GFP_NOFS users as well.
> 
> Quoting from http://lkml.kernel.org/r/20161125131806.GB24353@dhcp22.suse.cz :
> > > Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
> > > commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
> > > 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
> > > re-entering the fs code") ? My understanding is that mkdir() system call
> > > caused memory allocation for inode creation and that memory allocation
> > > caused memory reclaim which had to be !__GFP_FS.
> > 
> > I will have a look later, thanks for the points.
> 
> What is your answer to this problem? For those who prefer panic over lockup,
> please provide a mean to invoke the OOM killer (e.g. __GFP_WANT_OOM_KILLER).

Please stop shifting discussion off the scope of the discussed patch.
This just distracts from the main point. I really think that at least
patch 1 is a good clean up and the second one makes a lot of sense as
well from the semantic point of view. Now you constantly push to the
extreme with very strong statements without any actual data point. This
is not really helpful!

If you believe that my argumentation is incorrect then you are free to
nak the patch with your reasoning. But please stop this nit picking on
nonsense combination of flags.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 11:03         ` Vlastimil Babka
@ 2016-12-06 19:25           ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-06 19:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, akpm, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Tue 06-12-16 12:03:02, Vlastimil Babka wrote:
> On 12/06/2016 11:38 AM, Tetsuo Handa wrote:
> >>
> >> So we are somewhere in the middle between pre-mature and pointless
> >> system disruption (GFP_NOFS with a lots of metadata or lowmem request)
> >> where the OOM killer even might not help and potential lockup which is
> >> inevitable with the current design. Dunno about you but I would rather
> >> go with the first option. To be honest I really fail to understand your
> >> line of argumentation. We have this
> >> 	do {
> >> 		cond_resched();
> >> 	} while (!(page = alloc_page(GFP_NOFS)));
> >> vs.
> >> 	page = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> >>
> >> the first one doesn't invoke OOM killer while the later does. This
> >> discrepancy just cannot make any sense... The same is true for
> >>
> >> 	alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL)
> >>
> >> Now we can discuss whether it is a _good_ idea to not invoke OOM killer
> >> for those exceptions but whatever we do __GFP_NOFAIL is not a way to
> >> give such a subtle side effect. Or do you disagree even with that?
> > 
> > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.
> 
> I guess that wasn't intended?

I even didn't think about that possibility because it just doesn't make
any sense.

> > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> > allocation requests fail without invoking the OOM killer when both
> > __GFP_NORETRY and __GFP_NOFAIL are given.
> > 
> > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> > __GFP_NOFAIL allocation requests will loop forever without invoking
> > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> Does such combination of flag make sense? Should we warn about it, or
> even silently remove __GFP_NORETRY in such case?

No this combination doesn't make any sense. I seriously doubt we should
even care about it and simply following the stronger requirement makes
more sense from a semantic point of view.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-06 19:22         ` Michal Hocko
@ 2016-12-08 12:53           ` Tetsuo Handa
  2016-12-08 13:47             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-12-08 12:53 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 06-12-16 19:38:38, Tetsuo Handa wrote:
> > You are trying to increase possible locations of lockups by changing
> > default behavior of __GFP_NOFAIL.
> 
> I disagree. I have tried to explain that it is much more important to
> have fewer silent side effects than optimize for the very worst case.  I
> simply do not see __GFP_NOFAIL lockups so common to even care or tweak
> their semantic in a weird way. It seems you prefer to optimize for the
> absolute worst case and even for that case you cannot offer anything
> better than randomly OOM killing random processes until the system
> somehow resurrects or panics. I consider this a very bad design. So
> let's agree to disagree here.

You think that invoking the OOM killer with __GFP_NOFAIL is worse than
locking up with __GFP_NOFAIL. But I think that locking up with __GFP_NOFAIL
is worse than invoking the OOM killer with __GFP_NOFAIL. If we could agree
with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
is given, we can avoid locking up while minimizing possibility of invoking
the OOM killer...

I suggest "when you change something, ask users who are affected by
your change" because patch 2 has values-based conflict.

> > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath"
> > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority.
> > 
> > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL
> > allocation requests fail without invoking the OOM killer when both
> > __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> Sigh... __GFP_NORETRY | __GFP_NOFAIL _doesn't_ make _any_ sense what so
> ever.
> 
> > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY;
> > __GFP_NOFAIL allocation requests will loop forever without invoking
> > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given.
> 
> So what? Strictly speaking __GFP_NOFAIL should be always stronger but I
> really fail to see why we should even consider __GFP_NORETRY in that
> context. I definitely do not want to complicate the page fault path for
> a nonsense combination of flags.
> 
> > Those callers which prefer lockup over panic can specify both
> > __GFP_NORETRY and __GFP_NOFAIL.
> 
> No! This combination just doesn't make any sense. The same way how
> __GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense
> as well. Please use a common sense!

I wonder why I'm accused so much. I mentioned that patch 2 might be a
garbage because patch 1 alone unexpectedly provided a mean to retry forever
without invoking the OOM killer. You are not describing that fact in the
description. You are not describing what combinations are valid and
which flag is stronger requirement in gfp.h (e.g. __GFP_NOFAIL v.s.
__GFP_NORETRY).

> Invoking or not invoking the oom killer is the page allocator internal
> business. No code outside of the MM is to talk about those decisions.
> The fact that we provide a lightweight allocation mode which doesn't
> invoke the OOM killer is a mere implementation detail.

__GFP_NOFAIL allocation requests for e.g. fs writeback is considered as
code inside the MM because they are operations for reclaiming memory.
Such __GFP_NOFAIL allocation requests should be given a chance to choose
which one (possibility of lockup by not invoking the OOM killer or
possibility of panic by invoking the OOM killer) they prefer.

Therefore,

> If you believe that my argumentation is incorrect then you are free to
> nak the patch with your reasoning. But please stop this nit picking on
> nonsense combination of flags.

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
provide a mean to invoke the OOM killer if someone chose possibility of
panic" or "you accept kmallocwd".

> > > Full stop. There is no guaranteed way to make a forward progress with
> > > the current page allocator implementation.
> > 
> > Then, will you accept kmallocwd until page allocator implementation
> > can provide a guaranteed way to make a forward progress?
> 
> No, I find your kmallocwd too complex for the advantage it provides.

My kmallocwd provides us two advantages.

One is that, if the cause of lockup is not memory allocation request,
kmallocwd gives us a proof that you are doing well. You seriously tend to
ignore corner cases with your wish that the absolute worst case won't
happen; that makes me seriously explorer corner cases as a secure coder
for proactive protection; that irritates you further.
You say that I constantly push to the extreme with very strong statements
without any actual data point, but I say that you constantly reject
without any proof just because you have never heard. The reality is that
we can hardly expect people to have knowledge/skills for reporting corner
cases.

The other is that, synchronous mechanism (like warn_alloc()) is prone to
corner cases. We won't be able to catch all corner cases before people
are trapped by them. If the cause of lockup is memory allocation request,
kmallocwd gives us a trigger to take actions. This keeps me away from
exploring corner cases which you think unlikely happen. This helps you
to choose whatever semantic/logic you prefer.

Since I'm not a __GFP_NOFAIL user, I don't care as long as lockups are
detected and reported using a catch-all approach (i.e. asynchronous
mechanism). Instead of cruelly rejecting kmallocwd with "too complex",
will you explain why you feel complex (as a reply to kmallocwd thread)?
I have my goal for written as such, but there would be room for reducing
complexity if you explain details.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-08 12:53           ` Tetsuo Handa
@ 2016-12-08 13:47             ` Michal Hocko
  2016-12-11 11:23               ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-12-08 13:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 06-12-16 19:38:38, Tetsuo Handa wrote:
> > > You are trying to increase possible locations of lockups by changing
> > > default behavior of __GFP_NOFAIL.
> > 
> > I disagree. I have tried to explain that it is much more important to
> > have fewer silent side effects than optimize for the very worst case.  I
> > simply do not see __GFP_NOFAIL lockups so common to even care or tweak
> > their semantic in a weird way. It seems you prefer to optimize for the
> > absolute worst case and even for that case you cannot offer anything
> > better than randomly OOM killing random processes until the system
> > somehow resurrects or panics. I consider this a very bad design. So
> > let's agree to disagree here.
> 
> You think that invoking the OOM killer with __GFP_NOFAIL is worse than
> locking up with __GFP_NOFAIL.

Yes and I have explained why.

> But I think that locking up with __GFP_NOFAIL
> is worse than invoking the OOM killer with __GFP_NOFAIL.

Without any actual arguments based just on handwaving.

> If we could agree
> with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> is given, we can avoid locking up while minimizing possibility of invoking
> the OOM killer...

I do not understand. We do __alloc_pages_nowmark even when oom is called
for GFP_NOFAIL.

> I suggest "when you change something, ask users who are affected by
> your change" because patch 2 has values-based conflict.
> 
[...]
> > > Those callers which prefer lockup over panic can specify both
> > > __GFP_NORETRY and __GFP_NOFAIL.
> > 
> > No! This combination just doesn't make any sense. The same way how
> > __GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense
> > as well. Please use a common sense!
> 
> I wonder why I'm accused so much. I mentioned that patch 2 might be a
> garbage because patch 1 alone unexpectedly provided a mean to retry forever
> without invoking the OOM killer.

Which is the whole point of the patch and the changelog is vocal about
that. Even explaining why it is desirable to not override decisions when
the oom killer is not invoked. Please reread that and object if the
argument is not correct.

> You are not describing that fact in the
> description. You are not describing what combinations are valid and
> which flag is stronger requirement in gfp.h (e.g. __GFP_NOFAIL v.s.
> __GFP_NORETRY).

Sigh... I really fail to see why I should describe an impossible gfp
mask combination which is _not_ used in the kernel. Please stop this
strawman, I am really tired of it.
 
> > Invoking or not invoking the oom killer is the page allocator internal
> > business. No code outside of the MM is to talk about those decisions.
> > The fact that we provide a lightweight allocation mode which doesn't
> > invoke the OOM killer is a mere implementation detail.
> 
> __GFP_NOFAIL allocation requests for e.g. fs writeback is considered as
> code inside the MM because they are operations for reclaiming memory.
> Such __GFP_NOFAIL allocation requests should be given a chance to choose
> which one (possibility of lockup by not invoking the OOM killer or
> possibility of panic by invoking the OOM killer) they prefer.

Please be more specific. How and why they should choose that. Which
allocation are we talking about and why do you believe that the current
implementation with access to memory reserves is not sufficient.

> Therefore,
> 
> > If you believe that my argumentation is incorrect then you are free to
> > nak the patch with your reasoning. But please stop this nit picking on
> > nonsense combination of flags.
> 
> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
> provide a mean to invoke the OOM killer if someone chose possibility of
> panic"

I believe that the changelog contains my reasonining and so far I
haven't heard any _argument_ from you why they are wrong. You just
managed to nitpick on an impossible and pointless gfp_mask combination
and some handwaving on possible lockups without any backing arguments.
This is not something I would consider as a basis for a serious nack. So
if you really hate this patch then do please start being reasonable and
put some real arguments into your review without any off topics and/or
strawman arguments without any relevance.

> or "you accept kmallocwd".

Are you serious? Are you really suggesting that your patch has to be
accepted in order to have this one in?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-08 13:47             ` Michal Hocko
@ 2016-12-11 11:23               ` Tetsuo Handa
  2016-12-11 13:53                 ` Tetsuo Handa
  2016-12-12  8:48                 ` Michal Hocko
  0 siblings, 2 replies; 19+ messages in thread
From: Tetsuo Handa @ 2016-12-11 11:23 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> > If we could agree
> > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> > is given, we can avoid locking up while minimizing possibility of invoking
> > the OOM killer...
>
> I do not understand. We do __alloc_pages_nowmark even when oom is called
> for GFP_NOFAIL.

Where is that? I can find __alloc_pages_nowmark() after out_of_memory()
if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before
out_of_memory() if __GFP_NOFAIL is given.

What I mean is below patch folded into
"[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath".

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3057,6 +3057,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
+                        const struct alloc_context *ac)
+{
+    struct page *page;
+
+    page = get_page_from_freelist(gfp_mask, order,
+            ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+    /*
+     * fallback to ignore cpuset restriction if our nodes
+     * are depleted
+     */
+    if (!page)
+        page = get_page_from_freelist(gfp_mask, order,
+                ALLOC_NO_WATERMARKS, ac);
+
+    return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
     const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3118,21 +3137,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
             goto out;
     }
     /* Exhausted what can be done so it's blamo time */
-    if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+    if (out_of_memory(&oc))
         *did_some_progress = 1;
-
-        if (gfp_mask & __GFP_NOFAIL) {
-            page = get_page_from_freelist(gfp_mask, order,
-                    ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-            /*
-             * fallback to ignore cpuset restriction if our nodes
-             * are depleted
-             */
-            if (!page)
-                page = get_page_from_freelist(gfp_mask, order,
-                    ALLOC_NO_WATERMARKS, ac);
-        }
-    }
 out:
     mutex_unlock(&oom_lock);
     return page;
@@ -3738,6 +3744,19 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
          */
         WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+        /*
+         * Help non-failing allocations by giving them access to memory
+         * reserves
+         */
+        page = __alloc_pages_nowmark(gfp_mask, order, ac);
+        if (page)
+            goto got_pg;
+
+        /* Memory reserves did not help. Start killing things. */
+        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+        if (page)
+            goto got_pg;
+
         cond_resched();
         goto retry;
     }

> > Therefore,
> >
> > > If you believe that my argumentation is incorrect then you are free to
> > > nak the patch with your reasoning. But please stop this nit picking on
> > > nonsense combination of flags.
> >
> > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >
> > on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
> > provide a mean to invoke the OOM killer if someone chose possibility of
> > panic"
>
> I believe that the changelog contains my reasoning and so far I
> haven't heard any _argument_ from you why they are wrong. You just
> managed to nitpick on an impossible and pointless gfp_mask combination
> and some handwaving on possible lockups without any backing arguments.
> This is not something I would consider as a basis for a serious nack. So
> if you really hate this patch then do please start being reasonable and
> put some real arguments into your review without any off topics and/or
> strawman arguments without any relevance.
>

Are you aware that I'm not objecting to "change __GFP_NOFAIL not to invoke
the OOM killer". What I'm objecting is that you are trying to change
!__GFP_FS && !__GFP_NOFAIL allocation requests without offering transition
plan like below.

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
      * make sure exclude 0 mask - all other users should have at least
      * ___GFP_DIRECT_RECLAIM to get here.
      */
-    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_WANT_OOM_KILLER))
         return true;
 
     /*

If you do want to push

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
      * make sure exclude 0 mask - all other users should have at least
      * ___GFP_DIRECT_RECLAIM to get here.
      */
-    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+    if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
         return true;
 
     /*

change, ask existing __GFP_NOFAIL users with explanations like below

  I believe that __GFP_NOFAIL should not imply invocation of the OOM killer.
  Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer.
  But since currently the OOM killer is not invoked unless either __GFP_FS or
  __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM
  killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking
  by not invoking the OOM killer. Although I can't prove that this change
  never causes livelock, I don't want to provide an alternative flag like
  __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must
  agree with accepting the risk introduced by this change.

and confirm that all existing __GFP_NOFAIL users are willing to accept
the risk of livelocking by not invoking the OOM killer.

Unless you do this procedure, I continue:

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

> > or "you accept kmallocwd".
>
> Are you serious? Are you really suggesting that your patch has to be
> accepted in order to have this one in?

I'm surprised you think my kmallocwd as a choice for applying this patch,
for I was assuming that you choose the procedure above which is super
straightforward.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-11 11:23               ` Tetsuo Handa
@ 2016-12-11 13:53                 ` Tetsuo Handa
  2016-12-12  8:52                   ` Michal Hocko
  2016-12-12  8:48                 ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-12-11 13:53 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> > > If we could agree
> > > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> > > is given, we can avoid locking up while minimizing possibility of invoking
> > > the OOM killer...
> >
> > I do not understand. We do __alloc_pages_nowmark even when oom is called
> > for GFP_NOFAIL.
> 
> Where is that? I can find __alloc_pages_nowmark() after out_of_memory()
> if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before
> out_of_memory() if __GFP_NOFAIL is given.
> 
> What I mean is below patch folded into
> "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath".
> 
Oops, I wrongly implemented "__alloc_pages_nowmark() before out_of_memory() if
__GFP_NOFAIL is given." case. Updated version is shown below.

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3116,23 +3116,27 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 		/* The OOM killer may not free memory on a specific node */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
+	} else {
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		page = get_page_from_freelist(gfp_mask, order,
+					      ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+		/*
+		 * fallback to ignore cpuset restriction if our nodes
+		 * are depleted
+		 */
+		if (!page)
+			page = get_page_from_freelist(gfp_mask, order,
+						      ALLOC_NO_WATERMARKS, ac);
+		if (page)
+			goto out;
 	}
+
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc))
 		*did_some_progress = 1;
-
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
-		}
-	}
 out:
 	mutex_unlock(&oom_lock);
 	return page;
@@ -3738,6 +3742,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/* Try memory reserves and then start killing things. */
+		page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}

I'm calling __alloc_pages_may_oom() from nopage: because we reach without
calling __alloc_pages_may_oom(), for PATCH 1/2 is not for stop enforcing
the OOM killer for __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-11 11:23               ` Tetsuo Handa
  2016-12-11 13:53                 ` Tetsuo Handa
@ 2016-12-12  8:48                 ` Michal Hocko
  2016-12-14 10:34                   ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-12-12  8:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Sun 11-12-16 20:23:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
[...]
> > > > If you believe that my argumentation is incorrect then you are free to
> > > > nak the patch with your reasoning. But please stop this nit picking on
> > > > nonsense combination of flags.
> > >
> > > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > >
> > > on patch 2 unless "you explain these patches to __GFP_NOFAIL users and
> > > provide a mean to invoke the OOM killer if someone chose possibility of
> > > panic"
> >
> > I believe that the changelog contains my reasoning and so far I
> > haven't heard any _argument_ from you why they are wrong. You just
> > managed to nitpick on an impossible and pointless gfp_mask combination
> > and some handwaving on possible lockups without any backing arguments.
> > This is not something I would consider as a basis for a serious nack. So
> > if you really hate this patch then do please start being reasonable and
> > put some real arguments into your review without any off topics and/or
> > strawman arguments without any relevance.
> >
> 
> Are you aware that I'm not objecting to "change __GFP_NOFAIL not to invoke
> the OOM killer". What I'm objecting is that you are trying to change
> !__GFP_FS && !__GFP_NOFAIL allocation requests without offering transition
> plan like below.
> 
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
>       * make sure exclude 0 mask - all other users should have at least
>       * ___GFP_DIRECT_RECLAIM to get here.
>       */
> -    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> +    if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_WANT_OOM_KILLER))
>          return true;

I have already asked that but let me ask again. _Who_ would use this
flag and not risk the pre-mature OOM killer invocation?

[...]

>   I believe that __GFP_NOFAIL should not imply invocation of the OOM killer.
>   Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer.
>   But since currently the OOM killer is not invoked unless either __GFP_FS or
>   __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM
>   killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking
>   by not invoking the OOM killer. Although I can't prove that this change
>   never causes livelock, I don't want to provide an alternative flag like
>   __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must
>   agree with accepting the risk introduced by this change.

I think you are seriously misled here. First of all, I have gone through
GFP_NOFS | GFP_NOFAIL users and _none_ of them have added the nofail
flag to enforce the OOM killer. Those users just want to express that an
allocation failure is simply not acceptable. Most of them were simply
conversions from the open-conded
	do { } while (! (page = page_alloc(GFP_NOFS));
loops. Which _does_ not invoke the OOM killer. And that is the most
importatnt point here. Why the above open coded (and as you say lockup
prone) loop is OK while GFP_NOFAIL varian should behave any differently?

> and confirm that all existing __GFP_NOFAIL users are willing to accept
> the risk of livelocking by not invoking the OOM killer.
> 
> Unless you do this procedure, I continue:
> 
> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

I was hoping for some actual arguments but I am afraid this is circling
in a loop. You are still handwaving with theoretical lockups without any
actual proof they are real. While I am not saying the risk is not there
I also say that there are other aspects to consider
	- lockups will happen only if there are no other GFP_FS requests
	  which trigger the OOM which is quite unlikely in most
	  situations
	- triggering oom for GFP_NOFS | GFP_NOFAIL has a non negligible
	  risk of pre-mature OOM killer invocation for the same reason
	  we do not trigger oom for GFP_NOFS. Even worse metadata heavy
	  workloads are much harder to contain so this might be used as
	  a DoS vector.
	- one of the primary point of GFP_NOFAIL existence is to prevent
	  from open coding endless loops in the code because the page
	  allocator can handle most situations more gracefully (e.g.
	  grant access to memory reserves). Having a completely
	  different OOM killer behavior is both confusing and encourages
	  abuse. If we have users who definitely need to control the OOM
	  behavior then we should add a gfp flag for them. But this
	  needs a strong use case and consider whether there are other
	  options to go around that.

I can add the above to the changelog if you think this is helpful but I
still maintain my position that your "this might cause lockups
theoretically" is unfounded and not justified to block the patch. I will
of course retract this patch if you can demonstrate the issue is real or
that any of my argumentation in the changelog is not correct.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-11 13:53                 ` Tetsuo Handa
@ 2016-12-12  8:52                   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-12  8:52 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Sun 11-12-16 22:53:55, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote:
> > > > If we could agree
> > > > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL
> > > > is given, we can avoid locking up while minimizing possibility of invoking
> > > > the OOM killer...
> > >
> > > I do not understand. We do __alloc_pages_nowmark even when oom is called
> > > for GFP_NOFAIL.
> > 
> > Where is that? I can find __alloc_pages_nowmark() after out_of_memory()
> > if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before
> > out_of_memory() if __GFP_NOFAIL is given.
> > 
> > What I mean is below patch folded into
> > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath".
> > 
> Oops, I wrongly implemented "__alloc_pages_nowmark() before out_of_memory() if
> __GFP_NOFAIL is given." case. Updated version is shown below.

If you want to introduce such a change then make sure to justify it
properly in the changelog. I will not comment on this change here
because I believe it is not directly needed for neither of the two
patches.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-12-12  8:48                 ` Michal Hocko
@ 2016-12-14 10:34                   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-14 10:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel

On Mon 12-12-16 09:48:37, Michal Hocko wrote:
> On Sun 11-12-16 20:23:47, Tetsuo Handa wrote:
[...]
> >   I believe that __GFP_NOFAIL should not imply invocation of the OOM killer.
> >   Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer.
> >   But since currently the OOM killer is not invoked unless either __GFP_FS or
> >   __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM
> >   killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking
> >   by not invoking the OOM killer. Although I can't prove that this change
> >   never causes livelock, I don't want to provide an alternative flag like
> >   __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must
> >   agree with accepting the risk introduced by this change.
> 
> I think you are seriously misled here. First of all, I have gone through
> GFP_NOFS | GFP_NOFAIL users and _none_ of them have added the nofail
> flag to enforce the OOM killer. Those users just want to express that an
> allocation failure is simply not acceptable. Most of them were simply
> conversions from the open-conded
> 	do { } while (! (page = page_alloc(GFP_NOFS));
> loops. Which _does_ not invoke the OOM killer. And that is the most
> importatnt point here. Why the above open coded (and as you say lockup
> prone) loop is OK while GFP_NOFAIL varian should behave any differently?
> 
> > and confirm that all existing __GFP_NOFAIL users are willing to accept
> > the risk of livelocking by not invoking the OOM killer.
> > 
> > Unless you do this procedure, I continue:
> > 
> > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> I was hoping for some actual arguments but I am afraid this is circling
> in a loop. You are still handwaving with theoretical lockups without any
> actual proof they are real. While I am not saying the risk is not there
> I also say that there are other aspects to consider
> 	- lockups will happen only if there are no other GFP_FS requests
> 	  which trigger the OOM which is quite unlikely in most
> 	  situations
> 	- triggering oom for GFP_NOFS | GFP_NOFAIL has a non negligible
> 	  risk of pre-mature OOM killer invocation for the same reason
> 	  we do not trigger oom for GFP_NOFS. Even worse metadata heavy
> 	  workloads are much harder to contain so this might be used as
> 	  a DoS vector.
> 	- one of the primary point of GFP_NOFAIL existence is to prevent
> 	  from open coding endless loops in the code because the page
> 	  allocator can handle most situations more gracefully (e.g.
> 	  grant access to memory reserves). Having a completely
> 	  different OOM killer behavior is both confusing and encourages
> 	  abuse. If we have users who definitely need to control the OOM
> 	  behavior then we should add a gfp flag for them. But this
> 	  needs a strong use case and consider whether there are other
> 	  options to go around that.
> 
> I can add the above to the changelog if you think this is helpful but I
> still maintain my position that your "this might cause lockups
> theoretically" is unfounded and not justified to block the patch. I will
> of course retract this patch if you can demonstrate the issue is real or
> that any of my argumentation in the changelog is not correct.

I was thinking about this some more and realized that there is a
different risk which this patch would introduce and have to be
considered. Heavy GFP_NOFS | __GFP_NOFAIL users might actually deplete
memory reserves. This was less of a problem with the current code
because we invoke the oom killer and so at least _some_ memory might be
freed. I will think about it some more but I guess I will just allow a
partial access in the no-oom case. I will post the patch 1 in the
meantime because I believe this is a reasonable cleanup.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  2016-12-16 15:58 ` OOM: Better, but still there on Michal Hocko
@ 2016-12-16 15:58   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-12-16 15:58 UTC (permalink / raw)
  To: Nils Holland
  Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Changes since v1
- do not skip direct reclaim for TIF_MEMDIE && GFP_NOFAIL as per Hillf
- do not skip __alloc_pages_may_oom for TIF_MEMDIE && GFP_NOFAIL as
  per Tetsuo

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f2c9e535f7f..095e2fa286de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3640,35 +3640,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim) {
-		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually allow this type of allocation
-		 * to fail.
-		 */
-		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+	if (!can_direct_reclaim)
 		goto nopage;
-	}
 
-	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC) {
-		/*
-		 * __GFP_NOFAIL request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us.
-		 */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-			cond_resched();
-			goto retry;
-		}
-		goto nopage;
+	/* Make sure we know about allocations which stall for too long */
+	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		warn_alloc(gfp_mask,
+			"page alloction stalls for %ums, order:%u",
+			jiffies_to_msecs(jiffies-alloc_start), order);
+		stall_timeout += 10 * HZ;
 	}
 
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	/* Avoid recursion of direct reclaim */
+	if (current->flags & PF_MEMALLOC)
 		goto nopage;
 
-
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
 							&did_some_progress);
@@ -3692,14 +3678,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page allocation stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
 		goto retry;
@@ -3721,6 +3699,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	/* Avoid allocations with no watermarks from looping endlessly */
+	if (test_thread_flag(TIF_MEMDIE))
+		goto nopage;
+
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
@@ -3728,6 +3710,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+	 * we always retry
+	 */
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * All existing users of the __GFP_NOFAIL are blockable, so warn
+		 * of any new users that actually require GFP_NOWAIT
+		 */
+		if (WARN_ON_ONCE(!can_direct_reclaim))
+			goto fail;
+
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+		/*
+		 * non failing costly orders are a hard requirement which we
+		 * are not prepared for much so let's warn about these users
+		 * so that we can identify them and convert them to something
+		 * else.
+		 */
+		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+		cond_resched();
+		goto retry;
+	}
+fail:
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-12-16 15:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko
2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
2016-12-02  7:23   ` Vlastimil Babka
2016-12-05 13:45   ` Tetsuo Handa
2016-12-05 14:10     ` Michal Hocko
2016-12-06  8:27       ` Michal Hocko
2016-12-06 10:38       ` Tetsuo Handa
2016-12-06 11:03         ` Vlastimil Babka
2016-12-06 19:25           ` Michal Hocko
2016-12-06 19:22         ` Michal Hocko
2016-12-08 12:53           ` Tetsuo Handa
2016-12-08 13:47             ` Michal Hocko
2016-12-11 11:23               ` Tetsuo Handa
2016-12-11 13:53                 ` Tetsuo Handa
2016-12-12  8:52                   ` Michal Hocko
2016-12-12  8:48                 ` Michal Hocko
2016-12-14 10:34                   ` Michal Hocko
2016-12-16  7:39 OOM: Better, but still there on 4.9 Michal Hocko
2016-12-16 15:58 ` OOM: Better, but still there on Michal Hocko
2016-12-16 15:58   ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).