All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] GFP_NOFAIL cleanups
@ 2016-11-23  6:49 ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Vlastimil Babka, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

Hi,
Tetsuo has noticed [1] that recent changes have changed GFP_NOFAIL
semantic for costly order requests. I believe that the primary reason
why this happened is that our GFP_NOFAIL checks are too scattered
and it is really easy to forget about adding one. That's why I am
proposing patch 1 which consolidates all the nofail handling at a single
place. This should help to make this code better maintainable.

Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
surprising. As things stand currently GFP_NOFAIL overrides the oom killer
prevention code which is both subtle and not really needed. The patch 2
has more details about issues this might cause.

I would consider both patches more a cleanup than anything else. Any
feedback is highly appreciated.

[1] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC 0/2] GFP_NOFAIL cleanups
@ 2016-11-23  6:49 ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Vlastimil Babka, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

Hi,
Tetsuo has noticed [1] that recent changes have changed GFP_NOFAIL
semantic for costly order requests. I believe that the primary reason
why this happened is that our GFP_NOFAIL checks are too scattered
and it is really easy to forget about adding one. That's why I am
proposing patch 1 which consolidates all the nofail handling at a single
place. This should help to make this code better maintainable.

Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less
surprising. As things stand currently GFP_NOFAIL overrides the oom killer
prevention code which is both subtle and not really needed. The patch 2
has more details about issues this might cause.

I would consider both patches more a cleanup than anything else. Any
feedback is highly appreciated.

[1] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  2016-11-23  6:49 ` Michal Hocko
@ 2016-11-23  6:49   ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Vlastimil Babka, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fbfead6aa7d..76c0b6bb0baf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3627,32 +3627,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim) {
-		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually allow this type of allocation
-		 * to fail.
-		 */
-		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+	if (!can_direct_reclaim)
 		goto nopage;
+
+	/* Make sure we know about allocations which stall for too long */
+	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		warn_alloc(gfp_mask,
+			"page alloction stalls for %ums, order:%u",
+			jiffies_to_msecs(jiffies-alloc_start), order);
+		stall_timeout += 10 * HZ;
 	}
 
 	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC) {
-		/*
-		 * __GFP_NOFAIL request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us.
-		 */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-			cond_resched();
-			goto retry;
-		}
+	if (current->flags & PF_MEMALLOC)
 		goto nopage;
-	}
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	if (test_thread_flag(TIF_MEMDIE))
 		goto nopage;
 
 
@@ -3679,14 +3670,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page alloction stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
 		goto retry;
@@ -3715,6 +3698,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+	 * we always retry
+	 */
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * All existing users of the __GFP_NOFAIL are blockable, so warn
+		 * of any new users that actually require GFP_NOWAIT
+		 */
+		if (WARN_ON_ONCE(!can_direct_reclaim))
+			goto fail;
+
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+		/*
+		 * non failing costly orders are a hard requirement which we
+		 * are not prepared for much so let's warn about these users
+		 * so that we can identify them and convert them to something
+		 * else.
+		 */
+		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+		cond_resched();
+		goto retry;
+	}
+fail:
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
@ 2016-11-23  6:49   ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Vlastimil Babka, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fbfead6aa7d..76c0b6bb0baf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3627,32 +3627,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim) {
-		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually allow this type of allocation
-		 * to fail.
-		 */
-		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+	if (!can_direct_reclaim)
 		goto nopage;
+
+	/* Make sure we know about allocations which stall for too long */
+	if (time_after(jiffies, alloc_start + stall_timeout)) {
+		warn_alloc(gfp_mask,
+			"page alloction stalls for %ums, order:%u",
+			jiffies_to_msecs(jiffies-alloc_start), order);
+		stall_timeout += 10 * HZ;
 	}
 
 	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC) {
-		/*
-		 * __GFP_NOFAIL request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us.
-		 */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-			cond_resched();
-			goto retry;
-		}
+	if (current->flags & PF_MEMALLOC)
 		goto nopage;
-	}
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	if (test_thread_flag(TIF_MEMDIE))
 		goto nopage;
 
 
@@ -3679,14 +3670,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask,
-			"page alloction stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 				 did_some_progress > 0, &no_progress_loops))
 		goto retry;
@@ -3715,6 +3698,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
+	 * we always retry
+	 */
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * All existing users of the __GFP_NOFAIL are blockable, so warn
+		 * of any new users that actually require GFP_NOWAIT
+		 */
+		if (WARN_ON_ONCE(!can_direct_reclaim))
+			goto fail;
+
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+
+		/*
+		 * non failing costly orders are a hard requirement which we
+		 * are not prepared for much so let's warn about these users
+		 * so that we can identify them and convert them to something
+		 * else.
+		 */
+		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
+
+		cond_resched();
+		goto retry;
+	}
+fail:
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-23  6:49 ` Michal Hocko
@ 2016-11-23  6:49   ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Vlastimil Babka, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases - e.g. it will hardly
help to form costly order - I believe we certainly do not want to kill
all processes and eventually panic the system just because there is a
nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
the consequences - it is much better this request would loop for ever
than the massive system disruption, lowmem is also highly unlikely to be
freed during OOM killer and GFP_NOFS request could trigger while there
is still a lot of memory pinned by filesystems.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
__alloc_pages_nowmark helper has been introduced for that purpose.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++----------------------
 2 files changed, 59 insertions(+), 38 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c0b6bb0baf..7102641147c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3044,6 +3044,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
+						const struct alloc_context *ac)
+{
+	struct page *page;
+
+	page = get_page_from_freelist(gfp_mask, order,
+			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+	/*
+	 * fallback to ignore cpuset restriction if our nodes
+	 * are depleted
+	 */
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order,
+				ALLOC_NO_WATERMARKS, ac);
+
+	return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3078,47 +3097,41 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto out;
 
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		/* Coredumps can quickly deplete all memory reserves */
-		if (current->flags & PF_DUMPCORE)
-			goto out;
-		/* The OOM killer will not help higher order allocs */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto out;
-		/* The OOM killer does not needlessly kill tasks for lowmem */
-		if (ac->high_zoneidx < ZONE_NORMAL)
-			goto out;
-		if (pm_suspended_storage())
-			goto out;
-		/*
-		 * XXX: GFP_NOFS allocations should rather fail than rely on
-		 * other request to make a forward progress.
-		 * We are in an unfortunate situation where out_of_memory cannot
-		 * do much for this context but let's try it to at least get
-		 * access to memory reserved if the current task is killed (see
-		 * out_of_memory). Once filesystems are ready to handle allocation
-		 * failures more gracefully we should just bail out here.
-		 */
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		goto out;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+	if (pm_suspended_storage())
+		goto out;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 
-		/* The OOM killer may not free memory on a specific node */
-		if (gfp_mask & __GFP_THISNODE)
-			goto out;
-	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
 		*did_some_progress = 1;
 
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
-		}
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		if (gfp_mask & __GFP_NOFAIL)
+			page = __alloc_pages_nowmark(gfp_mask, order, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		page = __alloc_pages_nowmark(gfp_mask, order, ac);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-23  6:49   ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23  6:49 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Vlastimil Babka, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases - e.g. it will hardly
help to form costly order - I believe we certainly do not want to kill
all processes and eventually panic the system just because there is a
nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
the consequences - it is much better this request would loop for ever
than the massive system disruption, lowmem is also highly unlikely to be
freed during OOM killer and GFP_NOFS request could trigger while there
is still a lot of memory pinned by filesystems.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
__alloc_pages_nowmark helper has been introduced for that purpose.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++----------------------
 2 files changed, 59 insertions(+), 38 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * make sure exclude 0 mask - all other users should have at least
 	 * ___GFP_DIRECT_RECLAIM to get here.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c0b6bb0baf..7102641147c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3044,6 +3044,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 }
 
 static inline struct page *
+__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
+						const struct alloc_context *ac)
+{
+	struct page *page;
+
+	page = get_page_from_freelist(gfp_mask, order,
+			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+	/*
+	 * fallback to ignore cpuset restriction if our nodes
+	 * are depleted
+	 */
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order,
+				ALLOC_NO_WATERMARKS, ac);
+
+	return page;
+}
+
+static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
@@ -3078,47 +3097,41 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto out;
 
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		/* Coredumps can quickly deplete all memory reserves */
-		if (current->flags & PF_DUMPCORE)
-			goto out;
-		/* The OOM killer will not help higher order allocs */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto out;
-		/* The OOM killer does not needlessly kill tasks for lowmem */
-		if (ac->high_zoneidx < ZONE_NORMAL)
-			goto out;
-		if (pm_suspended_storage())
-			goto out;
-		/*
-		 * XXX: GFP_NOFS allocations should rather fail than rely on
-		 * other request to make a forward progress.
-		 * We are in an unfortunate situation where out_of_memory cannot
-		 * do much for this context but let's try it to at least get
-		 * access to memory reserved if the current task is killed (see
-		 * out_of_memory). Once filesystems are ready to handle allocation
-		 * failures more gracefully we should just bail out here.
-		 */
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		goto out;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+	if (pm_suspended_storage())
+		goto out;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 
-		/* The OOM killer may not free memory on a specific node */
-		if (gfp_mask & __GFP_THISNODE)
-			goto out;
-	}
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
 		*did_some_progress = 1;
 
-		if (gfp_mask & __GFP_NOFAIL) {
-			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
-			/*
-			 * fallback to ignore cpuset restriction if our nodes
-			 * are depleted
-			 */
-			if (!page)
-				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
-		}
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		if (gfp_mask & __GFP_NOFAIL)
+			page = __alloc_pages_nowmark(gfp_mask, order, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
+		/*
+		 * Help non-failing allocations by giving them access to memory
+		 * reserves
+		 */
+		page = __alloc_pages_nowmark(gfp_mask, order, ac);
+		if (page)
+			goto got_pg;
+
 		cond_resched();
 		goto retry;
 	}
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  2016-11-23  6:49   ` Michal Hocko
@ 2016-11-23 10:43     ` Vlastimil Babka
  -1 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2016-11-23 10:43 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Tetsuo Handa, David Rientjes, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On 11/23/2016 07:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
> detection") has subtly changed semantic for costly high order requests
> with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
> My code inspection didn't reveal any such users in the tree but it is
> true that this might lead to unexpected allocation failures and
> subsequent OOPs.
>
> __alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
> There are few special cases but we are lacking a catch all place to be
> sure we will not miss any case where the non failing allocation might
> fail. This patch reorganizes the code a bit and puts all those special
> cases under nopage label which is the generic go-to-fail path. Non
> failing allocations are retried or those that cannot retry like
> non-sleeping allocation go to the failure point directly. This should
> make the code flow much easier to follow and make it less error prone
> for future changes.
>
> While we are there we have to move the stall check up to catch
> potentially looping non-failing allocations.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, that's much better than the current state.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 41 insertions(+), 27 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fbfead6aa7d..76c0b6bb0baf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3627,32 +3627,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>
>  	/* Caller is not willing to reclaim, we can't balance anything */
> -	if (!can_direct_reclaim) {
> -		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually allow this type of allocation
> -		 * to fail.
> -		 */
> -		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
> +	if (!can_direct_reclaim)
>  		goto nopage;
> +
> +	/* Make sure we know about allocations which stall for too long */
> +	if (time_after(jiffies, alloc_start + stall_timeout)) {
> +		warn_alloc(gfp_mask,
> +			"page alloction stalls for %ums, order:%u",
> +			jiffies_to_msecs(jiffies-alloc_start), order);
> +		stall_timeout += 10 * HZ;
>  	}
>
>  	/* Avoid recursion of direct reclaim */
> -	if (current->flags & PF_MEMALLOC) {
> -		/*
> -		 * __GFP_NOFAIL request from this context is rather bizarre
> -		 * because we cannot reclaim anything and only can loop waiting
> -		 * for somebody to do a work for us.
> -		 */
> -		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> -			cond_resched();
> -			goto retry;
> -		}
> +	if (current->flags & PF_MEMALLOC)
>  		goto nopage;
> -	}
>
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> +	if (test_thread_flag(TIF_MEMDIE))
>  		goto nopage;
>
>
> @@ -3679,14 +3670,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>  		goto nopage;
>
> -	/* Make sure we know about allocations which stall for too long */
> -	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask,
> -			"page alloction stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> -	}
> -
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>  				 did_some_progress > 0, &no_progress_loops))
>  		goto retry;
> @@ -3715,6 +3698,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>
>  nopage:
> +	/*
> +	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> +	 * we always retry
> +	 */
> +	if (gfp_mask & __GFP_NOFAIL) {
> +		/*
> +		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> +		 * of any new users that actually require GFP_NOWAIT
> +		 */
> +		if (WARN_ON_ONCE(!can_direct_reclaim))
> +			goto fail;
> +
> +		/*
> +		 * PF_MEMALLOC request from this context is rather bizarre
> +		 * because we cannot reclaim anything and only can loop waiting
> +		 * for somebody to do a work for us
> +		 */
> +		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
> +
> +		/*
> +		 * non failing costly orders are a hard requirement which we
> +		 * are not prepared for much so let's warn about these users
> +		 * so that we can identify them and convert them to something
> +		 * else.
> +		 */
> +		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
> +
> +		cond_resched();
> +		goto retry;
> +	}
> +fail:
>  	warn_alloc(gfp_mask,
>  			"page allocation failure: order:%u", order);
>  got_pg:
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath
@ 2016-11-23 10:43     ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2016-11-23 10:43 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Tetsuo Handa, David Rientjes, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On 11/23/2016 07:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
> detection") has subtly changed semantic for costly high order requests
> with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
> My code inspection didn't reveal any such users in the tree but it is
> true that this might lead to unexpected allocation failures and
> subsequent OOPs.
>
> __alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
> There are few special cases but we are lacking a catch all place to be
> sure we will not miss any case where the non failing allocation might
> fail. This patch reorganizes the code a bit and puts all those special
> cases under nopage label which is the generic go-to-fail path. Non
> failing allocations are retried or those that cannot retry like
> non-sleeping allocation go to the failure point directly. This should
> make the code flow much easier to follow and make it less error prone
> for future changes.
>
> While we are there we have to move the stall check up to catch
> potentially looping non-failing allocations.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, that's much better than the current state.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 41 insertions(+), 27 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fbfead6aa7d..76c0b6bb0baf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3627,32 +3627,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>
>  	/* Caller is not willing to reclaim, we can't balance anything */
> -	if (!can_direct_reclaim) {
> -		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually allow this type of allocation
> -		 * to fail.
> -		 */
> -		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
> +	if (!can_direct_reclaim)
>  		goto nopage;
> +
> +	/* Make sure we know about allocations which stall for too long */
> +	if (time_after(jiffies, alloc_start + stall_timeout)) {
> +		warn_alloc(gfp_mask,
> +			"page alloction stalls for %ums, order:%u",
> +			jiffies_to_msecs(jiffies-alloc_start), order);
> +		stall_timeout += 10 * HZ;
>  	}
>
>  	/* Avoid recursion of direct reclaim */
> -	if (current->flags & PF_MEMALLOC) {
> -		/*
> -		 * __GFP_NOFAIL request from this context is rather bizarre
> -		 * because we cannot reclaim anything and only can loop waiting
> -		 * for somebody to do a work for us.
> -		 */
> -		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> -			cond_resched();
> -			goto retry;
> -		}
> +	if (current->flags & PF_MEMALLOC)
>  		goto nopage;
> -	}
>
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> +	if (test_thread_flag(TIF_MEMDIE))
>  		goto nopage;
>
>
> @@ -3679,14 +3670,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>  		goto nopage;
>
> -	/* Make sure we know about allocations which stall for too long */
> -	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask,
> -			"page alloction stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> -	}
> -
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>  				 did_some_progress > 0, &no_progress_loops))
>  		goto retry;
> @@ -3715,6 +3698,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>
>  nopage:
> +	/*
> +	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
> +	 * we always retry
> +	 */
> +	if (gfp_mask & __GFP_NOFAIL) {
> +		/*
> +		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> +		 * of any new users that actually require GFP_NOWAIT
> +		 */
> +		if (WARN_ON_ONCE(!can_direct_reclaim))
> +			goto fail;
> +
> +		/*
> +		 * PF_MEMALLOC request from this context is rather bizarre
> +		 * because we cannot reclaim anything and only can loop waiting
> +		 * for somebody to do a work for us
> +		 */
> +		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
> +
> +		/*
> +		 * non failing costly orders are a hard requirement which we
> +		 * are not prepared for much so let's warn about these users
> +		 * so that we can identify them and convert them to something
> +		 * else.
> +		 */
> +		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
> +
> +		cond_resched();
> +		goto retry;
> +	}
> +fail:
>  	warn_alloc(gfp_mask,
>  			"page allocation failure: order:%u", order);
>  got_pg:
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-23  6:49   ` Michal Hocko
@ 2016-11-23 12:19     ` Vlastimil Babka
  -1 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2016-11-23 12:19 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Tetsuo Handa, David Rientjes, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On 11/23/2016 07:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
>
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.
>
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

This makes some sense to me, but there might be unpleasant consequences, 
e.g. due to allowing costly allocations without reserves. I guess only 
testing will show...

Also some comments below.

> ---
>  mm/oom_kill.c   |  2 +-
>  mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++----------------------
>  2 files changed, 59 insertions(+), 38 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ec9f11d4f094..12a6fce85f61 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
>  	 * make sure exclude 0 mask - all other users should have at least
>  	 * ___GFP_DIRECT_RECLAIM to get here.
>  	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
>  		return true;
>
>  	/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c0b6bb0baf..7102641147c4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3044,6 +3044,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  }
>
>  static inline struct page *
> +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
> +						const struct alloc_context *ac)
> +{
> +	struct page *page;
> +
> +	page = get_page_from_freelist(gfp_mask, order,
> +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> +	/*
> +	 * fallback to ignore cpuset restriction if our nodes
> +	 * are depleted
> +	 */
> +	if (!page)
> +		page = get_page_from_freelist(gfp_mask, order,
> +				ALLOC_NO_WATERMARKS, ac);

Is this enough? Look at what __alloc_pages_slowpath() does since 
e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the 
context can ignore memory policies").

...

> -	}
>  	/* Exhausted what can be done so it's blamo time */
> -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> +	if (out_of_memory(&oc)) {

This removes the warning, but also the check for __GFP_NOFAIL itself. 
Was it what you wanted?

>  		*did_some_progress = 1;
>
> -		if (gfp_mask & __GFP_NOFAIL) {
> -			page = get_page_from_freelist(gfp_mask, order,
> -					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> -			/*
> -			 * fallback to ignore cpuset restriction if our nodes
> -			 * are depleted
> -			 */
> -			if (!page)
> -				page = get_page_from_freelist(gfp_mask, order,
> -					ALLOC_NO_WATERMARKS, ac);
> -		}
> +		/*
> +		 * Help non-failing allocations by giving them access to memory
> +		 * reserves
> +		 */
> +		if (gfp_mask & __GFP_NOFAIL)
> +			page = __alloc_pages_nowmark(gfp_mask, order, ac);
>  	}
>  out:
>  	mutex_unlock(&oom_lock);
> @@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 */
>  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>
> +		/*
> +		 * Help non-failing allocations by giving them access to memory
> +		 * reserves
> +		 */
> +		page = __alloc_pages_nowmark(gfp_mask, order, ac);
> +		if (page)
> +			goto got_pg;
> +
>  		cond_resched();
>  		goto retry;
>  	}
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-23 12:19     ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2016-11-23 12:19 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Tetsuo Handa, David Rientjes, Johannes Weiner, Mel Gorman,
	Andrew Morton, LKML, Michal Hocko

On 11/23/2016 07:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
>
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.
>
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

This makes some sense to me, but there might be unpleasant consequences, 
e.g. due to allowing costly allocations without reserves. I guess only 
testing will show...

Also some comments below.

> ---
>  mm/oom_kill.c   |  2 +-
>  mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++----------------------
>  2 files changed, 59 insertions(+), 38 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ec9f11d4f094..12a6fce85f61 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
>  	 * make sure exclude 0 mask - all other users should have at least
>  	 * ___GFP_DIRECT_RECLAIM to get here.
>  	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
>  		return true;
>
>  	/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c0b6bb0baf..7102641147c4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3044,6 +3044,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
>  }
>
>  static inline struct page *
> +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
> +						const struct alloc_context *ac)
> +{
> +	struct page *page;
> +
> +	page = get_page_from_freelist(gfp_mask, order,
> +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> +	/*
> +	 * fallback to ignore cpuset restriction if our nodes
> +	 * are depleted
> +	 */
> +	if (!page)
> +		page = get_page_from_freelist(gfp_mask, order,
> +				ALLOC_NO_WATERMARKS, ac);

Is this enough? Look at what __alloc_pages_slowpath() does since 
e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the 
context can ignore memory policies").

...

> -	}
>  	/* Exhausted what can be done so it's blamo time */
> -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> +	if (out_of_memory(&oc)) {

This removes the warning, but also the check for __GFP_NOFAIL itself. 
Was it what you wanted?

>  		*did_some_progress = 1;
>
> -		if (gfp_mask & __GFP_NOFAIL) {
> -			page = get_page_from_freelist(gfp_mask, order,
> -					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> -			/*
> -			 * fallback to ignore cpuset restriction if our nodes
> -			 * are depleted
> -			 */
> -			if (!page)
> -				page = get_page_from_freelist(gfp_mask, order,
> -					ALLOC_NO_WATERMARKS, ac);
> -		}
> +		/*
> +		 * Help non-failing allocations by giving them access to memory
> +		 * reserves
> +		 */
> +		if (gfp_mask & __GFP_NOFAIL)
> +			page = __alloc_pages_nowmark(gfp_mask, order, ac);
>  	}
>  out:
>  	mutex_unlock(&oom_lock);
> @@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 */
>  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>
> +		/*
> +		 * Help non-failing allocations by giving them access to memory
> +		 * reserves
> +		 */
> +		page = __alloc_pages_nowmark(gfp_mask, order, ac);
> +		if (page)
> +			goto got_pg;
> +
>  		cond_resched();
>  		goto retry;
>  	}
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-23 12:19     ` Vlastimil Babka
@ 2016-11-23 12:35       ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23 12:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

On Wed 23-11-16 13:19:20, Vlastimil Babka wrote:
> On 11/23/2016 07:49 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > the allocation request. This includes lowmem requests, costly high
> > order requests and others. For a long time __GFP_NOFAIL acted as an
> > override for all those rules. This is not documented and it can be quite
> > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > the existing open coded loops around allocator to nofail request (and we
> > have done that in the past) then such a change would have a non trivial
> > side effect which is not obvious. Note that the primary motivation for
> > skipping the OOM killer is to prevent from pre-mature invocation.
> > 
> > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > be invoked otherwise the request would be looping for ever. But this
> > argument is rather weak because the OOM killer doesn't really guarantee
> > any forward progress for those exceptional cases - e.g. it will hardly
> > help to form costly order - I believe we certainly do not want to kill
> > all processes and eventually panic the system just because there is a
> > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > the consequences - it is much better this request would loop for ever
> > than the massive system disruption, lowmem is also highly unlikely to be
> > freed during OOM killer and GFP_NOFS request could trigger while there
> > is still a lot of memory pinned by filesystems.
> > 
> > This patch simply removes the __GFP_NOFAIL special case in order to have
> > a more clear semantic without surprising side effects. Instead we do
> > allow nofail requests to access memory reserves to move forward in both
> > cases when the OOM killer is invoked and when it should be supressed.
> > __alloc_pages_nowmark helper has been introduced for that purpose.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> This makes some sense to me, but there might be unpleasant consequences,
> e.g. due to allowing costly allocations without reserves.

I am not sure I understand. Did you mean with reserves? Anyway, my code
inspection shown that we are not really doing GFP_NOFAIL for costly
orders. This might change in the future but even if we do that then this
shouldn't add a risk of the reserves depletion, right?

> I guess only testing will show...
> 
> Also some comments below.
[...]
> >  static inline struct page *
> > +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
> > +						const struct alloc_context *ac)
> > +{
> > +	struct page *page;
> > +
> > +	page = get_page_from_freelist(gfp_mask, order,
> > +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> > +	/*
> > +	 * fallback to ignore cpuset restriction if our nodes
> > +	 * are depleted
> > +	 */
> > +	if (!page)
> > +		page = get_page_from_freelist(gfp_mask, order,
> > +				ALLOC_NO_WATERMARKS, ac);
> 
> Is this enough? Look at what __alloc_pages_slowpath() does since
> e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the
> context can ignore memory policies").

this is a one time attempt to do the nowmark allocation. If we need to
do the recalculation then this should happen in the next round. Or am I
missing your question?

> 
> ...
> 
> > -	}
> >  	/* Exhausted what can be done so it's blamo time */
> > -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> > +	if (out_of_memory(&oc)) {
> 
> This removes the warning, but also the check for __GFP_NOFAIL itself. Was it
> what you wanted?

The point of the check was to keep looping for __GFP_NOFAIL requests
even when the OOM killer is disabled (out_of_memory returns false). We
are accomplishing that by 
> 
> >  		*did_some_progress = 1;
		^^^^ this

it is true we will not have the warning but I am not really sure we care
all that much. In any case it wouldn't be all that hard to check for oom
killer disabled and warn on in the allocator slow path.

thanks for having a look!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-23 12:35       ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23 12:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

On Wed 23-11-16 13:19:20, Vlastimil Babka wrote:
> On 11/23/2016 07:49 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > the allocation request. This includes lowmem requests, costly high
> > order requests and others. For a long time __GFP_NOFAIL acted as an
> > override for all those rules. This is not documented and it can be quite
> > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > the existing open coded loops around allocator to nofail request (and we
> > have done that in the past) then such a change would have a non trivial
> > side effect which is not obvious. Note that the primary motivation for
> > skipping the OOM killer is to prevent from pre-mature invocation.
> > 
> > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > be invoked otherwise the request would be looping for ever. But this
> > argument is rather weak because the OOM killer doesn't really guarantee
> > any forward progress for those exceptional cases - e.g. it will hardly
> > help to form costly order - I believe we certainly do not want to kill
> > all processes and eventually panic the system just because there is a
> > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > the consequences - it is much better this request would loop for ever
> > than the massive system disruption, lowmem is also highly unlikely to be
> > freed during OOM killer and GFP_NOFS request could trigger while there
> > is still a lot of memory pinned by filesystems.
> > 
> > This patch simply removes the __GFP_NOFAIL special case in order to have
> > a more clear semantic without surprising side effects. Instead we do
> > allow nofail requests to access memory reserves to move forward in both
> > cases when the OOM killer is invoked and when it should be supressed.
> > __alloc_pages_nowmark helper has been introduced for that purpose.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> This makes some sense to me, but there might be unpleasant consequences,
> e.g. due to allowing costly allocations without reserves.

I am not sure I understand. Did you mean with reserves? Anyway, my code
inspection shown that we are not really doing GFP_NOFAIL for costly
orders. This might change in the future but even if we do that then this
shouldn't add a risk of the reserves depletion, right?

> I guess only testing will show...
> 
> Also some comments below.
[...]
> >  static inline struct page *
> > +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
> > +						const struct alloc_context *ac)
> > +{
> > +	struct page *page;
> > +
> > +	page = get_page_from_freelist(gfp_mask, order,
> > +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> > +	/*
> > +	 * fallback to ignore cpuset restriction if our nodes
> > +	 * are depleted
> > +	 */
> > +	if (!page)
> > +		page = get_page_from_freelist(gfp_mask, order,
> > +				ALLOC_NO_WATERMARKS, ac);
> 
> Is this enough? Look at what __alloc_pages_slowpath() does since
> e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the
> context can ignore memory policies").

this is a one time attempt to do the nowmark allocation. If we need to
do the recalculation then this should happen in the next round. Or am I
missing your question?

> 
> ...
> 
> > -	}
> >  	/* Exhausted what can be done so it's blamo time */
> > -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> > +	if (out_of_memory(&oc)) {
> 
> This removes the warning, but also the check for __GFP_NOFAIL itself. Was it
> what you wanted?

The point of the check was to keep looping for __GFP_NOFAIL requests
even when the OOM killer is disabled (out_of_memory returns false). We
are accomplishing that by 
> 
> >  		*did_some_progress = 1;
		^^^^ this

it is true we will not have the warning but I am not really sure we care
all that much. In any case it wouldn't be all that hard to check for oom
killer disabled and warn on in the allocator slow path.

thanks for having a look!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-23  6:49   ` Michal Hocko
@ 2016-11-23 14:35     ` Tetsuo Handa
  -1 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2016-11-23 14:35 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: vbabka, rientjes, hannes, mgorman, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
> 
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.
> 
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.

__alloc_pages_nowmark() likely works if order is 0, but there is no
guarantee that __alloc_pages_nowmark() can find order > 0 pages.
If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
with requested order due to fragmentation, __GFP_NOFAIL should invoke
the OOM killer. I believe that risking kill all processes and panic the
system eventually is better than __GFP_NOFAIL livelock.

I'm not happy that the caller cannot invoke the OOM killer unless __GFP_FS
or __GFP_NOFAIL is specified. I think we should get rid of the concept of
premature OOM killer invocation. That is, whenever requested pages cannot
be allocated and the caller does not want to fail, invoking the OOM killer
is no longer premature. Unfortunately, there seems to be cases where the
caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
between memory allocation by system calls and memory reclaim by filesystems.
But memory reclaim by filesystems are not the fault of userspace processes
which issued system calls. It is unfair to force legitimate processes to fail
system calls with ENOMEM when GFP_NOFS is used inside system calls instead of
killing memory hog processes using the OOM killer. The root cause is that we
blindly treat all memory allocation requests evenly using the same watermark
(with rough-grained exceptions such as __GFP_HIGH) and allow lower priority
memory allocations (e.g. memory for buffered writes) to consume memory to the
level where higher priority memory allocations (e.g. memory for disk I/O) has
to retry looping without invoking the OOM killer, instead of using different
watermarks based on purpose/importance/priority of individual memory
allocation requests so that higher priority memory allocations can invoke
the OOM killer.



> @@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 */
>  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>  
> +		/*
> +		 * Help non-failing allocations by giving them access to memory
> +		 * reserves
> +		 */
> +		page = __alloc_pages_nowmark(gfp_mask, order, ac);
> +		if (page)
> +			goto got_pg;
> +

Should no_progress_loops be reset to 0 before retrying?

>  		cond_resched();
>  		goto retry;
>  	}
> -- 
> 2.10.2
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-23 14:35     ` Tetsuo Handa
  0 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2016-11-23 14:35 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: vbabka, rientjes, hannes, mgorman, akpm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> the allocation request. This includes lowmem requests, costly high
> order requests and others. For a long time __GFP_NOFAIL acted as an
> override for all those rules. This is not documented and it can be quite
> surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> the existing open coded loops around allocator to nofail request (and we
> have done that in the past) then such a change would have a non trivial
> side effect which is not obvious. Note that the primary motivation for
> skipping the OOM killer is to prevent from pre-mature invocation.
> 
> The exception has been added by 82553a937f12 ("oom: invoke oom killer
> for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> be invoked otherwise the request would be looping for ever. But this
> argument is rather weak because the OOM killer doesn't really guarantee
> any forward progress for those exceptional cases - e.g. it will hardly
> help to form costly order - I believe we certainly do not want to kill
> all processes and eventually panic the system just because there is a
> nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> the consequences - it is much better this request would loop for ever
> than the massive system disruption, lowmem is also highly unlikely to be
> freed during OOM killer and GFP_NOFS request could trigger while there
> is still a lot of memory pinned by filesystems.
> 
> This patch simply removes the __GFP_NOFAIL special case in order to have
> a more clear semantic without surprising side effects. Instead we do
> allow nofail requests to access memory reserves to move forward in both
> cases when the OOM killer is invoked and when it should be supressed.
> __alloc_pages_nowmark helper has been introduced for that purpose.

__alloc_pages_nowmark() likely works if order is 0, but there is no
guarantee that __alloc_pages_nowmark() can find order > 0 pages.
If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
with requested order due to fragmentation, __GFP_NOFAIL should invoke
the OOM killer. I believe that risking kill all processes and panic the
system eventually is better than __GFP_NOFAIL livelock.

I'm not happy that the caller cannot invoke the OOM killer unless __GFP_FS
or __GFP_NOFAIL is specified. I think we should get rid of the concept of
premature OOM killer invocation. That is, whenever requested pages cannot
be allocated and the caller does not want to fail, invoking the OOM killer
is no longer premature. Unfortunately, there seems to be cases where the
caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
between memory allocation by system calls and memory reclaim by filesystems.
But memory reclaim by filesystems are not the fault of userspace processes
which issued system calls. It is unfair to force legitimate processes to fail
system calls with ENOMEM when GFP_NOFS is used inside system calls instead of
killing memory hog processes using the OOM killer. The root cause is that we
blindly treat all memory allocation requests evenly using the same watermark
(with rough-grained exceptions such as __GFP_HIGH) and allow lower priority
memory allocations (e.g. memory for buffered writes) to consume memory to the
level where higher priority memory allocations (e.g. memory for disk I/O) has
to retry looping without invoking the OOM killer, instead of using different
watermarks based on purpose/importance/priority of individual memory
allocation requests so that higher priority memory allocations can invoke
the OOM killer.



> @@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 */
>  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
>  
> +		/*
> +		 * Help non-failing allocations by giving them access to memory
> +		 * reserves
> +		 */
> +		page = __alloc_pages_nowmark(gfp_mask, order, ac);
> +		if (page)
> +			goto got_pg;
> +

Should no_progress_loops be reset to 0 before retrying?

>  		cond_resched();
>  		goto retry;
>  	}
> -- 
> 2.10.2
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-23 14:35     ` Tetsuo Handa
@ 2016-11-23 15:35       ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23 15:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, vbabka, rientjes, hannes, mgorman, akpm, linux-kernel

On Wed 23-11-16 23:35:10, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > the allocation request. This includes lowmem requests, costly high
> > order requests and others. For a long time __GFP_NOFAIL acted as an
> > override for all those rules. This is not documented and it can be quite
> > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > the existing open coded loops around allocator to nofail request (and we
> > have done that in the past) then such a change would have a non trivial
> > side effect which is not obvious. Note that the primary motivation for
> > skipping the OOM killer is to prevent from pre-mature invocation.
> > 
> > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > be invoked otherwise the request would be looping for ever. But this
> > argument is rather weak because the OOM killer doesn't really guarantee
> > any forward progress for those exceptional cases - e.g. it will hardly
> > help to form costly order - I believe we certainly do not want to kill
> > all processes and eventually panic the system just because there is a
> > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > the consequences - it is much better this request would loop for ever
> > than the massive system disruption, lowmem is also highly unlikely to be
> > freed during OOM killer and GFP_NOFS request could trigger while there
> > is still a lot of memory pinned by filesystems.
> > 
> > This patch simply removes the __GFP_NOFAIL special case in order to have
> > a more clear semantic without surprising side effects. Instead we do
> > allow nofail requests to access memory reserves to move forward in both
> > cases when the OOM killer is invoked and when it should be supressed.
> > __alloc_pages_nowmark helper has been introduced for that purpose.
> 
> __alloc_pages_nowmark() likely works if order is 0, but there is no
> guarantee that __alloc_pages_nowmark() can find order > 0 pages.

Yes and it is not meant to be a guarantee. We are just trying to help
them out.

> If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
> with requested order due to fragmentation, __GFP_NOFAIL should invoke
> the OOM killer. I believe that risking kill all processes and panic the
> system eventually is better than __GFP_NOFAIL livelock.

I violently disagree. Just imagine a driver which asks for an order-9
page and cannot really continue without it so it uses GFP_NOFAIL. There
is absolutely no reason to disrupt or even put the whole system down
just because of this particular request. It might take for ever to
continue but that is to be expected when asking for such a hard
requirement.

> I'm not happy that the caller cannot invoke the OOM killer unless __GFP_FS
> or __GFP_NOFAIL is specified. I think we should get rid of the concept of
> premature OOM killer invocation.

I am not happy about GFP_NOFS situation either but this is far from
trivial to solve and certainly outside of the scope of these patches.
But as the matter of fact GFP_NOFS context currently might lead to
pre-mature OOMs which are no-go for most users. Allowing __GFP_NOFAIL
to override this will just lead to hit those pre-mature OOMs too easily.

> That is, whenever requested pages cannot
> be allocated and the caller does not want to fail, invoking the OOM killer
> is no longer premature.

This is only true for the full reclaim contexts which GFP_NOFS is not.

> Unfortunately, there seems to be cases where the
> caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
> between memory allocation by system calls and memory reclaim by filesystems.

I do not understand your point here. Syscall is an entry point to the
kernel where we cannot recurse to the FS code so GFP_NOFS seems wrong
thing to ask.

> But memory reclaim by filesystems are not the fault of userspace processes
> which issued system calls. It is unfair to force legitimate processes to fail
> system calls with ENOMEM when GFP_NOFS is used inside system calls instead of
> killing memory hog processes using the OOM killer. The root cause is that we
> blindly treat all memory allocation requests evenly using the same watermark
> (with rough-grained exceptions such as __GFP_HIGH) and allow lower priority
> memory allocations (e.g. memory for buffered writes) to consume memory to the
> level where higher priority memory allocations (e.g. memory for disk I/O) has
> to retry looping without invoking the OOM killer, instead of using different
> watermarks based on purpose/importance/priority of individual memory
> allocation requests so that higher priority memory allocations can invoke
> the OOM killer.

The priority/importance of the allocation is really subjective. If you
ask every subsystem will claim theirs to be the most important. Gfp mask
has been used to give some constrains for allocations. If you need a
forward progress guarantee then you have to build on top - e.g. use
mempools.

Anyway I fail to see how the above is related to this patch. My main
point here is that GFP_NOFAIL should not override the OOM decisions.

> > @@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		 */
> >  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
> >  
> > +		/*
> > +		 * Help non-failing allocations by giving them access to memory
> > +		 * reserves
> > +		 */
> > +		page = __alloc_pages_nowmark(gfp_mask, order, ac);
> > +		if (page)
> > +			goto got_pg;
> > +
> 
> Should no_progress_loops be reset to 0 before retrying?

Hmm, we might but is it necessary? The OOM path handles that properly
and nothing else but the oom detection really cares about no_progress_loops.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-23 15:35       ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-23 15:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, vbabka, rientjes, hannes, mgorman, akpm, linux-kernel

On Wed 23-11-16 23:35:10, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom makes sure to skip the OOM killer depending on
> > the allocation request. This includes lowmem requests, costly high
> > order requests and others. For a long time __GFP_NOFAIL acted as an
> > override for all those rules. This is not documented and it can be quite
> > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
> > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
> > the existing open coded loops around allocator to nofail request (and we
> > have done that in the past) then such a change would have a non trivial
> > side effect which is not obvious. Note that the primary motivation for
> > skipping the OOM killer is to prevent from pre-mature invocation.
> > 
> > The exception has been added by 82553a937f12 ("oom: invoke oom killer
> > for __GFP_NOFAIL"). The changelog points out that the oom killer has to
> > be invoked otherwise the request would be looping for ever. But this
> > argument is rather weak because the OOM killer doesn't really guarantee
> > any forward progress for those exceptional cases - e.g. it will hardly
> > help to form costly order - I believe we certainly do not want to kill
> > all processes and eventually panic the system just because there is a
> > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
> > the consequences - it is much better this request would loop for ever
> > than the massive system disruption, lowmem is also highly unlikely to be
> > freed during OOM killer and GFP_NOFS request could trigger while there
> > is still a lot of memory pinned by filesystems.
> > 
> > This patch simply removes the __GFP_NOFAIL special case in order to have
> > a more clear semantic without surprising side effects. Instead we do
> > allow nofail requests to access memory reserves to move forward in both
> > cases when the OOM killer is invoked and when it should be supressed.
> > __alloc_pages_nowmark helper has been introduced for that purpose.
> 
> __alloc_pages_nowmark() likely works if order is 0, but there is no
> guarantee that __alloc_pages_nowmark() can find order > 0 pages.

Yes and it is not meant to be a guarantee. We are just trying to help
them out.

> If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
> with requested order due to fragmentation, __GFP_NOFAIL should invoke
> the OOM killer. I believe that risking kill all processes and panic the
> system eventually is better than __GFP_NOFAIL livelock.

I violently disagree. Just imagine a driver which asks for an order-9
page and cannot really continue without it so it uses GFP_NOFAIL. There
is absolutely no reason to disrupt or even put the whole system down
just because of this particular request. It might take for ever to
continue but that is to be expected when asking for such a hard
requirement.

> I'm not happy that the caller cannot invoke the OOM killer unless __GFP_FS
> or __GFP_NOFAIL is specified. I think we should get rid of the concept of
> premature OOM killer invocation.

I am not happy about GFP_NOFS situation either but this is far from
trivial to solve and certainly outside of the scope of these patches.
But as the matter of fact GFP_NOFS context currently might lead to
pre-mature OOMs which are no-go for most users. Allowing __GFP_NOFAIL
to override this will just lead to hit those pre-mature OOMs too easily.

> That is, whenever requested pages cannot
> be allocated and the caller does not want to fail, invoking the OOM killer
> is no longer premature.

This is only true for the full reclaim contexts which GFP_NOFS is not.

> Unfortunately, there seems to be cases where the
> caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
> between memory allocation by system calls and memory reclaim by filesystems.

I do not understand your point here. Syscall is an entry point to the
kernel where we cannot recurse to the FS code so GFP_NOFS seems wrong
thing to ask.

> But memory reclaim by filesystems are not the fault of userspace processes
> which issued system calls. It is unfair to force legitimate processes to fail
> system calls with ENOMEM when GFP_NOFS is used inside system calls instead of
> killing memory hog processes using the OOM killer. The root cause is that we
> blindly treat all memory allocation requests evenly using the same watermark
> (with rough-grained exceptions such as __GFP_HIGH) and allow lower priority
> memory allocations (e.g. memory for buffered writes) to consume memory to the
> level where higher priority memory allocations (e.g. memory for disk I/O) has
> to retry looping without invoking the OOM killer, instead of using different
> watermarks based on purpose/importance/priority of individual memory
> allocation requests so that higher priority memory allocations can invoke
> the OOM killer.

The priority/importance of the allocation is really subjective. If you
ask every subsystem will claim theirs to be the most important. Gfp mask
has been used to give some constrains for allocations. If you need a
forward progress guarantee then you have to build on top - e.g. use
mempools.

Anyway I fail to see how the above is related to this patch. My main
point here is that GFP_NOFAIL should not override the OOM decisions.

> > @@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		 */
> >  		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
> >  
> > +		/*
> > +		 * Help non-failing allocations by giving them access to memory
> > +		 * reserves
> > +		 */
> > +		page = __alloc_pages_nowmark(gfp_mask, order, ac);
> > +		if (page)
> > +			goto got_pg;
> > +
> 
> Should no_progress_loops be reset to 0 before retrying?

Hmm, we might but is it necessary? The OOM path handles that properly
and nothing else but the oom detection really cares about no_progress_loops.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-23 12:35       ` Michal Hocko
@ 2016-11-24  7:41         ` Vlastimil Babka
  -1 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2016-11-24  7:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

On 11/23/2016 01:35 PM, Michal Hocko wrote:
> On Wed 23-11-16 13:19:20, Vlastimil Babka wrote:
>> This makes some sense to me, but there might be unpleasant consequences,
>> e.g. due to allowing costly allocations without reserves.
>
> I am not sure I understand. Did you mean with reserves? Anyway, my code

Yeah, with reserves/without watermarks checks. Sorry.

> inspection shown that we are not really doing GFP_NOFAIL for costly
> orders. This might change in the future but even if we do that then this
> shouldn't add a risk of the reserves depletion, right?

Well it's true that it will be unlikely that high-order pages will exist 
at min watermark, but if they do, high-order page depletes more than 
order-0. Anyway we have the WARN_ON_ONCE on cosly nofail allocations, so 
at least this won't happen silently...

>> I guess only testing will show...
>>
>> Also some comments below.
> [...]
>>>  static inline struct page *
>>> +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
>>> +						const struct alloc_context *ac)
>>> +{
>>> +	struct page *page;
>>> +
>>> +	page = get_page_from_freelist(gfp_mask, order,
>>> +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
>>> +	/*
>>> +	 * fallback to ignore cpuset restriction if our nodes
>>> +	 * are depleted
>>> +	 */
>>> +	if (!page)
>>> +		page = get_page_from_freelist(gfp_mask, order,
>>> +				ALLOC_NO_WATERMARKS, ac);
>>
>> Is this enough? Look at what __alloc_pages_slowpath() does since
>> e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the
>> context can ignore memory policies").
>
> this is a one time attempt to do the nowmark allocation. If we need to
> do the recalculation then this should happen in the next round. Or am I
> missing your question?

The next round no-watermarks allocation attempt in 
__alloc_pages_slowpath() uses different criteria than the new 
__alloc_pages_nowmark() callers. And it would be nicer to unify this as 
well, if possible.

>
>>
>> ...
>>
>>> -	}
>>>  	/* Exhausted what can be done so it's blamo time */
>>> -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
>>> +	if (out_of_memory(&oc)) {
>>
>> This removes the warning, but also the check for __GFP_NOFAIL itself. Was it
>> what you wanted?
>
> The point of the check was to keep looping for __GFP_NOFAIL requests
> even when the OOM killer is disabled (out_of_memory returns false). We
> are accomplishing that by
>>
>>>  		*did_some_progress = 1;
> 		^^^^ this

But oom disabled means that this line is not reached?

> it is true we will not have the warning but I am not really sure we care
> all that much. In any case it wouldn't be all that hard to check for oom
> killer disabled and warn on in the allocator slow path.
>
> thanks for having a look!
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-24  7:41         ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2016-11-24  7:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

On 11/23/2016 01:35 PM, Michal Hocko wrote:
> On Wed 23-11-16 13:19:20, Vlastimil Babka wrote:
>> This makes some sense to me, but there might be unpleasant consequences,
>> e.g. due to allowing costly allocations without reserves.
>
> I am not sure I understand. Did you mean with reserves? Anyway, my code

Yeah, with reserves/without watermarks checks. Sorry.

> inspection shown that we are not really doing GFP_NOFAIL for costly
> orders. This might change in the future but even if we do that then this
> shouldn't add a risk of the reserves depletion, right?

Well it's true that it will be unlikely that high-order pages will exist 
at min watermark, but if they do, high-order page depletes more than 
order-0. Anyway we have the WARN_ON_ONCE on cosly nofail allocations, so 
at least this won't happen silently...

>> I guess only testing will show...
>>
>> Also some comments below.
> [...]
>>>  static inline struct page *
>>> +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
>>> +						const struct alloc_context *ac)
>>> +{
>>> +	struct page *page;
>>> +
>>> +	page = get_page_from_freelist(gfp_mask, order,
>>> +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
>>> +	/*
>>> +	 * fallback to ignore cpuset restriction if our nodes
>>> +	 * are depleted
>>> +	 */
>>> +	if (!page)
>>> +		page = get_page_from_freelist(gfp_mask, order,
>>> +				ALLOC_NO_WATERMARKS, ac);
>>
>> Is this enough? Look at what __alloc_pages_slowpath() does since
>> e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the
>> context can ignore memory policies").
>
> this is a one time attempt to do the nowmark allocation. If we need to
> do the recalculation then this should happen in the next round. Or am I
> missing your question?

The next round no-watermarks allocation attempt in 
__alloc_pages_slowpath() uses different criteria than the new 
__alloc_pages_nowmark() callers. And it would be nicer to unify this as 
well, if possible.

>
>>
>> ...
>>
>>> -	}
>>>  	/* Exhausted what can be done so it's blamo time */
>>> -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
>>> +	if (out_of_memory(&oc)) {
>>
>> This removes the warning, but also the check for __GFP_NOFAIL itself. Was it
>> what you wanted?
>
> The point of the check was to keep looping for __GFP_NOFAIL requests
> even when the OOM killer is disabled (out_of_memory returns false). We
> are accomplishing that by
>>
>>>  		*did_some_progress = 1;
> 		^^^^ this

But oom disabled means that this line is not reached?

> it is true we will not have the warning but I am not really sure we care
> all that much. In any case it wouldn't be all that hard to check for oom
> killer disabled and warn on in the allocator slow path.
>
> thanks for having a look!
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-24  7:41         ` Vlastimil Babka
@ 2016-11-24  7:51           ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-24  7:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

On Thu 24-11-16 08:41:30, Vlastimil Babka wrote:
> On 11/23/2016 01:35 PM, Michal Hocko wrote:
> > On Wed 23-11-16 13:19:20, Vlastimil Babka wrote:
[...]
> > > >  static inline struct page *
> > > > +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
> > > > +						const struct alloc_context *ac)
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = get_page_from_freelist(gfp_mask, order,
> > > > +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> > > > +	/*
> > > > +	 * fallback to ignore cpuset restriction if our nodes
> > > > +	 * are depleted
> > > > +	 */
> > > > +	if (!page)
> > > > +		page = get_page_from_freelist(gfp_mask, order,
> > > > +				ALLOC_NO_WATERMARKS, ac);
> > > 
> > > Is this enough? Look at what __alloc_pages_slowpath() does since
> > > e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the
> > > context can ignore memory policies").
> > 
> > this is a one time attempt to do the nowmark allocation. If we need to
> > do the recalculation then this should happen in the next round. Or am I
> > missing your question?
> 
> The next round no-watermarks allocation attempt in __alloc_pages_slowpath()
> uses different criteria than the new __alloc_pages_nowmark() callers. And it
> would be nicer to unify this as well, if possible.

I am sorry but I still do not see your point. Could you be more specific
why it matters? In other words this is what we were doing prior to this
patch already so I am not changing it. I just wrapped it into a helper
because I have to do the same thing at two places because of oom vs.
no-oom paths.

> > > > -	}
> > > >  	/* Exhausted what can be done so it's blamo time */
> > > > -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> > > > +	if (out_of_memory(&oc)) {
> > > 
> > > This removes the warning, but also the check for __GFP_NOFAIL itself. Was it
> > > what you wanted?
> > 
> > The point of the check was to keep looping for __GFP_NOFAIL requests
> > even when the OOM killer is disabled (out_of_memory returns false). We
> > are accomplishing that by
> > > 
> > > >  		*did_some_progress = 1;
> > 		^^^^ this
> 
> But oom disabled means that this line is not reached?

Yes but it doesn't need to anymore because we have that "check NOFAIL on
nopage" check in the allocator slow path from the first patch. We didn't
have that previously so we had to "cheat" here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-24  7:51           ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-24  7:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Mel Gorman, Andrew Morton, LKML

On Thu 24-11-16 08:41:30, Vlastimil Babka wrote:
> On 11/23/2016 01:35 PM, Michal Hocko wrote:
> > On Wed 23-11-16 13:19:20, Vlastimil Babka wrote:
[...]
> > > >  static inline struct page *
> > > > +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order,
> > > > +						const struct alloc_context *ac)
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = get_page_from_freelist(gfp_mask, order,
> > > > +			ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
> > > > +	/*
> > > > +	 * fallback to ignore cpuset restriction if our nodes
> > > > +	 * are depleted
> > > > +	 */
> > > > +	if (!page)
> > > > +		page = get_page_from_freelist(gfp_mask, order,
> > > > +				ALLOC_NO_WATERMARKS, ac);
> > > 
> > > Is this enough? Look at what __alloc_pages_slowpath() does since
> > > e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the
> > > context can ignore memory policies").
> > 
> > this is a one time attempt to do the nowmark allocation. If we need to
> > do the recalculation then this should happen in the next round. Or am I
> > missing your question?
> 
> The next round no-watermarks allocation attempt in __alloc_pages_slowpath()
> uses different criteria than the new __alloc_pages_nowmark() callers. And it
> would be nicer to unify this as well, if possible.

I am sorry but I still do not see your point. Could you be more specific
why it matters? In other words this is what we were doing prior to this
patch already so I am not changing it. I just wrapped it into a helper
because I have to do the same thing at two places because of oom vs.
no-oom paths.

> > > > -	}
> > > >  	/* Exhausted what can be done so it's blamo time */
> > > > -	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
> > > > +	if (out_of_memory(&oc)) {
> > > 
> > > This removes the warning, but also the check for __GFP_NOFAIL itself. Was it
> > > what you wanted?
> > 
> > The point of the check was to keep looping for __GFP_NOFAIL requests
> > even when the OOM killer is disabled (out_of_memory returns false). We
> > are accomplishing that by
> > > 
> > > >  		*did_some_progress = 1;
> > 		^^^^ this
> 
> But oom disabled means that this line is not reached?

Yes but it doesn't need to anymore because we have that "check NOFAIL on
nopage" check in the allocator slow path from the first patch. We didn't
have that previously so we had to "cheat" here.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-23 15:35       ` Michal Hocko
@ 2016-11-25 12:00         ` Tetsuo Handa
  -1 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2016-11-25 12:00 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, vbabka, rientjes, hannes, mgorman, akpm, linux-kernel

Michal Hocko wrote:
> On Wed 23-11-16 23:35:10, Tetsuo Handa wrote:
> > If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
> > with requested order due to fragmentation, __GFP_NOFAIL should invoke
> > the OOM killer. I believe that risking kill all processes and panic the
> > system eventually is better than __GFP_NOFAIL livelock.
>
> I violently disagree. Just imagine a driver which asks for an order-9
> page and cannot really continue without it so it uses GFP_NOFAIL. There
> is absolutely no reason to disrupt or even put the whole system down
> just because of this particular request. It might take for ever to
> continue but that is to be expected when asking for such a hard
> requirement.

Did we find such in-tree drivers? If any, we likely already know it via
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); in buffered_rmqueue().
Even if there were such out-of-tree drivers, we don't need to take care of
out-of-tree drivers.

> > Unfortunately, there seems to be cases where the
> > caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
> > between memory allocation by system calls and memory reclaim by filesystems.
>
> I do not understand your point here. Syscall is an entry point to the
> kernel where we cannot recurse to the FS code so GFP_NOFS seems wrong
> thing to ask.

Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
re-entering the fs code") ? My understanding is that mkdir() system call
caused memory allocation for inode creation and that memory allocation
caused memory reclaim which had to be !__GFP_FS.

And whether we need to use GFP_NOFS at specific point is very very unclear.
For example, security_inode_init_security() calls call_int_hook() macro
which calls smack_inode_init_security() if Smack is active.
smack_inode_init_security() uses GFP_NOFS for memory allocation.
security_inode_init_security() also calls evm_inode_init_security(), and
evm_inode_init_security() uses GFP_NOFS for memory allocation.
Looks consistent? Yes.

But evm_inode_init_security() also calls evm_init_hmac() which in turn calls
init_desc() which uses GFP_KERNEL for memory allocation. This is not consistent.

And security_inode_init_security() also calls initxattrs() callback which is
provided by filesystem code. For example, btrfs_initxattrs() is called if
security_inode_init_security() is called by btrfs. And btrfs_initxattrs() is
using GFP_KERNEL for memory allocation. This is not consistent too.

Either we are needlessly using GFP_NOFS with risk of retry-forever loop or
we are wrongly using GFP_KERNEL with risk of memory reclaim deadlock.
Apart from we need to make these GFP_NOFS/GFP_KERNEL usages consistent
(although whether we need to use GFP_NOFS is very very unclear),
I do want to allow memory allocations from functions which are called by
system calls to invoke the OOM-killer (e.g. __GFP_MAY_OOMKILL) rather than
risk retry-forever loop (or fail that request) even if we need to use GFP_NOFS.
Also, I'm willing to give up memory allocations from functions which are
called by system calls if SIGKILL is pending (i.e. __GFP_KILLABLE).

Did you understand my point?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-25 12:00         ` Tetsuo Handa
  0 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2016-11-25 12:00 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, vbabka, rientjes, hannes, mgorman, akpm, linux-kernel

Michal Hocko wrote:
> On Wed 23-11-16 23:35:10, Tetsuo Handa wrote:
> > If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
> > with requested order due to fragmentation, __GFP_NOFAIL should invoke
> > the OOM killer. I believe that risking kill all processes and panic the
> > system eventually is better than __GFP_NOFAIL livelock.
>
> I violently disagree. Just imagine a driver which asks for an order-9
> page and cannot really continue without it so it uses GFP_NOFAIL. There
> is absolutely no reason to disrupt or even put the whole system down
> just because of this particular request. It might take for ever to
> continue but that is to be expected when asking for such a hard
> requirement.

Did we find such in-tree drivers? If any, we likely already know it via
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); in buffered_rmqueue().
Even if there were such out-of-tree drivers, we don't need to take care of
out-of-tree drivers.

> > Unfortunately, there seems to be cases where the
> > caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
> > between memory allocation by system calls and memory reclaim by filesystems.
>
> I do not understand your point here. Syscall is an entry point to the
> kernel where we cannot recurse to the FS code so GFP_NOFS seems wrong
> thing to ask.

Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
re-entering the fs code") ? My understanding is that mkdir() system call
caused memory allocation for inode creation and that memory allocation
caused memory reclaim which had to be !__GFP_FS.

And whether we need to use GFP_NOFS at specific point is very very unclear.
For example, security_inode_init_security() calls call_int_hook() macro
which calls smack_inode_init_security() if Smack is active.
smack_inode_init_security() uses GFP_NOFS for memory allocation.
security_inode_init_security() also calls evm_inode_init_security(), and
evm_inode_init_security() uses GFP_NOFS for memory allocation.
Looks consistent? Yes.

But evm_inode_init_security() also calls evm_init_hmac() which in turn calls
init_desc() which uses GFP_KERNEL for memory allocation. This is not consistent.

And security_inode_init_security() also calls initxattrs() callback which is
provided by filesystem code. For example, btrfs_initxattrs() is called if
security_inode_init_security() is called by btrfs. And btrfs_initxattrs() is
using GFP_KERNEL for memory allocation. This is not consistent too.

Either we are needlessly using GFP_NOFS with risk of retry-forever loop or
we are wrongly using GFP_KERNEL with risk of memory reclaim deadlock.
Apart from we need to make these GFP_NOFS/GFP_KERNEL usages consistent
(although whether we need to use GFP_NOFS is very very unclear),
I do want to allow memory allocations from functions which are called by
system calls to invoke the OOM-killer (e.g. __GFP_MAY_OOMKILL) rather than
risk retry-forever loop (or fail that request) even if we need to use GFP_NOFS.
Also, I'm willing to give up memory allocations from functions which are
called by system calls if SIGKILL is pending (i.e. __GFP_KILLABLE).

Did you understand my point?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
  2016-11-25 12:00         ` Tetsuo Handa
@ 2016-11-25 13:18           ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-25 13:18 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, vbabka, rientjes, hannes, mgorman, akpm, linux-kernel

On Fri 25-11-16 21:00:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 23-11-16 23:35:10, Tetsuo Handa wrote:
> > > If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
> > > with requested order due to fragmentation, __GFP_NOFAIL should invoke
> > > the OOM killer. I believe that risking kill all processes and panic the
> > > system eventually is better than __GFP_NOFAIL livelock.
> >
> > I violently disagree. Just imagine a driver which asks for an order-9
> > page and cannot really continue without it so it uses GFP_NOFAIL. There
> > is absolutely no reason to disrupt or even put the whole system down
> > just because of this particular request. It might take for ever to
> > continue but that is to be expected when asking for such a hard
> > requirement.
> 
> Did we find such in-tree drivers? If any, we likely already know it via
> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); in buffered_rmqueue().
> Even if there were such out-of-tree drivers, we don't need to take care of
> out-of-tree drivers.

We do not have any costly + GFP_NOFAIL users in the tree from my quick
check. The whole point of this excercise is to make such a potential
user not seeing unexpected side effects - e.g. when transforming open
coded endless lopps into GFP_NOFAIL which is imho preferable.

The bottom line is that GFP_NOFAIL is about never failing not to get the
memory as quickly as possible or for any price.

> > > Unfortunately, there seems to be cases where the
> > > caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
> > > between memory allocation by system calls and memory reclaim by filesystems.
> >
> > I do not understand your point here. Syscall is an entry point to the
> > kernel where we cannot recurse to the FS code so GFP_NOFS seems wrong
> > thing to ask.
> 
> Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
> commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
> 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
> re-entering the fs code") ? My understanding is that mkdir() system call
> caused memory allocation for inode creation and that memory allocation
> caused memory reclaim which had to be !__GFP_FS.

I will have a look later, thanks for the points.
 
> And whether we need to use GFP_NOFS at specific point is very very unclear.

And that is exactly why I am pushing for a scoped GFP_NOFS usage where
the FS code marks those scopes which are dangerous from the reclaim
recursion or for other FS internal reasons and the stacking code
shouldn't care at all. Spreading GFP_NOFS randomly is not at all helpful
nor it makes the situation any better.

I am sorry but I would prefer not to discuss this part in this thread as
it is mostly off topic. The point I am trying to make here is to clean
up GFP_NOFAIL usage. And I argue that overriding the oom prevention
decisions just because of GFP_NOFAIL is wrong. So let's please stick
with this topic. I might be wrong and miss some legitimate case but then
I would like to hear about it.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically
@ 2016-11-25 13:18           ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2016-11-25 13:18 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, vbabka, rientjes, hannes, mgorman, akpm, linux-kernel

On Fri 25-11-16 21:00:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 23-11-16 23:35:10, Tetsuo Handa wrote:
> > > If __alloc_pages_nowmark() called by __GFP_NOFAIL could not find pages
> > > with requested order due to fragmentation, __GFP_NOFAIL should invoke
> > > the OOM killer. I believe that risking kill all processes and panic the
> > > system eventually is better than __GFP_NOFAIL livelock.
> >
> > I violently disagree. Just imagine a driver which asks for an order-9
> > page and cannot really continue without it so it uses GFP_NOFAIL. There
> > is absolutely no reason to disrupt or even put the whole system down
> > just because of this particular request. It might take for ever to
> > continue but that is to be expected when asking for such a hard
> > requirement.
> 
> Did we find such in-tree drivers? If any, we likely already know it via
> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); in buffered_rmqueue().
> Even if there were such out-of-tree drivers, we don't need to take care of
> out-of-tree drivers.

We do not have any costly + GFP_NOFAIL users in the tree from my quick
check. The whole point of this excercise is to make such a potential
user not seeing unexpected side effects - e.g. when transforming open
coded endless lopps into GFP_NOFAIL which is imho preferable.

The bottom line is that GFP_NOFAIL is about never failing not to get the
memory as quickly as possible or for any price.

> > > Unfortunately, there seems to be cases where the
> > > caller needs to use GFP_NOFS rather than GFP_KERNEL due to unclear dependency
> > > between memory allocation by system calls and memory reclaim by filesystems.
> >
> > I do not understand your point here. Syscall is an entry point to the
> > kernel where we cannot recurse to the FS code so GFP_NOFS seems wrong
> > thing to ask.
> 
> Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to
> commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit
> 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from
> re-entering the fs code") ? My understanding is that mkdir() system call
> caused memory allocation for inode creation and that memory allocation
> caused memory reclaim which had to be !__GFP_FS.

I will have a look later, thanks for the points.
 
> And whether we need to use GFP_NOFS at specific point is very very unclear.

And that is exactly why I am pushing for a scoped GFP_NOFS usage where
the FS code marks those scopes which are dangerous from the reclaim
recursion or for other FS internal reasons and the stacking code
shouldn't care at all. Spreading GFP_NOFS randomly is not at all helpful
nor it makes the situation any better.

I am sorry but I would prefer not to discuss this part in this thread as
it is mostly off topic. The point I am trying to make here is to clean
up GFP_NOFAIL usage. And I argue that overriding the oom prevention
decisions just because of GFP_NOFAIL is wrong. So let's please stick
with this topic. I might be wrong and miss some legitimate case but then
I would like to hear about it.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-11-25 13:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-23  6:49 [RFC 0/2] GFP_NOFAIL cleanups Michal Hocko
2016-11-23  6:49 ` Michal Hocko
2016-11-23  6:49 ` [RFC 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
2016-11-23  6:49   ` Michal Hocko
2016-11-23 10:43   ` Vlastimil Babka
2016-11-23 10:43     ` Vlastimil Babka
2016-11-23  6:49 ` [RFC 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
2016-11-23  6:49   ` Michal Hocko
2016-11-23 12:19   ` Vlastimil Babka
2016-11-23 12:19     ` Vlastimil Babka
2016-11-23 12:35     ` Michal Hocko
2016-11-23 12:35       ` Michal Hocko
2016-11-24  7:41       ` Vlastimil Babka
2016-11-24  7:41         ` Vlastimil Babka
2016-11-24  7:51         ` Michal Hocko
2016-11-24  7:51           ` Michal Hocko
2016-11-23 14:35   ` Tetsuo Handa
2016-11-23 14:35     ` Tetsuo Handa
2016-11-23 15:35     ` Michal Hocko
2016-11-23 15:35       ` Michal Hocko
2016-11-25 12:00       ` Tetsuo Handa
2016-11-25 12:00         ` Tetsuo Handa
2016-11-25 13:18         ` Michal Hocko
2016-11-25 13:18           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.