linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
@ 2013-11-22 17:17 Johannes Weiner
  2013-11-27  1:01 ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2013-11-22 17:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel

84235de394d9 ("fs: buffer: move allocation failure loop into the
allocator") started recognizing __GFP_NOFAIL in memory cgroups but
forgot to disable the OOM killer.

Any task that does not fail allocation will also not enter the OOM
completion path.  So don't declare an OOM state in this case or it'll
be leaked and the task be able to bypass the limit until the next
userspace-triggered page fault cleans up the OOM state.

Reported-by: William Dauchy <wdauchy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@kernel.org # 3.12
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13b9d0f..cc4f9cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2677,6 +2677,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	if (unlikely(task_in_memcg_oom(current)))
 		goto bypass;
 
+	if (gfp_mask & __GFP_NOFAIL)
+		oom = false;
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-22 17:17 [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations Johannes Weiner
@ 2013-11-27  1:01 ` David Rientjes
  2013-11-27  3:33   ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-11-27  1:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel

On Fri, 22 Nov 2013, Johannes Weiner wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 13b9d0f..cc4f9cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2677,6 +2677,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto bypass;
>  
> +	if (gfp_mask & __GFP_NOFAIL)
> +		oom = false;
> +
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
>  	 * The mm_struct's mem_cgroup changes on task migration if the

Sorry, I don't understand this.  What happens in the following scenario:

 - memory.usage_in_bytes == memory.limit_in_bytes,

 - memcg reclaim fails to reclaim memory, and

 - all processes (perhaps only one) attached to the memcg are doing one of
   the over dozen __GFP_NOFAIL allocations in the kernel?

How do we make forward progress if you cannot oom kill something?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-27  1:01 ` David Rientjes
@ 2013-11-27  3:33   ` David Rientjes
  2013-11-27 16:39     ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-11-27  3:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel

On Tue, 26 Nov 2013, David Rientjes wrote:

> On Fri, 22 Nov 2013, Johannes Weiner wrote:
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 13b9d0f..cc4f9cb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2677,6 +2677,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  	if (unlikely(task_in_memcg_oom(current)))
> >  		goto bypass;
> >  
> > +	if (gfp_mask & __GFP_NOFAIL)
> > +		oom = false;
> > +
> >  	/*
> >  	 * We always charge the cgroup the mm_struct belongs to.
> >  	 * The mm_struct's mem_cgroup changes on task migration if the
> 
> Sorry, I don't understand this.  What happens in the following scenario:
> 
>  - memory.usage_in_bytes == memory.limit_in_bytes,
> 
>  - memcg reclaim fails to reclaim memory, and
> 
>  - all processes (perhaps only one) attached to the memcg are doing one of
>    the over dozen __GFP_NOFAIL allocations in the kernel?
> 
> How do we make forward progress if you cannot oom kill something?
> 

Ah, this is because of 3168ecbe1c04 ("mm: memcg: use proper memcg in limit 
bypass") which just bypasses all of these allocations and charges the root 
memcg.  So if allocations want to bypass memcg isolation they just have to 
be __GFP_NOFAIL?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-27  3:33   ` David Rientjes
@ 2013-11-27 16:39     ` Johannes Weiner
  2013-11-27 21:38       ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2013-11-27 16:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel

On Tue, Nov 26, 2013 at 07:33:12PM -0800, David Rientjes wrote:
> On Tue, 26 Nov 2013, David Rientjes wrote:
> 
> > On Fri, 22 Nov 2013, Johannes Weiner wrote:
> > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 13b9d0f..cc4f9cb 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2677,6 +2677,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > >  	if (unlikely(task_in_memcg_oom(current)))
> > >  		goto bypass;
> > >  
> > > +	if (gfp_mask & __GFP_NOFAIL)
> > > +		oom = false;
> > > +
> > >  	/*
> > >  	 * We always charge the cgroup the mm_struct belongs to.
> > >  	 * The mm_struct's mem_cgroup changes on task migration if the
> > 
> > Sorry, I don't understand this.  What happens in the following scenario:
> > 
> >  - memory.usage_in_bytes == memory.limit_in_bytes,
> > 
> >  - memcg reclaim fails to reclaim memory, and
> > 
> >  - all processes (perhaps only one) attached to the memcg are doing one of
> >    the over dozen __GFP_NOFAIL allocations in the kernel?
> > 
> > How do we make forward progress if you cannot oom kill something?

Bypass the limit.

> Ah, this is because of 3168ecbe1c04 ("mm: memcg: use proper memcg in limit 
> bypass") which just bypasses all of these allocations and charges the root 
> memcg.  So if allocations want to bypass memcg isolation they just have to 
> be __GFP_NOFAIL?

I don't think we have another option.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-27 16:39     ` Johannes Weiner
@ 2013-11-27 21:38       ` David Rientjes
  2013-11-27 22:53         ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-11-27 21:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > Ah, this is because of 3168ecbe1c04 ("mm: memcg: use proper memcg in limit 
> > bypass") which just bypasses all of these allocations and charges the root 
> > memcg.  So if allocations want to bypass memcg isolation they just have to 
> > be __GFP_NOFAIL?
> 
> I don't think we have another option.
> 

We don't give __GFP_NOFAIL allocations access to memory reserves in the 
page allocator and we do call the oom killer for them so that a process is 
killed so that memory is freed.  Why do we have a different policy for 
memcg?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-27 21:38       ` David Rientjes
@ 2013-11-27 22:53         ` Johannes Weiner
  2013-11-27 23:34           ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2013-11-27 22:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel

On Wed, Nov 27, 2013 at 01:38:59PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > Ah, this is because of 3168ecbe1c04 ("mm: memcg: use proper memcg in limit 
> > > bypass") which just bypasses all of these allocations and charges the root 
> > > memcg.  So if allocations want to bypass memcg isolation they just have to 
> > > be __GFP_NOFAIL?
> > 
> > I don't think we have another option.
> > 
> 
> We don't give __GFP_NOFAIL allocations access to memory reserves in the 
> page allocator and we do call the oom killer for them so that a process is 
> killed so that memory is freed.  Why do we have a different policy for 
> memcg?

Oh boy, that's the epic story we dealt with all throughout the last
merge window... ;-)

__GFP_NOFAIL allocations might come in with various filesystem locks
held that could prevent an OOM victim from exiting, so a loop around
the OOM killer in an allocation context is prone to loop endlessly.

Because of this, I changed memcg to never invoke OOM kills from the
allocation context anymore but save it for the end of the page fault
handler.  __GFP_NOFAIL allocations can not fail and thus do not reach
the end of the page fault, so no OOM kill invocation possible.

Arguably, the page allocator should also just return NULL and leave
OOM killing to pagefault_out_of_memory(), but it's much less likely to
get stuck since the overall system has more chances of producing free
memory even without an OOM kill than a memcg with a single process and
no swap for example.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-27 22:53         ` Johannes Weiner
@ 2013-11-27 23:34           ` David Rientjes
  2013-11-28 10:20             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-11-27 23:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > We don't give __GFP_NOFAIL allocations access to memory reserves in the 
> > page allocator and we do call the oom killer for them so that a process is 
> > killed so that memory is freed.  Why do we have a different policy for 
> > memcg?
> 
> Oh boy, that's the epic story we dealt with all throughout the last
> merge window... ;-)
> 
> __GFP_NOFAIL allocations might come in with various filesystem locks
> held that could prevent an OOM victim from exiting, so a loop around
> the OOM killer in an allocation context is prone to loop endlessly.
> 

Ok, so let's forget about GFP_KERNEL | __GFP_NOFAIL since anything doing 
__GFP_FS should not be holding such locks, we have some of those in the 
drivers code and that makes sense that they are doing GFP_KERNEL.

Focusing on the GFP_NOFS | __GFP_NOFAIL allocations in the filesystem 
code, the kernel oom killer independent of memcg never gets called because 
!__GFP_FS and they'll simply loop around the page allocator forever.

In the past, Andrew has expressed the desire to get rid of __GFP_NOFAIL 
entirely since it's flawed when combined with GFP_NOFS (and GFP_KERNEL | 
__GFP_NOFAIL could simply be reimplemented in the caller) because of the 
reason you point out in addition to making it very difficult in the page 
allocator to free memory independent of memcg.

So I'm wondering if we should just disable the oom killer in memcg for 
__GFP_NOFAIL as you've done here, but not bypass to the root memcg and 
just allow them to spin?  I think we should be focused on the fixing the 
callers rather than breaking memcg isolation.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-27 23:34           ` David Rientjes
@ 2013-11-28 10:20             ` Michal Hocko
  2013-11-29 23:46               ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2013-11-28 10:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel

On Wed 27-11-13 15:34:24, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > We don't give __GFP_NOFAIL allocations access to memory reserves in the 
> > > page allocator and we do call the oom killer for them so that a process is 
> > > killed so that memory is freed.  Why do we have a different policy for 
> > > memcg?
> > 
> > Oh boy, that's the epic story we dealt with all throughout the last
> > merge window... ;-)
> > 
> > __GFP_NOFAIL allocations might come in with various filesystem locks
> > held that could prevent an OOM victim from exiting, so a loop around
> > the OOM killer in an allocation context is prone to loop endlessly.
> > 
> 
> Ok, so let's forget about GFP_KERNEL | __GFP_NOFAIL since anything doing 
> __GFP_FS should not be holding such locks, we have some of those in the 
> drivers code and that makes sense that they are doing GFP_KERNEL.
> 
> Focusing on the GFP_NOFS | __GFP_NOFAIL allocations in the filesystem 
> code, the kernel oom killer independent of memcg never gets called because 
> !__GFP_FS and they'll simply loop around the page allocator forever.
> 
> In the past, Andrew has expressed the desire to get rid of __GFP_NOFAIL 
> entirely since it's flawed when combined with GFP_NOFS (and GFP_KERNEL | 
> __GFP_NOFAIL could simply be reimplemented in the caller) because of the 
> reason you point out in addition to making it very difficult in the page 
> allocator to free memory independent of memcg.
> 
> So I'm wondering if we should just disable the oom killer in memcg for 
> __GFP_NOFAIL as you've done here, but not bypass to the root memcg and 
> just allow them to spin?  I think we should be focused on the fixing the 
> callers rather than breaking memcg isolation.

What if the callers simply cannot deal with the allocation failure?
84235de394d97 (fs: buffer: move allocation failure loop into the
allocator) describes one such case when __getblk_slow tries desperately
to grow buffers relying on the reclaim to free something. As there might
be no reclaim going on we are screwed.

That being said, while I do agree with you that we should strive for
isolation as much as possible there are certain cases when this is
impossible to achieve without seeing much worse consequences. For now,
we hope that __GFP_NOFAIL is used very scarcely.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-28 10:20             ` Michal Hocko
@ 2013-11-29 23:46               ` David Rientjes
  2013-12-02 13:22                 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-11-29 23:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel

On Thu, 28 Nov 2013, Michal Hocko wrote:

> > Ok, so let's forget about GFP_KERNEL | __GFP_NOFAIL since anything doing 
> > __GFP_FS should not be holding such locks, we have some of those in the 
> > drivers code and that makes sense that they are doing GFP_KERNEL.
> > 
> > Focusing on the GFP_NOFS | __GFP_NOFAIL allocations in the filesystem 
> > code, the kernel oom killer independent of memcg never gets called because 
> > !__GFP_FS and they'll simply loop around the page allocator forever.
> > 
> > In the past, Andrew has expressed the desire to get rid of __GFP_NOFAIL 
> > entirely since it's flawed when combined with GFP_NOFS (and GFP_KERNEL | 
> > __GFP_NOFAIL could simply be reimplemented in the caller) because of the 
> > reason you point out in addition to making it very difficult in the page 
> > allocator to free memory independent of memcg.
> > 
> > So I'm wondering if we should just disable the oom killer in memcg for 
> > __GFP_NOFAIL as you've done here, but not bypass to the root memcg and 
> > just allow them to spin?  I think we should be focused on the fixing the 
> > callers rather than breaking memcg isolation.
> 
> What if the callers simply cannot deal with the allocation failure?
> 84235de394d97 (fs: buffer: move allocation failure loop into the
> allocator) describes one such case when __getblk_slow tries desperately
> to grow buffers relying on the reclaim to free something. As there might
> be no reclaim going on we are screwed.
> 

My suggestion is to spin, not return NULL.  Bypassing to the root memcg 
can lead to a system oom condition whereas if memcg weren't involved at 
all the page allocator would just spin (because of !__GFP_FS).

> That being said, while I do agree with you that we should strive for
> isolation as much as possible there are certain cases when this is
> impossible to achieve without seeing much worse consequences. For now,
> we hope that __GFP_NOFAIL is used very scarcely.

If that's true, why not bypass the per-zone min watermarks in the page 
allocator as well to allow these allocations to succeed?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-11-29 23:46               ` David Rientjes
@ 2013-12-02 13:22                 ` Michal Hocko
  2013-12-02 23:02                   ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2013-12-02 13:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel

On Fri 29-11-13 15:46:16, David Rientjes wrote:
> On Thu, 28 Nov 2013, Michal Hocko wrote:
> 
> > > Ok, so let's forget about GFP_KERNEL | __GFP_NOFAIL since anything doing 
> > > __GFP_FS should not be holding such locks, we have some of those in the 
> > > drivers code and that makes sense that they are doing GFP_KERNEL.
> > > 
> > > Focusing on the GFP_NOFS | __GFP_NOFAIL allocations in the filesystem 
> > > code, the kernel oom killer independent of memcg never gets called because 
> > > !__GFP_FS and they'll simply loop around the page allocator forever.
> > > 
> > > In the past, Andrew has expressed the desire to get rid of __GFP_NOFAIL 
> > > entirely since it's flawed when combined with GFP_NOFS (and GFP_KERNEL | 
> > > __GFP_NOFAIL could simply be reimplemented in the caller) because of the 
> > > reason you point out in addition to making it very difficult in the page 
> > > allocator to free memory independent of memcg.
> > > 
> > > So I'm wondering if we should just disable the oom killer in memcg for 
> > > __GFP_NOFAIL as you've done here, but not bypass to the root memcg and 
> > > just allow them to spin?  I think we should be focused on the fixing the 
> > > callers rather than breaking memcg isolation.
> > 
> > What if the callers simply cannot deal with the allocation failure?
> > 84235de394d97 (fs: buffer: move allocation failure loop into the
> > allocator) describes one such case when __getblk_slow tries desperately
> > to grow buffers relying on the reclaim to free something. As there might
> > be no reclaim going on we are screwed.
> > 
> 
> My suggestion is to spin, not return NULL. 

Spin on which level? The whole point of this change was to not spin for
ever because the caller might sit on top of other locks which might
prevent somebody else to die although it has been killed.

> Bypassing to the root memcg 
> can lead to a system oom condition whereas if memcg weren't involved at 
> all the page allocator would just spin (because of !__GFP_FS).

I am confused now. The page allocation has already happened at the time
we are doing the charge. So the global OOM would have happened already.

> > That being said, while I do agree with you that we should strive for
> > isolation as much as possible there are certain cases when this is
> > impossible to achieve without seeing much worse consequences. For now,
> > we hope that __GFP_NOFAIL is used very scarcely.
> 
> If that's true, why not bypass the per-zone min watermarks in the page 
> allocator as well to allow these allocations to succeed?

Allocations are already done. We simply cannot charge that allocation
because we have reached the hard limit. And the said allocation might
prevent OOM action to proceed due to held locks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-12-02 13:22                 ` Michal Hocko
@ 2013-12-02 23:02                   ` David Rientjes
  2013-12-03 22:25                     ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-12-02 23:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel

On Mon, 2 Dec 2013, Michal Hocko wrote:

> > > What if the callers simply cannot deal with the allocation failure?
> > > 84235de394d97 (fs: buffer: move allocation failure loop into the
> > > allocator) describes one such case when __getblk_slow tries desperately
> > > to grow buffers relying on the reclaim to free something. As there might
> > > be no reclaim going on we are screwed.
> > > 
> > 
> > My suggestion is to spin, not return NULL. 
> 
> Spin on which level? The whole point of this change was to not spin for
> ever because the caller might sit on top of other locks which might
> prevent somebody else to die although it has been killed.
> 

See my question about the non-memcg page allocator behavior below.

> > Bypassing to the root memcg 
> > can lead to a system oom condition whereas if memcg weren't involved at 
> > all the page allocator would just spin (because of !__GFP_FS).
> 
> I am confused now. The page allocation has already happened at the time
> we are doing the charge. So the global OOM would have happened already.
> 

That's precisely the point, the successful charges can allow additional 
page allocations to occur and cause system oom conditions if you don't 
have memcg isolation.  Some customers, including us, use memcg to ensure 
that a set of processes cannot use more resources than allowed.  Any 
bypass opens up the possibility of additional memory allocations that 
cause the system to be oom and then we end up requiring a userspace oom 
handler because our policy is complex enough that it cannot be effected 
simply by /proc/pid/oom_score_adj.

I'm not quite sure how significant of a point this is, though, because it 
depends on the caller doing the __GFP_NOFAIL allocations that allow the 
bypass.  If you're doing

	for (i = 0; i < 1 << 20; i++)
		page[i] = alloc_page(GFP_NOFS | __GFP_NOFAIL);

it can become significant, but I'm unsure of how much memory all callers 
end up allocating in this context.

> > > That being said, while I do agree with you that we should strive for
> > > isolation as much as possible there are certain cases when this is
> > > impossible to achieve without seeing much worse consequences. For now,
> > > we hope that __GFP_NOFAIL is used very scarcely.
> > 
> > If that's true, why not bypass the per-zone min watermarks in the page 
> > allocator as well to allow these allocations to succeed?
> 
> Allocations are already done. We simply cannot charge that allocation
> because we have reached the hard limit. And the said allocation might
> prevent OOM action to proceed due to held locks.

I'm referring to the generic non-memcg page allocator behavior.  Forget 
memcg for a moment.  What is the behavior in the _page_allocator_ for 
GFP_NOFS | __GFP_NOFAIL?  Do we spin forever if reclaim fails or do we 
bypas the per-zone min watermarks to allow it to allocate because "it 
needs to succeed, it may be holding filesystem locks"?

It's already been acknowledged in this thread that no bypassing is done 
in the page allocator and it just spins.  There's some handwaving saying 
that since the entire system is oom that there is a greater chance that 
memory will be freed by something else, but that's just handwaving and is 
certainly no guaranteed.

So, my question again: why not bypass the per-zone min watermarks in the 
page allocator?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-12-02 23:02                   ` David Rientjes
@ 2013-12-03 22:25                     ` Johannes Weiner
  2013-12-03 23:40                       ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2013-12-03 22:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, linux-mm, cgroups, linux-kernel

On Mon, Dec 02, 2013 at 03:02:09PM -0800, David Rientjes wrote:
> On Mon, 2 Dec 2013, Michal Hocko wrote:
> 
> > > > What if the callers simply cannot deal with the allocation failure?
> > > > 84235de394d97 (fs: buffer: move allocation failure loop into the
> > > > allocator) describes one such case when __getblk_slow tries desperately
> > > > to grow buffers relying on the reclaim to free something. As there might
> > > > be no reclaim going on we are screwed.
> > > > 
> > > 
> > > My suggestion is to spin, not return NULL. 
> > 
> > Spin on which level? The whole point of this change was to not spin for
> > ever because the caller might sit on top of other locks which might
> > prevent somebody else to die although it has been killed.
> 
> See my question about the non-memcg page allocator behavior below.

No, please answer the question.

> > > Bypassing to the root memcg 
> > > can lead to a system oom condition whereas if memcg weren't involved at 
> > > all the page allocator would just spin (because of !__GFP_FS).
> > 
> > I am confused now. The page allocation has already happened at the time
> > we are doing the charge. So the global OOM would have happened already.
> > 
> 
> That's precisely the point, the successful charges can allow additional 
> page allocations to occur and cause system oom conditions if you don't 
> have memcg isolation.  Some customers, including us, use memcg to ensure 
> that a set of processes cannot use more resources than allowed.  Any 
> bypass opens up the possibility of additional memory allocations that 
> cause the system to be oom and then we end up requiring a userspace oom 
> handler because our policy is complex enough that it cannot be effected 
> simply by /proc/pid/oom_score_adj.
> 
> I'm not quite sure how significant of a point this is, though, because it 
> depends on the caller doing the __GFP_NOFAIL allocations that allow the 
> bypass.  If you're doing
> 
> 	for (i = 0; i < 1 << 20; i++)
> 		page[i] = alloc_page(GFP_NOFS | __GFP_NOFAIL);

Hyperbole serves no one.

> it can become significant, but I'm unsure of how much memory all callers 
> end up allocating in this context.
>
> > > > That being said, while I do agree with you that we should strive for
> > > > isolation as much as possible there are certain cases when this is
> > > > impossible to achieve without seeing much worse consequences. For now,
> > > > we hope that __GFP_NOFAIL is used very scarcely.
> > > 
> > > If that's true, why not bypass the per-zone min watermarks in the page 
> > > allocator as well to allow these allocations to succeed?
> > 
> > Allocations are already done. We simply cannot charge that allocation
> > because we have reached the hard limit. And the said allocation might
> > prevent OOM action to proceed due to held locks.
> 
> I'm referring to the generic non-memcg page allocator behavior.  Forget 
> memcg for a moment.  What is the behavior in the _page_allocator_ for 
> GFP_NOFS | __GFP_NOFAIL?  Do we spin forever if reclaim fails or do we 
> bypas the per-zone min watermarks to allow it to allocate because "it 
> needs to succeed, it may be holding filesystem locks"?
> 
> It's already been acknowledged in this thread that no bypassing is done 
> in the page allocator and it just spins.  There's some handwaving saying 
> that since the entire system is oom that there is a greater chance that 
> memory will be freed by something else, but that's just handwaving and is 
> certainly no guaranteed.

Do you have another explanation of why this deadlock is not triggering
in the global case?  It's pretty obvious that there is a deadlock that
can not be resolved unless some unrelated task intervenes, just read
__alloc_pages_slowpath().

But we had a concrete bug report for memcg where there was no other
task to intervene.  One was stuck in the OOM killer waiting for the
victim to exit, the victim was stuck on locks that the killer held.

> So, my question again: why not bypass the per-zone min watermarks in the 
> page allocator?

I don't even know what your argument is supposed to be.  The fact that
we don't do it in the page allocator means that there can't be a bug
in memcg?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-12-03 22:25                     ` Johannes Weiner
@ 2013-12-03 23:40                       ` David Rientjes
  2013-12-04  3:01                         ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-12-03 23:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, linux-mm, cgroups, linux-kernel

On Tue, 3 Dec 2013, Johannes Weiner wrote:

> > > Spin on which level? The whole point of this change was to not spin for
> > > ever because the caller might sit on top of other locks which might
> > > prevent somebody else to die although it has been killed.
> > 
> > See my question about the non-memcg page allocator behavior below.
> 
> No, please answer the question.
> 

The question would be answered below, by having consistency in allocation 
and charging paths between both the page allocator and memcg.

> > I'm not quite sure how significant of a point this is, though, because it 
> > depends on the caller doing the __GFP_NOFAIL allocations that allow the 
> > bypass.  If you're doing
> > 
> > 	for (i = 0; i < 1 << 20; i++)
> > 		page[i] = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> 
> Hyperbole serves no one.
> 

Since this bypasses all charges to the root memcg in oom conditions as a 
result of your patch, how do you ensure the "leakage" is contained to a 
small amount of memory?  Are we currently just trusting the users of 
__GFP_NOFAIL that they aren't allocating a large amount of memory?

> > I'm referring to the generic non-memcg page allocator behavior.  Forget 
> > memcg for a moment.  What is the behavior in the _page_allocator_ for 
> > GFP_NOFS | __GFP_NOFAIL?  Do we spin forever if reclaim fails or do we 
> > bypas the per-zone min watermarks to allow it to allocate because "it 
> > needs to succeed, it may be holding filesystem locks"?
> > 
> > It's already been acknowledged in this thread that no bypassing is done 
> > in the page allocator and it just spins.  There's some handwaving saying 
> > that since the entire system is oom that there is a greater chance that 
> > memory will be freed by something else, but that's just handwaving and is 
> > certainly no guaranteed.
> 
> Do you have another explanation of why this deadlock is not triggering
> in the global case?  It's pretty obvious that there is a deadlock that
> can not be resolved unless some unrelated task intervenes, just read
> __alloc_pages_slowpath().
> 
> But we had a concrete bug report for memcg where there was no other
> task to intervene.  One was stuck in the OOM killer waiting for the
> victim to exit, the victim was stuck on locks that the killer held.
> 

I believe the page allocator would be susceptible to the same deadlock if 
nothing else on the system can reclaim memory and that belief comes from 
code inspection that shows __GFP_NOFAIL is not guaranteed to ever succeed 
in the page allocator as their charges now are (with your patch) in memcg.  
I do not have an example of such an incident.

> > So, my question again: why not bypass the per-zone min watermarks in the 
> > page allocator?
> 
> I don't even know what your argument is supposed to be.  The fact that
> we don't do it in the page allocator means that there can't be a bug
> in memcg?
> 

I'm asking if we should allow GFP_NOFS | __GFP_NOFAIL allocations in the 
page allocator to bypass per-zone min watermarks after reclaim has failed 
since the oom killer cannot be called in such a context so that the page 
allocator is not susceptible to the same deadlock without a complete 
depletion of memory reserves?

It's not an argument, it's a question.  Relax.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-12-03 23:40                       ` David Rientjes
@ 2013-12-04  3:01                         ` Johannes Weiner
  2013-12-04  4:34                           ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2013-12-04  3:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, linux-mm, cgroups, linux-kernel

On Tue, Dec 03, 2013 at 03:40:13PM -0800, David Rientjes wrote:
> On Tue, 3 Dec 2013, Johannes Weiner wrote:
> 
> > > > Spin on which level? The whole point of this change was to not spin for
> > > > ever because the caller might sit on top of other locks which might
> > > > prevent somebody else to die although it has been killed.
> > > 
> > > See my question about the non-memcg page allocator behavior below.
> > 
> > No, please answer the question.
> > 
> 
> The question would be answered below, by having consistency in allocation 
> and charging paths between both the page allocator and memcg.
> 
> > > I'm not quite sure how significant of a point this is, though, because it 
> > > depends on the caller doing the __GFP_NOFAIL allocations that allow the 
> > > bypass.  If you're doing
> > > 
> > > 	for (i = 0; i < 1 << 20; i++)
> > > 		page[i] = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> > 
> > Hyperbole serves no one.
> > 
> 
> Since this bypasses all charges to the root memcg in oom conditions as a 
> result of your patch, how do you ensure the "leakage" is contained to a 
> small amount of memory?  Are we currently just trusting the users of 
> __GFP_NOFAIL that they aren't allocating a large amount of memory?

Yes, as answered in my first reply to you:

---

> Ah, this is because of 3168ecbe1c04 ("mm: memcg: use proper memcg in limit 
> bypass") which just bypasses all of these allocations and charges the root 
> memcg.  So if allocations want to bypass memcg isolation they just have to 
> be __GFP_NOFAIL?

I don't think we have another option.

---

Is there a specific reason you keep repeating the same questions?

> > > I'm referring to the generic non-memcg page allocator behavior.  Forget 
> > > memcg for a moment.  What is the behavior in the _page_allocator_ for 
> > > GFP_NOFS | __GFP_NOFAIL?  Do we spin forever if reclaim fails or do we 
> > > bypas the per-zone min watermarks to allow it to allocate because "it 
> > > needs to succeed, it may be holding filesystem locks"?
> > > 
> > > It's already been acknowledged in this thread that no bypassing is done 
> > > in the page allocator and it just spins.  There's some handwaving saying 
> > > that since the entire system is oom that there is a greater chance that 
> > > memory will be freed by something else, but that's just handwaving and is 
> > > certainly no guaranteed.
> > 
> > Do you have another explanation of why this deadlock is not triggering
> > in the global case?  It's pretty obvious that there is a deadlock that
> > can not be resolved unless some unrelated task intervenes, just read
> > __alloc_pages_slowpath().
> > 
> > But we had a concrete bug report for memcg where there was no other
> > task to intervene.  One was stuck in the OOM killer waiting for the
> > victim to exit, the victim was stuck on locks that the killer held.
> > 
> 
> I believe the page allocator would be susceptible to the same deadlock if 
> nothing else on the system can reclaim memory and that belief comes from 
> code inspection that shows __GFP_NOFAIL is not guaranteed to ever succeed 
> in the page allocator as their charges now are (with your patch) in memcg.  
> I do not have an example of such an incident.

Me neither.

> > > So, my question again: why not bypass the per-zone min watermarks in the 
> > > page allocator?
> > 
> > I don't even know what your argument is supposed to be.  The fact that
> > we don't do it in the page allocator means that there can't be a bug
> > in memcg?
> > 
> 
> I'm asking if we should allow GFP_NOFS | __GFP_NOFAIL allocations in the 
> page allocator to bypass per-zone min watermarks after reclaim has failed 
> since the oom killer cannot be called in such a context so that the page 
> allocator is not susceptible to the same deadlock without a complete 
> depletion of memory reserves?

Yes, I think so.

> It's not an argument, it's a question.  Relax.

Right.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-12-04  3:01                         ` Johannes Weiner
@ 2013-12-04  4:34                           ` Dave Chinner
  2013-12-04  5:25                             ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2013-12-04  4:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Michal Hocko, Andrew Morton, linux-mm, cgroups,
	linux-kernel

On Tue, Dec 03, 2013 at 10:01:01PM -0500, Johannes Weiner wrote:
> On Tue, Dec 03, 2013 at 03:40:13PM -0800, David Rientjes wrote:
> > On Tue, 3 Dec 2013, Johannes Weiner wrote:
> > I believe the page allocator would be susceptible to the same deadlock if 
> > nothing else on the system can reclaim memory and that belief comes from 
> > code inspection that shows __GFP_NOFAIL is not guaranteed to ever succeed 
> > in the page allocator as their charges now are (with your patch) in memcg.  
> > I do not have an example of such an incident.
> 
> Me neither.

Is this the sort of thing that you expect to see when GFP_NOFS |
GFP_NOFAIL type allocations continualy fail?

http://oss.sgi.com/archives/xfs/2013-12/msg00095.html

XFS doesn't use GFP_NOFAIL, it does it's own loop with GFP_NOWARN in
kmem_alloc() so that if we get stuck for more than 100 attempts to
allocate it throws a warning. i.e. only when we really are stuck and
reclaim is not making any progress.

This specific case is due to memory fragmentation preventing a 64k
memory allocation (due to the filesystem being configured with a 64k
directory block size), but GFP_NOFS | GFP_NOFAIL allocations happen
*all the time* in filesystems.

> > > > So, my question again: why not bypass the per-zone min watermarks in the 
> > > > page allocator?
> > > 
> > > I don't even know what your argument is supposed to be.  The fact that
> > > we don't do it in the page allocator means that there can't be a bug
> > > in memcg?
> > > 
> > 
> > I'm asking if we should allow GFP_NOFS | __GFP_NOFAIL allocations in the 
> > page allocator to bypass per-zone min watermarks after reclaim has failed 
> > since the oom killer cannot be called in such a context so that the page 
> > allocator is not susceptible to the same deadlock without a complete 
> > depletion of memory reserves?
> 
> Yes, I think so.

There be dragons. If memcg's deadlock in low memory conditions in
the presence of GFP_NOFS | GFP_NOFAIL allocations, then we need to
make the memcg reclaim design more robust, not work around it by
allowing filesystems to drain critical memory reserves needed for
other situations....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-12-04  4:34                           ` Dave Chinner
@ 2013-12-04  5:25                             ` Johannes Weiner
  2013-12-04  6:10                               ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2013-12-04  5:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Rientjes, Michal Hocko, Andrew Morton, linux-mm, cgroups,
	linux-kernel

On Wed, Dec 04, 2013 at 03:34:17PM +1100, Dave Chinner wrote:
> On Tue, Dec 03, 2013 at 10:01:01PM -0500, Johannes Weiner wrote:
> > On Tue, Dec 03, 2013 at 03:40:13PM -0800, David Rientjes wrote:
> > > On Tue, 3 Dec 2013, Johannes Weiner wrote:
> > > I believe the page allocator would be susceptible to the same deadlock if 
> > > nothing else on the system can reclaim memory and that belief comes from 
> > > code inspection that shows __GFP_NOFAIL is not guaranteed to ever succeed 
> > > in the page allocator as their charges now are (with your patch) in memcg.  
> > > I do not have an example of such an incident.
> > 
> > Me neither.
> 
> Is this the sort of thing that you expect to see when GFP_NOFS |
> GFP_NOFAIL type allocations continualy fail?
> 
> http://oss.sgi.com/archives/xfs/2013-12/msg00095.html
> 
> XFS doesn't use GFP_NOFAIL, it does it's own loop with GFP_NOWARN in
> kmem_alloc() so that if we get stuck for more than 100 attempts to
> allocate it throws a warning. i.e. only when we really are stuck and
> reclaim is not making any progress.
> 
> This specific case is due to memory fragmentation preventing a 64k
> memory allocation (due to the filesystem being configured with a 64k
> directory block size), but GFP_NOFS | GFP_NOFAIL allocations happen
> *all the time* in filesystems.

Yes, the question is whether this in itself is a practical problem,
regardless of whether you use __GFP_NOFAIL or a manual loop.

> > > > > So, my question again: why not bypass the per-zone min watermarks in the 
> > > > > page allocator?
> > > > 
> > > > I don't even know what your argument is supposed to be.  The fact that
> > > > we don't do it in the page allocator means that there can't be a bug
> > > > in memcg?
> > > > 
> > > 
> > > I'm asking if we should allow GFP_NOFS | __GFP_NOFAIL allocations in the 
> > > page allocator to bypass per-zone min watermarks after reclaim has failed 
> > > since the oom killer cannot be called in such a context so that the page 
> > > allocator is not susceptible to the same deadlock without a complete 
> > > depletion of memory reserves?
> > 
> > Yes, I think so.
> 
> There be dragons. If memcg's deadlock in low memory conditions in
> the presence of GFP_NOFS | GFP_NOFAIL allocations, then we need to
> make the memcg reclaim design more robust, not work around it by
> allowing filesystems to drain critical memory reserves needed for
> other situations....

The problems in the page allocator and memcg are entirely unrelated.
What we do in the memcg does not affect the page allocator and vice
versa.  However, they are problems of the same type, so we are trying
to find out whether both instances can have the same solution:

If GFP_NOFS | __GFP_NOFAIL allocations can not make forward progress
in direct reclaim, they are screwed: can't reclaim memory, can't OOM
kill, can't return NULL.  They are essentially stuck until a third
party intervenes.  This applies to both the page allocator and memcg.

In memcg, I fixed it by allowing the __GFP_NOFAIL task to bypass the
user-defined memory limit after reclaim fails.

David asks whether we should do the equivalent in the page allocator
and allow __GFP_NOFAIL allocations to dip into the emergency memory
reserves for the same reason.

I suggested that the situations are not entirely the same.  A memcg
might only have one or two tasks and so third party intervention to
reduce memory usage in the memcg can be unlikely to impossible,
whereas in the case of the page allocator, the likelihood of any task
in the system releasing or reclaiming memory is higher.

However, the GFP_NOFS | __GFP_NOFAIL task stuck in the page allocator
may hold filesystem locks that could prevent a third party from
freeing memory and/or exiting, so we can not guarantee that only the
__GFP_NOFAIL task is getting stuck, it might well trap other tasks.
The same applies to open-coded GFP_NOFS allocation loops of course
unless they cycle the filesystem locks while looping.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  2013-12-04  5:25                             ` Johannes Weiner
@ 2013-12-04  6:10                               ` David Rientjes
  0 siblings, 0 replies; 17+ messages in thread
From: David Rientjes @ 2013-12-04  6:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Michal Hocko, Andrew Morton, linux-mm, cgroups,
	linux-kernel

On Wed, 4 Dec 2013, Johannes Weiner wrote:

> However, the GFP_NOFS | __GFP_NOFAIL task stuck in the page allocator
> may hold filesystem locks that could prevent a third party from
> freeing memory and/or exiting, so we can not guarantee that only the
> __GFP_NOFAIL task is getting stuck, it might well trap other tasks.
> The same applies to open-coded GFP_NOFS allocation loops of course
> unless they cycle the filesystem locks while looping.
> 

Yup.  I think we should do this:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2631,6 +2631,11 @@ rebalance:
 						pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+
+		/* Allocations that cannot fail must allocate from somewhere */
+		if (gfp_mask & __GFP_NOFAIL)
+			alloc_flags |= ALLOC_HARDER;
+
 		goto rebalance;
 	} else {
 		/*

so that it gets the same behavior as GFP_ATOMIC and is allowed to allocate 
from memory reserves (although not enough to totally deplete memory).  We 
need to leave some memory reserves around in case another process with 
__GFP_FS invokes the oom killer and the victim needs memory to exit since 
the GFP_NOFS | __GFP_NOFAIL failure wasn't only because reclaim was 
limited due to !__GFP_FS.

The only downside of this is that it might become harder in the future to 
ever make a case to remove __GFP_NOFAIL entirely since the behavior of the 
page allocator is changed with this and it's not equivalent to coding the 
retry directly in the caller.

On a tangent, GFP_NOWAIT | __GFP_NOFAIL and GFP_ATOMIC | __GFP_NOFAIL 
actually allows allocations to fail.  Nothing currently does that, but I 
wonder if we should do this for correctness:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2535,17 +2535,19 @@ rebalance:
 		}
 	}
 
-	/* Atomic allocations - we can't balance anything */
-	if (!wait)
-		goto nopage;
-
-	/* Avoid recursion of direct reclaim */
-	if (current->flags & PF_MEMALLOC)
-		goto nopage;
-
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
-		goto nopage;
+	if (likely(!(gfp_mask & __GFP_NOFAIL))) {
+		/* Atomic allocations - we can't balance anything */
+		if (!wait)
+			goto nopage;
+
+		/* Avoid recursion of direct reclaim */
+		if (current->flags & PF_MEMALLOC)
+			goto nopage;
+
+		/* Avoid allocations without watermarks from looping forever */
+		if (test_thread_flag(TIF_MEMDIE))
+			goto nopage;
+	}
 
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent

It can be likely() because the __GFP_NOFAIL restart from the first patch 
above will likely now succeed since there's access to memory reserves and 
we never actually get here but once for __GFP_NOFAIL.  Thoughts on either 
patch?

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-12-04  6:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-22 17:17 [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations Johannes Weiner
2013-11-27  1:01 ` David Rientjes
2013-11-27  3:33   ` David Rientjes
2013-11-27 16:39     ` Johannes Weiner
2013-11-27 21:38       ` David Rientjes
2013-11-27 22:53         ` Johannes Weiner
2013-11-27 23:34           ` David Rientjes
2013-11-28 10:20             ` Michal Hocko
2013-11-29 23:46               ` David Rientjes
2013-12-02 13:22                 ` Michal Hocko
2013-12-02 23:02                   ` David Rientjes
2013-12-03 22:25                     ` Johannes Weiner
2013-12-03 23:40                       ` David Rientjes
2013-12-04  3:01                         ` Johannes Weiner
2013-12-04  4:34                           ` Dave Chinner
2013-12-04  5:25                             ` Johannes Weiner
2013-12-04  6:10                               ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).