All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, torvalds@linux-foundation.org,
	akpm@linux-foundation.org, ying.huang@intel.com,
	aarcange@redhat.com, david@fromorbit.com, mhocko@suse.cz,
	tytso@mit.edu
Subject: Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
Date: Thu, 26 Mar 2015 07:24:45 -0400	[thread overview]
Message-ID: <20150326112445.GC18560@cmpxchg.org> (raw)
In-Reply-To: <201503252315.FBJ09847.FSOtOJQFOMLFVH@I-love.SAKURA.ne.jp>

On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5cfda39b3268..e066ac7353a4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		killed = 1;
> >  	}
> >  out:
> > +	if (test_thread_flag(TIF_MEMDIE))
> > +		return true;
> >  	/*
> > -	 * Give the killed threads a good chance of exiting before trying to
> > -	 * allocate memory again.
> > +	 * Wait for any outstanding OOM victims to die.  In rare cases
> > +	 * victims can get stuck behind the allocating tasks, so the
> > +	 * wait needs to be bounded.  It's crude alright, but cheaper
> > +	 * than keeping a global dependency tree between all tasks.
> >  	 */
> > -	if (killed)
> > -		schedule_timeout_killable(1);
> > +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
> >  
> >  	return true;
> >  }
> 
> out_of_memory() returning true with bounded wait effectively means that
> wait forever without choosing subsequent OOM victims when first OOM victim
> failed to die. The system will lock up, won't it?

The OOM killer already refuses to choose another victim as long as the
first one hasn't exited, see oom_scan_process_thread().  That's why
later patches in this series introduce a reserve for OOM-killing tasks
and give nofail allocations access to emergency reserves, in case they
themselves prevent that single OOM victim from exiting.  But otherwise
victims should be exiting eventually.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c1224ba45548..9ce9c4c083a0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> >  }
> >  
> >  static inline struct page *
> > -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	const struct alloc_context *ac, unsigned long *did_some_progress)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  
> >  	*did_some_progress = 0;
> >  
> >  	/*
> > -	 * Acquire the oom lock.  If that fails, somebody else is
> > -	 * making progress for us.
> > +	 * This allocating task can become the OOM victim itself at
> > +	 * any point before acquiring the lock.  In that case, exit
> > +	 * quickly and don't block on the lock held by another task
> > +	 * waiting for us to exit.
> >  	 */
> > -	if (!mutex_trylock(&oom_lock)) {
> > -		*did_some_progress = 1;
> > -		schedule_timeout_uninterruptible(1);
> > -		return NULL;
> > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		goto alloc;
> >  	}
> 
> When a thread group has 1000 threads and most of them are doing memory allocation
> request, all of them will get fatal_signal_pending() == true when one of them are
> chosen by OOM killer.
> This code will allow most of them to access memory reserves, won't it?

Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
any dying thread.  Thanks, I'll fix it in v2.

> > @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (gfp_mask & __GFP_THISNODE)
> >  			goto out;
> >  	}
> > -	/* Exhausted what can be done so it's blamo time */
> > -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> > -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +
> > +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
> >  		*did_some_progress = 1;
> > +	} else {
> > +		/* Oops, these shouldn't happen with the OOM killer disabled */
> > +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +			*did_some_progress = 1;
> > +	}
> 
> I think GFP_NOFAIL allocations need to involve OOM killer than
> pretending as if forward progress is made. If all of in-flight
> allocation requests are GFP_NOFAIL, the system will lock up.

Hm?  They do involve the OOM killer, but once userspace is frozen for
suspend/hibernate we shouldn't kill and thaw random tasks anymore as
that might corrupt the memory snapshot, so nofail allocations are a
bug at this point.

> After all, if we wait for OOM killer progress before retrying, I think
> we should involve OOM killer after some bounded timeout regardless of
> gfp flags, and let OOM killer kill more threads after another bounded
> timeout. Otherwise, the corner cases will lock up the system.

Giving nofail allocations access to emergency reserves targets this
problem, but I agree with you that it's still possible for the system
to lock up if they have been consumed and still no task made enough
forward progress to release memory.  It is unlikely but possible.

I will probably come back to the OOM victim timeout patch some time in
the future as that seems more robust.  It would also drastically
simplify memcg OOM handling.  But that patch was controversial in the
past and seemed beyond the scope of this patch set.

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, torvalds@linux-foundation.org,
	akpm@linux-foundation.org, ying.huang@intel.com,
	aarcange@redhat.com, david@fromorbit.com, mhocko@suse.cz,
	tytso@mit.edu
Subject: Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
Date: Thu, 26 Mar 2015 07:24:45 -0400	[thread overview]
Message-ID: <20150326112445.GC18560@cmpxchg.org> (raw)
In-Reply-To: <201503252315.FBJ09847.FSOtOJQFOMLFVH@I-love.SAKURA.ne.jp>

On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5cfda39b3268..e066ac7353a4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		killed = 1;
> >  	}
> >  out:
> > +	if (test_thread_flag(TIF_MEMDIE))
> > +		return true;
> >  	/*
> > -	 * Give the killed threads a good chance of exiting before trying to
> > -	 * allocate memory again.
> > +	 * Wait for any outstanding OOM victims to die.  In rare cases
> > +	 * victims can get stuck behind the allocating tasks, so the
> > +	 * wait needs to be bounded.  It's crude alright, but cheaper
> > +	 * than keeping a global dependency tree between all tasks.
> >  	 */
> > -	if (killed)
> > -		schedule_timeout_killable(1);
> > +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
> >  
> >  	return true;
> >  }
> 
> out_of_memory() returning true with bounded wait effectively means that
> wait forever without choosing subsequent OOM victims when first OOM victim
> failed to die. The system will lock up, won't it?

The OOM killer already refuses to choose another victim as long as the
first one hasn't exited, see oom_scan_process_thread().  That's why
later patches in this series introduce a reserve for OOM-killing tasks
and give nofail allocations access to emergency reserves, in case they
themselves prevent that single OOM victim from exiting.  But otherwise
victims should be exiting eventually.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c1224ba45548..9ce9c4c083a0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> >  }
> >  
> >  static inline struct page *
> > -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	const struct alloc_context *ac, unsigned long *did_some_progress)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  
> >  	*did_some_progress = 0;
> >  
> >  	/*
> > -	 * Acquire the oom lock.  If that fails, somebody else is
> > -	 * making progress for us.
> > +	 * This allocating task can become the OOM victim itself at
> > +	 * any point before acquiring the lock.  In that case, exit
> > +	 * quickly and don't block on the lock held by another task
> > +	 * waiting for us to exit.
> >  	 */
> > -	if (!mutex_trylock(&oom_lock)) {
> > -		*did_some_progress = 1;
> > -		schedule_timeout_uninterruptible(1);
> > -		return NULL;
> > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		goto alloc;
> >  	}
> 
> When a thread group has 1000 threads and most of them are doing memory allocation
> request, all of them will get fatal_signal_pending() == true when one of them are
> chosen by OOM killer.
> This code will allow most of them to access memory reserves, won't it?

Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
any dying thread.  Thanks, I'll fix it in v2.

> > @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (gfp_mask & __GFP_THISNODE)
> >  			goto out;
> >  	}
> > -	/* Exhausted what can be done so it's blamo time */
> > -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> > -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +
> > +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
> >  		*did_some_progress = 1;
> > +	} else {
> > +		/* Oops, these shouldn't happen with the OOM killer disabled */
> > +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +			*did_some_progress = 1;
> > +	}
> 
> I think GFP_NOFAIL allocations need to involve OOM killer than
> pretending as if forward progress is made. If all of in-flight
> allocation requests are GFP_NOFAIL, the system will lock up.

Hm?  They do involve the OOM killer, but once userspace is frozen for
suspend/hibernate we shouldn't kill and thaw random tasks anymore as
that might corrupt the memory snapshot, so nofail allocations are a
bug at this point.

> After all, if we wait for OOM killer progress before retrying, I think
> we should involve OOM killer after some bounded timeout regardless of
> gfp flags, and let OOM killer kill more threads after another bounded
> timeout. Otherwise, the corner cases will lock up the system.

Giving nofail allocations access to emergency reserves targets this
problem, but I agree with you that it's still possible for the system
to lock up if they have been consumed and still no task made enough
forward progress to release memory.  It is unlikely but possible.

I will probably come back to the OOM victim timeout patch some time in
the future as that seems more robust.  It would also drastically
simplify memcg OOM handling.  But that patch was controversial in the
past and seemed beyond the scope of this patch set.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2015-03-26 11:24 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
2015-03-25  6:17 ` Johannes Weiner
2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26  0:51   ` David Rientjes
2015-03-26  0:51     ` David Rientjes
2015-03-26 11:51     ` Michal Hocko
2015-03-26 11:51       ` Michal Hocko
2015-03-26 13:18       ` Michal Hocko
2015-03-26 13:18         ` Michal Hocko
2015-03-26 19:30         ` David Rientjes
2015-03-26 19:30           ` David Rientjes
2015-03-26 11:43   ` Michal Hocko
2015-03-26 11:43     ` Michal Hocko
2015-03-26 20:05   ` David Rientjes
2015-03-26 20:05     ` David Rientjes
2015-03-25  6:17 ` [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26  3:34   ` David Rientjes
2015-03-26  3:34     ` David Rientjes
2015-03-26 11:54   ` Michal Hocko
2015-03-26 11:54     ` Michal Hocko
2015-03-25  6:17 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26  3:31   ` David Rientjes
2015-03-26  3:31     ` David Rientjes
2015-03-26 11:05     ` Johannes Weiner
2015-03-26 11:05       ` Johannes Weiner
2015-03-26 19:50       ` David Rientjes
2015-03-26 19:50         ` David Rientjes
2015-03-30 14:48         ` Michal Hocko
2015-03-30 14:48           ` Michal Hocko
2015-04-02 23:01         ` [patch] android, lmk: avoid setting TIF_MEMDIE if process has already exited David Rientjes
2015-04-02 23:01           ` David Rientjes
2015-04-28 22:50           ` [patch resend] " David Rientjes
2015-04-28 22:50             ` David Rientjes
2015-03-26 11:57   ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Michal Hocko
2015-03-26 11:57     ` Michal Hocko
2015-03-25  6:17 ` [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim() Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26 12:53   ` Michal Hocko
2015-03-26 12:53     ` Michal Hocko
2015-03-26 13:01     ` Michal Hocko
2015-03-26 13:01       ` Michal Hocko
2015-03-26 15:10       ` Johannes Weiner
2015-03-26 15:10         ` Johannes Weiner
2015-03-26 15:04     ` Johannes Weiner
2015-03-26 15:04       ` Johannes Weiner
2015-03-25  6:17 ` [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26 13:03   ` Michal Hocko
2015-03-26 13:03     ` Michal Hocko
2015-03-25  6:17 ` [patch 06/12] mm: oom_kill: simplify OOM killer locking Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26 13:31   ` Michal Hocko
2015-03-26 13:31     ` Michal Hocko
2015-03-26 15:17     ` Johannes Weiner
2015-03-26 15:17       ` Johannes Weiner
2015-03-26 16:07       ` Michal Hocko
2015-03-26 16:07         ` Michal Hocko
2015-03-25  6:17 ` [patch 07/12] mm: page_alloc: inline should_alloc_retry() Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26 14:11   ` Michal Hocko
2015-03-26 14:11     ` Michal Hocko
2015-03-26 15:18     ` Johannes Weiner
2015-03-26 15:18       ` Johannes Weiner
2015-03-25  6:17 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-25 14:15   ` Tetsuo Handa
2015-03-25 14:15     ` Tetsuo Handa
2015-03-25 17:01     ` Vlastimil Babka
2015-03-25 17:01       ` Vlastimil Babka
2015-03-26 11:28       ` Johannes Weiner
2015-03-26 11:28         ` Johannes Weiner
2015-03-26 11:24     ` Johannes Weiner [this message]
2015-03-26 11:24       ` Johannes Weiner
2015-03-26 14:32       ` Michal Hocko
2015-03-26 14:32         ` Michal Hocko
2015-03-26 15:23         ` Johannes Weiner
2015-03-26 15:23           ` Johannes Weiner
2015-03-26 15:38           ` Michal Hocko
2015-03-26 15:38             ` Michal Hocko
2015-03-26 18:17             ` Johannes Weiner
2015-03-26 18:17               ` Johannes Weiner
2015-03-27 14:01             ` [patch 08/12] mm: page_alloc: wait for OOM killer progressbefore retrying Tetsuo Handa
2015-03-27 14:01               ` Tetsuo Handa
2015-03-26 15:58   ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Michal Hocko
2015-03-26 15:58     ` Michal Hocko
2015-03-26 18:23     ` Johannes Weiner
2015-03-26 18:23       ` Johannes Weiner
2015-03-25  6:17 ` [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-04-14 16:49   ` Michal Hocko
2015-04-14 16:49     ` Michal Hocko
2015-04-24 19:13     ` Johannes Weiner
2015-04-24 19:13       ` Johannes Weiner
2015-03-25  6:17 ` [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-04-14 16:55   ` Michal Hocko
2015-04-14 16:55     ` Michal Hocko
2015-03-25  6:17 ` [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26 14:50   ` Michal Hocko
2015-03-26 14:50     ` Michal Hocko
2015-03-25  6:17 ` [patch 12/12] mm: page_alloc: do not lock up low-order " Johannes Weiner
2015-03-25  6:17   ` Johannes Weiner
2015-03-26 15:32   ` Michal Hocko
2015-03-26 15:32     ` Michal Hocko
2015-03-26 19:58 ` [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Dave Chinner
2015-03-26 19:58   ` Dave Chinner
2015-03-27 15:05   ` Johannes Weiner
2015-03-27 15:05     ` Johannes Weiner
2015-03-30  0:32     ` Dave Chinner
2015-03-30  0:32       ` Dave Chinner
2015-03-30 19:31       ` Johannes Weiner
2015-03-30 19:31         ` Johannes Weiner
2015-04-01 15:19       ` Michal Hocko
2015-04-01 15:19         ` Michal Hocko
2015-04-01 21:39         ` Dave Chinner
2015-04-01 21:39           ` Dave Chinner
2015-04-02  7:29           ` Michal Hocko
2015-04-02  7:29             ` Michal Hocko
2015-04-07 14:18         ` Johannes Weiner
2015-04-07 14:18           ` Johannes Weiner
2015-04-11  7:29           ` Tetsuo Handa
2015-04-11  7:29             ` Tetsuo Handa
2015-04-13 12:49             ` Michal Hocko
2015-04-13 12:49               ` Michal Hocko
2015-04-13 12:46           ` Michal Hocko
2015-04-13 12:46             ` Michal Hocko
2015-04-14  0:11             ` Dave Chinner
2015-04-14  0:11               ` Dave Chinner
2015-04-14  7:20               ` Michal Hocko
2015-04-14  7:20                 ` Michal Hocko
2015-04-14 10:36             ` Johannes Weiner
2015-04-14 10:36               ` Johannes Weiner
2015-04-14 14:23               ` Michal Hocko
2015-04-14 14:23                 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150326112445.GC18560@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.