linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying.
@ 2017-07-16 10:59 Tetsuo Handa
  2017-07-17  8:56 ` Michal Hocko
  2017-07-17 15:24 ` Michal Hocko
  0 siblings, 2 replies; 7+ messages in thread
From: Tetsuo Handa @ 2017-07-16 10:59 UTC (permalink / raw)
  To: linux-mm, mhocko, hannes, rientjes; +Cc: linux-kernel, Tetsuo Handa

Since the whole memory reclaim path has never been designed to handle the
scheduling priority inversions, those locations which are assuming that
execution of some code path shall eventually complete without using
synchronization mechanisms can get stuck (livelock) due to scheduling
priority inversions, for CPU time is not guaranteed to be yielded to some
thread doing such code path.

mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
one of such locations, and it was demonstrated using artificial stressing
that the system gets stuck effectively forever because SCHED_IDLE priority
thread is unable to resume execution at schedule_timeout_killable(1) if
a lot of !SCHED_IDLE priority threads are wasting CPU time [1].

To solve this problem properly, complete redesign and rewrite of the whole
memory reclaim path will be needed. But we are not going to think about
reimplementing the the whole stack (at least for foreseeable future).

Thus, this patch workarounds livelock by forcibly yielding enough CPU time
to the thread holding oom_lock by using mutex_lock_killable() mechanism,
so that the OOM killer/reaper can use CPU time yielded by this patch.
Of course, this patch does not help if the cause of lack of CPU time is
somewhere else (e.g. executing CPU intensive computation with very high
scheduling priority), but that is not fault of this patch.
This patch only manages not to lockup if the cause of lack of CPU time is
direct reclaim storm wasting CPU time without making any progress while
waiting for oom_lock.

[1] http://lkml.kernel.org/r/201707142130.JJF10142.FHJFOQSOOtMVLF@I-love.SAKURA.ne.jp

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..622ecbf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3259,10 +3259,12 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	*did_some_progress = 0;
 
 	/*
-	 * Acquire the oom lock.  If that fails, somebody else is
-	 * making progress for us.
+	 * Acquire the oom lock. If that fails, somebody else should be making
+	 * progress for us. But if many threads are doing the same thing, the
+	 * owner of the oom lock can fail to make progress due to lack of CPU
+	 * time. Therefore, wait unless we get SIGKILL.
 	 */
-	if (!mutex_trylock(&oom_lock)) {
+	if (mutex_lock_killable(&oom_lock)) {
 		*did_some_progress = 1;
 		schedule_timeout_uninterruptible(1);
 		return NULL;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying.
  2017-07-16 10:59 [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying Tetsuo Handa
@ 2017-07-17  8:56 ` Michal Hocko
  2017-07-17 13:50   ` Tetsuo Handa
  2017-07-17 15:24 ` Michal Hocko
  1 sibling, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2017-07-17  8:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, hannes, rientjes, linux-kernel

On Sun 16-07-17 19:59:51, Tetsuo Handa wrote:
> Since the whole memory reclaim path has never been designed to handle the
> scheduling priority inversions, those locations which are assuming that
> execution of some code path shall eventually complete without using
> synchronization mechanisms can get stuck (livelock) due to scheduling
> priority inversions, for CPU time is not guaranteed to be yielded to some
> thread doing such code path.
> 
> mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
> schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
> one of such locations, and it was demonstrated using artificial stressing
> that the system gets stuck effectively forever because SCHED_IDLE priority
> thread is unable to resume execution at schedule_timeout_killable(1) if
> a lot of !SCHED_IDLE priority threads are wasting CPU time [1].

I do not understand this. All the contending tasks will go and sleep for
1s. How can they preempt the lock holder?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying.
  2017-07-17  8:56 ` Michal Hocko
@ 2017-07-17 13:50   ` Tetsuo Handa
  2017-07-17 15:15     ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Tetsuo Handa @ 2017-07-17 13:50 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, hannes, rientjes, linux-kernel

Michal Hocko wrote:
> On Sun 16-07-17 19:59:51, Tetsuo Handa wrote:
> > Since the whole memory reclaim path has never been designed to handle the
> > scheduling priority inversions, those locations which are assuming that
> > execution of some code path shall eventually complete without using
> > synchronization mechanisms can get stuck (livelock) due to scheduling
> > priority inversions, for CPU time is not guaranteed to be yielded to some
> > thread doing such code path.
> > 
> > mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
> > schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
> > one of such locations, and it was demonstrated using artificial stressing
> > that the system gets stuck effectively forever because SCHED_IDLE priority
> > thread is unable to resume execution at schedule_timeout_killable(1) if
> > a lot of !SCHED_IDLE priority threads are wasting CPU time [1].
> 
> I do not understand this. All the contending tasks will go and sleep for
> 1s. How can they preempt the lock holder?

Not 1s. It sleeps for only 1 jiffies, which is 1ms if CONFIG_HZ=1000.

And 1ms may not be long enough to allow the owner of oom_lock when there are
many threads doing the same thing. I demonstrated that SCHED_IDLE oom_lock
owner is completely defeated by a bunch of !SCHED_IDLE contending threads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying.
  2017-07-17 13:50   ` Tetsuo Handa
@ 2017-07-17 15:15     ` Michal Hocko
  0 siblings, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2017-07-17 15:15 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, hannes, rientjes, linux-kernel

On Mon 17-07-17 22:50:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 16-07-17 19:59:51, Tetsuo Handa wrote:
> > > Since the whole memory reclaim path has never been designed to handle the
> > > scheduling priority inversions, those locations which are assuming that
> > > execution of some code path shall eventually complete without using
> > > synchronization mechanisms can get stuck (livelock) due to scheduling
> > > priority inversions, for CPU time is not guaranteed to be yielded to some
> > > thread doing such code path.
> > > 
> > > mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
> > > schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
> > > one of such locations, and it was demonstrated using artificial stressing
> > > that the system gets stuck effectively forever because SCHED_IDLE priority
> > > thread is unable to resume execution at schedule_timeout_killable(1) if
> > > a lot of !SCHED_IDLE priority threads are wasting CPU time [1].
> > 
> > I do not understand this. All the contending tasks will go and sleep for
> > 1s. How can they preempt the lock holder?
> 
> Not 1s. It sleeps for only 1 jiffies, which is 1ms if CONFIG_HZ=1000.

Right, for some reason I have seen HZ. My bad!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying.
  2017-07-16 10:59 [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying Tetsuo Handa
  2017-07-17  8:56 ` Michal Hocko
@ 2017-07-17 15:24 ` Michal Hocko
  2017-07-17 21:42   ` Tetsuo Handa
  1 sibling, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2017-07-17 15:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, hannes, rientjes, linux-kernel

On Sun 16-07-17 19:59:51, Tetsuo Handa wrote:
> Since the whole memory reclaim path has never been designed to handle the
> scheduling priority inversions, those locations which are assuming that
> execution of some code path shall eventually complete without using
> synchronization mechanisms can get stuck (livelock) due to scheduling
> priority inversions, for CPU time is not guaranteed to be yielded to some
> thread doing such code path.
> 
> mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
> schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
> one of such locations, and it was demonstrated using artificial stressing
> that the system gets stuck effectively forever because SCHED_IDLE priority
> thread is unable to resume execution at schedule_timeout_killable(1) if
> a lot of !SCHED_IDLE priority threads are wasting CPU time [1].
> 
> To solve this problem properly, complete redesign and rewrite of the whole
> memory reclaim path will be needed. But we are not going to think about
> reimplementing the the whole stack (at least for foreseeable future).
> 
> Thus, this patch workarounds livelock by forcibly yielding enough CPU time
> to the thread holding oom_lock by using mutex_lock_killable() mechanism,
> so that the OOM killer/reaper can use CPU time yielded by this patch.
> Of course, this patch does not help if the cause of lack of CPU time is
> somewhere else (e.g. executing CPU intensive computation with very high
> scheduling priority), but that is not fault of this patch.
> This patch only manages not to lockup if the cause of lack of CPU time is
> direct reclaim storm wasting CPU time without making any progress while
> waiting for oom_lock.

I have to think about this some more. Hitting much more on the oom_lock
is a problem while __oom_reap_task_mm still depends on the oom_lock. With
http://lkml.kernel.org/r/20170626130346.26314-1-mhocko@kernel.org it
doesn't do anymore.

Also this whole reasoning is little bit dubious to me. The whole reclaim
stack might still preempt the holder of the lock so you are addressin
only a very specific contention case where everybody hits the oom. I
suspect that a differently constructed testcase might result in the same
problem.
 
> [1] http://lkml.kernel.org/r/201707142130.JJF10142.FHJFOQSOOtMVLF@I-love.SAKURA.ne.jp
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/page_alloc.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb..622ecbf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3259,10 +3259,12 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	*did_some_progress = 0;
>  
>  	/*
> -	 * Acquire the oom lock.  If that fails, somebody else is
> -	 * making progress for us.
> +	 * Acquire the oom lock. If that fails, somebody else should be making
> +	 * progress for us. But if many threads are doing the same thing, the
> +	 * owner of the oom lock can fail to make progress due to lack of CPU
> +	 * time. Therefore, wait unless we get SIGKILL.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> +	if (mutex_lock_killable(&oom_lock)) {
>  		*did_some_progress = 1;
>  		schedule_timeout_uninterruptible(1);
>  		return NULL;
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying.
  2017-07-17 15:24 ` Michal Hocko
@ 2017-07-17 21:42   ` Tetsuo Handa
  2017-07-18  9:08     ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Tetsuo Handa @ 2017-07-17 21:42 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, hannes, rientjes, linux-kernel

Michal Hocko wrote:
> On Sun 16-07-17 19:59:51, Tetsuo Handa wrote:
> > Since the whole memory reclaim path has never been designed to handle the
> > scheduling priority inversions, those locations which are assuming that
> > execution of some code path shall eventually complete without using
> > synchronization mechanisms can get stuck (livelock) due to scheduling
> > priority inversions, for CPU time is not guaranteed to be yielded to some
> > thread doing such code path.
> > 
> > mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
> > schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
> > one of such locations, and it was demonstrated using artificial stressing
> > that the system gets stuck effectively forever because SCHED_IDLE priority
> > thread is unable to resume execution at schedule_timeout_killable(1) if
> > a lot of !SCHED_IDLE priority threads are wasting CPU time [1].
> > 
> > To solve this problem properly, complete redesign and rewrite of the whole
> > memory reclaim path will be needed. But we are not going to think about
> > reimplementing the the whole stack (at least for foreseeable future).
> > 
> > Thus, this patch workarounds livelock by forcibly yielding enough CPU time
> > to the thread holding oom_lock by using mutex_lock_killable() mechanism,
> > so that the OOM killer/reaper can use CPU time yielded by this patch.
> > Of course, this patch does not help if the cause of lack of CPU time is
> > somewhere else (e.g. executing CPU intensive computation with very high
> > scheduling priority), but that is not fault of this patch.
> > This patch only manages not to lockup if the cause of lack of CPU time is
> > direct reclaim storm wasting CPU time without making any progress while
> > waiting for oom_lock.
> 
> I have to think about this some more. Hitting much more on the oom_lock
> is a problem while __oom_reap_task_mm still depends on the oom_lock. With
> http://lkml.kernel.org/r/20170626130346.26314-1-mhocko@kernel.org it
> doesn't do anymore.

I suggested preserving oom_lock serialization when setting MMF_OOM_SKIP in
reply to that post (unless we use some trick for force calling
get_page_from_freelist() after confirming that there is no !MMF_OOM_SKIP mm).

> 
> Also this whole reasoning is little bit dubious to me. The whole reclaim
> stack might still preempt the holder of the lock so you are addressin
> only a very specific contention case where everybody hits the oom. I
> suspect that a differently constructed testcase might result in the same
> problem.

I think that direct reclaim/compaction is primary source of CPU time
consumption, for there will be nothing more to do other than
get_page_from_freelist() and schedule_timeout_uninterruptible() if
we are waiting for somebody else to make progress using the OOM killer.
Thus, if we wait using mutex_lock_killable(), direct reclaim/compaction
will not be called (i.e. the rest of whole reclaim stack will not preempt
the holder of the oom_lock) after each allocating thread failed to acquire
the oom_lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying.
  2017-07-17 21:42   ` Tetsuo Handa
@ 2017-07-18  9:08     ` Michal Hocko
  0 siblings, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2017-07-18  9:08 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, hannes, rientjes, linux-kernel

On Tue 18-07-17 06:42:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 16-07-17 19:59:51, Tetsuo Handa wrote:
> > > Since the whole memory reclaim path has never been designed to handle the
> > > scheduling priority inversions, those locations which are assuming that
> > > execution of some code path shall eventually complete without using
> > > synchronization mechanisms can get stuck (livelock) due to scheduling
> > > priority inversions, for CPU time is not guaranteed to be yielded to some
> > > thread doing such code path.
> > > 
> > > mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
> > > schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
> > > one of such locations, and it was demonstrated using artificial stressing
> > > that the system gets stuck effectively forever because SCHED_IDLE priority
> > > thread is unable to resume execution at schedule_timeout_killable(1) if
> > > a lot of !SCHED_IDLE priority threads are wasting CPU time [1].
> > > 
> > > To solve this problem properly, complete redesign and rewrite of the whole
> > > memory reclaim path will be needed. But we are not going to think about
> > > reimplementing the the whole stack (at least for foreseeable future).
> > > 
> > > Thus, this patch workarounds livelock by forcibly yielding enough CPU time
> > > to the thread holding oom_lock by using mutex_lock_killable() mechanism,
> > > so that the OOM killer/reaper can use CPU time yielded by this patch.
> > > Of course, this patch does not help if the cause of lack of CPU time is
> > > somewhere else (e.g. executing CPU intensive computation with very high
> > > scheduling priority), but that is not fault of this patch.
> > > This patch only manages not to lockup if the cause of lack of CPU time is
> > > direct reclaim storm wasting CPU time without making any progress while
> > > waiting for oom_lock.
> > 
> > I have to think about this some more. Hitting much more on the oom_lock
> > is a problem while __oom_reap_task_mm still depends on the oom_lock. With
> > http://lkml.kernel.org/r/20170626130346.26314-1-mhocko@kernel.org it
> > doesn't do anymore.
> 
> I suggested preserving oom_lock serialization when setting MMF_OOM_SKIP in
> reply to that post (unless we use some trick for force calling
> get_page_from_freelist() after confirming that there is no !MMF_OOM_SKIP mm).

If that is necessary, which I believe it is not, it should be discussed
in that thread email.
 
> > Also this whole reasoning is little bit dubious to me. The whole reclaim
> > stack might still preempt the holder of the lock so you are addressin
> > only a very specific contention case where everybody hits the oom. I
> > suspect that a differently constructed testcase might result in the same
> > problem.
> 
> I think that direct reclaim/compaction is primary source of CPU time
> consumption, for there will be nothing more to do other than
> get_page_from_freelist() and schedule_timeout_uninterruptible() if
> we are waiting for somebody else to make progress using the OOM killer.
> Thus, if we wait using mutex_lock_killable(), direct reclaim/compaction
> will not be called (i.e. the rest of whole reclaim stack will not preempt
> the holder of the oom_lock) after each allocating thread failed to acquire
> the oom_lock.

But you still assume that you are going to hit the oom path. That takes
some time because you have to go over all reclaim priorities, compaction
attempts and so on. Many allocation paths will eventually hit the oom
path but many will simply still try the reclaim until they get there.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-07-18  9:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-16 10:59 [PATCH v2] mm/page_alloc: Wait for oom_lock before retrying Tetsuo Handa
2017-07-17  8:56 ` Michal Hocko
2017-07-17 13:50   ` Tetsuo Handa
2017-07-17 15:15     ` Michal Hocko
2017-07-17 15:24 ` Michal Hocko
2017-07-17 21:42   ` Tetsuo Handa
2017-07-18  9:08     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).