linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
@ 2014-12-19 13:01 Vlastimil Babka
  2014-12-19 13:01 ` [PATCH 2/2] mm, vmscan: wake up all pfmemalloc-throttled processes at once Vlastimil Babka
  2014-12-19 15:57 ` [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Vlastimil Babka @ 2014-12-19 13:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vlastimil Babka,
	stable, Mel Gorman, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Rik van Riel

Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
to sleep. Their analysis found the cause to be a combination of several
factors:

1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

2. The process has been killed (by OOM in this case), but has not yet been
   scheduled to remove itself from the waitqueue and die.

3. kswapd checks for throttled processes in prepare_kswapd_sleep():

        if (waitqueue_active(&pgdat->pfmemalloc_wait))
                wake_up(&pgdat->pfmemalloc_wait);

   However, for a process that was already killed, wake_up() does not remove
   the process from the waitqueue, since try_to_wake_up() checks its state
   first and returns false when the process is no longer waiting.

4. kswapd is running on the same CPU as the only CPU that the process is
   allowed to run on (through cpus_allowed, or possibly single-cpu system).

5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
   encounters no voluntary preemption points and repeatedly fails
   prepare_kswapd_sleep(), blocking the process from running and removing
   itself from the waitqueue, which would let kswapd sleep.

This patch fixes the issue by having kswapd call schedule() in situations
where it has tried to wake up a throttled process, but the wait queue is still
active. We have to be careful to do this outside of the prepare_to_wait() -
finish_wait() scope in kswapd_try_to_sleep().

Although it would be sufficient to limit the check to !PREEMPT configurations
to prevent the bug, even with preemption enabled it's better to schedule
immediately than to busy-loop until kswapd runs out of its CPU quantum.

Also we replace wake_up() with wake_up_all(), since it's more efficient than
to loop and wake processes one by one (now also rescheduling between each
iteration). Also update the comment prepare_kswapd_sleep() to hopefully more
clearly describe the races it is preventing.

Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
                      are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>   # v3.6+
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
---
I've CC'd also Peter and Ingo, since I'm not sure if the wait queue behavior
here is a feature of a bug. Could there be more code that relies on wake_up()
removing process from the wait queue immediately?

 mm/vmscan.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd9a72b..162c3f8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2914,23 +2914,28 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
  * Returns true if kswapd is ready to sleep
  */
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
-					int classzone_idx)
+				int classzone_idx, bool *woke_pfmemalloc)
 {
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return false;
 
 	/*
-	 * There is a potential race between when kswapd checks its watermarks
-	 * and a process gets throttled. There is also a potential race if
-	 * processes get throttled, kswapd wakes, a large process exits therby
-	 * balancing the zones that causes kswapd to miss a wakeup. If kswapd
-	 * is going to sleep, no process should be sleeping on pfmemalloc_wait
-	 * so wake them now if necessary. If necessary, processes will wake
-	 * kswapd and get throttled again
+	 * The throttled processes are normally woken up in balance_pgdat() as
+	 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
+	 * race between when kswapd checks the watermarks and a process gets
+	 * throttled. There is also a potential race if processes get
+	 * throttled, kswapd wakes, a large process exits thereby balancing the
+	 * zones, which causes kswapd to exit balance_pgdat() before reaching
+	 * the wake up checks. If kswapd is going to sleep, no process should
+	 * be sleeping on pfmemalloc_wait so wake them now if necessary. If the
+	 * wake up is premature, processes will wake kswapd and get throttled
+	 * again. Since each party wakes the other party in its own
+	 * prepare_to_wait() scope, we do not miss a wake up.
 	 */
 	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
-		wake_up(&pgdat->pfmemalloc_wait);
+		wake_up_all(&pgdat->pfmemalloc_wait);
+		*woke_pfmemalloc = true;
 		return false;
 	}
 
@@ -3220,6 +3225,7 @@ out:
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	long remaining = 0;
+	bool woke_pfmemalloc = false;
 	DEFINE_WAIT(wait);
 
 	if (freezing(current) || kthread_should_stop())
@@ -3228,7 +3234,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx,
+							&woke_pfmemalloc)) {
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3238,7 +3245,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx,
+							&woke_pfmemalloc)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -3270,6 +3278,17 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 	}
 	finish_wait(&pgdat->kswapd_wait, &wait);
+
+	/*
+	 * If we tried to wake up processes in prepare_kswapd_sleep(), it's
+	 * possible that some process was already killed, in which case the
+	 * wake_up() does not remove it from the wait queue. It needs to have
+	 * a chance to run, but it might be constrained to the CPU we are
+	 * now using. This is fatal on non-preemptive systems, unless we
+	 * schedule.
+	 */
+	if (woke_pfmemalloc && waitqueue_active(&pgdat->pfmemalloc_wait))
+		schedule();
 }
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] mm, vmscan: wake up all pfmemalloc-throttled processes at once
  2014-12-19 13:01 [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Vlastimil Babka
@ 2014-12-19 13:01 ` Vlastimil Babka
  2014-12-19 15:57 ` [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2014-12-19 13:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vlastimil Babka,
	Mel Gorman, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Rik van Riel, stable

Kswapd in balance_pgdate() currently uses wake_up() on processes waiting in
throttle_direct_reclaim(), which only wakes up a single process. This might
leave processes waiting for longer than necessary, until the check is reached
in the next loop iteration. Processes might also be left waiting if zone was
fully balanced in single iteration. Note that the comment in balance_pgdat()
also says "Wake them", so waking up a single process does not seem intentional.

Thus, replace wake_up() with wake_up_all().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 162c3f8..60b999c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3178,7 +3178,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
 				pfmemalloc_watermark_ok(pgdat))
-			wake_up(&pgdat->pfmemalloc_wait);
+			wake_up_all(&pgdat->pfmemalloc_wait);
 
 		/*
 		 * Fragmentation may mean that the system cannot be rebalanced
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-19 13:01 [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Vlastimil Babka
  2014-12-19 13:01 ` [PATCH 2/2] mm, vmscan: wake up all pfmemalloc-throttled processes at once Vlastimil Babka
@ 2014-12-19 15:57 ` Michal Hocko
  2014-12-19 18:28   ` Vladimir Davydov
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-12-19 15:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Ingo Molnar,
	Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Vladimir Davydov, Rik van Riel

On Fri 19-12-14 14:01:55, Vlastimil Babka wrote:
> Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
> in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
> to sleep. Their analysis found the cause to be a combination of several
> factors:
> 
> 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
> 
> 2. The process has been killed (by OOM in this case), but has not yet been
>    scheduled to remove itself from the waitqueue and die.

pfmemalloc_wait is used as wait_event and that one uses
autoremove_wake_function for wake ups so the task shouldn't stay on the
queue if it was woken up. Moreover pfmemalloc_wait sleeps are killable
by the OOM killer AFAICS.

$ git grep "wait_event.*pfmemalloc_wait"
mm/vmscan.c:
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
mm/vmscan.c:    wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,))

So OOM killer would wake it up already and kswapd shouldn't see this
task on the waitqueue anymore.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-19 15:57 ` [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Michal Hocko
@ 2014-12-19 18:28   ` Vladimir Davydov
  2014-12-19 23:05     ` Vlastimil Babka
  2014-12-20 10:47     ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Vladimir Davydov @ 2014-12-19 18:28 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Ingo Molnar,
	Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

Hi,

On Fri, Dec 19, 2014 at 04:57:47PM +0100, Michal Hocko wrote:
> On Fri 19-12-14 14:01:55, Vlastimil Babka wrote:
> > Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
> > in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
> > to sleep. Their analysis found the cause to be a combination of several
> > factors:
> > 
> > 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
> > 
> > 2. The process has been killed (by OOM in this case), but has not yet been
> >    scheduled to remove itself from the waitqueue and die.
> 
> pfmemalloc_wait is used as wait_event and that one uses
> autoremove_wake_function for wake ups so the task shouldn't stay on the
> queue if it was woken up. Moreover pfmemalloc_wait sleeps are killable
> by the OOM killer AFAICS.
> 
> $ git grep "wait_event.*pfmemalloc_wait"
> mm/vmscan.c:
> wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> mm/vmscan.c:    wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,))
> 
> So OOM killer would wake it up already and kswapd shouldn't see this
> task on the waitqueue anymore.

OOM killer will wake up the process, but it won't remove it from the
pfmemalloc_wait queue. Therefore, if kswapd gets scheduled before the
dying process, it will see the wait queue being still active, but won't
be able to wake anyone up, because the waiting process has already been
woken by SIGKILL. I think this is what Vlastimil means.

So AFAIU the problem does exist. However, I think it could be fixed by
simply waking up all processes waiting on pfmemalloc_wait before putting
kswapd to sleep:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744e2b491527..2a123634c220 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 	if (remaining)
 		return false;
 
+	if (!pgdat_balanced(pgdat, order, classzone_idx))
+		return false;
+
 	/*
 	 * There is a potential race between when kswapd checks its watermarks
 	 * and a process gets throttled. There is also a potential race if
@@ -2993,12 +2996,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 	 * so wake them now if necessary. If necessary, processes will wake
 	 * kswapd and get throttled again
 	 */
-	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
-		wake_up(&pgdat->pfmemalloc_wait);
-		return false;
-	}
+	wake_up_all(&pgdat->pfmemalloc_wait);
 
-	return pgdat_balanced(pgdat, order, classzone_idx);
+	return true;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-19 18:28   ` Vladimir Davydov
@ 2014-12-19 23:05     ` Vlastimil Babka
  2014-12-20  9:24       ` Vladimir Davydov
  2014-12-20 10:47     ` Michal Hocko
  1 sibling, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2014-12-19 23:05 UTC (permalink / raw)
  To: Vladimir Davydov, Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Ingo Molnar,
	Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On 19.12.2014 19:28, Vladimir Davydov wrote:
> Hi,
>
> On Fri, Dec 19, 2014 at 04:57:47PM +0100, Michal Hocko wrote:
>> On Fri 19-12-14 14:01:55, Vlastimil Babka wrote:
>>> Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
>>> in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
>>> to sleep. Their analysis found the cause to be a combination of several
>>> factors:
>>>
>>> 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
>>>
>>> 2. The process has been killed (by OOM in this case), but has not yet been
>>>     scheduled to remove itself from the waitqueue and die.
>> pfmemalloc_wait is used as wait_event and that one uses
>> autoremove_wake_function for wake ups so the task shouldn't stay on the
>> queue if it was woken up. Moreover pfmemalloc_wait sleeps are killable
>> by the OOM killer AFAICS.
>>
>> $ git grep "wait_event.*pfmemalloc_wait"
>> mm/vmscan.c:
>> wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>> mm/vmscan.c:    wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,))
>>
>> So OOM killer would wake it up already and kswapd shouldn't see this
>> task on the waitqueue anymore.
> OOM killer will wake up the process, but it won't remove it from the
> pfmemalloc_wait queue. Therefore, if kswapd gets scheduled before the
> dying process, it will see the wait queue being still active, but won't
> be able to wake anyone up, because the waiting process has already been
> woken by SIGKILL. I think this is what Vlastimil means.

Yes, that's exactly what I think happens.

> So AFAIU the problem does exist. However, I think it could be fixed by
> simply waking up all processes waiting on pfmemalloc_wait before putting
> kswapd to sleep:

Hm I don't see how it helps? If any of the waiting processes were killed
and wants to run on kswapd's CPU to remove itself from the waitqueue,
it will still remain on the waitqueue, no?

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 744e2b491527..2a123634c220 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>   	if (remaining)
>   		return false;
>   
> +	if (!pgdat_balanced(pgdat, order, classzone_idx))
> +		return false;
> +
>   	/*
>   	 * There is a potential race between when kswapd checks its watermarks
>   	 * and a process gets throttled. There is also a potential race if
> @@ -2993,12 +2996,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>   	 * so wake them now if necessary. If necessary, processes will wake
>   	 * kswapd and get throttled again
>   	 */
> -	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> -		wake_up(&pgdat->pfmemalloc_wait);
> -		return false;
> -	}
> +	wake_up_all(&pgdat->pfmemalloc_wait);
>   
> -	return pgdat_balanced(pgdat, order, classzone_idx);
> +	return true;
>   }
>   
>   /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-19 23:05     ` Vlastimil Babka
@ 2014-12-20  9:24       ` Vladimir Davydov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2014-12-20  9:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrew Morton, linux-mm, linux-kernel, Ingo Molnar,
	Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On Sat, Dec 20, 2014 at 12:05:58AM +0100, Vlastimil Babka wrote:
> On 19.12.2014 19:28, Vladimir Davydov wrote:
> >So AFAIU the problem does exist. However, I think it could be fixed by
> >simply waking up all processes waiting on pfmemalloc_wait before putting
> >kswapd to sleep:
> 
> Hm I don't see how it helps? If any of the waiting processes were killed
> and wants to run on kswapd's CPU to remove itself from the waitqueue,
> it will still remain on the waitqueue, no?

Yes, but do we really want all waiting processes to be removed from the
wait queue? AFAIU we just want them to be awake before putting kswapd to
sleep. If there's a process killed (and therefore woken) by the OOM
killer left on the wait queue after we called wake_up_all, it will see
pgdat_balanced=true as soon as it gets scheduled and pass away quickly.
All we have to do is drop the waitqueue_active check from kswapd.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-19 18:28   ` Vladimir Davydov
  2014-12-19 23:05     ` Vlastimil Babka
@ 2014-12-20 10:47     ` Michal Hocko
  2014-12-20 14:18       ` Vladimir Davydov
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-12-20 10:47 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	Ingo Molnar, Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On Fri 19-12-14 21:28:15, Vladimir Davydov wrote:
> Hi,
> 
> On Fri, Dec 19, 2014 at 04:57:47PM +0100, Michal Hocko wrote:
> > On Fri 19-12-14 14:01:55, Vlastimil Babka wrote:
> > > Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
> > > in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
> > > to sleep. Their analysis found the cause to be a combination of several
> > > factors:
> > > 
> > > 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
> > > 
> > > 2. The process has been killed (by OOM in this case), but has not yet been
> > >    scheduled to remove itself from the waitqueue and die.
> > 
> > pfmemalloc_wait is used as wait_event and that one uses
> > autoremove_wake_function for wake ups so the task shouldn't stay on the
> > queue if it was woken up. Moreover pfmemalloc_wait sleeps are killable
> > by the OOM killer AFAICS.
> > 
> > $ git grep "wait_event.*pfmemalloc_wait"
> > mm/vmscan.c:
> > wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> > mm/vmscan.c:    wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,))
> > 
> > So OOM killer would wake it up already and kswapd shouldn't see this
> > task on the waitqueue anymore.
> 
> OOM killer will wake up the process, but it won't remove it from the
> pfmemalloc_wait queue. Therefore, if kswapd gets scheduled before the
> dying process, it will see the wait queue being still active, but won't
> be able to wake anyone up, because the waiting process has already been
> woken by SIGKILL. I think this is what Vlastimil means.

OK, I see the point now. I didn't realize that autoremove_wake_function
doesn't remove the waiter from the queue if the state doesn't change.

> So AFAIU the problem does exist. However, I think it could be fixed by
> simply waking up all processes waiting on pfmemalloc_wait before putting
> kswapd to sleep:

I think that a simple cond_resched() in kswapd_try_to_sleep should be
sufficient and less risky fix, so basically what Vlastimil was proposing
in the beginning.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 744e2b491527..2a123634c220 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>  	if (remaining)
>  		return false;
>  
> +	if (!pgdat_balanced(pgdat, order, classzone_idx))
> +		return false;
> +

What would be consequences of not waking up pfmemalloc waiters while the
node is not balanced?

>  	/*
>  	 * There is a potential race between when kswapd checks its watermarks
>  	 * and a process gets throttled. There is also a potential race if
> @@ -2993,12 +2996,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>  	 * so wake them now if necessary. If necessary, processes will wake
>  	 * kswapd and get throttled again
>  	 */
> -	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> -		wake_up(&pgdat->pfmemalloc_wait);
> -		return false;
> -	}
> +	wake_up_all(&pgdat->pfmemalloc_wait);
>  
> -	return pgdat_balanced(pgdat, order, classzone_idx);
> +	return true;
>  }
>  
>  /*

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-20 10:47     ` Michal Hocko
@ 2014-12-20 14:18       ` Vladimir Davydov
  2014-12-22 14:24         ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Vladimir Davydov @ 2014-12-20 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	Ingo Molnar, Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On Sat, Dec 20, 2014 at 11:47:46AM +0100, Michal Hocko wrote:
> On Fri 19-12-14 21:28:15, Vladimir Davydov wrote:
> > So AFAIU the problem does exist. However, I think it could be fixed by
> > simply waking up all processes waiting on pfmemalloc_wait before putting
> > kswapd to sleep:
> 
> I think that a simple cond_resched() in kswapd_try_to_sleep should be
> sufficient and less risky fix, so basically what Vlastimil was proposing
> in the beginning.

With such a solution we implicitly rely upon the scheduler
implementation, which AFAIU is wrong. E.g. suppose processes are
governed by FIFO and kswapd happens to have a higher prio than the
process killed by OOM. Then after cond_resched kswapd will be picked for
execution again, and the killing process won't have a chance to remove
itself from the wait queue.

> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 744e2b491527..2a123634c220 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> >  	if (remaining)
> >  		return false;
> >  
> > +	if (!pgdat_balanced(pgdat, order, classzone_idx))
> > +		return false;
> > +
> 
> What would be consequences of not waking up pfmemalloc waiters while the
> node is not balanced?

They will get woken up a bit later in balanced_pgdat. This might result
in latency spikes though. In order not to change the original behaviour
we could always wake all pfmemalloc waiters no matter if we are going to
sleep or not:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744e2b491527..a21e0bd563c3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2993,10 +2993,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 	 * so wake them now if necessary. If necessary, processes will wake
 	 * kswapd and get throttled again
 	 */
-	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
-		wake_up(&pgdat->pfmemalloc_wait);
-		return false;
-	}
+	wake_up_all(&pgdat->pfmemalloc_wait);
 
 	return pgdat_balanced(pgdat, order, classzone_idx);
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-20 14:18       ` Vladimir Davydov
@ 2014-12-22 14:24         ` Michal Hocko
  2014-12-22 16:25           ` Vladimir Davydov
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-12-22 14:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	Ingo Molnar, Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On Sat 20-12-14 17:18:24, Vladimir Davydov wrote:
> On Sat, Dec 20, 2014 at 11:47:46AM +0100, Michal Hocko wrote:
> > On Fri 19-12-14 21:28:15, Vladimir Davydov wrote:
> > > So AFAIU the problem does exist. However, I think it could be fixed by
> > > simply waking up all processes waiting on pfmemalloc_wait before putting
> > > kswapd to sleep:
> > 
> > I think that a simple cond_resched() in kswapd_try_to_sleep should be
> > sufficient and less risky fix, so basically what Vlastimil was proposing
> > in the beginning.
> 
> With such a solution we implicitly rely upon the scheduler
> implementation, which AFAIU is wrong.

But this is a scheduling problem, isn't it? !PREEMPT kernel with a
kernel thread looping without a scheduling point which prevents the
woken task to run due to cpu affinity...

> E.g. suppose processes are
> governed by FIFO and kswapd happens to have a higher prio than the
> process killed by OOM. Then after cond_resched kswapd will be picked for
> execution again, and the killing process won't have a chance to remove
> itself from the wait queue.

Except that kswapd runs as SCHED_NORMAL with 0 priority.

> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 744e2b491527..2a123634c220 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > >  	if (remaining)
> > >  		return false;
> > >  
> > > +	if (!pgdat_balanced(pgdat, order, classzone_idx))
> > > +		return false;
> > > +
> > 
> > What would be consequences of not waking up pfmemalloc waiters while the
> > node is not balanced?
> 
> They will get woken up a bit later in balanced_pgdat. This might result
> in latency spikes though. In order not to change the original behaviour
> we could always wake all pfmemalloc waiters no matter if we are going to
> sleep or not:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 744e2b491527..a21e0bd563c3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2993,10 +2993,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>  	 * so wake them now if necessary. If necessary, processes will wake
>  	 * kswapd and get throttled again
>  	 */
> -	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> -		wake_up(&pgdat->pfmemalloc_wait);
> -		return false;
> -	}
> +	wake_up_all(&pgdat->pfmemalloc_wait);
>  
>  	return pgdat_balanced(pgdat, order, classzone_idx);

So you are relying on scheduling points somewhere down the
balance_pgdat. That should be sufficient. I am still quite surprised
that we have an OOM victim still on the queue and balanced pgdat here
because OOM victim didn't have chance to free memory. So somebody else
must have released a lot of memory after OOM.

This patch seems better than the one from Vlastimil. Care to post it
with the full changelog, please?

Thanks!

>  }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-22 14:24         ` Michal Hocko
@ 2014-12-22 16:25           ` Vladimir Davydov
  2014-12-22 19:33             ` Vlastimil Babka
  2014-12-23 10:16             ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Vladimir Davydov @ 2014-12-22 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	Ingo Molnar, Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On Mon, Dec 22, 2014 at 03:24:35PM +0100, Michal Hocko wrote:
> On Sat 20-12-14 17:18:24, Vladimir Davydov wrote:
> > On Sat, Dec 20, 2014 at 11:47:46AM +0100, Michal Hocko wrote:
> > > On Fri 19-12-14 21:28:15, Vladimir Davydov wrote:
> > > > So AFAIU the problem does exist. However, I think it could be fixed by
> > > > simply waking up all processes waiting on pfmemalloc_wait before putting
> > > > kswapd to sleep:
> > > 
> > > I think that a simple cond_resched() in kswapd_try_to_sleep should be
> > > sufficient and less risky fix, so basically what Vlastimil was proposing
> > > in the beginning.
> > 
> > With such a solution we implicitly rely upon the scheduler
> > implementation, which AFAIU is wrong.
> 
> But this is a scheduling problem, isn't it? !PREEMPT kernel with a
> kernel thread looping without a scheduling point which prevents the
> woken task to run due to cpu affinity...

I wouldn't say so. To me it looks more like an abuse of the workqueue
API. AFAIU an abstract scheduler algorithm isn't supposed to guarantee
that an arbitrary process will ever get scheduled unless the CPU is
idle. Of course, my example below looks contrived. Nobody will raise
kswapd priority for it to preempt other processes. However, IMO we
shouldn't rely on that in the mm subsys, which is rather orthogonal to
the sched.

> 
> > E.g. suppose processes are
> > governed by FIFO and kswapd happens to have a higher prio than the
> > process killed by OOM. Then after cond_resched kswapd will be picked for
> > execution again, and the killing process won't have a chance to remove
> > itself from the wait queue.
> 
> Except that kswapd runs as SCHED_NORMAL with 0 priority.
> 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 744e2b491527..2a123634c220 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > > >  	if (remaining)
> > > >  		return false;
> > > >  
> > > > +	if (!pgdat_balanced(pgdat, order, classzone_idx))
> > > > +		return false;
> > > > +
> > > 
> > > What would be consequences of not waking up pfmemalloc waiters while the
> > > node is not balanced?
> > 
> > They will get woken up a bit later in balanced_pgdat. This might result
> > in latency spikes though. In order not to change the original behaviour
> > we could always wake all pfmemalloc waiters no matter if we are going to
> > sleep or not:
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 744e2b491527..a21e0bd563c3 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2993,10 +2993,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> >  	 * so wake them now if necessary. If necessary, processes will wake
> >  	 * kswapd and get throttled again
> >  	 */
> > -	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> > -		wake_up(&pgdat->pfmemalloc_wait);
> > -		return false;
> > -	}
> > +	wake_up_all(&pgdat->pfmemalloc_wait);
> >  
> >  	return pgdat_balanced(pgdat, order, classzone_idx);
> 
> So you are relying on scheduling points somewhere down the
> balance_pgdat. That should be sufficient. I am still quite surprised
> that we have an OOM victim still on the queue and balanced pgdat here
> because OOM victim didn't have chance to free memory. So somebody else
> must have released a lot of memory after OOM.
> 
> This patch seems better than the one from Vlastimil. Care to post it
> with the full changelog, please?

Attached below (merged with 2/2). I haven't checked that it does fix the
issue, because I don't have the reproducer, so it should be committed
only if Vlastimil approves it.

From: Vlastimil Babka <vbabka@suse.cz>
Subject: [PATCH] mm, vmscan: prevent kswapd livelock due to
 pfmemalloc-throttled process being killed

Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
to sleep. Their analysis found the cause to be a combination of several
factors:

1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

2. The process has been killed (by OOM in this case), but has not yet been
   scheduled to remove itself from the waitqueue and die.

3. kswapd checks for throttled processes in prepare_kswapd_sleep() and
   do not put itself to sleep if there are any:

        if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                wake_up(&pgdat->pfmemalloc_wait);
                return false;
        }

   However, for a process that was already killed, wake_up() does not remove
   the process from the waitqueue, since try_to_wake_up() checks its state
   first and returns false when the process is no longer waiting.

4. kswapd is running on the same CPU as the only CPU that the process is
   allowed to run on (through cpus_allowed, or possibly single-cpu system).

5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
   encounters no voluntary preemption points and repeatedly fails
   prepare_kswapd_sleep(), blocking the process from running and removing
   itself from the waitqueue, which would let kswapd sleep.

So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled. This was done to avoid the race
between kswapd checking pfmemalloc_wait and a process getting throttled
as the comment in prepare_kswapd_sleep() explains.

However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue empties. To eliminate the race, it's actually
enough to guarantee that all processes waiting on pfmemalloc_wait queue
have been woken up by the time we put kswapd to sleep.

This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above.

Also, it replaces wake_up with wake_up_all in balance_pgdat(), because:
 - using wake_up there might leave processes waiting for longer than
   necessary, until the check is reached in the next loop iteration;
 - processes might also be left waiting even if zone was fully balanced
   in single iteration;
 - the comment says "wake them" so the author of the commit that
   introduced pfmemalloc_wait seemed to mean wake_up_all;
 - this corresponds to how we wake processes waiting on pfmemalloc_wait
   in prepare_kswapd_sleep.

Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
                      are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: <stable@vger.kernel.org>   # v3.6+
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
---
Changes in v2:
 - instead of introducing yet another schedule() point in
   kswapd_try_to_sleep(), allow kswapd to sleep even if the
   pfmemalloc_wait queue is active, waking *all* throttled
   processes before going to sleep

 mm/vmscan.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8772b2b9ef..65287944b2cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2961,10 +2961,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 	 * so wake them now if necessary. If necessary, processes will wake
 	 * kswapd and get throttled again
 	 */
-	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
-		wake_up(&pgdat->pfmemalloc_wait);
-		return false;
-	}
+	if (waitqueue_active(&pgdat->pfmemalloc_wait))
+		wake_up_all(&pgdat->pfmemalloc_wait);
 
 	return pgdat_balanced(pgdat, order, classzone_idx);
 }
@@ -3205,7 +3203,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
 				pfmemalloc_watermark_ok(pgdat))
-			wake_up(&pgdat->pfmemalloc_wait);
+			wake_up_all(&pgdat->pfmemalloc_wait);
 
 		/*
 		 * Fragmentation may mean that the system cannot be rebalanced
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-22 16:25           ` Vladimir Davydov
@ 2014-12-22 19:33             ` Vlastimil Babka
  2014-12-23 10:16             ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2014-12-22 19:33 UTC (permalink / raw)
  To: Vladimir Davydov, Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Ingo Molnar,
	Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On 22.12.2014 17:25, Vladimir Davydov wrote:
>
>>> E.g. suppose processes are
>>> governed by FIFO and kswapd happens to have a higher prio than the
>>> process killed by OOM. Then after cond_resched kswapd will be picked for
>>> execution again, and the killing process won't have a chance to remove
>>> itself from the wait queue.
>> Except that kswapd runs as SCHED_NORMAL with 0 priority.
>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 744e2b491527..2a123634c220 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>>>>>   	if (remaining)
>>>>>   		return false;
>>>>>   
>>>>> +	if (!pgdat_balanced(pgdat, order, classzone_idx))
>>>>> +		return false;
>>>>> +
>>>> What would be consequences of not waking up pfmemalloc waiters while the
>>>> node is not balanced?
>>> They will get woken up a bit later in balanced_pgdat. This might result
>>> in latency spikes though. In order not to change the original behaviour
>>> we could always wake all pfmemalloc waiters no matter if we are going to
>>> sleep or not:
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 744e2b491527..a21e0bd563c3 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2993,10 +2993,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>>>   	 * so wake them now if necessary. If necessary, processes will wake
>>>   	 * kswapd and get throttled again
>>>   	 */
>>> -	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
>>> -		wake_up(&pgdat->pfmemalloc_wait);
>>> -		return false;
>>> -	}
>>> +	wake_up_all(&pgdat->pfmemalloc_wait);
>>>   
>>>   	return pgdat_balanced(pgdat, order, classzone_idx);
>> So you are relying on scheduling points somewhere down the
>> balance_pgdat. That should be sufficient. I am still quite surprised
>> that we have an OOM victim still on the queue and balanced pgdat here
>> because OOM victim didn't have chance to free memory. So somebody else
>> must have released a lot of memory after OOM.
>>
>> This patch seems better than the one from Vlastimil. Care to post it
>> with the full changelog, please?
> Attached below (merged with 2/2). I haven't checked that it does fix the
> issue, because I don't have the reproducer, so it should be committed
> only if Vlastimil approves it.

I agree it's the right fix, thanks a lot. We only have a synthetic 
reproducer,
as the real scenario would be hard to trigger reliably. I can test it 
later, but
I think it's reasonably clear the patch will help.
I would just personaly keep the comment clarification in the patch, but it's
not a critical issue.

Vlastimil

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  2014-12-22 16:25           ` Vladimir Davydov
  2014-12-22 19:33             ` Vlastimil Babka
@ 2014-12-23 10:16             ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-12-23 10:16 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	Ingo Molnar, Peter Zijlstra, stable, Mel Gorman, Johannes Weiner,
	Rik van Riel

On Mon 22-12-14 19:25:58, Vladimir Davydov wrote:
[...]
> From: Vlastimil Babka <vbabka@suse.cz>
> Subject: [PATCH] mm, vmscan: prevent kswapd livelock due to
>  pfmemalloc-throttled process being killed
> 
> Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
> in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
> to sleep. Their analysis found the cause to be a combination of several
> factors:
> 
> 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
> 
> 2. The process has been killed (by OOM in this case), but has not yet been
>    scheduled to remove itself from the waitqueue and die.
> 
> 3. kswapd checks for throttled processes in prepare_kswapd_sleep() and
>    do not put itself to sleep if there are any:
> 
>         if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
>                 wake_up(&pgdat->pfmemalloc_wait);
>                 return false;
>         }
> 
>    However, for a process that was already killed, wake_up() does not remove
>    the process from the waitqueue, since try_to_wake_up() checks its state
>    first and returns false when the process is no longer waiting.
> 
> 4. kswapd is running on the same CPU as the only CPU that the process is
>    allowed to run on (through cpus_allowed, or possibly single-cpu system).
> 
> 5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
>    encounters no voluntary preemption points and repeatedly fails
>    prepare_kswapd_sleep(), blocking the process from running and removing
>    itself from the waitqueue, which would let kswapd sleep.
> 
> So, the source of the problem is that we prevent kswapd from going to
> sleep until there are processes waiting on the pfmemalloc_wait queue,
> and a process waiting on a queue is guaranteed to be removed from the
> queue only when it gets scheduled. This was done to avoid the race
> between kswapd checking pfmemalloc_wait and a process getting throttled
> as the comment in prepare_kswapd_sleep() explains.
> 
> However, it isn't necessary to postpone kswapd sleep until the
> pfmemalloc_wait queue empties. To eliminate the race, it's actually
> enough to guarantee that all processes waiting on pfmemalloc_wait queue
> have been woken up by the time we put kswapd to sleep.
> 
> This patch therefore fixes this issue by substituting 'wake_up' with
> 'wake_up_all' and removing 'return false' in the code snippet from
> prepare_kswapd_sleep() above.
> 
> Also, it replaces wake_up with wake_up_all in balance_pgdat(), because:
>  - using wake_up there might leave processes waiting for longer than
>    necessary, until the check is reached in the next loop iteration;
>  - processes might also be left waiting even if zone was fully balanced
>    in single iteration;
>  - the comment says "wake them" so the author of the commit that
>    introduced pfmemalloc_wait seemed to mean wake_up_all;
>  - this corresponds to how we wake processes waiting on pfmemalloc_wait
>    in prepare_kswapd_sleep.

I would still separate this into a separate patch because it is not
directly related to the issue. It also doesn't need to be backported to
the stable tree AFAIU.

> Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
>                       are low and swap is backed by network storage")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: <stable@vger.kernel.org>   # v3.6+
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Rik van Riel <riel@redhat.com>

Other than that the patch looks good to me.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> Changes in v2:
>  - instead of introducing yet another schedule() point in
>    kswapd_try_to_sleep(), allow kswapd to sleep even if the
>    pfmemalloc_wait queue is active, waking *all* throttled
>    processes before going to sleep
> 
>  mm/vmscan.c |    8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8772b2b9ef..65287944b2cf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2961,10 +2961,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>  	 * so wake them now if necessary. If necessary, processes will wake
>  	 * kswapd and get throttled again
>  	 */
> -	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> -		wake_up(&pgdat->pfmemalloc_wait);
> -		return false;
> -	}
> +	if (waitqueue_active(&pgdat->pfmemalloc_wait))
> +		wake_up_all(&pgdat->pfmemalloc_wait);
>  
>  	return pgdat_balanced(pgdat, order, classzone_idx);
>  }
> @@ -3205,7 +3203,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  		 */
>  		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
>  				pfmemalloc_watermark_ok(pgdat))
> -			wake_up(&pgdat->pfmemalloc_wait);
> +			wake_up_all(&pgdat->pfmemalloc_wait);
>  
>  		/*
>  		 * Fragmentation may mean that the system cannot be rebalanced
> -- 
> 1.7.10.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-12-23 10:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-19 13:01 [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Vlastimil Babka
2014-12-19 13:01 ` [PATCH 2/2] mm, vmscan: wake up all pfmemalloc-throttled processes at once Vlastimil Babka
2014-12-19 15:57 ` [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Michal Hocko
2014-12-19 18:28   ` Vladimir Davydov
2014-12-19 23:05     ` Vlastimil Babka
2014-12-20  9:24       ` Vladimir Davydov
2014-12-20 10:47     ` Michal Hocko
2014-12-20 14:18       ` Vladimir Davydov
2014-12-22 14:24         ` Michal Hocko
2014-12-22 16:25           ` Vladimir Davydov
2014-12-22 19:33             ` Vlastimil Babka
2014-12-23 10:16             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).