linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kswapd craziness in 3.7
@ 2012-11-27 20:48 Johannes Weiner
  2012-11-27 20:48 ` [patch] mm: vmscan: fix kswapd endless loop on higher order allocation Johannes Weiner
                   ` (3 more replies)
  0 siblings, 4 replies; 65+ messages in thread
From: Johannes Weiner @ 2012-11-27 20:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, George Spelvin, Johannes Hirte,
	Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer, Valdis.Kletnieks,
	Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac, Bruno Wolff III,
	Linus Torvalds, linux-mm, linux-kernel

Hi everyone,

I hope I included everybody that participated in the various threads
on kswapd getting stuck / exhibiting high CPU usage.  We were looking
at at least three root causes as far as I can see, so it's not really
clear who observed which problem.  Please correct me if the
reported-by, tested-by, bisected-by tags are incomplete.

One problem was, as it seems, overly aggressive reclaim due to scaling
up reclaim goals based on compaction failures.  This one was reverted
in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
reclaim/compaction based on failures".

Another one was an accounting problem where a freed higher order page
was underreported, and so kswapd had trouble restoring watermarks.
This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
(appears like memory leak).

The third one is a problem with small zones, like the DMA zone, where
the high watermark is lower than the low watermark plus compaction gap
(2 * allocation size).  The zonelist reclaim in kswapd would do
nothing because all high watermarks are met, but the compaction logic
would find its own requirements unmet and loop over the zones again.
Indefinitely, until some third party would free enough memory to help
meet the higher compaction watermark.  The problematic code has been
there since the 3.4 merge window for non-THP higher order allocations
but has been more prominent since the 3.7 merge window, where kswapd
is also woken up for the much more common THP allocations.

The following patch should fix the third issue by making both reclaim
and compaction code in kswapd use the same predicate to determine
whether a zone is balanced or not.

Hopefully, the sum of all three fixes should tame kswapd enough for
3.7.

Johannes


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [patch] mm: vmscan: fix kswapd endless loop on higher order allocation
  2012-11-27 20:48 kswapd craziness in 3.7 Johannes Weiner
@ 2012-11-27 20:48 ` Johannes Weiner
  2012-11-27 20:58 ` kswapd craziness in 3.7 Linus Torvalds
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 65+ messages in thread
From: Johannes Weiner @ 2012-11-27 20:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, George Spelvin, Johannes Hirte,
	Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer, Valdis.Kletnieks,
	Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac, Bruno Wolff III,
	Linus Torvalds, linux-mm, linux-kernel

Kswapd does not in all places have the same criteria for a balanced
zone.  Zones are only being reclaimed when their high watermark is
breached, but compaction checks loop over the zonelist again when the
zone does not meet the low watermark plus two times the size of the
allocation.  This gets kswapd stuck in an endless loop over a small
zone, like the DMA zone, where the high watermark is smaller than the
compaction requirement.

Add a function, zone_balanced(), that checks the watermark, and, for
higher order allocations, if compaction has enough free memory.  Then
use it uniformly to check for balanced zones.

This makes sure that when the compaction watermark is not met, at
least reclaim happens and progress is made - or the zone is declared
unreclaimable at some point and skipped entirely.

Reported-by: George Spelvin <linux@horizon.com>
Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Reported-by: Tomas Racek <tracek@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..3b0aef4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2397,6 +2397,19 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
+static bool zone_balanced(struct zone *zone, int order,
+			  unsigned long balance_gap, int classzone_idx)
+{
+	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
+				    balance_gap, classzone_idx, 0))
+		return false;
+
+	if (COMPACTION_BUILD && order && !compaction_suitable(zone, order))
+		return false;
+
+	return true;
+}
+
 /*
  * pgdat_balanced is used when checking if a node is balanced for high-order
  * allocations. Only zones that meet watermarks and are in a zone allowed
@@ -2475,8 +2488,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 			continue;
 		}
 
-		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
-							i, 0))
+		if (!zone_balanced(zone, order, 0, i))
 			all_zones_ok = false;
 		else
 			balanced += zone->present_pages;
@@ -2585,8 +2597,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				break;
 			}
 
-			if (!zone_watermark_ok_safe(zone, order,
-					high_wmark_pages(zone), 0, 0)) {
+			if (!zone_balanced(zone, order, 0, 0)) {
 				end_zone = i;
 				break;
 			} else {
@@ -2662,9 +2673,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				testorder = 0;
 
 			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
-				    !zone_watermark_ok_safe(zone, testorder,
-					high_wmark_pages(zone) + balance_gap,
-					end_zone, 0)) {
+			    !zone_balanced(zone, testorder,
+					   balance_gap, end_zone)) {
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
@@ -2691,8 +2701,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				continue;
 			}
 
-			if (!zone_watermark_ok_safe(zone, testorder,
-					high_wmark_pages(zone), end_zone, 0)) {
+			if (!zone_balanced(zone, testorder, 0, end_zone)) {
 				all_zones_ok = 0;
 				/*
 				 * We are still under min water mark.  This
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 20:48 kswapd craziness in 3.7 Johannes Weiner
  2012-11-27 20:48 ` [patch] mm: vmscan: fix kswapd endless loop on higher order allocation Johannes Weiner
@ 2012-11-27 20:58 ` Linus Torvalds
  2012-11-27 21:16   ` Rik van Riel
                     ` (2 more replies)
  2012-11-28  9:45 ` Mel Gorman
  2012-12-03 13:14 ` Jiri Slaby
  3 siblings, 3 replies; 65+ messages in thread
From: Linus Torvalds @ 2012-11-27 20:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

Note that in the meantime, I've also applied (through Andrew) the
patch that reverts commit c654345924f7 (see commit 82b212f40059
'Revert "mm: remove __GFP_NO_KSWAPD"').

I wonder if that revert may be bogus, and a result of this same issue.
Maybe that revert should be reverted, and replaced with your patch?

Mel? Zdenek? What's the status here?

                 Linus

On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hi everyone,
>
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem.  Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.
>
> One problem was, as it seems, overly aggressive reclaim due to scaling
> up reclaim goals based on compaction failures.  This one was reverted
> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
> reclaim/compaction based on failures".
>
> Another one was an accounting problem where a freed higher order page
> was underreported, and so kswapd had trouble restoring watermarks.
> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
> (appears like memory leak).
>
> The third one is a problem with small zones, like the DMA zone, where
> the high watermark is lower than the low watermark plus compaction gap
> (2 * allocation size).  The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark.  The problematic code has been
> there since the 3.4 merge window for non-THP higher order allocations
> but has been more prominent since the 3.7 merge window, where kswapd
> is also woken up for the much more common THP allocations.
>
> The following patch should fix the third issue by making both reclaim
> and compaction code in kswapd use the same predicate to determine
> whether a zone is balanced or not.
>
> Hopefully, the sum of all three fixes should tame kswapd enough for
> 3.7.
>
> Johannes
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 20:58 ` kswapd craziness in 3.7 Linus Torvalds
@ 2012-11-27 21:16   ` Rik van Riel
  2012-11-27 21:49     ` Johannes Weiner
  2012-11-27 21:29   ` Johannes Weiner
  2012-11-28 13:35   ` Zdenek Kabelac
  2 siblings, 1 reply; 65+ messages in thread
From: Rik van Riel @ 2012-11-27 21:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> Note that in the meantime, I've also applied (through Andrew) the
> patch that reverts commit c654345924f7 (see commit 82b212f40059
> 'Revert "mm: remove __GFP_NO_KSWAPD"').
>
> I wonder if that revert may be bogus, and a result of this same issue.
> Maybe that revert should be reverted, and replaced with your patch?
>
> Mel? Zdenek? What's the status here?

Mel posted several patches to fix the kswapd issue.  This one is
slightly more risky than the outright revert, but probably preferred
from a performance point of view:

https://lkml.org/lkml/2012/11/12/151

It works by skipping the kswapd wakeup for THP allocations, only
if compaction is deferred or contended.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 20:58 ` kswapd craziness in 3.7 Linus Torvalds
  2012-11-27 21:16   ` Rik van Riel
@ 2012-11-27 21:29   ` Johannes Weiner
  2012-11-28 13:35   ` Zdenek Kabelac
  2 siblings, 0 replies; 65+ messages in thread
From: Johannes Weiner @ 2012-11-27 21:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Tue, Nov 27, 2012 at 12:58:18PM -0800, Linus Torvalds wrote:
> Note that in the meantime, I've also applied (through Andrew) the
> patch that reverts commit c654345924f7 (see commit 82b212f40059
> 'Revert "mm: remove __GFP_NO_KSWAPD"').
> 
> I wonder if that revert may be bogus, and a result of this same issue.
> Maybe that revert should be reverted, and replaced with your patch?

The __GFP_NO_KSWAPD removal woke kswapd for THP reclaim and so it
exposed all these bugs that accumulated in there when higher order
kswapd reclaim was excercised less often.

The revert will hide the problem again, but doesn't make it go away
entirely, so I think we need my fix either way.

Whether you want to put the full THP weight back on the freshly fixed
higher order kswapd code for 3.7 is a different matter :-) At least we
would see quickly if it's still not working correctly...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 21:16   ` Rik van Riel
@ 2012-11-27 21:49     ` Johannes Weiner
  2012-11-27 22:02       ` Rik van Riel
  0 siblings, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-11-27 21:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
> On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> >Note that in the meantime, I've also applied (through Andrew) the
> >patch that reverts commit c654345924f7 (see commit 82b212f40059
> >'Revert "mm: remove __GFP_NO_KSWAPD"').
> >
> >I wonder if that revert may be bogus, and a result of this same issue.
> >Maybe that revert should be reverted, and replaced with your patch?
> >
> >Mel? Zdenek? What's the status here?
> 
> Mel posted several patches to fix the kswapd issue.  This one is
> slightly more risky than the outright revert, but probably preferred
> from a performance point of view:
> 
> https://lkml.org/lkml/2012/11/12/151
> 
> It works by skipping the kswapd wakeup for THP allocations, only
> if compaction is deferred or contended.

Just to clarify, this would be a replacement strictly for the
__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
up for higher order allocations like THP.

My patch is to fix how kswapd actually does higher order reclaim, and
it is required either way.

[ But isn't the _reason_ why the "wake up kswapd more carefully for
  THP" patch was written kind of moot now since it was developed
  against a crazy kswapd?  It would certainly need to be re-evaluated.
  My (limited) testing didn't show any issues anymore with waking
  kswapd unconditionally once it's fixed. ]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 21:49     ` Johannes Weiner
@ 2012-11-27 22:02       ` Rik van Riel
  2012-11-27 22:26         ` Johannes Weiner
  0 siblings, 1 reply; 65+ messages in thread
From: Rik van Riel @ 2012-11-27 22:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On 11/27/2012 04:49 PM, Johannes Weiner wrote:
> On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
>> On 11/27/2012 03:58 PM, Linus Torvalds wrote:
>>> Note that in the meantime, I've also applied (through Andrew) the
>>> patch that reverts commit c654345924f7 (see commit 82b212f40059
>>> 'Revert "mm: remove __GFP_NO_KSWAPD"').
>>>
>>> I wonder if that revert may be bogus, and a result of this same issue.
>>> Maybe that revert should be reverted, and replaced with your patch?
>>>
>>> Mel? Zdenek? What's the status here?
>>
>> Mel posted several patches to fix the kswapd issue.  This one is
>> slightly more risky than the outright revert, but probably preferred
>> from a performance point of view:
>>
>> https://lkml.org/lkml/2012/11/12/151
>>
>> It works by skipping the kswapd wakeup for THP allocations, only
>> if compaction is deferred or contended.
>
> Just to clarify, this would be a replacement strictly for the
> __GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
> up for higher order allocations like THP.
>
> My patch is to fix how kswapd actually does higher order reclaim, and
> it is required either way.
>
> [ But isn't the _reason_ why the "wake up kswapd more carefully for
>    THP" patch was written kind of moot now since it was developed
>    against a crazy kswapd?  It would certainly need to be re-evaluated.
>    My (limited) testing didn't show any issues anymore with waking
>    kswapd unconditionally once it's fixed. ]

Kswapd going crazy is certainly a large part of the problem.

However, that leaves the issue of page_alloc.c waking up
kswapd when the system is not actually low on memory.

Instead, kswapd is woken up because memory compaction failed,
potentially even due to lock contention during compaction!

Ideally the allocation code would only wake up kswapd if
memory needs to be freed, or in order for kswapd to do
memory compaction (so the allocator does not have to).

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 22:02       ` Rik van Riel
@ 2012-11-27 22:26         ` Johannes Weiner
  2012-11-27 23:19           ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-11-27 22:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> On 11/27/2012 04:49 PM, Johannes Weiner wrote:
> >On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
> >>On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> >>>Note that in the meantime, I've also applied (through Andrew) the
> >>>patch that reverts commit c654345924f7 (see commit 82b212f40059
> >>>'Revert "mm: remove __GFP_NO_KSWAPD"').
> >>>
> >>>I wonder if that revert may be bogus, and a result of this same issue.
> >>>Maybe that revert should be reverted, and replaced with your patch?
> >>>
> >>>Mel? Zdenek? What's the status here?
> >>
> >>Mel posted several patches to fix the kswapd issue.  This one is
> >>slightly more risky than the outright revert, but probably preferred
> >>from a performance point of view:
> >>
> >>https://lkml.org/lkml/2012/11/12/151
> >>
> >>It works by skipping the kswapd wakeup for THP allocations, only
> >>if compaction is deferred or contended.
> >
> >Just to clarify, this would be a replacement strictly for the
> >__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
> >up for higher order allocations like THP.
> >
> >My patch is to fix how kswapd actually does higher order reclaim, and
> >it is required either way.
> >
> >[ But isn't the _reason_ why the "wake up kswapd more carefully for
> >   THP" patch was written kind of moot now since it was developed
> >   against a crazy kswapd?  It would certainly need to be re-evaluated.
> >   My (limited) testing didn't show any issues anymore with waking
> >   kswapd unconditionally once it's fixed. ]
> 
> Kswapd going crazy is certainly a large part of the problem.
> 
> However, that leaves the issue of page_alloc.c waking up
> kswapd when the system is not actually low on memory.
> 
> Instead, kswapd is woken up because memory compaction failed,
> potentially even due to lock contention during compaction!
> 
> Ideally the allocation code would only wake up kswapd if
> memory needs to be freed, or in order for kswapd to do
> memory compaction (so the allocator does not have to).

Maybe I missed something, but shouldn't this be solved with my patch?

The first scan over the zones finds the higher order watermark
breached, but the reclaim scan over the zones tests against order-0
(testorder) watermarks when compaction is suitable, i.e. no reclaim if
there are enough order-0 pages for compaction to work.  It should just
fall through to that zones_need_compaction condition at the end and
run compaction.

As such, it should always be approriate to wake kswapd if allocations
fail.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 22:26         ` Johannes Weiner
@ 2012-11-27 23:19           ` Linus Torvalds
  2012-11-28 10:13             ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2012-11-27 23:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Mel Gorman, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
>>
>> Kswapd going crazy is certainly a large part of the problem.
>>
>> However, that leaves the issue of page_alloc.c waking up
>> kswapd when the system is not actually low on memory.
>>
>> Instead, kswapd is woken up because memory compaction failed,
>> potentially even due to lock contention during compaction!
>>
>> Ideally the allocation code would only wake up kswapd if
>> memory needs to be freed, or in order for kswapd to do
>> memory compaction (so the allocator does not have to).
>
> Maybe I missed something, but shouldn't this be solved with my patch?

Ok, guys. Cage fight!

The rules are simple: two men enter, one man leaves.

And the one who comes out gets to explain to me which patch(es) I
should apply, and which I should revert, if any.

My current guess is that I should apply the one Johannes just sent
("mm: vmscan: fix kswapd endless loop on higher order allocation")
after having added the cc to stable to it, and then revert the recent
revert (commit 82b212f40059).

But I await the Thunderdome. <Cue Tina Turner "We Don't Need Another Hero">

                      Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 20:48 kswapd craziness in 3.7 Johannes Weiner
  2012-11-27 20:48 ` [patch] mm: vmscan: fix kswapd endless loop on higher order allocation Johannes Weiner
  2012-11-27 20:58 ` kswapd craziness in 3.7 Linus Torvalds
@ 2012-11-28  9:45 ` Mel Gorman
  2012-12-03 15:23   ` Zdenek Kabelac
  2012-12-03 13:14 ` Jiri Slaby
  3 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2012-11-28  9:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, George Spelvin, Johannes Hirte,
	Thorsten Leemhuis, Tomas Racek, Jan Kara, Dave Hansen,
	Josh Boyer, Valdis.Kletnieks, Jiri Slaby, Zdenek Kabelac,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

(Adding Thorsten to cc)

On Tue, Nov 27, 2012 at 03:48:34PM -0500, Johannes Weiner wrote:
> Hi everyone,
> 
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem.  Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.
> 
> One problem was, as it seems, overly aggressive reclaim due to scaling
> up reclaim goals based on compaction failures.  This one was reverted
> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
> reclaim/compaction based on failures".
> 

This particular one would have been made worse by the accounting bug and
if kswapd was staying awake longer than necessary. As scaling the amount
of reclaim only for direct reclaim helped this problem a lot, I strongly
suspect the accounting bug was a factor.

However the benefit for this is marginal -- it primarily affects how
many THP pages we can allocate under stress. There is already a graceful
fallback path and a system under heavy reclaim pressure is not going to
notice the performance benefit of THP.

> Another one was an accounting problem where a freed higher order page
> was underreported, and so kswapd had trouble restoring watermarks.
> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
> (appears like memory leak).
> 

This almost certainly also requires the follow-on fix at
https://lkml.org/lkml/2012/11/26/225 for reasons I explained in
https://lkml.org/lkml/2012/11/27/190 .

> The third one is a problem with small zones, like the DMA zone, where
> the high watermark is lower than the low watermark plus compaction gap
> (2 * allocation size).  The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark.  The problematic code has been
> there since the 3.4 merge window for non-THP higher order allocations
> but has been more prominent since the 3.7 merge window, where kswapd
> is also woken up for the much more common THP allocations.
> 

Yes. 

> The following patch should fix the third issue by making both reclaim
> and compaction code in kswapd use the same predicate to determine
> whether a zone is balanced or not.
> 
> Hopefully, the sum of all three fixes should tame kswapd enough for
> 3.7.
> 

Not exactly sure of that. With just those patches it is possible for
allocations for THP entering the slow path to keep kswapd continually awake
doing busy work. This was an alternative to the revert that covered that
https://lkml.org/lkml/2012/11/12/151 but it was not enough because kswapd
would stay awake due to the bug you identified and fixed.

I went with the __GFP_NO_KSWAPD patch in this cycle because 3.6 was/is
very poor in how it handles THP after the removal of lumpy reclaim. 3.7
was shaping up to be even worse with multiple root causes too close to the
release date.  Taking kswapd out of the equation covered some of the
problems (yes, by hiding them) so it could be revisited but Johannes may
have finally squashed it.

However, if we revert the revert then I strongly recommend that it be
replaced with "Avoid waking kswapd for THP allocations when compaction is
deferred or contended".

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 23:19           ` Linus Torvalds
@ 2012-11-28 10:13             ` Mel Gorman
  2012-11-28 10:51               ` Thorsten Leemhuis
                                 ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Mel Gorman @ 2012-11-28 10:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
> On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> >>
> >> Kswapd going crazy is certainly a large part of the problem.
> >>
> >> However, that leaves the issue of page_alloc.c waking up
> >> kswapd when the system is not actually low on memory.
> >>
> >> Instead, kswapd is woken up because memory compaction failed,
> >> potentially even due to lock contention during compaction!
> >>
> >> Ideally the allocation code would only wake up kswapd if
> >> memory needs to be freed, or in order for kswapd to do
> >> memory compaction (so the allocator does not have to).
> >
> > Maybe I missed something, but shouldn't this be solved with my patch?
> 
> Ok, guys. Cage fight!
> 
> The rules are simple: two men enter, one man leaves.
> 

I'm fairly scorch damaged from this whole cycle already. I won't need a
prop master to look the part for a thunderdome match.

> And the one who comes out gets to explain to me which patch(es) I
> should apply, and which I should revert, if any.
> 

Based on the reports I've seen I expect the following to work for 3.7

Keep
  96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
  ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)

Revert
  82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"

Merge
  mm: vmscan: fix kswapd endless loop on higher order allocation
  mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
think we should also avoid waking kswapd for THP allocations if compaction
is deferred. Johannes' patch might mean that kswapd goes quickly go back
to sleep but it's still busy work.

3.6 is still known to be screwed in terms of THP because of the amount of
time it can spend in compaction after lumpy reclaim was removed. This is
my old list of patches I felt needed to be backported after 3.7 came out.
They are not tagged -stable, I'll be sending it to Greg manually.

e64c523 mm: compaction: abort compaction loop if lock is contended or run too long
3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave
661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
f40d1e4 mm: compaction: acquire the zone->lock as late as possible
753341a revert "mm: have order > 0 compaction start off where it left"
bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were isolated
c89511a mm: compaction: Restart compaction from near where it left off
6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA

Only Johannes' patch needs to be added to this list. kswapd is not woken
for THP in 3.6 but as it calls compaction for other high-order allocations
it still makes sense.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28 10:13             ` Mel Gorman
@ 2012-11-28 10:51               ` Thorsten Leemhuis
  2012-11-28 16:42               ` Mel Gorman
  2012-11-28 22:52               ` Andrew Morton
  2 siblings, 0 replies; 65+ messages in thread
From: Thorsten Leemhuis @ 2012-11-28 10:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Johannes Weiner, Rik van Riel, Andrew Morton,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List

Mel Gorman wrote on 28.11.2012 11:13:
> On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
>> On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
>
>> And the one who comes out gets to explain to me which patch(es) I
>> should apply, and which I should revert, if any.
> 
> Based on the reports I've seen I expect the following to work for 3.7
> 
> Keep
>   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> 
> Revert
>   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> 
> Merge
>   mm: vmscan: fix kswapd endless loop on higher order allocation
>   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

I'll build a kernel with this combination and will give it a try. Maybe
one of those people that reported problems in
https://bugzilla.redhat.com/show_bug.cgi?id=866988 can try them, too.
There two people recently reported their problems were gone with kernels
that contained 82b212f4.

> Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> think we should also avoid waking kswapd for THP allocations if compaction
> is deferred. Johannes' patch might mean that kswapd goes quickly go back
> to sleep but it's still busy work.

Is there a way to trigger (some benchmark?) and detect (something in
/proc/vmstat ?) the problem Hannes patch tries to fix?

Background: The two main problems that got me into this discussion
vanished thx to 9671009 (mm: revert "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures") and ef6c5be (fix
incorrect NR_FREE_PAGES accounting (appears like memory leak)). I
thought all my problems had gone, but after a few days of uptime
(suspended and resumed the particular machine a few times in between, as
I was using it just in the evenings) kswap now and then started
consuming nearly 100% of one cpu core for 10 to 15 seconds intervals (it
seems watching a YouTube video triggered it; and the machine was using a
little bit swap space). I just had started debugging this, but due to
some stupid mistake
(https://plus.google.com/107616711159256259828/posts/GXuhf1LTien ) then
rebooted the machine :-/ So maybe I hit the problem Hannes patch tries
to solve, but I'm not sure; and I have no easy way to verify quickly if
the proposed patch combination helps.

Thorsten

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 20:58 ` kswapd craziness in 3.7 Linus Torvalds
  2012-11-27 21:16   ` Rik van Riel
  2012-11-27 21:29   ` Johannes Weiner
@ 2012-11-28 13:35   ` Zdenek Kabelac
  2012-11-28 14:04     ` Jiri Slaby
  2 siblings, 1 reply; 65+ messages in thread
From: Zdenek Kabelac @ 2012-11-28 13:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Thorsten Leemhuis, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List

Dne 27.11.2012 21:58, Linus Torvalds napsal(a):
> Note that in the meantime, I've also applied (through Andrew) the
> patch that reverts commit c654345924f7 (see commit 82b212f40059
> 'Revert "mm: remove __GFP_NO_KSWAPD"').
>
> I wonder if that revert may be bogus, and a result of this same issue.
> Maybe that revert should be reverted, and replaced with your patch?
>
> Mel? Zdenek? What's the status here?
>


I've tried for longer term:

https://lkml.org/lkml/2012/11/5/308
https://lkml.org/lkml/2012/11/12/113

these 2 seems to be now merge in -rc7
(since they disappeared after my git rebase)


and added slightly modified patch from Jiri
(https://lkml.org/lkml/2012/11/15/950
(Unsure where it still applies for -rc7??)

Also I've Jan Kara <jack@suse.cz>
fs: Fix imbalance in freeze protection in mark_files_ro()
(which is still not applied to upstream)

And I think I'm NOT seeing huge load from kswapd0.
(At least related to my not really long uptimes)


But also I'm now  frequent victim of my other report:

https://lkml.org/lkml/2012/11/15/369

Which turns into a problem, that if my T61 docking station
has enabled support for 'old hw' for docking in BIOS - i.e. serial output'
it becomes unstable and either 1st. or 2nd. resume deadlocks
machine - and serial port gives just garbage)

Zdenek


>                   Linus
>
> On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> Hi everyone,
>>
>> I hope I included everybody that participated in the various threads
>> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
>> at at least three root causes as far as I can see, so it's not really
>> clear who observed which problem.  Please correct me if the
>> reported-by, tested-by, bisected-by tags are incomplete.
>>
>> One problem was, as it seems, overly aggressive reclaim due to scaling
>> up reclaim goals based on compaction failures.  This one was reverted
>> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
>> reclaim/compaction based on failures".
>>
>> Another one was an accounting problem where a freed higher order page
>> was underreported, and so kswapd had trouble restoring watermarks.
>> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
>> (appears like memory leak).
>>
>> The third one is a problem with small zones, like the DMA zone, where
>> the high watermark is lower than the low watermark plus compaction gap
>> (2 * allocation size).  The zonelist reclaim in kswapd would do
>> nothing because all high watermarks are met, but the compaction logic
>> would find its own requirements unmet and loop over the zones again.
>> Indefinitely, until some third party would free enough memory to help
>> meet the higher compaction watermark.  The problematic code has been
>> there since the 3.4 merge window for non-THP higher order allocations
>> but has been more prominent since the 3.7 merge window, where kswapd
>> is also woken up for the much more common THP allocations.
>>
>> The following patch should fix the third issue by making both reclaim
>> and compaction code in kswapd use the same predicate to determine
>> whether a zone is balanced or not.
>>
>> Hopefully, the sum of all three fixes should tame kswapd enough for
>> 3.7.
>>
>> Johannes
>>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28 13:35   ` Zdenek Kabelac
@ 2012-11-28 14:04     ` Jiri Slaby
  0 siblings, 0 replies; 65+ messages in thread
From: Jiri Slaby @ 2012-11-28 14:04 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Linus Torvalds, Johannes Weiner, Andrew Morton, Mel Gorman,
	Rik van Riel, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks,
	Thorsten Leemhuis, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List

On 11/28/2012 02:35 PM, Zdenek Kabelac wrote:
> and added slightly modified patch from Jiri
> (https://lkml.org/lkml/2012/11/15/950
> (Unsure where it still applies for -rc7??)

It is needed for -next only. And if you have recent -next, it's already
there...

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28 10:13             ` Mel Gorman
  2012-11-28 10:51               ` Thorsten Leemhuis
@ 2012-11-28 16:42               ` Mel Gorman
  2012-11-28 22:52               ` Andrew Morton
  2 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2012-11-28 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Wed, Nov 28, 2012 at 10:13:59AM +0000, Mel Gorman wrote:
> On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
> > On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> > >>
> > >> Kswapd going crazy is certainly a large part of the problem.
> > >>
> > >> However, that leaves the issue of page_alloc.c waking up
> > >> kswapd when the system is not actually low on memory.
> > >>
> > >> Instead, kswapd is woken up because memory compaction failed,
> > >> potentially even due to lock contention during compaction!
> > >>
> > >> Ideally the allocation code would only wake up kswapd if
> > >> memory needs to be freed, or in order for kswapd to do
> > >> memory compaction (so the allocator does not have to).
> > >
> > > Maybe I missed something, but shouldn't this be solved with my patch?
> > 
> > Ok, guys. Cage fight!
> > 
> > The rules are simple: two men enter, one man leaves.
> > 
> 
> I'm fairly scorch damaged from this whole cycle already. I won't need a
> prop master to look the part for a thunderdome match.
> 
> > And the one who comes out gets to explain to me which patch(es) I
> > should apply, and which I should revert, if any.
> > 
> 
> Based on the reports I've seen I expect the following to work for 3.7
> 
> Keep
>   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> 
> Revert
>   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> 
> Merge
>   mm: vmscan: fix kswapd endless loop on higher order allocation
>   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
> 

and
    mm: compaction: Fix return value of capture_free_page

but this one may already be in flight from Andrew's tree as he picked it
up already.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28 10:13             ` Mel Gorman
  2012-11-28 10:51               ` Thorsten Leemhuis
  2012-11-28 16:42               ` Mel Gorman
@ 2012-11-28 22:52               ` Andrew Morton
  2012-11-28 23:54                 ` Mel Gorman
  2 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2012-11-28 22:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Johannes Weiner, Rik van Riel, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Wed, 28 Nov 2012 10:13:59 +0000
Mel Gorman <mgorman@suse.de> wrote:

> Based on the reports I've seen I expect the following to work for 3.7
> 
> Keep
>   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> 
> Revert
>   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> 
> Merge
>   mm: vmscan: fix kswapd endless loop on higher order allocation
>   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

"mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
myself" and when Zdenek tested it he hit an unexplained oom.

> Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> think we should also avoid waking kswapd for THP allocations if compaction
> is deferred. Johannes' patch might mean that kswapd goes quickly go back
> to sleep but it's still busy work.
> 
> 3.6 is still known to be screwed in terms of THP because of the amount of
> time it can spend in compaction after lumpy reclaim was removed. This is
> my old list of patches I felt needed to be backported after 3.7 came out.
> They are not tagged -stable, I'll be sending it to Greg manually.
> 
> e64c523 mm: compaction: abort compaction loop if lock is contended or run too long
> 3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave
> 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
> 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
> f40d1e4 mm: compaction: acquire the zone->lock as late as possible
> 753341a revert "mm: have order > 0 compaction start off where it left"
> bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were isolated
> c89511a mm: compaction: Restart compaction from near where it left off
> 6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
> 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA
> 
> Only Johannes' patch needs to be added to this list. kswapd is not woken
> for THP in 3.6 but as it calls compaction for other high-order allocations
> it still makes sense.

Please identify "Johannes' patch"?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28 22:52               ` Andrew Morton
@ 2012-11-28 23:54                 ` Mel Gorman
  2012-11-29  0:14                   ` Andrew Morton
  2012-11-29 15:30                   ` Thorsten Leemhuis
  0 siblings, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2012-11-28 23:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Rik van Riel, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> On Wed, 28 Nov 2012 10:13:59 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Based on the reports I've seen I expect the following to work for 3.7
> > 
> > Keep
> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> > 
> > Revert
> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> > 
> > Merge
> >   mm: vmscan: fix kswapd endless loop on higher order allocation
> >   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
> 
> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> myself" and when Zdenek tested it he hit an unexplained oom.
> 

I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
Further, when he hit that OOM, it looked like a genuine OOM. He had no
swap configured and inactive/active file pages were very low. Finally,
the free pages for Normal looked off and could also have been affected by
the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
here. Are you thinking of something else?

I have not tested with the patch admittedly but Thorsten has and seemed
to be ok with it https://lkml.org/lkml/2012/11/23/276.

> > Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> > think we should also avoid waking kswapd for THP allocations if compaction
> > is deferred. Johannes' patch might mean that kswapd goes quickly go back
> > to sleep but it's still busy work.
> > 
> > 3.6 is still known to be screwed in terms of THP because of the amount of
> > time it can spend in compaction after lumpy reclaim was removed. This is
> > my old list of patches I felt needed to be backported after 3.7 came out.
> > They are not tagged -stable, I'll be sending it to Greg manually.
> > 
> > e64c523 mm: compaction: abort compaction loop if lock is contended or run too long
> > 3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave
> > 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
> > 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
> > f40d1e4 mm: compaction: acquire the zone->lock as late as possible
> > 753341a revert "mm: have order > 0 compaction start off where it left"
> > bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were isolated
> > c89511a mm: compaction: Restart compaction from near where it left off
> > 6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
> > 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA
> > 
> > Only Johannes' patch needs to be added to this list. kswapd is not woken
> > for THP in 3.6 but as it calls compaction for other high-order allocations
> > it still makes sense.
> 
> Please identify "Johannes' patch"?

mm: vmscan: fix kswapd endless loop on higher order allocation

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28 23:54                 ` Mel Gorman
@ 2012-11-29  0:14                   ` Andrew Morton
  2012-11-29 15:30                   ` Thorsten Leemhuis
  1 sibling, 0 replies; 65+ messages in thread
From: Andrew Morton @ 2012-11-29  0:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Johannes Weiner, Rik van Riel, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis Kletnieks, Jiri Slaby, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, linux-mm, Linux Kernel Mailing List

On Wed, 28 Nov 2012 23:54:12 +0000
Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> > On Wed, 28 Nov 2012 10:13:59 +0000
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > Based on the reports I've seen I expect the following to work for 3.7
> > > 
> > > Keep
> > >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> > >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> > > 
> > > Revert
> > >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> > > 
> > > Merge
> > >   mm: vmscan: fix kswapd endless loop on higher order allocation
> > >   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
> > 
> > "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> > myself" and when Zdenek tested it he hit an unexplained oom.
> > 
> 
> I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> Further, when he hit that OOM, it looked like a genuine OOM. He had no
> swap configured and inactive/active file pages were very low. Finally,
> the free pages for Normal looked off and could also have been affected by
> the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> here. Are you thinking of something else?

who, me, think?  I was trying to work out why I hadn't merged or queued
a patch which you felt was important.  Turned out it was because it
didn't look very tested and final.

> I have not tested with the patch admittedly but Thorsten has and seemed
> to be ok with it https://lkml.org/lkml/2012/11/23/276.

OK, I'll queue revert-revert-mm-remove-__gfp_no_kswapd.patch and the
patch from https://patchwork.kernel.org/patch/1728081/.

So what I'm currently sitting on for 3.7 is

mm-compaction-fix-return-value-of-capture_free_page.patch
mm-vmemmap-fix-wrong-use-of-virt_to_page.patch
mm-vmscan-fix-endless-loop-in-kswapd-balancing.patch
revert-revert-mm-remove-__gfp_no_kswapd.patch
mm-avoid-waking-kswapd-for-thp-allocations-when-compaction-is-deferred-or-contended.patch
mm-soft-offline-split-thp-at-the-beginning-of-soft_offline_page.patch

> > Please identify "Johannes' patch"?
> 
> mm: vmscan: fix kswapd endless loop on higher order allocation

OK, we have that.  I'll start a round of testing, do another -next drop
and send the above Linuswards tomorrow.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28 23:54                 ` Mel Gorman
  2012-11-29  0:14                   ` Andrew Morton
@ 2012-11-29 15:30                   ` Thorsten Leemhuis
  2012-11-29 17:05                     ` Johannes Weiner
  1 sibling, 1 reply; 65+ messages in thread
From: Thorsten Leemhuis @ 2012-11-29 15:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List

Mel Gorman wrote on 29.11.2012 00:54:
> On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
>> On Wed, 28 Nov 2012 10:13:59 +0000
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>> > Based on the reports I've seen I expect the following to work for 3.7
>> > Keep
>> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>> > Revert
>> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>> > Merge
>> >   mm: vmscan: fix kswapd endless loop on higher order allocation
>> >   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
>> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
>> myself" and when Zdenek tested it he hit an unexplained oom.
> I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> Further, when he hit that OOM, it looked like a genuine OOM. He had no
> swap configured and inactive/active file pages were very low. Finally,
> the free pages for Normal looked off and could also have been affected by
> the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> here. Are you thinking of something else?
> 
> I have not tested with the patch admittedly but Thorsten has and seemed
> to be ok with it https://lkml.org/lkml/2012/11/23/276.

Yeah, on my two main work horses a few different kernels based on rc6 or
rc7 worked fine with this patch. But sorry, it seems the patch doesn't
fix the problems Fedora user John Ellson sees, who tried kernels I built
in the Fedora buildsystem. Details:

In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned
his machine worked fine with a rc6 based kernel I built that contained
82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried
a kernel with the same baseline that contained "Avoid waking kswapd for
THP allocations when […]" instead and reported it didn't help on his
i686 machine (seems it helped the x86-64 one):
https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33

He now tried a recent mainline kernel I built 20 hours ago that is based
on a git checkout from round about two days ago, reverts 82b212f4, and had
 * fix-kswapd-endless-loop-on-higher-order-allocation.patch
 * Avoid-waking-kswapd-for-THP-allocations-when.patch
 * mm-compaction-Fix-return-value-of-capture_free_page.patch
applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and
comment 41 he reported that this kernel on his i686 host showed 100%cpu
usage by kswapd0 :-/

Build log for said kernel rpms (I quite sure I applied the patches
properly, but you know: mistakes happen, so be careful, maybe I did
something stupid somewhere...):
http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log

I know, this makes things more complicated again; but I wanted to let
you guys know that some problem might still be lurking somewhere. Side
note: right now it seems John with kernels that contain
"Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
quicker (or only?) on i686 than on x86-64.

CU
Thorsten

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-29 15:30                   ` Thorsten Leemhuis
@ 2012-11-29 17:05                     ` Johannes Weiner
  2012-11-30 12:39                       ` Thorsten Leemhuis
  0 siblings, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-11-29 17:05 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Mel Gorman, Andrew Morton, Linus Torvalds, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List

On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote:
> Mel Gorman wrote on 29.11.2012 00:54:
> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> >> On Wed, 28 Nov 2012 10:13:59 +0000
> >> Mel Gorman <mgorman@suse.de> wrote:
> >> 
> >> > Based on the reports I've seen I expect the following to work for 3.7
> >> > Keep
> >> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
> >> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> >> > Revert
> >> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> >> > Merge
> >> >   mm: vmscan: fix kswapd endless loop on higher order allocation
> >> >   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> >> myself" and when Zdenek tested it he hit an unexplained oom.
> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> > Further, when he hit that OOM, it looked like a genuine OOM. He had no
> > swap configured and inactive/active file pages were very low. Finally,
> > the free pages for Normal looked off and could also have been affected by
> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> > here. Are you thinking of something else?
> > 
> > I have not tested with the patch admittedly but Thorsten has and seemed
> > to be ok with it https://lkml.org/lkml/2012/11/23/276.
> 
> Yeah, on my two main work horses a few different kernels based on rc6 or
> rc7 worked fine with this patch. But sorry, it seems the patch doesn't
> fix the problems Fedora user John Ellson sees, who tried kernels I built
> in the Fedora buildsystem. Details:
> 
> In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned
> his machine worked fine with a rc6 based kernel I built that contained
> 82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried
> a kernel with the same baseline that contained "Avoid waking kswapd for
> THP allocations when […]" instead and reported it didn't help on his
> i686 machine (seems it helped the x86-64 one):
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33
> 
> He now tried a recent mainline kernel I built 20 hours ago that is based
> on a git checkout from round about two days ago, reverts 82b212f4, and had
>  * fix-kswapd-endless-loop-on-higher-order-allocation.patch
>  * Avoid-waking-kswapd-for-THP-allocations-when.patch
>  * mm-compaction-Fix-return-value-of-capture_free_page.patch
> applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and
> comment 41 he reported that this kernel on his i686 host showed 100%cpu
> usage by kswapd0 :-/
> 
> Build log for said kernel rpms (I quite sure I applied the patches
> properly, but you know: mistakes happen, so be careful, maybe I did
> something stupid somewhere...):
> http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log
> 
> I know, this makes things more complicated again; but I wanted to let
> you guys know that some problem might still be lurking somewhere. Side
> note: right now it seems John with kernels that contain
> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
> quicker (or only?) on i686 than on x86-64.

Humm, highmem...  Could this be the lowmem protection forcing kswapd
to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every
time it's woken up?

This requires somebody to wake up kswapd regularly, though and from
his report it's not quite clear to me if kswapd gets stuck or just has
really high CPU usage while the system is still under load.  The
initial post says he would expect "<5% cpu when idling" but his top
snippet in there shows there are other tasks running as well.  So does
it happen while the system is busy or when it's otherwise idle?

[ On the other hand, not waking kswapd from THP allocations seems to
  not show this problem on his i686 machine.  But it could also just
  be a tiny window of conditions aligning perfectly that drops kswapd
  in an endless loop, and the increased wakeups increase the
  probability of hitting it.  So, yeah, this would be good to know. ]

As the system is still responsive when this happens, any chance he
could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
haywire?

Or even run perf record -a -g sleep 5; perf report > kswapd.txt?

Preferrably with this patch applied, to rule out faulty lowmem
protection:

buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
when figuring out whether the zone is balanced and so priority levels
are not descended and no progress is ever made.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..73c4f5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2400,6 +2400,14 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 static bool zone_balanced(struct zone *zone, int order,
 			  unsigned long balance_gap, int classzone_idx)
 {
+	/*
+	 * If the number of buffer_heads in the machine exceeds the
+	 * maximum allowed level and this node has a highmem zone,
+	 * force kswapd to reclaim from it to relieve lowmem pressure.
+	 */
+	if (is_highmem(zone) && buffer_heads_over_limit)
+		return false;
+
 	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
 				    balance_gap, classzone_idx, 0))
 		return false;
@@ -2586,17 +2594,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			 */
 			age_active_anon(zone, &sc);
 
-			/*
-			 * If the number of buffer_heads in the machine
-			 * exceeds the maximum allowed level and this node
-			 * has a highmem zone, force kswapd to reclaim from
-			 * it to relieve lowmem pressure.
-			 */
-			if (buffer_heads_over_limit && is_highmem_idx(i)) {
-				end_zone = i;
-				break;
-			}
-
 			if (!zone_balanced(zone, order, 0, 0)) {
 				end_zone = i;
 				break;
@@ -2672,8 +2669,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 						COMPACT_SKIPPED)
 				testorder = 0;
 
-			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
-			    !zone_balanced(zone, testorder,
+			if (!zone_balanced(zone, testorder,
 					   balance_gap, end_zone)) {
 				shrink_zone(zone, &sc);
 


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-29 17:05                     ` Johannes Weiner
@ 2012-11-30 12:39                       ` Thorsten Leemhuis
  2012-12-01  0:45                         ` Johannes Weiner
  0 siblings, 1 reply; 65+ messages in thread
From: Thorsten Leemhuis @ 2012-11-30 12:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Linus Torvalds, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List

Johannes Weiner wrote on 29.11.2012 18:05:
> On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote:
>> Mel Gorman wrote on 29.11.2012 00:54:
>> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
>> >> On Wed, 28 Nov 2012 10:13:59 +0000
>> >> Mel Gorman <mgorman@suse.de> wrote:
>> >> > Based on the reports I've seen I expect the following to work for 3.7
>> >> > Keep
>> >> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
>> >> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
>> >> > Revert
>> >> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>> >> > Merge
>> >> >   mm: vmscan: fix kswapd endless loop on higher order allocation
>> >> >   mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended
>> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
>> >> myself" and when Zdenek tested it he hit an unexplained oom.
>> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
>> > Further, when he hit that OOM, it looked like a genuine OOM. He had no
>> > swap configured and inactive/active file pages were very low. Finally,
>> > the free pages for Normal looked off and could also have been affected by
>> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
>> > here. Are you thinking of something else?
>> > I have not tested with the patch admittedly but Thorsten has and seemed
>> > to be ok with it https://lkml.org/lkml/2012/11/23/276.
>> Yeah, on my two main work horses a few different kernels based on rc6 or
>> rc7 worked fine with this patch. But sorry, it seems the patch doesn't
>> fix the problems Fedora user John Ellson sees, who tried kernels I built
>> in the Fedora buildsystem. Details:
> [...]
>> I know, this makes things more complicated again; but I wanted to let
>> you guys know that some problem might still be lurking somewhere. Side
>> note: right now it seems John with kernels that contain
>> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
>> quicker (or only?) on i686 than on x86-64.
>
> Humm, highmem...  Could this be the lowmem protection forcing kswapd
> to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every
> time it's woken up?
> 
> This requires somebody to wake up kswapd regularly, though and from
> his report it's not quite clear to me if kswapd gets stuck or just has
> really high CPU usage while the system is still under load.  The
> initial post says he would expect "<5% cpu when idling" but his top
> snippet in there shows there are other tasks running as well.  So does
> it happen while the system is busy or when it's otherwise idle?
> 
> [ On the other hand, not waking kswapd from THP allocations seems to
>   not show this problem on his i686 machine.  But it could also just
>   be a tiny window of conditions aligning perfectly that drops kswapd
>   in an endless loop, and the increased wakeups increase the
>   probability of hitting it.  So, yeah, this would be good to know. ]
> 
> As the system is still responsive when this happens, any chance he
> could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
> haywire?
> 
> Or even run perf record -a -g sleep 5; perf report > kswapd.txt?
> 
> Preferrably with this patch applied, to rule out faulty lowmem
> protection:
> 
> buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
> when figuring out whether the zone is balanced and so priority levels
> are not descended and no progress is ever made.

/me wonders how to elegantly get out of his man-in-the-middle position

John was able to reproduce the problem quickly with a kernel that 
contained the patch from your mail. For details see 

https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later

He provided the informations there. Parts of it:

/proc/vmstat while kswad0 at 100%cpu

nr_free_pages 196858
nr_inactive_anon 15804
nr_active_anon 65
nr_inactive_file 20792
nr_active_file 11307
nr_unevictable 0
nr_mlock 0
nr_anon_pages 14385
nr_mapped 2393
nr_file_pages 32563
nr_dirty 5
nr_writeback 0
nr_slab_reclaimable 3113
nr_slab_unreclaimable 4725
nr_page_table_pages 271
nr_kernel_stack 96
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1487
nr_vmscan_immediate_reclaim 3
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 381
nr_dirtied 388323
nr_written 361128
nr_anon_transparent_hugepages 1
nr_free_cma 0
nr_dirty_threshold 38188
nr_dirty_background_threshold 19094
pgpgin 1057223
pgpgout 1552306
pswpin 8
pswpout 1487
pgalloc_dma 5548
pgalloc_normal 10651864
pgalloc_high 2191246
pgalloc_movable 0
pgfree 13055503
pgactivate 440358
pgdeactivate 259724
pgfault 31423675
pgmajfault 3760
pgrefill_dma 2174
pgrefill_normal 212914
pgrefill_high 51755
pgrefill_movable 0
pgsteal_kswapd_dma 1
pgsteal_kswapd_normal 202106
pgsteal_kswapd_high 36515
pgsteal_kswapd_movable 0
pgsteal_direct_dma 18
pgsteal_direct_normal 0
pgsteal_direct_high 3818
pgsteal_direct_movable 0
pgscan_kswapd_dma 1
pgscan_kswapd_normal 203044
pgscan_kswapd_high 40407
pgscan_kswapd_movable 0
pgscan_direct_dma 18
pgscan_direct_normal 0
pgscan_direct_high 4409
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 0
slabs_scanned 264192
kswapd_inodesteal 171676
kswapd_low_wmark_hit_quickly 0
kswapd_high_wmark_hit_quickly 26
kswapd_skip_congestion_wait 0
pageoutrun 117729182
allocstall 5
pgrotated 1628
compact_blocks_moved 313
compact_pages_moved 7192
compact_pagemigrate_failed 265
compact_stall 13
compact_fail 9
compact_success 4
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2985
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1877
unevictable_pgs_mlocked 3965
unevictable_pgs_munlocked 3965
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 636
thp_fault_fallback 10
thp_collapse_alloc 342
thp_collapse_alloc_failed 2
thp_split 6


/proc/zoneinfo with kswapd0 at 100% cpu

Node 0, zone      DMA
  pages free     1655
        min      196
        low      245
        high     294
        scanned  0
        spanned  4080
        present  3951
    nr_free_pages 1655
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 3
    nr_slab_unreclaimable 1
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    nr_dirtied   315
    nr_written   315
    nr_anon_transparent_hugepages 0
    nr_free_cma  0
        protection: (0, 861, 1000, 1000)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
  all_unreclaimable: 1
  start_pfn:         16
  inactive_ratio:    1
Node 0, zone   Normal
  pages free     186234
        min      10953
        low      13691
        high     16429
        scanned  0
        spanned  222206
        present  220470
    nr_free_pages 186234
    nr_inactive_anon 3147
    nr_active_anon 2
    nr_inactive_file 14064
    nr_active_file 4672
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 3028
    nr_mapped    216
    nr_file_pages 18857
    nr_dirty     8
    nr_writeback 0
    nr_slab_reclaimable 3110
    nr_slab_unreclaimable 4729
    nr_page_table_pages 62
    nr_kernel_stack 96
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 311
    nr_vmscan_immediate_reclaim 2
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     114
    nr_dirtied   339809
    nr_written   315061
    nr_anon_transparent_hugepages 0
    nr_free_cma  0
        protection: (0, 0, 1111, 1111)
  pagesets
    cpu: 0
              count: 81
              high:  186
              batch: 31
  vm stats threshold: 8
  all_unreclaimable: 0
  start_pfn:         4096
  inactive_ratio:    1
Node 0, zone  HighMem
  pages free     8983
        min      34
        low      475
        high     917
        scanned  0
        spanned  35840
        present  35560
    nr_free_pages 8983
    nr_inactive_anon 12661
    nr_active_anon 64
    nr_inactive_file 6849
    nr_active_file 6500
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 11357
    nr_mapped    2177
    nr_file_pages 13692
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 209
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 1176
    nr_vmscan_immediate_reclaim 1
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     267
    nr_dirtied   48189
    nr_written   45739
    nr_anon_transparent_hugepages 1
    nr_free_cma  0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 20
              high:  42
              batch: 7
  vm stats threshold: 4
  all_unreclaimable: 0
  start_pfn:         226302
  inactive_ratio:    1


First few lines of /proc/vmstat while kswad0 at 100%cpu

# ========
# captured on: Fri Nov 30 07:22:00 2012
# hostname : rawhide
# os release : 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686
# perf version : 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686
# arch : i686
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : QEMU Virtual CPU version 1.0.1
# cpuid : AuthenticAMD,6,2,3
# total memory : 1027716 kB
# cmdline : /usr/bin/perf record -g -a sleep 5 
# event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, id = { 7 }
# HEADER_CPU_TOPOLOGY info available, use -I to display
# pmu mappings: software = 1, tracepoint = 2, breakpoint = 5
# ========
#
# Samples: 20K of event 'cpu-clock'
# Event count (approx.): 20016
#
# Overhead      Command              Shared Object                               Symbol
# ........  ...........  .........................  ...................................
#
    16.52%      kswapd0  [kernel.kallsyms]          [k] idr_get_next                   
                |
                --- idr_get_next
                   |          
                   |--99.76%-- css_get_next
                   |          mem_cgroup_iter
                   |          |          
                   |          |--50.49%-- shrink_zone
                   |          |          kswapd
                   |          |          kthread
                   |          |          ret_from_kernel_thread
                   |          |          
                   |           --49.51%-- kswapd
                   |                     kthread
                   |                     ret_from_kernel_thread
                    --0.24%-- [...]

    11.23%      kswapd0  [kernel.kallsyms]          [k] prune_super                    
                |
                --- prune_super
                   |          
                   |--86.74%-- shrink_slab
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                    --13.26%-- kswapd
                              kthread
                              ret_from_kernel_thread

    10.73%      kswapd0  [kernel.kallsyms]          [k] shrink_slab                    
                |
                --- shrink_slab
                   |          
                   |--99.58%-- kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                    --0.42%-- [...]

     7.36%      kswapd0  [kernel.kallsyms]          [k] grab_super_passive             
                |
                --- grab_super_passive
                   |          
                   |--92.46%-- prune_super
                   |          shrink_slab
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                    --7.54%-- shrink_slab
                              kswapd
                              kthread
                              ret_from_kernel_thread

     5.82%      kswapd0  [kernel.kallsyms]          [k] _raw_spin_lock                 
                |
                --- _raw_spin_lock
                   |          
                   |--34.28%-- put_super
                   |          drop_super
                   |          prune_super
                   |          shrink_slab
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                   |--30.50%-- grab_super_passive
                   |          prune_super
                   |          shrink_slab
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                   |--17.27%-- prune_super
                   |          shrink_slab
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                   |--16.15%-- drop_super
                   |          prune_super
                   |          shrink_slab
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                   |--1.20%-- mb_cache_shrink_fn
                   |          shrink_slab
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                    --0.60%-- shrink_slab
                              kswapd
                              kthread
                              ret_from_kernel_thread

     4.43%      kswapd0  [kernel.kallsyms]          [k] fill_contig_page_info          
                |
                --- fill_contig_page_info
                   |          
                   |--99.10%-- fragmentation_index
                   |          compaction_suitable
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                    --0.90%-- compaction_suitable
                              kswapd
                              kthread
                              ret_from_kernel_thread

     3.81%      kswapd0  [kernel.kallsyms]          [k] shrink_lruvec                  
                |
                --- shrink_lruvec
                   |          
                   |--99.34%-- shrink_zone
                   |          kswapd
                   |          kthread
                   |          ret_from_kernel_thread
                   |          
                    --0.66%-- kswapd
                              kthread
                              ret_from_kernel_thread

The rest at https://bugzilla.redhat.com/attachment.cgi?id=654977

CU
 Thorsten

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-30 12:39                       ` Thorsten Leemhuis
@ 2012-12-01  0:45                         ` Johannes Weiner
  2012-12-03  8:30                           ` Thorsten Leemhuis
  0 siblings, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-12-01  0:45 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Mel Gorman, Andrew Morton, Linus Torvalds, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List, John Ellson

Hi Thorsten,

On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote:
> /me wonders how to elegantly get out of his man-in-the-middle position

You control the mighty koji :-)

But seriously, this is very helpful, thank you!  John now also Cc'd
directly.

> John was able to reproduce the problem quickly with a kernel that 
> contained the patch from your mail. For details see
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later
> 
> He provided the informations there. Parts of it:

> /proc/vmstat while kswad0 at 100%cpu
> /proc/zoneinfo with kswapd0 at 100% cpu
> perf profile

Thanks.

I'm quoting the interesting bits in order of the cars on my possibly
derailing train of thought:

> pageoutrun 117729182
> allocstall 5

Okay, so kswapd is stupidly looping but it's still managing to do it's
actual job; nobody is dropping into direct reclaim.

> pgsteal_kswapd_dma 1
> pgsteal_kswapd_normal 202106
> pgsteal_kswapd_high 36515
> pgsteal_kswapd_movable 0

> pgscan_kswapd_dma 1
> pgscan_kswapd_normal 203044
> pgscan_kswapd_high 40407
> pgscan_kswapd_movable 0

Does not seem excessive, so apparently it also does not overreclaim.

> Node 0, zone      DMA
>   pages free     1655
>         min      196
>         low      245
>         high     294

> Node 0, zone   Normal
>   pages free     186234
>         min      10953
>         low      13691
>         high     16429

> Node 0, zone  HighMem
>   pages free     8983
>         min      34
>         low      475
>         high     917

These are all well above their watermarks, yet kswapd is definitely
finding something wrong with one of these as it actually does drop
into the reclaim loop, so zone_balanced() must be returning false:

>     16.52%      kswapd0  [kernel.kallsyms]          [k] idr_get_next                   
>                 |
>                 --- idr_get_next
>                    |          
>                    |--99.76%-- css_get_next
>                    |          mem_cgroup_iter
>                    |          |          
>                    |          |--50.49%-- shrink_zone
>                    |          |          kswapd
>                    |          |          kthread
>                    |          |          ret_from_kernel_thread
>                    |          |          
>                    |           --49.51%-- kswapd
>                    |                     kthread
>                    |                     ret_from_kernel_thread
>                     --0.24%-- [...]
> 
>     11.23%      kswapd0  [kernel.kallsyms]          [k] prune_super                    
>                 |
>                 --- prune_super
>                    |          
>                    |--86.74%-- shrink_slab
>                    |          kswapd
>                    |          kthread
>                    |          ret_from_kernel_thread
>                    |          
>                     --13.26%-- kswapd
>                               kthread
>                               ret_from_kernel_thread

Spending so much time in shrink_zone and shrink_slab without
overreclaiming a zone, I would say that a) this always stays on the
DEF_PRIORITY and b) only loops on the DMA zone.  At DEF_PRIORITY, the
scan goal for filepages in the other zones would be > 0 e.g.

As the DMA zone watermarks are fine, it must be the fragmentation
index that indicates a lack of memory.  Filling in the 1655 free pages
into the fragmentation index formula indicates lack of free memory
when these 1655 pages are lumped together in less than 9 page blocks.
Not unrealistic, I think: on my desktop machine, the DMA zone's free
3975 pages are lumped together in only 12 blocks.  But on my system,
the DMA zone is either never used and there is always at least one
page block available that could satisfy a huge page allocation
(fragmentation index == -1000).  Unless the system gets really close
to OOM, at which point the DMA zone is highly fragmented.  And keep in
mind that if the priority level goes below DEF_PRIORITY, as it does
close to OOM, the unreclaimable DMA zone is ignored anyway.  But the
DMA zone here is just barely used:

> Node 0, zone      DMA
[...]
>     nr_slab_reclaimable 3
>     nr_slab_unreclaimable 1
[...]
>     nr_dirtied   315
>     nr_written   315

which could explain a fragmentation index that asks for more free
memory while the watermarks are fine.

Why this all loops: there is one more inconsistency where the
conditions for reclaim and the conditions for compaction contradict
each other: reclaim also does not consider the DMA zone balanced, but
it needs only 25% of the whole node to be balanced, while compaction
requires every single zone to be balanced individually.

So these strict per-zone checks for compaction at the end of
balance_pgdat() are likely to be the culprits that keep kswapd looping
forever on this machine, trying to balance DMA for compaction while
reclaim decides it has enough balanced memory in the node overall.

I think we can just remove them: whenever the compaction code is
reached, the reclaim code balanced 25% of the memory available for the
classzone to be suitable for compaction.

Mel?  Rik?

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
 to individual uncompactable zones

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the
node's memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all
bigger zones in the node had plenty of free memory.  Arguably, the
same should apply to compaction: if a significant part of the node is
balanced enough to run compaction, do not get hung up on that tiny
zone that might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced).  Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

Reported-by: Thorsten Leemhuis <fedora@leemhuis.info>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..486100f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
-				continue;
-
-			/* Would compaction fail due to lack of free memory? */
-			if (COMPACTION_BUILD &&
-			    compaction_suitable(zone, order) == COMPACT_SKIPPED)
-				goto loop_again;
-
-			/* Confirm the zone is balanced for order-0 */
-			if (!zone_watermark_ok(zone, 0,
-					high_wmark_pages(zone), 0, 0)) {
-				order = sc.order = 0;
-				goto loop_again;
-			}
-
 			/* Check if the memory needs to be defragmented. */
 			if (zone_watermark_ok(zone, order,
 				    low_wmark_pages(zone), *classzone_idx, 0))
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-01  0:45                         ` Johannes Weiner
@ 2012-12-03  8:30                           ` Thorsten Leemhuis
  2012-12-03 13:08                             ` Fedora repo (was: Re: kswapd craziness in 3.7) Borislav Petkov
  2012-12-03 19:42                             ` kswapd craziness in 3.7 Johannes Weiner
  0 siblings, 2 replies; 65+ messages in thread
From: Thorsten Leemhuis @ 2012-12-03  8:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Linus Torvalds, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List, John Ellson

Hi!

Johannes Weiner wrote on 01.12.2012 01:45:
> On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote:
>> /me wonders how to elegantly get out of his man-in-the-middle position
> You control the mighty koji :-)

Something even a journalist can ;-)

> But seriously, this is very helpful, thank you!

Np; BTW, in case anybody here on LKML cares: I started maintaining a
side repo (PPA in ubuntu speak) a few weeks ago that offers kernel
vanilla builds (mainline and stable) for the Fedora 17 and 18; see
https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
for details. It's not as good and up2date yet as I would like it, but
one has to start somewhere.

Back to topic:

> John now also Cc'd directly.
> 
>> John was able to reproduce the problem quickly with a kernel that 
>> contained the patch from your mail. For details see
>
> [stripped: all the glory details of what likely went wrong and lead
> to the problem john sees or saw]
>
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>  to individual uncompactable zones
> 
> When a zone meets its high watermark and is compactable in case of
> higher order allocations, it contributes to the percentage of the
> node's memory that is considered balanced.
> [...]

FYI: I built a kernel with that patch. I've been running on my x86_64
machine at home over the weekend and everything was working fine (just
as without the patch). John gave it a quick try and in
https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported:

"""
I just installed
kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my
usual load that triggers the problem.  OK so far.  I'll check again in
24hours, but looking good so far.
"""

BTW, I built that kernel without the patch you mentioned in
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
[...]) It looked to me like that patch was only meant for debugging. Let
me know if that was wrong. Ohh, and I didn't update to a fresher
mainline checkout yet to make sure the base for John's testing didn't
change.

CU
 Thorsten

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Fedora repo (was: Re: kswapd craziness in 3.7)
  2012-12-03  8:30                           ` Thorsten Leemhuis
@ 2012-12-03 13:08                             ` Borislav Petkov
  2012-12-03 19:42                             ` kswapd craziness in 3.7 Johannes Weiner
  1 sibling, 0 replies; 65+ messages in thread
From: Borislav Petkov @ 2012-12-03 13:08 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, Linus Torvalds,
	Rik van Riel, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List, John Ellson

On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
> Np; BTW, in case anybody here on LKML cares: I started maintaining a
> side repo (PPA in ubuntu speak) a few weeks ago that offers kernel
> vanilla builds (mainline and stable) for the Fedora 17 and 18; see
> https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
> for details. It's not as good and up2date yet as I would like it, but
> one has to start somewhere.

Once you have this ready, you should send a more official mail with
"[ANNOUNCE]" in its subject and containing explanations how to use the
repo to lkml and relevant lists so that more people know about it.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-27 20:48 kswapd craziness in 3.7 Johannes Weiner
                   ` (2 preceding siblings ...)
  2012-11-28  9:45 ` Mel Gorman
@ 2012-12-03 13:14 ` Jiri Slaby
  2012-12-04  8:55   ` Jiri Slaby
  3 siblings, 1 reply; 65+ messages in thread
From: Jiri Slaby @ 2012-12-03 13:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis.Kletnieks, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On 11/27/2012 09:48 PM, Johannes Weiner wrote:
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem.  Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.

Hi, I reported the problem for the first time but I got lost in the
patches flying around very early.

Whatever is in the current -next, works for me since -next was
resurrected after the 2 weeks gap last week...

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-11-28  9:45 ` Mel Gorman
@ 2012-12-03 15:23   ` Zdenek Kabelac
  2012-12-03 19:18     ` Johannes Weiner
  0 siblings, 1 reply; 65+ messages in thread
From: Zdenek Kabelac @ 2012-12-03 15:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, George Spelvin,
	Johannes Hirte, Thorsten Leemhuis, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis.Kletnieks, Jiri Slaby,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

Dne 28.11.2012 10:45, Mel Gorman napsal(a):
> (Adding Thorsten to cc)
>
> On Tue, Nov 27, 2012 at 03:48:34PM -0500, Johannes Weiner wrote:
>> Hi everyone,
>>
>> I hope I included everybody that participated in the various threads
>> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
>> at at least three root causes as far as I can see, so it's not really
>> clear who observed which problem.  Please correct me if the
>> reported-by, tested-by, bisected-by tags are incomplete.
>>
>> One problem was, as it seems, overly aggressive reclaim due to scaling
>> up reclaim goals based on compaction failures.  This one was reverted
>> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
>> reclaim/compaction based on failures".
>>
>
> This particular one would have been made worse by the accounting bug and
> if kswapd was staying awake longer than necessary. As scaling the amount
> of reclaim only for direct reclaim helped this problem a lot, I strongly
> suspect the accounting bug was a factor.
>
> However the benefit for this is marginal -- it primarily affects how
> many THP pages we can allocate under stress. There is already a graceful
> fallback path and a system under heavy reclaim pressure is not going to
> notice the performance benefit of THP.
>
>> Another one was an accounting problem where a freed higher order page
>> was underreported, and so kswapd had trouble restoring watermarks.
>> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
>> (appears like memory leak).
>>
>
> This almost certainly also requires the follow-on fix at
> https://lkml.org/lkml/2012/11/26/225 for reasons I explained in
> https://lkml.org/lkml/2012/11/27/190 .
>
>> The third one is a problem with small zones, like the DMA zone, where
>> the high watermark is lower than the low watermark plus compaction gap
>> (2 * allocation size).  The zonelist reclaim in kswapd would do
>> nothing because all high watermarks are met, but the compaction logic
>> would find its own requirements unmet and loop over the zones again.
>> Indefinitely, until some third party would free enough memory to help
>> meet the higher compaction watermark.  The problematic code has been
>> there since the 3.4 merge window for non-THP higher order allocations
>> but has been more prominent since the 3.7 merge window, where kswapd
>> is also woken up for the much more common THP allocations.
>>
>
> Yes.
>
>> The following patch should fix the third issue by making both reclaim
>> and compaction code in kswapd use the same predicate to determine
>> whether a zone is balanced or not.
>>
>> Hopefully, the sum of all three fixes should tame kswapd enough for
>> 3.7.
>>
>
> Not exactly sure of that. With just those patches it is possible for
> allocations for THP entering the slow path to keep kswapd continually awake
> doing busy work. This was an alternative to the revert that covered that
> https://lkml.org/lkml/2012/11/12/151 but it was not enough because kswapd
> would stay awake due to the bug you identified and fixed.
>
> I went with the __GFP_NO_KSWAPD patch in this cycle because 3.6 was/is
> very poor in how it handles THP after the removal of lumpy reclaim. 3.7
> was shaping up to be even worse with multiple root causes too close to the
> release date.  Taking kswapd out of the equation covered some of the
> problems (yes, by hiding them) so it could be revisited but Johannes may
> have finally squashed it.
>
> However, if we revert the revert then I strongly recommend that it be
> replaced with "Avoid waking kswapd for THP allocations when compaction is
> deferred or contended".
>


Ok, bad news - I've been hit by  kswapd0 loop again -
my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again shown 
kswapd0 for couple minutes on CPU.

It seemed to go instantly away when I've drop caches
(echo 3 >/proc/sys/vm/drop_cache)
(After that I've had over 1G free memory)

Here are some stats before drop while kswapd0 was running:

kswapd0         R  running task        0    30      2 0x00000000
  ffff880133207b08 0000000000000082 ffff880133207b18 0000000000000246
  ffff880135b92340 ffff880133207fd8 ffff880133207fd8 ffff880133207fd8
  ffff880103098000 ffff880135b92340 0000000000000000 ffff880133206000
Call Trace:
  [<ffffffff815566b2>] preempt_schedule+0x42/0x60
  [<ffffffff81558555>] _raw_spin_unlock+0x55/0x60
  [<ffffffff81193b3c>] grab_super_passive+0x3c/0x90
  [<ffffffff81193bd6>] prune_super+0x46/0x1b0
  [<ffffffff81141eda>] shrink_slab+0xba/0x510
  [<ffffffff81185c3a>] ? mem_cgroup_iter+0x17a/0x2e0
  [<ffffffff81185b8a>] ? mem_cgroup_iter+0xca/0x2e0
  [<ffffffff81145141>] balance_pgdat+0x621/0x7e0
  [<ffffffff81145474>] kswapd+0x174/0x640
  [<ffffffff8106fd40>] ? __init_waitqueue_head+0x60/0x60
  [<ffffffff81145300>] ? balance_pgdat+0x7e0/0x7e0
  [<ffffffff8106f52b>] kthread+0xdb/0xe0
  [<ffffffff8106f450>] ? kthread_create_on_node+0x140/0x140
  [<ffffffff815604dc>] ret_from_fork+0x7c/0xb0
  [<ffffffff8106f450>] ? kthread_create_on_node+0x140/0x140

runnable tasks:
             task   PID         tree-key  switches  prio     exec-runtime 
     sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
          kswapd0    30   8087056.792356     30543   120   8087056.792356 
158938.479290 137131605.711862 /
      kworker/0:3 29833   8087050.792356    526664   120   8087050.792356 
24710.527691  24775203.529553 /
R           bash 24767     43813.836355       121   120     43813.836355 
   40.855087     10579.107486 /autogroup-392

----

Showing all locks held in the system:
1 lock held by bash/10668:
  #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813b3dc0>] 
n_tty_read+0x610/0x990
1 lock held by bash/10756:
  #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813b3dc0>] 
n_tty_read+0x610/0x990
1 lock held by bash/26989:
  #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813b3dc0>] 
n_tty_read+0x610/0x990
1 lock held by less/10268:
  #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813b3dc0>] 
n_tty_read+0x610/0x990
1 lock held by less/19112:
  #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813b3dc0>] 
n_tty_read+0x610/0x990
1 lock held by bash/13774:
  #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813b3dc0>] 
n_tty_read+0x610/0x990
1 lock held by less/32444:
  #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813b3dc0>] 
n_tty_read+0x610/0x990
2 locks held by bash/24767:
  #0:  (sysrq_key_table_lock){......}, at: [<ffffffff813bb553>] 
__handle_sysrq+0x33/0x190
  #1:  (tasklist_lock){.+.+..}, at: [<ffffffff810ad973>] 
debug_show_all_locks+0x43/0x2a0

=============================================

SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) 
terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) 
thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) 
nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync 
show-task-states(T) Unmount force-fb(V) show-blocked-tasks(W) 
dump-ftrace-buffer(Z)
SysRq : Show Memory
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 147
CPU    1: hi:  186, btch:  31 usd: 157
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd: 154
CPU    1: hi:  186, btch:  31 usd: 182
active_anon:610014 inactive_anon:16551 isolated_anon:0
  active_file:83258 inactive_file:151927 isolated_file:0
  unevictable:16 dirty:12 writeback:0 unstable:0
  free:72021 slab_reclaimable:18685 slab_unreclaimable:13682
  mapped:23445 shmem:29913 pagetables:7689 bounce:0
  free_cma:0
DMA free:15892kB min:260kB low:324kB high:388kB active_anon:0kB 
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15644kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB 
kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2982 3927 3927
DMA32 free:251976kB min:51124kB low:63904kB high:76684kB active_anon:1738128kB 
inactive_anon:58108kB active_file:316652kB inactive_file:591328kB 
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:3054528kB 
mlocked:16kB dirty:40kB writeback:0kB mapped:58684kB shmem:108216kB 
slab_reclaimable:38888kB slab_unreclaimable:15988kB kernel_stack:1416kB 
pagetables:8684kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 945 945
Normal free:20216kB min:16196kB low:20244kB high:24292kB active_anon:701928kB 
inactive_anon:8096kB active_file:16380kB inactive_file:16380kB 
unevictable:48kB isolated(anon):0kB isolated(file):0kB present:967680kB 
mlocked:48kB dirty:8kB writeback:0kB mapped:35096kB shmem:11436kB 
slab_reclaimable:35852kB slab_unreclaimable:38732kB kernel_stack:3200kB 
pagetables:22072kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:42 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB 0*8kB 1*16kB 0*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 
1*2048kB 3*4096kB = 15892kB
DMA32: 56*4kB 577*8kB 754*16kB 1192*32kB 713*64kB 484*128kB 223*256kB 57*512kB 
1*1024kB 1*2048kB 0*4096kB = 251976kB
Normal: 526*4kB 350*8kB 181*16kB 152*32kB 66*64kB 18*128kB 2*256kB 1*512kB 
0*1024kB 0*2048kB 0*4096kB = 20216kB
265099 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
1032176 pages RAM
42790 pages reserved
672981 pages shared
820401 pages non-shared

vmstat:

nr_free_pages 72360
nr_inactive_anon 16501
nr_active_anon 609811
nr_inactive_file 151932
nr_active_file 83212
nr_unevictable 16
nr_mlock 16
nr_anon_pages 503314
nr_mapped 23443
nr_file_pages 264982
nr_dirty 234
nr_writeback 0
nr_slab_reclaimable 18685
nr_slab_unreclaimable 13682
nr_page_table_pages 7690
nr_kernel_stack 577
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 29
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 29838
nr_dirtied 2206202
nr_written 2066654
nr_anon_transparent_hugepages 182
nr_free_cma 0
nr_dirty_threshold 13870
nr_dirty_background_threshold 6935
pgpgin 3224666
pgpgout 9329522
pswpin 0
pswpout 0
pgalloc_dma 2
pgalloc_dma32 100605413
pgalloc_normal 25009399
pgalloc_movable 0
pgfree 126647271
pgactivate 1185101
pgdeactivate 214747
pgfault 106494704
pgmajfault 9834
pgrefill_dma 0
pgrefill_dma32 99747
pgrefill_normal 232841
pgrefill_movable 0
pgsteal_kswapd_dma 0
pgsteal_kswapd_dma32 208294
pgsteal_kswapd_normal 162100
pgsteal_kswapd_movable 0
pgsteal_direct_dma 0
pgsteal_direct_dma32 11942
pgsteal_direct_normal 91155
pgsteal_direct_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 211693
pgscan_kswapd_normal 182157
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 12129
pgscan_direct_normal 96028
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 77546
slabs_scanned 784384
kswapd_inodesteal 47090
kswapd_low_wmark_hit_quickly 57
kswapd_high_wmark_hit_quickly 275
kswapd_skip_congestion_wait 0
pageoutrun 1636173
allocstall 175
pgrotated 73
compact_blocks_moved 80209
compact_pages_moved 345293
compact_pagemigrate_failed 64875
compact_stall 736
compact_fail 314
compact_success 422
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2848
unevictable_pgs_scanned 0
unevictable_pgs_rescued 3330
unevictable_pgs_mlocked 3346
unevictable_pgs_munlocked 3330
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 53631
thp_fault_fallback 1682
thp_collapse_alloc 13390
thp_collapse_alloc_failed 643
thp_split 2387


Zdenek



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-03 15:23   ` Zdenek Kabelac
@ 2012-12-03 19:18     ` Johannes Weiner
  2012-12-04  9:05       ` Zdenek Kabelac
  0 siblings, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-12-03 19:18 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, George Spelvin,
	Johannes Hirte, Thorsten Leemhuis, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis.Kletnieks, Jiri Slaby,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

Szia Zdenek,

On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
> Ok, bad news - I've been hit by  kswapd0 loop again -
> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
> shown kswapd0 for couple minutes on CPU.
> 
> It seemed to go instantly away when I've drop caches
> (echo 3 >/proc/sys/vm/drop_cache)
> (After that I've had over 1G free memory)

Any chance you could retry with this patch on top?

Thanks!

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
 to individual uncompactable zones

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the
node's memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all
bigger zones in the node had plenty of free memory.  Arguably, the
same should apply to compaction: if a significant part of the node is
balanced enough to run compaction, do not get hung up on that tiny
zone that might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced).  Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

Reported-by: Thorsten Leemhuis <fedora@leemhuis.info>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..486100f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
-				continue;
-
-			/* Would compaction fail due to lack of free memory? */
-			if (COMPACTION_BUILD &&
-			    compaction_suitable(zone, order) == COMPACT_SKIPPED)
-				goto loop_again;
-
-			/* Confirm the zone is balanced for order-0 */
-			if (!zone_watermark_ok(zone, 0,
-					high_wmark_pages(zone), 0, 0)) {
-				order = sc.order = 0;
-				goto loop_again;
-			}
-
 			/* Check if the memory needs to be defragmented. */
 			if (zone_watermark_ok(zone, order,
 				    low_wmark_pages(zone), *classzone_idx, 0))
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-03  8:30                           ` Thorsten Leemhuis
  2012-12-03 13:08                             ` Fedora repo (was: Re: kswapd craziness in 3.7) Borislav Petkov
@ 2012-12-03 19:42                             ` Johannes Weiner
  2012-12-04 21:42                               ` Johannes Weiner
  2012-12-06  8:09                               ` Thorsten Leemhuis
  1 sibling, 2 replies; 65+ messages in thread
From: Johannes Weiner @ 2012-12-03 19:42 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Mel Gorman, Andrew Morton, Linus Torvalds, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List, John Ellson

On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
> >> John was able to reproduce the problem quickly with a kernel that 
> >> contained the patch from your mail. For details see
> >
> > [stripped: all the glory details of what likely went wrong and lead
> > to the problem john sees or saw]
> >
> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
> >  to individual uncompactable zones
> > 
> > When a zone meets its high watermark and is compactable in case of
> > higher order allocations, it contributes to the percentage of the
> > node's memory that is considered balanced.
> > [...]
> 
> FYI: I built a kernel with that patch. I've been running on my x86_64
> machine at home over the weekend and everything was working fine (just
> as without the patch). John gave it a quick try and in
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported:
> 
> """
> I just installed
> kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my
> usual load that triggers the problem.  OK so far.  I'll check again in
> 24hours, but looking good so far.
> """

w00t!

> BTW, I built that kernel without the patch you mentioned in
> http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
> ("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
> [...]) It looked to me like that patch was only meant for debugging. Let
> me know if that was wrong. Ohh, and I didn't update to a fresher
> mainline checkout yet to make sure the base for John's testing didn't
> change.

Ah, yes, the ApplyPatch is commented out.

I think we want that upstream as well, but it's not critical.  It'll
reduce kswapd CPU usage marginally on highmem systems in certain
situations, but I don't think any of the 100% CPU usage problems are
fixed by it.

Not rebasing sounds reasonable to me to verify the patch.  It might be
worth testing that the final version that will be 3.8 still works for
John, however, once that is done.  Just to be sure.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-03 13:14 ` Jiri Slaby
@ 2012-12-04  8:55   ` Jiri Slaby
  0 siblings, 0 replies; 65+ messages in thread
From: Jiri Slaby @ 2012-12-04  8:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, George Spelvin,
	Johannes Hirte, Tomas Racek, Jan Kara, Dave Hansen, Josh Boyer,
	Valdis.Kletnieks, Thorsten Leemhuis, Zdenek Kabelac,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On 12/03/2012 02:14 PM, Jiri Slaby wrote:
> On 11/27/2012 09:48 PM, Johannes Weiner wrote:
>> I hope I included everybody that participated in the various threads
>> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
>> at at least three root causes as far as I can see, so it's not really
>> clear who observed which problem.  Please correct me if the
>> reported-by, tested-by, bisected-by tags are incomplete.
> 
> Hi, I reported the problem for the first time but I got lost in the
> patches flying around very early.
> 
> Whatever is in the current -next, works for me since -next was
> resurrected after the 2 weeks gap last week...

Bah, I always need to write an email to reproduce that. It's back:
3.7.0-rc7-next-20121130

[<ffffffff810b132a>] __cond_resched+0x2a/0x40
[<ffffffff81133770>] shrink_slab+0x1c0/0x2d0
[<ffffffff8113668d>] kswapd+0x65d/0xb50
[<ffffffff810a37b0>] kthread+0xc0/0xd0
[<ffffffff816ba4dc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

Going to apply this:
https://lkml.org/lkml/2012/12/3/407
and wait another 5 days to see the results...

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-03 19:18     ` Johannes Weiner
@ 2012-12-04  9:05       ` Zdenek Kabelac
  2012-12-04  9:15         ` Jiri Slaby
                           ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Zdenek Kabelac @ 2012-12-04  9:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, George Spelvin,
	Johannes Hirte, Thorsten Leemhuis, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis.Kletnieks, Jiri Slaby,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
> Szia Zdenek,
>
> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
>> Ok, bad news - I've been hit by  kswapd0 loop again -
>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
>> shown kswapd0 for couple minutes on CPU.
>>
>> It seemed to go instantly away when I've drop caches
>> (echo 3 >/proc/sys/vm/drop_cache)
>> (After that I've had over 1G free memory)
>
> Any chance you could retry with this patch on top?
>
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>   to individual uncompactable zones
>
> ---
>   mm/vmscan.c | 16 ----------------
>   1 file changed, 16 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c


Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8)
with your patch.  I'll be able to give some feedback after couple
days (if I keep my machine running without reboot - since before
I had occasional problems with ACPI now resolved.
(https://bugzilla.kernel.org/show_bug.cgi?id=51071)
(patch not yet in -rc8)
I'm also using this extra patch: https://patchwork.kernel.org/patch/1792531/

What seems to be triggering condition on my machine - running laptop for some 
days - and having   Thunderbird reaching 0.8G (I guess they must keep all my 
news messages in memory to consume that size) and Firefox 1.3GB of consumed
memory (assuming massive leaking with combination of flash)

Zdenek


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04  9:05       ` Zdenek Kabelac
@ 2012-12-04  9:15         ` Jiri Slaby
  2012-12-04 16:11           ` Johannes Weiner
  2012-12-04 16:15         ` Johannes Weiner
  2012-12-06 13:51         ` Zdenek Kabelac
  2 siblings, 1 reply; 65+ messages in thread
From: Jiri Slaby @ 2012-12-04  9:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Zdenek Kabelac, Mel Gorman, Andrew Morton, Rik van Riel,
	George Spelvin, Johannes Hirte, Thorsten Leemhuis, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis.Kletnieks,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On 12/04/2012 10:05 AM, Zdenek Kabelac wrote:
> Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
>> Szia Zdenek,
>>
>> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
>>> Ok, bad news - I've been hit by  kswapd0 loop again -
>>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
>>> shown kswapd0 for couple minutes on CPU.
>>>
>>> It seemed to go instantly away when I've drop caches
>>> (echo 3 >/proc/sys/vm/drop_cache)
>>> (After that I've had over 1G free memory)
>>
>> Any chance you could retry with this patch on top?

It does not apply to -next :/. Should I try anything else?

>> From: Johannes Weiner <hannes@cmpxchg.org>
>> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>>   to individual uncompactable zones
...
> What seems to be triggering condition on my machine - running laptop for
> some days - and having   Thunderbird reaching 0.8G (I guess they must
> keep all my news messages in memory to consume that size) and Firefox
> 1.3GB of consumed
> memory (assuming massive leaking with combination of flash)

Similar here, 5 days of uptime (suspend/resumes in between). FF 900M, TB
250M, java 1.1G, kvm 550M, X 400M, cache 1.5G out of 6G total mem. And boom.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04  9:15         ` Jiri Slaby
@ 2012-12-04 16:11           ` Johannes Weiner
  2012-12-04 16:22             ` Jiri Slaby
  2012-12-08 10:35             ` Jiri Slaby
  0 siblings, 2 replies; 65+ messages in thread
From: Johannes Weiner @ 2012-12-04 16:11 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Zdenek Kabelac, Mel Gorman, Andrew Morton, Rik van Riel,
	George Spelvin, Johannes Hirte, Thorsten Leemhuis, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis.Kletnieks,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On Tue, Dec 04, 2012 at 10:15:09AM +0100, Jiri Slaby wrote:
> On 12/04/2012 10:05 AM, Zdenek Kabelac wrote:
> > Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
> >> Szia Zdenek,
> >>
> >> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
> >>> Ok, bad news - I've been hit by  kswapd0 loop again -
> >>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
> >>> shown kswapd0 for couple minutes on CPU.
> >>>
> >>> It seemed to go instantly away when I've drop caches
> >>> (echo 3 >/proc/sys/vm/drop_cache)
> >>> (After that I've had over 1G free memory)
> >>
> >> Any chance you could retry with this patch on top?
> 
> It does not apply to -next :/. Should I try anything else?

The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
is a -next patch.  I hope you don't run into other problems that come
out of -next craziness, because Linus is kinda waiting for this to be
resolved to release 3.8.  If you've always tested against -next so far
and it worked otherwise, don't change the environment now, please.  If
you just started, it would make more sense to test based on 3.7-rc8.

Thanks!

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
 to individual uncompactable zones

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the
node's memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all
bigger zones in the node had plenty of free memory.  Arguably, the
same should apply to compaction: if a significant part of the node is
balanced enough to run compaction, do not get hung up on that tiny
zone that might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced).  Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

Reported-by: Thorsten Leemhuis <fedora@leemhuis.info>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..486100f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable &&
-			    sc.priority != DEF_PRIORITY)
-				continue;
-
-			/* Would compaction fail due to lack of free memory? */
-			if (IS_ENABLED(CONFIG_COMPACTION) &&
-			    compaction_suitable(zone, order) == COMPACT_SKIPPED)
-				goto loop_again;
-
-			/* Confirm the zone is balanced for order-0 */
-			if (!zone_watermark_ok(zone, 0,
-					high_wmark_pages(zone), 0, 0)) {
-				order = sc.order = 0;
-				goto loop_again;
-			}
-
 			/* Check if the memory needs to be defragmented. */
 			if (zone_watermark_ok(zone, order,
 				    low_wmark_pages(zone), *classzone_idx, 0))
-- 
1.7.11.7



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04  9:05       ` Zdenek Kabelac
  2012-12-04  9:15         ` Jiri Slaby
@ 2012-12-04 16:15         ` Johannes Weiner
  2012-12-06 13:51         ` Zdenek Kabelac
  2 siblings, 0 replies; 65+ messages in thread
From: Johannes Weiner @ 2012-12-04 16:15 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, George Spelvin,
	Johannes Hirte, Thorsten Leemhuis, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis.Kletnieks, Jiri Slaby,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On Tue, Dec 04, 2012 at 10:05:29AM +0100, Zdenek Kabelac wrote:
> Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
> >Szia Zdenek,
> >
> >On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
> >>Ok, bad news - I've been hit by  kswapd0 loop again -
> >>my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
> >>shown kswapd0 for couple minutes on CPU.
> >>
> >>It seemed to go instantly away when I've drop caches
> >>(echo 3 >/proc/sys/vm/drop_cache)
> >>(After that I've had over 1G free memory)
> >
> >Any chance you could retry with this patch on top?
> >
> >---
> >From: Johannes Weiner <hannes@cmpxchg.org>
> >Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
> >  to individual uncompactable zones
> >
> >---
> >  mm/vmscan.c | 16 ----------------
> >  1 file changed, 16 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> 
> 
> Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8)
> with your patch.  I'll be able to give some feedback after couple
> days (if I keep my machine running without reboot - since before
> I had occasional problems with ACPI now resolved.
> (https://bugzilla.kernel.org/show_bug.cgi?id=51071)
> (patch not yet in -rc8)
> I'm also using this extra patch: https://patchwork.kernel.org/patch/1792531/

Okay, fingers crossed!  Thanks for persisting.

> What seems to be triggering condition on my machine - running laptop
> for some days - and having   Thunderbird reaching 0.8G (I guess they
> must keep all my news messages in memory to consume that size) and
> Firefox 1.3GB of consumed
> memory (assuming massive leaking with combination of flash)

Were you able speed this process up in the past?  I.e. by doing a
search over all mail?  Watching 8 nyan cat videos in parallel?

If not, it's probably better not to change anything now...

Thanks!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04 16:11           ` Johannes Weiner
@ 2012-12-04 16:22             ` Jiri Slaby
  2012-12-04 19:50               ` Johannes Weiner
  2012-12-08 10:35             ` Jiri Slaby
  1 sibling, 1 reply; 65+ messages in thread
From: Jiri Slaby @ 2012-12-04 16:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Zdenek Kabelac, Mel Gorman, Andrew Morton, Rik van Riel,
	George Spelvin, Johannes Hirte, Thorsten Leemhuis, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis.Kletnieks,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On 12/04/2012 05:11 PM, Johannes Weiner wrote:
>>>> Any chance you could retry with this patch on top?
>>
>> It does not apply to -next :/. Should I try anything else?
> 
> The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
> is a -next patch.  I hope you don't run into other problems that come
> out of -next craziness, because Linus is kinda waiting for this to be
> resolved to release 3.8.  If you've always tested against -next so far
> and it worked otherwise, don't change the environment now, please.  If
> you just started, it would make more sense to test based on 3.7-rc8.

I reported the issue as soon as it appeared in -next for the first time
on Oct 12. Since then I'm constantly hitting the issue (well, there were
more than one I suppose, but not all of them were fixed by now) until
now. I run only -next...

Going to apply the patch now.

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04 16:22             ` Jiri Slaby
@ 2012-12-04 19:50               ` Johannes Weiner
  0 siblings, 0 replies; 65+ messages in thread
From: Johannes Weiner @ 2012-12-04 19:50 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Zdenek Kabelac, Mel Gorman, Andrew Morton, Rik van Riel,
	George Spelvin, Johannes Hirte, Thorsten Leemhuis, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis.Kletnieks,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On Tue, Dec 04, 2012 at 05:22:38PM +0100, Jiri Slaby wrote:
> On 12/04/2012 05:11 PM, Johannes Weiner wrote:
> >>>> Any chance you could retry with this patch on top?
> >>
> >> It does not apply to -next :/. Should I try anything else?
> > 
> > The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
> > is a -next patch.  I hope you don't run into other problems that come
> > out of -next craziness, because Linus is kinda waiting for this to be
> > resolved to release 3.8.  If you've always tested against -next so far
> > and it worked otherwise, don't change the environment now, please.  If
> > you just started, it would make more sense to test based on 3.7-rc8.
> 
> I reported the issue as soon as it appeared in -next for the first time
> on Oct 12. Since then I'm constantly hitting the issue (well, there were
> more than one I suppose, but not all of them were fixed by now) until
> now. I run only -next...

Okay.  Yes, it was a couple of problems, but not everybody hit the
same subset.

> Going to apply the patch now.

Thanks!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-03 19:42                             ` kswapd craziness in 3.7 Johannes Weiner
@ 2012-12-04 21:42                               ` Johannes Weiner
  2012-12-05  3:01                                 ` Bruno Wolff III
  2012-12-06  8:09                               ` Thorsten Leemhuis
  1 sibling, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-12-04 21:42 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Mel Gorman, Andrew Morton, Linus Torvalds, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List, John Ellson

On Mon, Dec 03, 2012 at 02:42:08PM -0500, Johannes Weiner wrote:
> On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
> > >> John was able to reproduce the problem quickly with a kernel that 
> > >> contained the patch from your mail. For details see
> > >
> > > [stripped: all the glory details of what likely went wrong and lead
> > > to the problem john sees or saw]
> > >
> > > ---
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
> > >  to individual uncompactable zones
> > > 
> > > When a zone meets its high watermark and is compactable in case of
> > > higher order allocations, it contributes to the percentage of the
> > > node's memory that is considered balanced.
> > > [...]
> > 
> > FYI: I built a kernel with that patch. I've been running on my x86_64
> > machine at home over the weekend and everything was working fine (just
> > as without the patch). John gave it a quick try and in
> > https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported:
> > 
> > """
> > I just installed
> > kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my
> > usual load that triggers the problem.  OK so far.  I'll check again in
> > 24hours, but looking good so far.
> > """
> 
> w00t!

Update from John in the BZ
(https://bugzilla.redhat.com/show_bug.cgi?id=866988#c62):

"Good news.

I've now been running both
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
and
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64
for over 24hours with no evidence of problems with kswapd"

Now waiting for results from Jiri, Zdenek and Bruno...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04 21:42                               ` Johannes Weiner
@ 2012-12-05  3:01                                 ` Bruno Wolff III
  2012-12-06 17:37                                   ` Bruno Wolff III
  0 siblings, 1 reply; 65+ messages in thread
From: Bruno Wolff III @ 2012-12-05  3:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Thorsten Leemhuis, Mel Gorman, Andrew Morton, Linus Torvalds,
	Rik van Riel, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, linux-mm, Linux Kernel Mailing List, John Ellson

On Tue, Dec 04, 2012 at 16:42:10 -0500,
   Johannes Weiner <hannes@cmpxchg.org> wrote:
>  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
>and
>  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64
>for over 24hours with no evidence of problems with kswapd"
>
>Now waiting for results from Jiri, Zdenek and Bruno...

I have been running 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE 
a bit over 23 hours and kswapd has accumalated one minute 8 seconds of 
CPU time. I did several yum operations during that time and didn't see 
kswapd spike to 90+% CPU usage as I had seen in the past. With some kernels 
I wasn't reliably triggering the kswapd issue, so it may not be long enough 
to know for sure that the problem is fixed.

I also should note that when I tried 3.7.0-0.rc7.git3.2.fc19.i686.PAE I 
did see problems with kswapd hitting 90+% usage of a CPU.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-03 19:42                             ` kswapd craziness in 3.7 Johannes Weiner
  2012-12-04 21:42                               ` Johannes Weiner
@ 2012-12-06  8:09                               ` Thorsten Leemhuis
  1 sibling, 0 replies; 65+ messages in thread
From: Thorsten Leemhuis @ 2012-12-06  8:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Linus Torvalds, Rik van Riel,
	George Spelvin, Johannes Hirte, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, Bruno Wolff III, linux-mm,
	Linux Kernel Mailing List, John Ellson

Hi!

Just a quick update

Johannes Weiner wrote on 03.12.2012 20:42:
> On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
>
>> BTW, I built that kernel without the patch you mentioned in
>> http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
>> ("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
>> [...]) It looked to me like that patch was only meant for debugging. Let
>> me know if that was wrong. Ohh, and I didn't update to a fresher
>> mainline checkout yet to make sure the base for John's testing didn't
>> change.
> 
> Ah, yes, the ApplyPatch is commented out.
> 
> I think we want that upstream as well, but it's not critical.
> [...]

Sorry, it had no "Singed-off-by", so I assumed it was just for debugging.

> Not rebasing sounds reasonable to me to verify the patch.  It might be
> worth testing that the final version that will be 3.8 still works for
> John, however, once that is done.  Just to be sure.

Just to be sure, I yesterday built a rc8 kernel with the patch
referenced above and the one that is not yet merged (these two, to be
precise: http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91300
; all the others patches my kswap test kernels contained earlier were
afaics merged a few days ago) and mentioned it in the Fedora bug report.
John gave them a try and  in
https://bugzilla.redhat.com/show_bug.cgi?id=866988#c65 reported "No
problems so far.  I'll check back again in ~24hours."

CU, Thorsten

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04  9:05       ` Zdenek Kabelac
  2012-12-04  9:15         ` Jiri Slaby
  2012-12-04 16:15         ` Johannes Weiner
@ 2012-12-06 13:51         ` Zdenek Kabelac
  2 siblings, 0 replies; 65+ messages in thread
From: Zdenek Kabelac @ 2012-12-06 13:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, George Spelvin,
	Johannes Hirte, Thorsten Leemhuis, Tomas Racek, Jan Kara,
	Dave Hansen, Josh Boyer, Valdis.Kletnieks, Jiri Slaby,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

Dne 4.12.2012 10:05, Zdenek Kabelac napsal(a):
> Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
>> Szia Zdenek,
>>
>> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
>>> Ok, bad news - I've been hit by  kswapd0 loop again -
>>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
>>> shown kswapd0 for couple minutes on CPU.
>>>
>>> It seemed to go instantly away when I've drop caches
>>> (echo 3 >/proc/sys/vm/drop_cache)
>>> (After that I've had over 1G free memory)
>>
>> Any chance you could retry with this patch on top?
>>
>> ---
>> From: Johannes Weiner <hannes@cmpxchg.org>
>> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>>   to individual uncompactable zones
>>
>> ---
>>   mm/vmscan.c | 16 ----------------
>>   1 file changed, 16 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>
>
> Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8)
> with your patch.  I'll be able to give some feedback after couple
> days (if I keep my machine running without reboot - since before

So to just give some positive info -

with  2 1/2 day uptime, several suspend/resumes, ff at 1.4GB
I still have just 29 seconds for kswapd0 process.

So the patch above might have helped - but I'll look for a few more days.

Zdenek

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-05  3:01                                 ` Bruno Wolff III
@ 2012-12-06 17:37                                   ` Bruno Wolff III
  2012-12-06 19:31                                     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Bruno Wolff III @ 2012-12-06 17:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Thorsten Leemhuis, Mel Gorman, Andrew Morton, Linus Torvalds,
	Rik van Riel, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, linux-mm, Linux Kernel Mailing List, John Ellson

On Tue, Dec 04, 2012 at 21:01:33 -0600,
   Bruno Wolff III <bruno@wolff.to> wrote:
>On Tue, Dec 04, 2012 at 16:42:10 -0500,
>  Johannes Weiner <hannes@cmpxchg.org> wrote:
>> kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
>>and
>> kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64
>>for over 24hours with no evidence of problems with kswapd"
>>
>>Now waiting for results from Jiri, Zdenek and Bruno...
>
>I have been running 
>3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE a bit over 23 
>hours and kswapd has accumalated one minute 8 seconds of CPU time. I 
>did several yum operations during that time and didn't see kswapd 
>spike to 90+% CPU usage as I had seen in the past. With some kernels 
>I wasn't reliably triggering the kswapd issue, so it may not be long 
>enough to know for sure that the problem is fixed.

I am now at a bit over 2 and 1/2 days with kswapd having used 1 minute 
53 seconds of CPU time.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-06 17:37                                   ` Bruno Wolff III
@ 2012-12-06 19:31                                     ` Linus Torvalds
  2012-12-06 19:43                                       ` Rik van Riel
                                                         ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Linus Torvalds @ 2012-12-06 19:31 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Johannes Weiner, Thorsten Leemhuis, Mel Gorman, Andrew Morton,
	Rik van Riel, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, linux-mm, Linux Kernel Mailing List, John Ellson

Ok, people seem to be reporting success.

I've applied Johannes' last patch with the new tested-by tags.

Johannes (or anybody else, for that matter), please holler LOUDLY if
you disagreed.. (or if I used the wrong version of the patch, there's
been several, afaik).

                 Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-06 19:31                                     ` Linus Torvalds
@ 2012-12-06 19:43                                       ` Rik van Riel
  2012-12-06 20:23                                       ` Johannes Weiner
  2012-12-08 12:06                                       ` Zlatko Calusic
  2 siblings, 0 replies; 65+ messages in thread
From: Rik van Riel @ 2012-12-06 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bruno Wolff III, Johannes Weiner, Thorsten Leemhuis, Mel Gorman,
	Andrew Morton, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, linux-mm, Linux Kernel Mailing List, John Ellson

On 12/06/2012 02:31 PM, Linus Torvalds wrote:
> Ok, people seem to be reporting success.
>
> I've applied Johannes' last patch with the new tested-by tags.
>
> Johannes (or anybody else, for that matter), please holler LOUDLY if
> you disagreed.. (or if I used the wrong version of the patch, there's
> been several, afaik).

Johannes's patch is a fairly big hammer, with kswapd not looping
back to the start when zones are still unbalanced.

However, the next allocation will wake up kswapd again, and
having kswapd stop early beats having it in an infinite loop.

I believe Johannes's patch will be fine for 3.7.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-06 19:31                                     ` Linus Torvalds
  2012-12-06 19:43                                       ` Rik van Riel
@ 2012-12-06 20:23                                       ` Johannes Weiner
  2012-12-06 20:32                                         ` Rik van Riel
  2012-12-08 12:06                                       ` Zlatko Calusic
  2 siblings, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-12-06 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bruno Wolff III, Thorsten Leemhuis, Mel Gorman, Andrew Morton,
	Rik van Riel, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, linux-mm, Linux Kernel Mailing List, John Ellson

On Thu, Dec 06, 2012 at 11:31:21AM -0800, Linus Torvalds wrote:
> Ok, people seem to be reporting success.
> 
> I've applied Johannes' last patch with the new tested-by tags.
> 
> Johannes (or anybody else, for that matter), please holler LOUDLY if
> you disagreed.. (or if I used the wrong version of the patch, there's
> been several, afaik).

I just went back one more time and of course that's when I spot that I
forgot to remove the zone congestion clearing that depended on the now
removed checks to ensure the zone is balanced.  It's not too big of a
deal, just the /risk/ of increased CPU use from reclaim because we go
back to scanning zones that we previously deemed congested and slept a
little bit before continuing reclaim.

Sorry, I should have seen that earlier.

Removing it is a low risk fix, the clearing was kinda redundant anyway
(the preliminary zone check clears it for OK zones, so does the
reclaim loop under the same criteria), letting it stay is probably
more problematic for 3.8 than just dropping it...

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: vmscan: fix inappropriate zone congestion clearing

c702418 ("mm: vmscan: do not keep kswapd looping forever due to
individual uncompactable zones") removed zone watermark checks from
the compaction code in kswapd but left in the zone congestion
clearing, which now happens unconditionally on higher order reclaim.

This messes up the reclaim throttling logic for zones with
dirty/writeback pages, where zones should only lose their congestion
status when their watermarks have been restored.

Remove the clearing from the zone compaction section entirely.  The
preliminary zone check and the reclaim loop in kswapd will clear it if
the zone is considered balanced.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 124bbfe..b7ed376 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2827,9 +2827,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			if (zone_watermark_ok(zone, order,
 				    low_wmark_pages(zone), *classzone_idx, 0))
 				zones_need_compaction = 0;
-
-			/* If balanced, clear the congested flag */
-			zone_clear_flag(zone, ZONE_CONGESTED);
 		}
 
 		if (zones_need_compaction)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-06 20:23                                       ` Johannes Weiner
@ 2012-12-06 20:32                                         ` Rik van Riel
  0 siblings, 0 replies; 65+ messages in thread
From: Rik van Riel @ 2012-12-06 20:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Bruno Wolff III, Thorsten Leemhuis, Mel Gorman,
	Andrew Morton, George Spelvin, Johannes Hirte, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis Kletnieks, Jiri Slaby,
	Zdenek Kabelac, linux-mm, Linux Kernel Mailing List, John Ellson

On 12/06/2012 03:23 PM, Johannes Weiner wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: vmscan: fix inappropriate zone congestion clearing
>
> c702418 ("mm: vmscan: do not keep kswapd looping forever due to
> individual uncompactable zones") removed zone watermark checks from
> the compaction code in kswapd but left in the zone congestion
> clearing, which now happens unconditionally on higher order reclaim.
>
> This messes up the reclaim throttling logic for zones with
> dirty/writeback pages, where zones should only lose their congestion
> status when their watermarks have been restored.
>
> Remove the clearing from the zone compaction section entirely.  The
> preliminary zone check and the reclaim loop in kswapd will clear it if
> the zone is considered balanced.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-04 16:11           ` Johannes Weiner
  2012-12-04 16:22             ` Jiri Slaby
@ 2012-12-08 10:35             ` Jiri Slaby
  1 sibling, 0 replies; 65+ messages in thread
From: Jiri Slaby @ 2012-12-08 10:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Zdenek Kabelac, Mel Gorman, Andrew Morton, Rik van Riel,
	George Spelvin, Johannes Hirte, Thorsten Leemhuis, Tomas Racek,
	Jan Kara, Dave Hansen, Josh Boyer, Valdis.Kletnieks,
	Bruno Wolff III, Linus Torvalds, linux-mm, linux-kernel

On 12/04/2012 05:11 PM, Johannes Weiner wrote:
> On Tue, Dec 04, 2012 at 10:15:09AM +0100, Jiri Slaby wrote:
>> It does not apply to -next :/. Should I try anything else?
> 
> The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
> is a -next patch.  I hope you don't run into other problems that come
> out of -next craziness, because Linus is kinda waiting for this to be
> resolved to release 3.8.  If you've always tested against -next so far
> and it worked otherwise, don't change the environment now, please.  If
> you just started, it would make more sense to test based on 3.7-rc8.
> 
> Thanks!
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>  to individual uncompactable zones
> 
> When a zone meets its high watermark and is compactable in case of
> higher order allocations, it contributes to the percentage of the
> node's memory that is considered balanced.
> 
> This requirement, that a node be only partially balanced, came about
> when kswapd was desparately trying to balance tiny zones when all
> bigger zones in the node had plenty of free memory.  Arguably, the
> same should apply to compaction: if a significant part of the node is
> balanced enough to run compaction, do not get hung up on that tiny
> zone that might never get in shape.
> 
> When the compaction logic in kswapd is reached, we know that at least
> 25% of the node's memory is balanced properly for compaction (see
> zone_balanced and pgdat_balanced).  Remove the individual zone checks
> that restart the kswapd cycle.
> 
> Otherwise, we may observe more endless looping in kswapd where the
> compaction code loops back to reclaim because of a single zone and
> reclaim does nothing because the node is considered balanced overall.
> 
> Reported-by: Thorsten Leemhuis <fedora@leemhuis.info>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Looks like it's gone with this patch now. Hopefully the send button
won't trigger the issue the same as the last time :).

> ---
>  mm/vmscan.c | 16 ----------------
>  1 file changed, 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3b0aef4..486100f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			if (!populated_zone(zone))
>  				continue;
>  
> -			if (zone->all_unreclaimable &&
> -			    sc.priority != DEF_PRIORITY)
> -				continue;
> -
> -			/* Would compaction fail due to lack of free memory? */
> -			if (IS_ENABLED(CONFIG_COMPACTION) &&
> -			    compaction_suitable(zone, order) == COMPACT_SKIPPED)
> -				goto loop_again;
> -
> -			/* Confirm the zone is balanced for order-0 */
> -			if (!zone_watermark_ok(zone, 0,
> -					high_wmark_pages(zone), 0, 0)) {
> -				order = sc.order = 0;
> -				goto loop_again;
> -			}
> -
>  			/* Check if the memory needs to be defragmented. */
>  			if (zone_watermark_ok(zone, order,
>  				    low_wmark_pages(zone), *classzone_idx, 0))
> 


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-06 19:31                                     ` Linus Torvalds
  2012-12-06 19:43                                       ` Rik van Riel
  2012-12-06 20:23                                       ` Johannes Weiner
@ 2012-12-08 12:06                                       ` Zlatko Calusic
  2012-12-08 21:22                                         ` Zlatko Calusic
  2 siblings, 1 reply; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-08 12:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Johannes Weiner, linux-mm, Linux Kernel Mailing List

On 06.12.2012 20:31, Linus Torvalds wrote:
> Ok, people seem to be reporting success.
>
> I've applied Johannes' last patch with the new tested-by tags.
>

I've been testing this patch since it was applied, and it certainly 
fixes the kswapd craziness issue, good work Johannes!

But, it's still not perfect yet, because I see that the system keeps 
lots of memory unused (free), where it previously used it all for the 
page cache (there's enough fs activity to warrant it).

I'm now testing the last piece of Johannes' changes (still not in git 
tree), and can report results in 24-48 hours.

Regards,
-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-08 12:06                                       ` Zlatko Calusic
@ 2012-12-08 21:22                                         ` Zlatko Calusic
  2012-12-09  1:01                                           ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-08 21:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Johannes Weiner, linux-mm, Linux Kernel Mailing List

On 08.12.2012 13:06, Zlatko Calusic wrote:
> On 06.12.2012 20:31, Linus Torvalds wrote:
>> Ok, people seem to be reporting success.
>>
>> I've applied Johannes' last patch with the new tested-by tags.
>>
>
> I've been testing this patch since it was applied, and it certainly
> fixes the kswapd craziness issue, good work Johannes!
>
> But, it's still not perfect yet, because I see that the system keeps
> lots of memory unused (free), where it previously used it all for the
> page cache (there's enough fs activity to warrant it).
>
> I'm now testing the last piece of Johannes' changes (still not in git
> tree), and can report results in 24-48 hours.
>
> Regards,

Or sooner... in short: nothing's changed!

On a 4GB RAM system, where applications use close to 2GB, kswapd likes 
to keep around 1GB free (unused), leaving only 1GB for page/buffer 
cache. If I force bigger page cache by reading a big file and thus use 
the unused 1GB of RAM, kswapd will soon (in a matter of minutes) evict 
those (or other) pages out and once again keep unused memory close to 1GB.

I guess it's not a showstopper, but it still counts as a very bad memory 
management, wasting lots of RAM.

As an additional data point, if memory pressure is slightly higher (say 
backup kicks in, keeping page cache mostly full) kswapd gets in D 
(uninterruptible sleep) state (function: congestion_wait) and load 
average goes up by 1. It recovers only when it successfully throws out 
half of page cache again.

Hope it helps.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-08 21:22                                         ` Zlatko Calusic
@ 2012-12-09  1:01                                           ` Linus Torvalds
  2012-12-09 21:59                                             ` Zdenek Kabelac
  2012-12-10 11:03                                             ` Mel Gorman
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2012-12-09  1:01 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, linux-mm,
	Linux Kernel Mailing List



On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> 
> Or sooner... in short: nothing's changed!
> 
> On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
> around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
> bigger page cache by reading a big file and thus use the unused 1GB of RAM,
> kswapd will soon (in a matter of minutes) evict those (or other) pages out and
> once again keep unused memory close to 1GB.

Ok, guys, what was the reclaim or kswapd patch during the merge window 
that actually caused all of these insane problems? It seems it was more 
fundamentally buggered than the fifteen-million fixes for kswapd we have 
already picked up.

(Ok, I may be exaggerating the number of patches, but it's starting to 
feel that way - I thought that 3.7 was going to be a calm and easy 
release, but the kswapd issues seem to just keep happening. We've been 
fighting the kswapd changes for a while now.)

Trying to keep a gigabyte free (presumably because that way we have lots 
of high-order alloction pages) is ridiculous. Is it one of the compaction 
changes? 

Mel? Ideas?

            Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-09  1:01                                           ` Linus Torvalds
@ 2012-12-09 21:59                                             ` Zdenek Kabelac
  2012-12-10 11:03                                             ` Mel Gorman
  1 sibling, 0 replies; 65+ messages in thread
From: Zdenek Kabelac @ 2012-12-09 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Johannes Weiner, Mel Gorman, linux-mm,
	Linux Kernel Mailing List

Dne 9.12.2012 02:01, Linus Torvalds napsal(a):
>
>
> On Sat, 8 Dec 2012, Zlatko Calusic wrote:
>>
>> Or sooner... in short: nothing's changed!
>>
>> On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
>> around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
>> bigger page cache by reading a big file and thus use the unused 1GB of RAM,
>> kswapd will soon (in a matter of minutes) evict those (or other) pages out and
>> once again keep unused memory close to 1GB.
>
> Ok, guys, what was the reclaim or kswapd patch during the merge window
> that actually caused all of these insane problems? It seems it was more
> fundamentally buggered than the fifteen-million fixes for kswapd we have
> already picked up.
>
> (Ok, I may be exaggerating the number of patches, but it's starting to
> feel that way - I thought that 3.7 was going to be a calm and easy
> release, but the kswapd issues seem to just keep happening. We've been
> fighting the kswapd changes for a while now.)
>
> Trying to keep a gigabyte free (presumably because that way we have lots
> of high-order alloction pages) is ridiculous. Is it one of the compaction
> changes?
>
> Mel? Ideas?
>

Very true

It's just as simple a making

dd if=/dev/zero of=/tmp/zero bs=1M count=0 seek=1000000

and now

dd if=/tmp/zero of=/dev/null bs=1M

and kswapd fights with dd  for CPU time....


Zdenek



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-09  1:01                                           ` Linus Torvalds
  2012-12-09 21:59                                             ` Zdenek Kabelac
@ 2012-12-10 11:03                                             ` Mel Gorman
  2012-12-10 16:39                                               ` Johannes Weiner
  2012-12-10 18:29                                               ` Zlatko Calusic
  1 sibling, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2012-12-10 11:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Rik van Riel, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List

On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> > 
> > Or sooner... in short: nothing's changed!
> > 
> > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
> > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
> > bigger page cache by reading a big file and thus use the unused 1GB of RAM,
> > kswapd will soon (in a matter of minutes) evict those (or other) pages out and
> > once again keep unused memory close to 1GB.
> 
> Ok, guys, what was the reclaim or kswapd patch during the merge window 
> that actually caused all of these insane problems?

I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary
candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd
was excessively reclaiming. kswapd would stay awake aggressively reclaiming
even if compaction was deferred. The flag was removed in this cycle when it
was expected that it was no longer necessary. I'm not foisting the blame
on Rik here, I was on the review list for that patch and did not identify
that it would cause this many problems either.

> It seems it was more 
> fundamentally buggered than the fifteen-million fixes for kswapd we have 
> already picked up.
> 

It was already fundamentally buggered up. The difference was it stayed
asleep for THP requests in earlier kernels.

There is a big difference between a direct reclaim/compaction for THP
and kswapd doing the same work. Direct reclaim/compaction will try once,
give up quickly and defer requests in the near future to avoid impacting
the system heavily for THP. The same applies for khugepaged.

kswapd is different. It can keep going until it meets its watermarks for
a THP allocation are met. Two reasons why it might keep going for a long
time are that compaction is being inefficient which we know it may be due
to crap like this

end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);

and the second reason is if the highest zone is relatively because
compaction_suitable will keep saying that allocations are failing due to
insufficient amounts of memory in the highest zone. It'll reclaim a little
from this highest zone and then shrink_slab() potentially dumping a large
amount of memory. This may be the case for Zlatko as with a 4G machine
his ZONE_NORMAL could be small depending on how the 32-bit address space
is used by his hardware.

> (Ok, I may be exaggerating the number of patches, but it's starting to 
> feel that way - I thought that 3.7 was going to be a calm and easy 
> release, but the kswapd issues seem to just keep happening. We've been 
> fighting the kswapd changes for a while now.)
> 

Yes.

> Trying to keep a gigabyte free (presumably because that way we have lots 
> of high-order alloction pages) is ridiculous. Is it one of the compaction 
> changes? 
> 

Not directly. Compaction has been a bigger factor after 3.5 due to the
removal of lumpy reclaim but it's not directly responsible for excessive
amounts of memory being kept free. The closest patch I'm aware of that
would cause problems of that nature would be commit 83fde0f2 (mm: vmscan:
scale number of pages reclaimed by reclaim/compaction based on failures)
and it has already been reverted by 96710098.

> Mel? Ideas?
> 

Consider reverting the revert of __GFP_NO_KSWAPD again until this can be
ironed out at a more reasonable pace. Rik? Johannes?

Verify if the shrinking slab is the issue with this brutually ugly
hack. Zlatko?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..2189d20 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	unsigned long balanced;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
+	bool should_shrink_slab = true;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
@@ -2695,7 +2696,8 @@ loop_again:
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
-				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				if (should_shrink_slab)
+					nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
 				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 				total_scanned += sc.nr_scanned;
 
@@ -2817,6 +2819,16 @@ out:
 	if (order) {
 		int zones_need_compaction = 1;
 
+		/*
+		 * Shrinking slab for high-order allocs can cause an excessive
+		 * amount of memory to be dumped. Only shrink slab once per
+		 * round for high-order allocs.
+		 *
+		 * This is a very stupid hack. balance_pgdat() is in serious
+		 * need of a rework
+		 */
+		should_shrink_slab = false;
+
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 11:03                                             ` Mel Gorman
@ 2012-12-10 16:39                                               ` Johannes Weiner
  2012-12-10 18:01                                                 ` Mel Gorman
  2012-12-10 18:29                                               ` Zlatko Calusic
  1 sibling, 1 reply; 65+ messages in thread
From: Johannes Weiner @ 2012-12-10 16:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Zlatko Calusic, Rik van Riel, linux-mm,
	Linux Kernel Mailing List

On Mon, Dec 10, 2012 at 11:03:37AM +0000, Mel Gorman wrote:
> On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote:
> > On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> > > Or sooner... in short: nothing's changed!
> > > 
> > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
> > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
> > > bigger page cache by reading a big file and thus use the unused 1GB of RAM,
> > > kswapd will soon (in a matter of minutes) evict those (or other) pages out and
> > > once again keep unused memory close to 1GB.
> > 
> > Ok, guys, what was the reclaim or kswapd patch during the merge window 
> > that actually caused all of these insane problems?
> 
> I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary
> candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd
> was excessively reclaiming. kswapd would stay awake aggressively reclaiming
> even if compaction was deferred. The flag was removed in this cycle when it
> was expected that it was no longer necessary. I'm not foisting the blame
> on Rik here, I was on the review list for that patch and did not identify
> that it would cause this many problems either.
>
> > It seems it was more 
> > fundamentally buggered than the fifteen-million fixes for kswapd we have 
> > already picked up.
> 
> It was already fundamentally buggered up. The difference was it stayed
> asleep for THP requests in earlier kernels.
> 
> There is a big difference between a direct reclaim/compaction for THP
> and kswapd doing the same work. Direct reclaim/compaction will try once,
> give up quickly and defer requests in the near future to avoid impacting
> the system heavily for THP. The same applies for khugepaged.
> 
> kswapd is different. It can keep going until it meets its watermarks for
> a THP allocation are met. Two reasons why it might keep going for a long
> time are that compaction is being inefficient which we know it may be due
> to crap like this
> 
> end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> 
> and the second reason is if the highest zone is relatively because
> compaction_suitable will keep saying that allocations are failing due to
> insufficient amounts of memory in the highest zone. It'll reclaim a little
> from this highest zone and then shrink_slab() potentially dumping a large
> amount of memory. This may be the case for Zlatko as with a 4G machine
> his ZONE_NORMAL could be small depending on how the 32-bit address space
> is used by his hardware.

Unlike direct reclaim, kswapd also never does sync migration.  Since
the fragmentation index is a ratio of free pages over free page
blocks, doing lightweight compaction that reduces the page blocks but
never really follows through to compact a THP block increases the free
memory requirement.

I thought about the small Normal zone too.  Direct reclaim/compaction
is fine with one zone being able to provide a THP, but kswapd requires
25% of the node.  A small ZONE_NORMAL would not be able to meet this
and so the bigger DMA32 zone would also be required to be balanced for
the THP allocation.

> > Mel? Ideas?
> 
> Consider reverting the revert of __GFP_NO_KSWAPD again until this can be
> ironed out at a more reasonable pace. Rik? Johannes?

Yes, I also think we need more time for this.

> Verify if the shrinking slab is the issue with this brutually ugly
> hack. Zlatko?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b7ed376..2189d20 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  	unsigned long balanced;
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> +	bool should_shrink_slab = true;
>  	unsigned long total_scanned;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long nr_soft_reclaimed;
> @@ -2695,7 +2696,8 @@ loop_again:
>  				shrink_zone(zone, &sc);
>  
>  				reclaim_state->reclaimed_slab = 0;
> -				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +				if (should_shrink_slab)
> +					nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  				total_scanned += sc.nr_scanned;
>  
> @@ -2817,6 +2819,16 @@ out:
>  	if (order) {
>  		int zones_need_compaction = 1;
>  
> +		/*
> +		 * Shrinking slab for high-order allocs can cause an excessive
> +		 * amount of memory to be dumped. Only shrink slab once per
> +		 * round for high-order allocs.
> +		 *
> +		 * This is a very stupid hack. balance_pgdat() is in serious
> +		 * need of a rework
> +		 */
> +		should_shrink_slab = false;
> +
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;

I don't see a shrink_slab() invocation after this point since the
loop_again jumps in this loop where removed, so this shouldn't change
anything?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 16:39                                               ` Johannes Weiner
@ 2012-12-10 18:01                                                 ` Mel Gorman
  2012-12-10 18:33                                                   ` Zlatko Calusic
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2012-12-10 18:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Zlatko Calusic, Rik van Riel, linux-mm,
	Linux Kernel Mailing List

On Mon, Dec 10, 2012 at 11:39:04AM -0500, Johannes Weiner wrote:
> On Mon, Dec 10, 2012 at 11:03:37AM +0000, Mel Gorman wrote:
> > On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote:
> > > On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> > > > Or sooner... in short: nothing's changed!
> > > > 
> > > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
> > > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
> > > > bigger page cache by reading a big file and thus use the unused 1GB of RAM,
> > > > kswapd will soon (in a matter of minutes) evict those (or other) pages out and
> > > > once again keep unused memory close to 1GB.
> > > 
> > > Ok, guys, what was the reclaim or kswapd patch during the merge window 
> > > that actually caused all of these insane problems?
> > 
> > I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary
> > candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd
> > was excessively reclaiming. kswapd would stay awake aggressively reclaiming
> > even if compaction was deferred. The flag was removed in this cycle when it
> > was expected that it was no longer necessary. I'm not foisting the blame
> > on Rik here, I was on the review list for that patch and did not identify
> > that it would cause this many problems either.
> >
> > > It seems it was more 
> > > fundamentally buggered than the fifteen-million fixes for kswapd we have 
> > > already picked up.
> > 
> > It was already fundamentally buggered up. The difference was it stayed
> > asleep for THP requests in earlier kernels.
> > 
> > There is a big difference between a direct reclaim/compaction for THP
> > and kswapd doing the same work. Direct reclaim/compaction will try once,
> > give up quickly and defer requests in the near future to avoid impacting
> > the system heavily for THP. The same applies for khugepaged.
> > 
> > kswapd is different. It can keep going until it meets its watermarks for
> > a THP allocation are met. Two reasons why it might keep going for a long
> > time are that compaction is being inefficient which we know it may be due
> > to crap like this
> > 
> > end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > 
> > and the second reason is if the highest zone is relatively because
> > compaction_suitable will keep saying that allocations are failing due to
> > insufficient amounts of memory in the highest zone. It'll reclaim a little
> > from this highest zone and then shrink_slab() potentially dumping a large
> > amount of memory. This may be the case for Zlatko as with a 4G machine
> > his ZONE_NORMAL could be small depending on how the 32-bit address space
> > is used by his hardware.
> 
> Unlike direct reclaim, kswapd also never does sync migration.  Since
> the fragmentation index is a ratio of free pages over free page
> blocks, doing lightweight compaction that reduces the page blocks but
> never really follows through to compact a THP block increases the free
> memory requirement.
> 

True.

> I thought about the small Normal zone too.  Direct reclaim/compaction
> is fine with one zone being able to provide a THP, but kswapd requires
> 25% of the node.  A small ZONE_NORMAL would not be able to meet this
> and so the bigger DMA32 zone would also be required to be balanced for
> the THP allocation.
> 

Also true.

> > > Mel? Ideas?
> > 
> > Consider reverting the revert of __GFP_NO_KSWAPD again until this can be
> > ironed out at a more reasonable pace. Rik? Johannes?
> 
> Yes, I also think we need more time for this.
> 

Yes, the last minute band-aids are just getting worse and the result is
more mess.

> <SNIP>
> 
> I don't see a shrink_slab() invocation after this point since the
> loop_again jumps in this loop where removed, so this shouldn't change
> anything?

/me slaps self

In this last-minute disaster, I'm not thinking properly at all any more. The
shrink slab disabling should have happened before the loop_again but even
then it's wrong because it's just covering over the problem.

The way order and testorder interact with how balanced is calculated means
that we potentially call shrink_slab() multiple times and that thing is
global in nature and basically uncontrolled. You could argue that we should
only call shrink_slab() if order-0 watermarks are not met but that will
not necessarily prevent kswapd reclaiming too much. It keeps going back
to balance_pgdat needing its list of requirements drawn up and receive
some major surgery and we're not going to do that as a quick hack.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 11:03                                             ` Mel Gorman
  2012-12-10 16:39                                               ` Johannes Weiner
@ 2012-12-10 18:29                                               ` Zlatko Calusic
  1 sibling, 0 replies; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-10 18:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Rik van Riel, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List

On 10.12.2012 12:03, Mel Gorman wrote:
> There is a big difference between a direct reclaim/compaction for THP
> and kswapd doing the same work. Direct reclaim/compaction will try once,
> give up quickly and defer requests in the near future to avoid impacting
> the system heavily for THP. The same applies for khugepaged.
>
> kswapd is different. It can keep going until it meets its watermarks for
> a THP allocation are met. Two reasons why it might keep going for a long
> time are that compaction is being inefficient which we know it may be due
> to crap like this
>
> end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
>
> and the second reason is if the highest zone is relatively because
> compaction_suitable will keep saying that allocations are failing due to
> insufficient amounts of memory in the highest zone. It'll reclaim a little
> from this highest zone and then shrink_slab() potentially dumping a large
> amount of memory. This may be the case for Zlatko as with a 4G machine
> his ZONE_NORMAL could be small depending on how the 32-bit address space
> is used by his hardware.
>

The kernel is 64-bit, if it makes any difference (userspace, though is 
still 32-bit). There's no swap (swap support not even compiled in). The 
zones are as follows:

On node 0 totalpages: 1048019
   DMA zone: 64 pages used for memmap
   DMA zone: 6 pages reserved
   DMA zone: 3913 pages, LIFO batch:0
   DMA32 zone: 16320 pages used for memmap
   DMA32 zone: 831109 pages, LIFO batch:31
   Normal zone: 3072 pages used for memmap
   Normal zone: 193535 pages, LIFO batch:31

If I understand correctly, you think that because 193535 pages in 
ZONE_NORMAL is relatively small compared to 831109 pages of ZONE_DMA32 
the system has hard time balancing itself?

Is there any way I could force and test different memory layout? I'm 
slightly lost at all the memory models (if I have a choice at all), so 
if you have any suggestions, I'm all ears.

Maybe I could limit available memory and thus have only DMA32 zone, just 
to prove your theory? I remember doing tuning like that many years ago 
when I had more time to play with Linux MM, unfortunately didn't have 
much time lately, so I'm a bit rusty, but I'm willing to help testing 
and resolving this issue.

-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 18:01                                                 ` Mel Gorman
@ 2012-12-10 18:33                                                   ` Zlatko Calusic
  2012-12-10 19:13                                                     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-10 18:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Linus Torvalds, Rik van Riel, linux-mm,
	Linux Kernel Mailing List

On 10.12.2012 19:01, Mel Gorman wrote:
> In this last-minute disaster, I'm not thinking properly at all any more. The
> shrink slab disabling should have happened before the loop_again but even
> then it's wrong because it's just covering over the problem.
>
> The way order and testorder interact with how balanced is calculated means
> that we potentially call shrink_slab() multiple times and that thing is
> global in nature and basically uncontrolled. You could argue that we should
> only call shrink_slab() if order-0 watermarks are not met but that will
> not necessarily prevent kswapd reclaiming too much. It keeps going back
> to balance_pgdat needing its list of requirements drawn up and receive
> some major surgery and we're not going to do that as a quick hack.
>

I was about to apply the patch that you sent, and reboot the server, but 
it seems there's no point because the patch is flawed?

Anyway, if and when you have a proper one, I'll be glad to test it for 
you and report results.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 18:33                                                   ` Zlatko Calusic
@ 2012-12-10 19:13                                                     ` Linus Torvalds
  2012-12-10 20:35                                                       ` Zlatko Calusic
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2012-12-10 19:13 UTC (permalink / raw)
  To: Zlatko Calusic, Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Rik van Riel, linux-mm,
	Linux Kernel Mailing List

On Mon, Dec 10, 2012 at 10:33 AM, Zlatko Calusic
<zlatko.calusic@iskon.hr> wrote:
>
> I was about to apply the patch that you sent, and reboot the server, but it
> seems there's no point because the patch is flawed?
>
> Anyway, if and when you have a proper one, I'll be glad to test it for you
> and report results.

I have reverted (again) the __GFP_NO_KSWAPD removal, and considering
that it really looks like there are overwhelming reasons to have that
flag, I will *not* take some new patch to revert it. I'm getting
convinced that the original removal really was bogus, and had no
actual valid reason for it.

Part of that is that I noticed that non-THP allocations wanted to use
it too. The i915 driver had wanted to use __GFP_NO_KSWAPD because it
too didn't want to start some cleaning thread. The whole mindset
kswapd is somehow better than direct reclaim or needed when it fails
is broken. Some allocations simply *will* fail, without necessarily
wanting kswapd to be started. THP - where the high order of the
allocation means that failure is inevitable under some fragmentation
circumstances - is just one such case.

I also reverted one of the "fix up the mess from removing
__GFP_NO_KSWAPD" patch, because that one was an obvious workaround
that tried to re-introduce the "let's not wake up kswapd after all for
that case". It clashed with a clean revert, and it was pointless in
the presense of __GFP_NO_KSWAPD anyway.

I did *not* revert some of the other fixup patches that tried to help
kswapd balancing decisions and avoid excessive CPU use other ways. So
some remains of this whole saga do still remain, but they look fairly
minimal.

It's worth giving this as much testing as is at all possible, but at
the same time I really don't think I can delay 3.7 any more without
messing up the holiday season too much. So unless something obvious
pops up, I will do the release tonight. So testing will be minimal -
but it's not like we haven't gone back-and-forth on this several times
already, and we revert to *mostly* the same old state as 3.6 anyway,
so it should be fairly safe.

                       Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 19:13                                                     ` Linus Torvalds
@ 2012-12-10 20:35                                                       ` Zlatko Calusic
  2012-12-10 21:28                                                         ` Linus Torvalds
  2012-12-11  0:19                                                         ` Zlatko Calusic
  0 siblings, 2 replies; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-10 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Rik van Riel,
	linux-mm, Linux Kernel Mailing List

On 10.12.2012 20:13, Linus Torvalds wrote:
> 
> It's worth giving this as much testing as is at all possible, but at
> the same time I really don't think I can delay 3.7 any more without
> messing up the holiday season too much. So unless something obvious
> pops up, I will do the release tonight. So testing will be minimal -
> but it's not like we haven't gone back-and-forth on this several times
> already, and we revert to *mostly* the same old state as 3.6 anyway,
> so it should be fairly safe.
> 

It compiles and boots without a hitch, so it must be perfect. :)

Seriously, a few more hours need to pass, until I can provide more convincing data. That's how long it takes on this particular machine for memory pressure to build up and memory fragmentation to ensue. Only then I'll be able to tell how it really behaves. I promise to get back as soon as I can.

And funny thing that you mention i915, because yesterday my daughter managed to lock up our laptop hard (that was a first), and this is what I found in kern.log after restart:

Dec  9 21:29:42 titan vmunix: general protection fault: 0000 [#1] PREEMPT SMP 
Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
Dec  9 21:29:42 titan vmunix: CPU 2 
Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G           O 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
Dec  9 21:29:42 titan vmunix: RIP: 0010:[<ffffffff81090b9c>]  [<ffffffff81090b9c>] find_get_page+0x3c/0x90
Dec  9 21:29:42 titan vmunix: RSP: 0018:ffff88014d9f7928  EFLAGS: 00010246
Dec  9 21:29:42 titan vmunix: RAX: ffff880052594bc8 RBX: 0200000000000000 RCX: 00000000fffffffa
Dec  9 21:29:42 titan vmunix: RDX: 0000000000000001 RSI: ffff880052594bc8 RDI: 0000000000000000
Dec  9 21:29:42 titan vmunix: RBP: ffff88014d9f7948 R08: 0200000000000000 R09: ffff880052594b18
Dec  9 21:29:42 titan vmunix: R10: 57ffe4cbb74d1280 R11: 0000000000000000 R12: ffff88011c959a90
Dec  9 21:29:42 titan vmunix: R13: 0000000000000053 R14: 0000000000000000 R15: 0000000000000053
Dec  9 21:29:42 titan vmunix: FS:  00007fcd8d413880(0000) GS:ffff880157c80000(0000) knlGS:0000000000000000
Dec  9 21:29:42 titan vmunix: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  9 21:29:42 titan vmunix: CR2: ffffffffff600400 CR3: 000000014d937000 CR4: 00000000000007e0
Dec  9 21:29:42 titan vmunix: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  9 21:29:42 titan vmunix: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  9 21:29:42 titan vmunix: Process Xorg (pid: 2523, threadinfo ffff88014d9f6000, task ffff88014d9c1260)
Dec  9 21:29:42 titan vmunix: Stack:
Dec  9 21:29:42 titan vmunix:  ffff88014d9f7958 ffff88011c959a88 0000000000000053 ffff88011c959a88
Dec  9 21:29:42 titan vmunix:  ffff88014d9f7978 ffffffff81090e21 0000000000000001 ffffea00014d1280
Dec  9 21:29:42 titan vmunix:  ffff88011c959960 0000000000000001 ffff88014d9f7a28 ffffffff810a1b60
Dec  9 21:29:42 titan vmunix: Call Trace:
Dec  9 21:29:42 titan vmunix:  [<ffffffff81090e21>] find_lock_page+0x21/0x80
Dec  9 21:29:42 titan vmunix:  [<ffffffff810a1b60>] shmem_getpage_gfp+0xa0/0x620
Dec  9 21:29:42 titan vmunix:  [<ffffffff810a224c>] shmem_read_mapping_page_gfp+0x2c/0x50
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b3611>] i915_gem_object_get_pages_gtt+0xe1/0x270
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b127f>] i915_gem_object_get_pages+0x4f/0x90
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b1383>] i915_gem_object_bind_to_gtt+0xc3/0x4c0
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b4413>] i915_gem_object_pin+0x123/0x190
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b7d97>] i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b8171>] i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b87b2>] i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b9894>] i915_gem_execbuffer2+0x94/0x280
Dec  9 21:29:42 titan vmunix:  [<ffffffff81287de3>] drm_ioctl+0x493/0x530
Dec  9 21:29:42 titan vmunix:  [<ffffffff812b9800>] ? i915_gem_execbuffer+0x480/0x480
Dec  9 21:29:42 titan vmunix:  [<ffffffff810d9cbf>] do_vfs_ioctl+0x8f/0x530
Dec  9 21:29:42 titan vmunix:  [<ffffffff810da1ab>] sys_ioctl+0x4b/0x90
Dec  9 21:29:42 titan vmunix:  [<ffffffff810c9e2d>] ? sys_read+0x4d/0xa0
Dec  9 21:29:42 titan vmunix:  [<ffffffff8154a4d2>] system_call_fastpath+0x16/0x1b
Dec  9 21:29:42 titan vmunix: Code: 63 08 48 83 ec 08 e8 84 9c fb ff 4c 89 ee 4c 89 e7 e8 89 b7 15 00 48 85 c0 48 89 c6 74 41 48 8b 18 48 85 db 74 1f f6 c3 03 75 3c <8b> 53 1c 85 d2 74 d9 8d 7a 01 89 d0 f0 0f b1 7b 1c 39 c2 75 23 
Dec  9 21:29:42 titan vmunix: RIP  [<ffffffff81090b9c>] find_get_page+0x3c/0x90
Dec  9 21:29:42 titan vmunix:  RSP <ffff88014d9f7928>

It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, the i915 driver will need to be taken better care of.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 20:35                                                       ` Zlatko Calusic
@ 2012-12-10 21:28                                                         ` Linus Torvalds
  2012-12-10 21:42                                                           ` Borislav Petkov
  2012-12-10 23:27                                                           ` Hugh Dickins
  2012-12-11  0:19                                                         ` Zlatko Calusic
  1 sibling, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2012-12-10 21:28 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Rik van Riel,
	linux-mm, Linux Kernel Mailing List, Hugh Dickins

[ Adding High Dickins because of the shmem oops. ]

On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic
<zlatko.calusic@iskon.hr> wrote:
>
> And funny thing that you mention i915, because yesterday my daughter managed to lock up our laptop hard (that was a first), and this is what I found in kern.log after restart:
>
> Dec  9 21:29:42 titan vmunix: general protection fault: 0000 [#1] PREEMPT SMP
> Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
> Dec  9 21:29:42 titan vmunix: CPU 2
> Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G           O 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
> Dec  9 21:29:42 titan vmunix: RIP: 0010:[<ffffffff81090b9c>]  [<ffffffff81090b9c>] find_get_page+0x3c/0x90

Ho humm..

I'm not convinced this is related.

> Dec  9 21:29:42 titan vmunix: Call Trace:
> Dec  9 21:29:42 titan vmunix:  [<ffffffff81090e21>] find_lock_page+0x21/0x80
> Dec  9 21:29:42 titan vmunix:  [<ffffffff810a1b60>] shmem_getpage_gfp+0xa0/0x620
> Dec  9 21:29:42 titan vmunix:  [<ffffffff810a224c>] shmem_read_mapping_page_gfp+0x2c/0x50
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b3611>] i915_gem_object_get_pages_gtt+0xe1/0x270
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b127f>] i915_gem_object_get_pages+0x4f/0x90
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b1383>] i915_gem_object_bind_to_gtt+0xc3/0x4c0
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b4413>] i915_gem_object_pin+0x123/0x190
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b7d97>] i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b8171>] i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b87b2>] i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
> Dec  9 21:29:42 titan vmunix:  [<ffffffff812b9894>] i915_gem_execbuffer2+0x94/0x280
> Dec  9 21:29:42 titan vmunix:  [<ffffffff81287de3>] drm_ioctl+0x493/0x530
> Dec  9 21:29:42 titan vmunix:  [<ffffffff810d9cbf>] do_vfs_ioctl+0x8f/0x530
> Dec  9 21:29:42 titan vmunix:  [<ffffffff810da1ab>] sys_ioctl+0x4b/0x90
> Dec  9 21:29:42 titan vmunix:  [<ffffffff8154a4d2>] system_call_fastpath+0x16/0x1b
>
> It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, the i915 driver will need to be taken better care of.

That decodes to

  11: e8 89 b7 15 00       callq  0x15b79f  # radix_tree_lookup_slot
  16: 48 85 c0             test   %rax,%rax
  19: 48 89 c6             mov    %rax,%rsi
  1c: 74 41                 je     0x5f
  1e: 48 8b 18             mov    (%rax),%rbx  #
  21: 48 85 db             test   %rbx,%rbx
  24: 74 1f                 je     0x45
  26: f6 c3 03             test   $0x3,%bl
  29: 75 3c                 jne    0x67
  2b:* 8b 53 1c             mov    0x1c(%rbx),%edx     <-- trapping instruction
  2e: 85 d2                 test   %edx,%edx
  30: 74 d9                 je     0xb

where %rbx is 0x0200000000000000. That looks like it could be a
single-bit error, and should have been zero.

It's the "atomic_read(&page->counter)" which is part of
"page_cache_get_speculative()" as far as I can tell, and it's the
"page" pointer that is that odd (non-pointer) value. The fact that
%ecx contains the value "-6" makes me wonder if there was a -ENXIO
somewhere, though.

None of it looks all that much related to whether the i915 driver uses
GFP_NO_KSWAPD or not, though.

                Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 21:28                                                         ` Linus Torvalds
@ 2012-12-10 21:42                                                           ` Borislav Petkov
  2012-12-10 21:47                                                             ` Linus Torvalds
  2012-12-10 23:27                                                           ` Hugh Dickins
  1 sibling, 1 reply; 65+ messages in thread
From: Borislav Petkov @ 2012-12-10 21:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Andrew Morton, Mel Gorman, Johannes Weiner,
	Rik van Riel, linux-mm, Linux Kernel Mailing List, Hugh Dickins

On Mon, Dec 10, 2012 at 01:28:54PM -0800, Linus Torvalds wrote:
> [ Adding High Dickins because of the shmem oops. ]
> 
> On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic
> <zlatko.calusic@iskon.hr> wrote:
> >
> > And funny thing that you mention i915, because yesterday my daughter managed to lock up our laptop hard (that was a first), and this is what I found in kern.log after restart:
> >
> > Dec  9 21:29:42 titan vmunix: general protection fault: 0000 [#1] PREEMPT SMP
> > Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
> > Dec  9 21:29:42 titan vmunix: CPU 2
> > Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G           O 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
> > Dec  9 21:29:42 titan vmunix: RIP: 0010:[<ffffffff81090b9c>]  [<ffffffff81090b9c>] find_get_page+0x3c/0x90
> 
> Ho humm..
> 
> I'm not convinced this is related.
> 
> > Dec  9 21:29:42 titan vmunix: Call Trace:
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff81090e21>] find_lock_page+0x21/0x80
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810a1b60>] shmem_getpage_gfp+0xa0/0x620
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810a224c>] shmem_read_mapping_page_gfp+0x2c/0x50
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b3611>] i915_gem_object_get_pages_gtt+0xe1/0x270
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b127f>] i915_gem_object_get_pages+0x4f/0x90
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b1383>] i915_gem_object_bind_to_gtt+0xc3/0x4c0
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b4413>] i915_gem_object_pin+0x123/0x190
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b7d97>] i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b8171>] i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b87b2>] i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b9894>] i915_gem_execbuffer2+0x94/0x280
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff81287de3>] drm_ioctl+0x493/0x530
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810d9cbf>] do_vfs_ioctl+0x8f/0x530
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810da1ab>] sys_ioctl+0x4b/0x90
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff8154a4d2>] system_call_fastpath+0x16/0x1b
> >
> > It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, the i915 driver will need to be taken better care of.
> 
> That decodes to
> 
>   11: e8 89 b7 15 00       callq  0x15b79f  # radix_tree_lookup_slot
>   16: 48 85 c0             test   %rax,%rax
>   19: 48 89 c6             mov    %rax,%rsi
>   1c: 74 41                 je     0x5f
>   1e: 48 8b 18             mov    (%rax),%rbx  #
>   21: 48 85 db             test   %rbx,%rbx
>   24: 74 1f                 je     0x45
>   26: f6 c3 03             test   $0x3,%bl
>   29: 75 3c                 jne    0x67
>   2b:* 8b 53 1c             mov    0x1c(%rbx),%edx     <-- trapping instruction
>   2e: 85 d2                 test   %edx,%edx
>   30: 74 d9                 je     0xb
> 
> where %rbx is 0x0200000000000000. That looks like it could be a
> single-bit error, and should have been zero.
> 
> It's the "atomic_read(&page->counter)" which is part of
> "page_cache_get_speculative()" as far as I can tell, and it's the
> "page" pointer that is that odd (non-pointer) value. The fact that
> %ecx contains the value "-6" makes me wonder if there was a -ENXIO
> somewhere, though.
> 
> None of it looks all that much related to whether the i915 driver uses
> GFP_NO_KSWAPD or not, though.

Aren't we gonna consider the out-of-tree vbox modules being loaded and
causing some corruptions like maybe the single-bit error above?

I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317

Hmm.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 21:42                                                           ` Borislav Petkov
@ 2012-12-10 21:47                                                             ` Linus Torvalds
  2012-12-10 21:54                                                               ` Borislav Petkov
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2012-12-10 21:47 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Zlatko Calusic, Andrew Morton,
	Mel Gorman, Johannes Weiner, Rik van Riel, linux-mm,
	Linux Kernel Mailing List, Hugh Dickins

On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov <bp@alien8.de> wrote:
>
> Aren't we gonna consider the out-of-tree vbox modules being loaded and
> causing some corruptions like maybe the single-bit error above?
>
> I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317

Yup, that looks more likely, I agree.

                Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 21:47                                                             ` Linus Torvalds
@ 2012-12-10 21:54                                                               ` Borislav Petkov
  2012-12-10 22:15                                                                 ` Zlatko Calusic
  0 siblings, 1 reply; 65+ messages in thread
From: Borislav Petkov @ 2012-12-10 21:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Andrew Morton, Mel Gorman, Johannes Weiner,
	Rik van Riel, linux-mm, Linux Kernel Mailing List, Hugh Dickins

On Mon, Dec 10, 2012 at 01:47:23PM -0800, Linus Torvalds wrote:
> On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov <bp@alien8.de> wrote:
> >
> > Aren't we gonna consider the out-of-tree vbox modules being loaded and
> > causing some corruptions like maybe the single-bit error above?
> >
> > I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317
> 
> Yup, that looks more likely, I agree.

@Zlatko: can your daughter try to retrigger the freeze without the vbox
modules loaded?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 21:54                                                               ` Borislav Petkov
@ 2012-12-10 22:15                                                                 ` Zlatko Calusic
  0 siblings, 0 replies; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-10 22:15 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Johannes Weiner, Rik van Riel, linux-mm,
	Linux Kernel Mailing List, Hugh Dickins

On 10.12.2012 22:54, Borislav Petkov wrote:
> On Mon, Dec 10, 2012 at 01:47:23PM -0800, Linus Torvalds wrote:
>> On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov <bp@alien8.de> wrote:
>>>
>>> Aren't we gonna consider the out-of-tree vbox modules being loaded and
>>> causing some corruptions like maybe the single-bit error above?
>>>
>>> I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317
>>
>> Yup, that looks more likely, I agree.
>
> @Zlatko: can your daughter try to retrigger the freeze without the vbox
> modules loaded?
>

Sure thing! :)

Although, the vbox modules were only loaded, no VM was running at the 
time lockup happened. But, I've just read the whole thread you mention 
above and I understand the concern. I'll make sure the vbox modules are 
unloaded when not really needed (most of the time on that machine), in 
case lockup happens again.

Next time my daughter plays online games, I'll tell her she's actually 
serving a greater purpose, and let her take her time. :)
-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 21:28                                                         ` Linus Torvalds
  2012-12-10 21:42                                                           ` Borislav Petkov
@ 2012-12-10 23:27                                                           ` Hugh Dickins
  1 sibling, 0 replies; 65+ messages in thread
From: Hugh Dickins @ 2012-12-10 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Borislav Petkov, Andrew Morton, Mel Gorman,
	Johannes Weiner, Rik van Riel, linux-mm,
	Linux Kernel Mailing List

On Mon, 10 Dec 2012, Linus Torvalds wrote:
> [ Adding High Dickins because of the shmem oops. ]

I had already noticed, and was about to reply; but only then refreshed
my mbox window, to find that you've already done it all for me: thanks.

> 
> On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic
> <zlatko.calusic@iskon.hr> wrote:
> >
> > And funny thing that you mention i915, because yesterday my daughter managed to lock up our laptop hard (that was a first), and this is what I found in kern.log after restart:
> >
> > Dec  9 21:29:42 titan vmunix: general protection fault: 0000 [#1] PREEMPT SMP
> > Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
> > Dec  9 21:29:42 titan vmunix: CPU 2
> > Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G           O 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
> > Dec  9 21:29:42 titan vmunix: RIP: 0010:[<ffffffff81090b9c>]  [<ffffffff81090b9c>] find_get_page+0x3c/0x90
> 
> Ho humm..
> 
> I'm not convinced this is related.
> 
> > Dec  9 21:29:42 titan vmunix: Call Trace:
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff81090e21>] find_lock_page+0x21/0x80
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810a1b60>] shmem_getpage_gfp+0xa0/0x620
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810a224c>] shmem_read_mapping_page_gfp+0x2c/0x50
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b3611>] i915_gem_object_get_pages_gtt+0xe1/0x270
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b127f>] i915_gem_object_get_pages+0x4f/0x90
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b1383>] i915_gem_object_bind_to_gtt+0xc3/0x4c0
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b4413>] i915_gem_object_pin+0x123/0x190
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b7d97>] i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b8171>] i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b87b2>] i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff812b9894>] i915_gem_execbuffer2+0x94/0x280
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff81287de3>] drm_ioctl+0x493/0x530
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810d9cbf>] do_vfs_ioctl+0x8f/0x530
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff810da1ab>] sys_ioctl+0x4b/0x90
> > Dec  9 21:29:42 titan vmunix:  [<ffffffff8154a4d2>] system_call_fastpath+0x16/0x1b
> >
> > It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, the i915 driver will need to be taken better care of.
> 
> That decodes to
> 
>   11: e8 89 b7 15 00       callq  0x15b79f  # radix_tree_lookup_slot
>   16: 48 85 c0             test   %rax,%rax
>   19: 48 89 c6             mov    %rax,%rsi
>   1c: 74 41                 je     0x5f
>   1e: 48 8b 18             mov    (%rax),%rbx  #
>   21: 48 85 db             test   %rbx,%rbx
>   24: 74 1f                 je     0x45
>   26: f6 c3 03             test   $0x3,%bl
>   29: 75 3c                 jne    0x67
>   2b:* 8b 53 1c             mov    0x1c(%rbx),%edx     <-- trapping instruction
>   2e: 85 d2                 test   %edx,%edx
>   30: 74 d9                 je     0xb
> 
> where %rbx is 0x0200000000000000. That looks like it could be a
> single-bit error, and should have been zero.
> 
> It's the "atomic_read(&page->counter)" which is part of
> "page_cache_get_speculative()" as far as I can tell, and it's the
> "page" pointer that is that odd (non-pointer) value. The fact that
> %ecx contains the value "-6" makes me wonder if there was a -ENXIO
> somewhere, though.

Yes, just what I was about to say; except I never considered the -6.

I was going to suggest it's a new notebook with not-so-good memory,
but see that Borislav has since made a better suggestion.

> 
> None of it looks all that much related to whether the i915 driver uses
> GFP_NO_KSWAPD or not, though.

Yes, no evidence here of anything to delay 3.7 further.

I'm running on current git, and no problems observed; but then, I never
did see any of these kswapd problems anyway.  And, in particular, I was
unable to reproduce Zlatko's 1GB of 4GB kept free (on yesterday's tree,
with no swap) - I saw about 100MB kept free.

Hugh

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-10 20:35                                                       ` Zlatko Calusic
  2012-12-10 21:28                                                         ` Linus Torvalds
@ 2012-12-11  0:19                                                         ` Zlatko Calusic
  2012-12-11 21:56                                                           ` Zlatko Calusic
  2012-12-19 22:24                                                           ` Zlatko Calusic
  1 sibling, 2 replies; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-11  0:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Rik van Riel,
	linux-mm, Linux Kernel Mailing List, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 2432 bytes --]

> On 10.12.2012 20:13, Linus Torvalds wrote:
>>
>> It's worth giving this as much testing as is at all possible, but at
>> the same time I really don't think I can delay 3.7 any more without
>> messing up the holiday season too much. So unless something obvious
>> pops up, I will do the release tonight. So testing will be minimal -
>> but it's not like we haven't gone back-and-forth on this several times
>> already, and we revert to *mostly* the same old state as 3.6 anyway,
>> so it should be fairly safe.
>>

So, here's what I found. In short: close, but no cigar!

Kswapd is certainly no more CPU pig, and memory seems to be utilized 
properly (the kernel still likes to keep 400MB free, somebody else can 
confirm if that's to be expected on a 4GB THP-enabled machine). So it 
looks very decent, and much better than anything I run in last 10 days, 
barring !THP kernel.

What remains a mystery is that kswapd occassionaly still likes to get 
stuck in a D state, only now it recovers faster than before (sometimes 
in a matter of seconds, but sometimes it takes a few minutes). Now, I 
admit it's a small, maybe even cosmetic issue. But, it could also be a 
warning sign of a bigger problem that will reveal itself on a more 
loaded machine.

I will now make one last attempt, I've just reverted 2 Johannes' commits 
that were also applied in attempt to fix breakage that removing 
gfp_no_kswapd introduced, namely ed23ec4 & c702418. For various reasons 
the results of this test will be available tommorow, so it's your call 
Linus.

To better document this whole mess from my point of view, I've attached 
two graphs. First one nicely shows kswapd frenzy a week ago (blue 
mountains on a CPU graph). On Thu 06 & Mon 10 (until few hours ago) I 
run !THP kernels, better memory utilization is, I think, obvious (look 
at the bottom graph, lots of green is evil). CPU spikes at the end of 
every day are daily backup runs, which are CPU, NOT I/O bound. Notice 
L.A. close to 1 on !THP kernels (as it should be), and almost 2 (Fri & 
Sat 08) when the backup fought with kswapd (and also big CPU iowait 
times in that case). Finally, todays run is somewhere in between, that's 
why it deserves "close, but no cigar" qualification. ;)

The last graph shows CPU usage in more detail, yesterdays run was on a 
!THP kernel, todays THP run is the one with red spikes. That was kswapd 
in D state, in congestion_wait().

-- 
Zlatko

[-- Attachment #2: screenshot1.png --]
[-- Type: image/png, Size: 52963 bytes --]

[-- Attachment #3: screenshot2.png --]
[-- Type: image/png, Size: 15122 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-11  0:19                                                         ` Zlatko Calusic
@ 2012-12-11 21:56                                                           ` Zlatko Calusic
  2012-12-19 22:24                                                           ` Zlatko Calusic
  1 sibling, 0 replies; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-11 21:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Rik van Riel,
	linux-mm, Linux Kernel Mailing List, Hugh Dickins

On 11.12.2012 01:19, Zlatko Calusic wrote:
>
> I will now make one last attempt, I've just reverted 2 Johannes' commits
> that were also applied in attempt to fix breakage that removing
> gfp_no_kswapd introduced, namely ed23ec4 & c702418. For various reasons
> the results of this test will be available tommorow, so it's your call
> Linus.
>

To be honest, I don't see any difference with those two commits 
reverted. Like those lines never did much anyway, so it's probably good 
we got rid of them. :P

-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: kswapd craziness in 3.7
  2012-12-11  0:19                                                         ` Zlatko Calusic
  2012-12-11 21:56                                                           ` Zlatko Calusic
@ 2012-12-19 22:24                                                           ` Zlatko Calusic
  1 sibling, 0 replies; 65+ messages in thread
From: Zlatko Calusic @ 2012-12-19 22:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Rik van Riel,
	linux-mm, Linux Kernel Mailing List, Hugh Dickins

On 11.12.2012 01:19, Zlatko Calusic wrote:
>> On 10.12.2012 20:13, Linus Torvalds wrote:
>>>
>>> It's worth giving this as much testing as is at all possible, but at
>>> the same time I really don't think I can delay 3.7 any more without
>>> messing up the holiday season too much. So unless something obvious
>>> pops up, I will do the release tonight. So testing will be minimal -
>>> but it's not like we haven't gone back-and-forth on this several times
>>> already, and we revert to *mostly* the same old state as 3.6 anyway,
>>> so it should be fairly safe.
>>>
>
> So, here's what I found. In short: close, but no cigar!
>
> Kswapd is certainly no more CPU pig, and memory seems to be utilized
> properly (the kernel still likes to keep 400MB free, somebody else can
> confirm if that's to be expected on a 4GB THP-enabled machine). So it
> looks very decent, and much better than anything I run in last 10 days,
> barring !THP kernel.
>
> What remains a mystery is that kswapd occassionaly still likes to get
> stuck in a D state, only now it recovers faster than before (sometimes
> in a matter of seconds, but sometimes it takes a few minutes). Now, I
> admit it's a small, maybe even cosmetic issue. But, it could also be a
> warning sign of a bigger problem that will reveal itself on a more
> loaded machine.
>

Ha, I nailed it!

The cigar aka the explanation together with a patch will follow shortly 
in a separate topic.

It's a genuine bug that has been with us for a long long time.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2012-12-19 22:24 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-27 20:48 kswapd craziness in 3.7 Johannes Weiner
2012-11-27 20:48 ` [patch] mm: vmscan: fix kswapd endless loop on higher order allocation Johannes Weiner
2012-11-27 20:58 ` kswapd craziness in 3.7 Linus Torvalds
2012-11-27 21:16   ` Rik van Riel
2012-11-27 21:49     ` Johannes Weiner
2012-11-27 22:02       ` Rik van Riel
2012-11-27 22:26         ` Johannes Weiner
2012-11-27 23:19           ` Linus Torvalds
2012-11-28 10:13             ` Mel Gorman
2012-11-28 10:51               ` Thorsten Leemhuis
2012-11-28 16:42               ` Mel Gorman
2012-11-28 22:52               ` Andrew Morton
2012-11-28 23:54                 ` Mel Gorman
2012-11-29  0:14                   ` Andrew Morton
2012-11-29 15:30                   ` Thorsten Leemhuis
2012-11-29 17:05                     ` Johannes Weiner
2012-11-30 12:39                       ` Thorsten Leemhuis
2012-12-01  0:45                         ` Johannes Weiner
2012-12-03  8:30                           ` Thorsten Leemhuis
2012-12-03 13:08                             ` Fedora repo (was: Re: kswapd craziness in 3.7) Borislav Petkov
2012-12-03 19:42                             ` kswapd craziness in 3.7 Johannes Weiner
2012-12-04 21:42                               ` Johannes Weiner
2012-12-05  3:01                                 ` Bruno Wolff III
2012-12-06 17:37                                   ` Bruno Wolff III
2012-12-06 19:31                                     ` Linus Torvalds
2012-12-06 19:43                                       ` Rik van Riel
2012-12-06 20:23                                       ` Johannes Weiner
2012-12-06 20:32                                         ` Rik van Riel
2012-12-08 12:06                                       ` Zlatko Calusic
2012-12-08 21:22                                         ` Zlatko Calusic
2012-12-09  1:01                                           ` Linus Torvalds
2012-12-09 21:59                                             ` Zdenek Kabelac
2012-12-10 11:03                                             ` Mel Gorman
2012-12-10 16:39                                               ` Johannes Weiner
2012-12-10 18:01                                                 ` Mel Gorman
2012-12-10 18:33                                                   ` Zlatko Calusic
2012-12-10 19:13                                                     ` Linus Torvalds
2012-12-10 20:35                                                       ` Zlatko Calusic
2012-12-10 21:28                                                         ` Linus Torvalds
2012-12-10 21:42                                                           ` Borislav Petkov
2012-12-10 21:47                                                             ` Linus Torvalds
2012-12-10 21:54                                                               ` Borislav Petkov
2012-12-10 22:15                                                                 ` Zlatko Calusic
2012-12-10 23:27                                                           ` Hugh Dickins
2012-12-11  0:19                                                         ` Zlatko Calusic
2012-12-11 21:56                                                           ` Zlatko Calusic
2012-12-19 22:24                                                           ` Zlatko Calusic
2012-12-10 18:29                                               ` Zlatko Calusic
2012-12-06  8:09                               ` Thorsten Leemhuis
2012-11-27 21:29   ` Johannes Weiner
2012-11-28 13:35   ` Zdenek Kabelac
2012-11-28 14:04     ` Jiri Slaby
2012-11-28  9:45 ` Mel Gorman
2012-12-03 15:23   ` Zdenek Kabelac
2012-12-03 19:18     ` Johannes Weiner
2012-12-04  9:05       ` Zdenek Kabelac
2012-12-04  9:15         ` Jiri Slaby
2012-12-04 16:11           ` Johannes Weiner
2012-12-04 16:22             ` Jiri Slaby
2012-12-04 19:50               ` Johannes Weiner
2012-12-08 10:35             ` Jiri Slaby
2012-12-04 16:15         ` Johannes Weiner
2012-12-06 13:51         ` Zdenek Kabelac
2012-12-03 13:14 ` Jiri Slaby
2012-12-04  8:55   ` Jiri Slaby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).