[PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-06-24 14:44 ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

(Built this time and passed a basic sniff-test.)

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.  Unfortunately, if the highest zone is
small, a problem occurs.

This seems to happen most with recent sandybridge laptops but it's
probably a co-incidence as some of these laptops just happen to have
a small Normal zone. The reproduction case is almost always during
copying large files that kswapd pegs at 100% CPU until the file is
deleted or cache is dropped.

The problem is mostly down to sleeping_prematurely() keeping kswapd
awake when the highest zone is small and unreclaimable and compounded
by the fact we shrink slabs even when not shrinking zones causing a lot
of time to be spent in shrinkers and a lot of memory to be reclaimed.

Patch 1 corrects sleeping_prematurely to check the zones matching
	the classzone_idx instead of all zones.

Patch 2 avoids shrinking slab when we are not shrinking a zone.

Patch 3 notes that sleeping_prematurely is checking lower zones against
	a high classzone which is not what allocators or balance_pgdat()
	is doing leading to an artifical believe that kswapd should be
	still awake.

Patch 4 notes that when balance_pgdat() gives up on a high zone that the
	decision is not communicated to sleeping_prematurely()

This problem affects 2.6.38.8 for certain and is expected to affect
2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
to be picked up by distros and this series is against 3.0-rc4. I've
cc'd people that reported similar problems recently to see if they
still suffer from the problem and if this fixes it.

 mm/vmscan.c |   59 +++++++++++++++++++++++++++++++++++------------------------
 1 files changed, 35 insertions(+), 24 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-06-24 14:44 ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

(Built this time and passed a basic sniff-test.)

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.  Unfortunately, if the highest zone is
small, a problem occurs.

This seems to happen most with recent sandybridge laptops but it's
probably a co-incidence as some of these laptops just happen to have
a small Normal zone. The reproduction case is almost always during
copying large files that kswapd pegs at 100% CPU until the file is
deleted or cache is dropped.

The problem is mostly down to sleeping_prematurely() keeping kswapd
awake when the highest zone is small and unreclaimable and compounded
by the fact we shrink slabs even when not shrinking zones causing a lot
of time to be spent in shrinkers and a lot of memory to be reclaimed.

Patch 1 corrects sleeping_prematurely to check the zones matching
	the classzone_idx instead of all zones.

Patch 2 avoids shrinking slab when we are not shrinking a zone.

Patch 3 notes that sleeping_prematurely is checking lower zones against
	a high classzone which is not what allocators or balance_pgdat()
	is doing leading to an artifical believe that kswapd should be
	still awake.

Patch 4 notes that when balance_pgdat() gives up on a high zone that the
	decision is not communicated to sleeping_prematurely()

This problem affects 2.6.38.8 for certain and is expected to affect
2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
to be picked up by distros and this series is against 3.0-rc4. I've
cc'd people that reported similar problems recently to see if they
still suffer from the problem and if this fixes it.

 mm/vmscan.c |   59 +++++++++++++++++++++++++++++++++++------------------------
 1 files changed, 35 insertions(+), 24 deletions(-)

-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
  2011-06-24 14:44 ` Mel Gorman
@ 2011-06-24 14:44   ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

A problem occurs if the highest zone is small.  balance_pgdat()
only considers unreclaimable zones when priority is DEF_PRIORITY
but sleeping_prematurely considers all zones. It's possible for this
sequence to occur

  1. kswapd wakes up and enters balance_pgdat()
  2. At DEF_PRIORITY, marks highest zone unreclaimable
  3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
  4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
        highest zone, clearing all_unreclaimable. Highest zone
        is still unbalanced
  5. kswapd returns and calls sleeping_prematurely
  6. sleeping_prematurely looks at *all* zones, not just the ones
     being considered by balance_pgdat. The highest small zone
     has all_unreclaimable cleared but but the zone is not
     balanced. all_zones_ok is false so kswapd stays awake

This patch corrects the behaviour of sleeping_prematurely to check
the zones balance_pgdat() checked.

Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ff834e..841e3bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		return true;
 
 	/* Check the watermark levels */
-	for (i = 0; i < pgdat->nr_zones; i++) {
+	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
 		if (!populated_zone(zone))
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
@ 2011-06-24 14:44   ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

A problem occurs if the highest zone is small.  balance_pgdat()
only considers unreclaimable zones when priority is DEF_PRIORITY
but sleeping_prematurely considers all zones. It's possible for this
sequence to occur

  1. kswapd wakes up and enters balance_pgdat()
  2. At DEF_PRIORITY, marks highest zone unreclaimable
  3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
  4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
        highest zone, clearing all_unreclaimable. Highest zone
        is still unbalanced
  5. kswapd returns and calls sleeping_prematurely
  6. sleeping_prematurely looks at *all* zones, not just the ones
     being considered by balance_pgdat. The highest small zone
     has all_unreclaimable cleared but but the zone is not
     balanced. all_zones_ok is false so kswapd stays awake

This patch corrects the behaviour of sleeping_prematurely to check
the zones balance_pgdat() checked.

Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ff834e..841e3bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		return true;
 
 	/* Check the watermark levels */
-	for (i = 0; i < pgdat->nr_zones; i++) {
+	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
 		if (!populated_zone(zone))
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
  2011-06-24 14:44 ` Mel Gorman
@ 2011-06-24 14:44   ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

When kswapd applies pressure to zones during node balancing, it checks
if the zone is above a high+balance_gap threshold. If it is, it does
not apply pressure but it unconditionally shrinks slab on a global
basis which is excessive. In the event kswapd is being kept awake due to
a high small unreclaimable zone, it skips zone shrinking but still
calls shrink_slab().

Once pressure has been applied, the check for zone being unreclaimable
is being made before the check is made if all_unreclaimable should be
set. This miss of unreclaimable can cause has_under_min_watermark_zone
to be set due to an unreclaimable zone preventing kswapd backing off
on congestion_wait().

Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   23 +++++++++++++----------
 1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 841e3bf..9cebed1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2507,18 +2507,18 @@ loop_again:
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone) + balance_gap,
-					end_zone, 0))
+					end_zone, 0)) {
 				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_scanned += sc.nr_scanned;
 
-			if (zone->all_unreclaimable)
-				continue;
-			if (nr_slab == 0 &&
-			    !zone_reclaimable(zone))
-				zone->all_unreclaimable = 1;
+				reclaim_state->reclaimed_slab = 0;
+				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				total_scanned += sc.nr_scanned;
+
+				if (nr_slab == 0 && !zone_reclaimable(zone))
+					zone->all_unreclaimable = 1;
+			}
+
 			/*
 			 * If we've done a decent amount of scanning and
 			 * the reclaim ratio is low, start doing writepage
@@ -2528,6 +2528,9 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
+			if (zone->all_unreclaimable)
+				continue;
+
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
@ 2011-06-24 14:44   ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

When kswapd applies pressure to zones during node balancing, it checks
if the zone is above a high+balance_gap threshold. If it is, it does
not apply pressure but it unconditionally shrinks slab on a global
basis which is excessive. In the event kswapd is being kept awake due to
a high small unreclaimable zone, it skips zone shrinking but still
calls shrink_slab().

Once pressure has been applied, the check for zone being unreclaimable
is being made before the check is made if all_unreclaimable should be
set. This miss of unreclaimable can cause has_under_min_watermark_zone
to be set due to an unreclaimable zone preventing kswapd backing off
on congestion_wait().

Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   23 +++++++++++++----------
 1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 841e3bf..9cebed1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2507,18 +2507,18 @@ loop_again:
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone) + balance_gap,
-					end_zone, 0))
+					end_zone, 0)) {
 				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_scanned += sc.nr_scanned;
 
-			if (zone->all_unreclaimable)
-				continue;
-			if (nr_slab == 0 &&
-			    !zone_reclaimable(zone))
-				zone->all_unreclaimable = 1;
+				reclaim_state->reclaimed_slab = 0;
+				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				total_scanned += sc.nr_scanned;
+
+				if (nr_slab == 0 && !zone_reclaimable(zone))
+					zone->all_unreclaimable = 1;
+			}
+
 			/*
 			 * If we've done a decent amount of scanning and
 			 * the reclaim ratio is low, start doing writepage
@@ -2528,6 +2528,9 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
+			if (zone->all_unreclaimable)
+				continue;
+
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
  2011-06-24 14:44 ` Mel Gorman
@ 2011-06-24 14:44   ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

When deciding if kswapd is sleeping prematurely, the classzone is
taken into account but this is different to what balance_pgdat() and
the allocator are doing. Specifically, the DMA zone will be checked
based on the classzone used when waking kswapd which could be for a
GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
the watermark is not met and kswapd thinks its sleeping prematurely
keeping kswapd awake in error.

Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9cebed1..a76b6cc2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		}
 
 		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
-							classzone_idx, 0))
+							i, 0))
 			all_zones_ok = false;
 		else
 			balanced += zone->present_pages;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
@ 2011-06-24 14:44   ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

When deciding if kswapd is sleeping prematurely, the classzone is
taken into account but this is different to what balance_pgdat() and
the allocator are doing. Specifically, the DMA zone will be checked
based on the classzone used when waking kswapd which could be for a
GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
the watermark is not met and kswapd thinks its sleeping prematurely
keeping kswapd awake in error.

Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9cebed1..a76b6cc2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		}
 
 		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
-							classzone_idx, 0))
+							i, 0))
 			all_zones_ok = false;
 		else
 			balanced += zone->present_pages;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-06-24 14:44 ` Mel Gorman
@ 2011-06-24 14:44   ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.  Unfortunately, if the highest zone is
small, a problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx than
it started because the highest zone was unreclaimable. Before checking
if it should go to sleep though, it checks pgdat->classzone_idx which
when there is no other activity will be MAX_NR_ZONES-1. It interprets
this as it has been woken up while reclaiming, skips scheduling and
reclaims again. As there is no useful reclaim work to do, it enters
into a loop of shrinking slab consuming loads of CPU until the highest
zone becomes reclaimable for a long period of time.

There are two problems here. 1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling. 2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately
at DEF_PRIORITY, the new lower classzone is not communicated back to
kswapd() for sleeping.

This patch does two things that are related. If the end_zone is
unreclaimable, this information is communicated back. Second, if
the classzone or order was reduced due to failing to reclaim, new
information is not read from pgdat and instead an attempt is made to go
to sleep. Due to this, it is also necessary that pgdat->classzone_idx
be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
being interpreted as wakeups.

Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   34 +++++++++++++++++++++-------------
 1 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a76b6cc2..fe854d7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2448,7 +2448,6 @@ loop_again:
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
-				*classzone_idx = i;
 				break;
 			}
 		}
@@ -2528,8 +2527,11 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;

-			if (zone->all_unreclaimable)
+			if (zone->all_unreclaimable) {
+				if (end_zone && end_zone == i)
+					end_zone--;
 				continue;
+			}

 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
@@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
  */
 static int kswapd(void *p)
 {
-	unsigned long order;
-	int classzone_idx;
+	unsigned long order, new_order;
+	int classzone_idx, new_classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;

@@ -2740,17 +2742,23 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();

-	order = 0;
-	classzone_idx = MAX_NR_ZONES - 1;
+	order = new_order = 0;
+	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
 	for ( ; ; ) {
-		unsigned long new_order;
-		int new_classzone_idx;
 		int ret;

-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order = 0;
-		pgdat->classzone_idx = MAX_NR_ZONES - 1;
+		/*
+		 * If the last balance_pgdat was unsuccessful it's unlikely a
+		 * new request of a similar or harder type will succeed soon
+		 * so consider going to sleep on the basis we reclaimed at
+		 */
+		if (classzone_idx >= new_classzone_idx && order == new_order) {
+			new_order = pgdat->kswapd_max_order;
+			new_classzone_idx = pgdat->classzone_idx;
+			pgdat->kswapd_max_order =  0;
+			pgdat->classzone_idx = pgdat->nr_zones - 1;
+		}
+
 		if (order < new_order || classzone_idx > new_classzone_idx) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
@@ -2763,7 +2771,7 @@ static int kswapd(void *p)
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
-			pgdat->classzone_idx = MAX_NR_ZONES - 1;
+			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}

 		ret = try_to_freeze();
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-06-24 14:44   ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.  Unfortunately, if the highest zone is
small, a problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx than
it started because the highest zone was unreclaimable. Before checking
if it should go to sleep though, it checks pgdat->classzone_idx which
when there is no other activity will be MAX_NR_ZONES-1. It interprets
this as it has been woken up while reclaiming, skips scheduling and
reclaims again. As there is no useful reclaim work to do, it enters
into a loop of shrinking slab consuming loads of CPU until the highest
zone becomes reclaimable for a long period of time.

There are two problems here. 1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling. 2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately
at DEF_PRIORITY, the new lower classzone is not communicated back to
kswapd() for sleeping.

This patch does two things that are related. If the end_zone is
unreclaimable, this information is communicated back. Second, if
the classzone or order was reduced due to failing to reclaim, new
information is not read from pgdat and instead an attempt is made to go
to sleep. Due to this, it is also necessary that pgdat->classzone_idx
be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
being interpreted as wakeups.

Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   34 +++++++++++++++++++++-------------
 1 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a76b6cc2..fe854d7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2448,7 +2448,6 @@ loop_again:
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
-				*classzone_idx = i;
 				break;
 			}
 		}
@@ -2528,8 +2527,11 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;

-			if (zone->all_unreclaimable)
+			if (zone->all_unreclaimable) {
+				if (end_zone && end_zone == i)
+					end_zone--;
 				continue;
+			}

 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
@@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
  */
 static int kswapd(void *p)
 {
-	unsigned long order;
-	int classzone_idx;
+	unsigned long order, new_order;
+	int classzone_idx, new_classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;

@@ -2740,17 +2742,23 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();

-	order = 0;
-	classzone_idx = MAX_NR_ZONES - 1;
+	order = new_order = 0;
+	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
 	for ( ; ; ) {
-		unsigned long new_order;
-		int new_classzone_idx;
 		int ret;

-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order = 0;
-		pgdat->classzone_idx = MAX_NR_ZONES - 1;
+		/*
+		 * If the last balance_pgdat was unsuccessful it's unlikely a
+		 * new request of a similar or harder type will succeed soon
+		 * so consider going to sleep on the basis we reclaimed at
+		 */
+		if (classzone_idx >= new_classzone_idx && order == new_order) {
+			new_order = pgdat->kswapd_max_order;
+			new_classzone_idx = pgdat->classzone_idx;
+			pgdat->kswapd_max_order =  0;
+			pgdat->classzone_idx = pgdat->nr_zones - 1;
+		}
+
 		if (order < new_order || classzone_idx > new_classzone_idx) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
@@ -2763,7 +2771,7 @@ static int kswapd(void *p)
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
-			pgdat->classzone_idx = MAX_NR_ZONES - 1;
+			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}

 		ret = try_to_freeze();
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-06-24 14:44 ` Mel Gorman
@ 2011-06-25 14:23   ` Andrew Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-06-25 14:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 8:44 AM, Mel Gorman <mgorman@suse.de> wrote:
> (Built this time and passed a basic sniff-test.)
>
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
>

[...]

I've been running these for a couple days with no problems, although I
haven't been trying to reproduce the problem.  (Well, no problems
related to memory management.)

I suspect that my pet unnecessary-OOM-kill bug is still around, but
that's probably not related, especially since I can trigger it if I
stick 8 GB of RAM in this laptop.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-06-25 14:23   ` Andrew Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-06-25 14:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 8:44 AM, Mel Gorman <mgorman@suse.de> wrote:
> (Built this time and passed a basic sniff-test.)
>
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
>

[...]

I've been running these for a couple days with no problems, although I
haven't been trying to reproduce the problem.  (Well, no problems
related to memory management.)

I suspect that my pet unnecessary-OOM-kill bug is still around, but
that's probably not related, especially since I can trigger it if I
stick 8 GB of RAM in this laptop.

Thanks,
Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-25 21:33     ` Rik van Riel
  -1 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 21:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
>    1. kswapd wakes up and enters balance_pgdat()
>    2. At DEF_PRIORITY, marks highest zone unreclaimable
>    3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>    4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>          highest zone, clearing all_unreclaimable. Highest zone
>          is still unbalanced
>    5. kswapd returns and calls sleeping_prematurely
>    6. sleeping_prematurely looks at *all* zones, not just the ones
>       being considered by balance_pgdat. The highest small zone
>       has all_unreclaimable cleared but but the zone is not
>       balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
>
> Reported-and-tested-by: Pádraig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
@ 2011-06-25 21:33     ` Rik van Riel
  0 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 21:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
>    1. kswapd wakes up and enters balance_pgdat()
>    2. At DEF_PRIORITY, marks highest zone unreclaimable
>    3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>    4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>          highest zone, clearing all_unreclaimable. Highest zone
>          is still unbalanced
>    5. kswapd returns and calls sleeping_prematurely
>    6. sleeping_prematurely looks at *all* zones, not just the ones
>       being considered by balance_pgdat. The highest small zone
>       has all_unreclaimable cleared but but the zone is not
>       balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
>
> Reported-and-tested-by: PA!draig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-25 21:40     ` Rik van Riel
  -1 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 21:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
>
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
>
> Reported-and-tested-by: Pádraig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
@ 2011-06-25 21:40     ` Rik van Riel
  0 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 21:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
>
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
>
> Reported-and-tested-by: PA!draig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-25 21:42     ` Rik van Riel
  -1 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 21:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.
>
> Reported-and-tested-by: Pádraig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
@ 2011-06-25 21:42     ` Rik van Riel
  0 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 21:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.
>
> Reported-and-tested-by: PA!draig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-25 23:17     ` Rik van Riel
  -1 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 23:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
>
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking
> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets
> this as it has been woken up while reclaiming, skips scheduling and
> reclaims again. As there is no useful reclaim work to do, it enters
> into a loop of shrinking slab consuming loads of CPU until the highest
> zone becomes reclaimable for a long period of time.
>
> There are two problems here. 1) If the returned classzone or order is
> lower, it'll continue reclaiming without scheduling. 2) if the highest
> zone was marked unreclaimable but balance_pgdat() returns immediately
> at DEF_PRIORITY, the new lower classzone is not communicated back to
> kswapd() for sleeping.
>
> This patch does two things that are related. If the end_zone is
> unreclaimable, this information is communicated back. Second, if
> the classzone or order was reduced due to failing to reclaim, new
> information is not read from pgdat and instead an attempt is made to go
> to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> being interpreted as wakeups.
>
> Reported-and-tested-by: Pádraig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-06-25 23:17     ` Rik van Riel
  0 siblings, 0 replies; 82+ messages in thread
From: Rik van Riel @ 2011-06-25 23:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Minchan Kim, Andrew Lutomirski, Johannes Weiner, linux-mm,
	linux-kernel

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
>
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking
> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets
> this as it has been woken up while reclaiming, skips scheduling and
> reclaims again. As there is no useful reclaim work to do, it enters
> into a loop of shrinking slab consuming loads of CPU until the highest
> zone becomes reclaimable for a long period of time.
>
> There are two problems here. 1) If the returned classzone or order is
> lower, it'll continue reclaiming without scheduling. 2) if the highest
> zone was marked unreclaimable but balance_pgdat() returns immediately
> at DEF_PRIORITY, the new lower classzone is not communicated back to
> kswapd() for sleeping.
>
> This patch does two things that are related. If the end_zone is
> unreclaimable, this information is communicated back. Second, if
> the classzone or order was reduced due to failing to reclaim, new
> information is not read from pgdat and instead an attempt is made to go
> to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> being interpreted as wakeups.
>
> Reported-and-tested-by: PA!draig Brady<P@draigBrady.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-27  6:10     ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-27  6:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
>  1. kswapd wakes up and enters balance_pgdat()
>  2. At DEF_PRIORITY, marks highest zone unreclaimable
>  3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>  4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>        highest zone, clearing all_unreclaimable. Highest zone
>        is still unbalanced
>  5. kswapd returns and calls sleeping_prematurely
>  6. sleeping_prematurely looks at *all* zones, not just the ones
>     being considered by balance_pgdat. The highest small zone
>     has all_unreclaimable cleared but but the zone is not
>     balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
@ 2011-06-27  6:10     ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-27  6:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
>  1. kswapd wakes up and enters balance_pgdat()
>  2. At DEF_PRIORITY, marks highest zone unreclaimable
>  3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>  4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>        highest zone, clearing all_unreclaimable. Highest zone
>        is still unbalanced
>  5. kswapd returns and calls sleeping_prematurely
>  6. sleeping_prematurely looks at *all* zones, not just the ones
>     being considered by balance_pgdat. The highest small zone
>     has all_unreclaimable cleared but but the zone is not
>     balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-27  6:53     ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-27  6:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.


I thought it was intentional when you submitted a patch firstly.
"Kswapd makes sure zones include enough free pages(ie, include reserve
limit of above zones).
But you seem to see DMA zone can't meet above requirement forever in
some situation so that kswapd doesn't sleep.
Right?

>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9cebed1..a76b6cc2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>                }
>
>                if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> -                                                       classzone_idx, 0))
> +                                                       i, 0))

Isn't it  better to use 0 instead of i?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
@ 2011-06-27  6:53     ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-27  6:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.


I thought it was intentional when you submitted a patch firstly.
"Kswapd makes sure zones include enough free pages(ie, include reserve
limit of above zones).
But you seem to see DMA zone can't meet above requirement forever in
some situation so that kswapd doesn't sleep.
Right?

>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9cebed1..a76b6cc2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>                }
>
>                if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> -                                                       classzone_idx, 0))
> +                                                       i, 0))

Isn't it  better to use 0 instead of i?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
  2011-06-27  6:53     ` Minchan Kim
@ 2011-06-28 12:52       ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-28 12:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Mon, Jun 27, 2011 at 03:53:04PM +0900, Minchan Kim wrote:
> On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> > When deciding if kswapd is sleeping prematurely, the classzone is
> > taken into account but this is different to what balance_pgdat() and
> > the allocator are doing. Specifically, the DMA zone will be checked
> > based on the classzone used when waking kswapd which could be for a
> > GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> > the watermark is not met and kswapd thinks its sleeping prematurely
> > keeping kswapd awake in error.
> 
> 
> I thought it was intentional when you submitted a patch firstly.

It was, it also wasn't right.

> "Kswapd makes sure zones include enough free pages(ie, include reserve
> limit of above zones).
> But you seem to see DMA zone can't meet above requirement forever in
> some situation so that kswapd doesn't sleep.
> Right?
> 

Right.

> >
> > Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9cebed1..a76b6cc2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >                }
> >
> >                if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> > -                                                       classzone_idx, 0))
> > +                                                       i, 0))
> 
> Isn't it  better to use 0 instead of i?
> 

I considered it but went with i to compromise between making sure zones
included enough free pages without requiring that ZONE_DMA meet an
almost impossible requirement when under continual memory pressure.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
@ 2011-06-28 12:52       ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-28 12:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Mon, Jun 27, 2011 at 03:53:04PM +0900, Minchan Kim wrote:
> On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> > When deciding if kswapd is sleeping prematurely, the classzone is
> > taken into account but this is different to what balance_pgdat() and
> > the allocator are doing. Specifically, the DMA zone will be checked
> > based on the classzone used when waking kswapd which could be for a
> > GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> > the watermark is not met and kswapd thinks its sleeping prematurely
> > keeping kswapd awake in error.
> 
> 
> I thought it was intentional when you submitted a patch firstly.

It was, it also wasn't right.

> "Kswapd makes sure zones include enough free pages(ie, include reserve
> limit of above zones).
> But you seem to see DMA zone can't meet above requirement forever in
> some situation so that kswapd doesn't sleep.
> Right?
> 

Right.

> >
> > Reported-and-tested-by: Padraig Brady <P@draigBrady.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9cebed1..a76b6cc2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >                }
> >
> >                if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> > -                                                       classzone_idx, 0))
> > +                                                       i, 0))
> 
> Isn't it  better to use 0 instead of i?
> 

I considered it but went with i to compromise between making sure zones
included enough free pages without requiring that ZONE_DMA meet an
almost impossible requirement when under continual memory pressure.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-28 21:49     ` Andrew Morton
  -1 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2011-06-28 21:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, 24 Jun 2011 15:44:54 +0100
Mel Gorman <mgorman@suse.de> wrote:

> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
> 
>   1. kswapd wakes up and enters balance_pgdat()
>   2. At DEF_PRIORITY, marks highest zone unreclaimable
>   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>         highest zone, clearing all_unreclaimable. Highest zone
>         is still unbalanced
>   5. kswapd returns and calls sleeping_prematurely
>   6. sleeping_prematurely looks at *all* zones, not just the ones
>      being considered by balance_pgdat. The highest small zone
>      has all_unreclaimable cleared but but the zone is not
>      balanced. all_zones_ok is false so kswapd stays awake
> 
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.

But kswapd is making progress: it's reclaiming slab.  Eventually that
won't work any more and all_unreclaimable will not be cleared and the
condition will fix itself up?



btw,

	if (!sleeping_prematurely(...))
		sleep();

hurts my brain.  My brain would prefer

	if (kswapd_should_sleep(...))
		sleep();

no?

> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>

But what were the before-and-after observations?  I don't understand
how this can cause a permanent cpuchew by kswapd.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  		return true;
>  
>  	/* Check the watermark levels */
> -	for (i = 0; i < pgdat->nr_zones; i++) {
> +	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
>  		if (!populated_zone(zone))

The patch looks sensible.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
@ 2011-06-28 21:49     ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2011-06-28 21:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, 24 Jun 2011 15:44:54 +0100
Mel Gorman <mgorman@suse.de> wrote:

> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
> 
>   1. kswapd wakes up and enters balance_pgdat()
>   2. At DEF_PRIORITY, marks highest zone unreclaimable
>   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>         highest zone, clearing all_unreclaimable. Highest zone
>         is still unbalanced
>   5. kswapd returns and calls sleeping_prematurely
>   6. sleeping_prematurely looks at *all* zones, not just the ones
>      being considered by balance_pgdat. The highest small zone
>      has all_unreclaimable cleared but but the zone is not
>      balanced. all_zones_ok is false so kswapd stays awake
> 
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.

But kswapd is making progress: it's reclaiming slab.  Eventually that
won't work any more and all_unreclaimable will not be cleared and the
condition will fix itself up?



btw,

	if (!sleeping_prematurely(...))
		sleep();

hurts my brain.  My brain would prefer

	if (kswapd_should_sleep(...))
		sleep();

no?

> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>

But what were the before-and-after observations?  I don't understand
how this can cause a permanent cpuchew by kswapd.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  		return true;
>  
>  	/* Check the watermark levels */
> -	for (i = 0; i < pgdat->nr_zones; i++) {
> +	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
>  		if (!populated_zone(zone))

The patch looks sensible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
  2011-06-28 12:52       ` Mel Gorman
@ 2011-06-28 23:23         ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-28 23:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Tue, Jun 28, 2011 at 9:52 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Jun 27, 2011 at 03:53:04PM +0900, Minchan Kim wrote:
>> On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > When deciding if kswapd is sleeping prematurely, the classzone is
>> > taken into account but this is different to what balance_pgdat() and
>> > the allocator are doing. Specifically, the DMA zone will be checked
>> > based on the classzone used when waking kswapd which could be for a
>> > GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
>> > the watermark is not met and kswapd thinks its sleeping prematurely
>> > keeping kswapd awake in error.
>>
>>
>> I thought it was intentional when you submitted a patch firstly.
>
> It was, it also wasn't right.
>
>> "Kswapd makes sure zones include enough free pages(ie, include reserve
>> limit of above zones).
>> But you seem to see DMA zone can't meet above requirement forever in
>> some situation so that kswapd doesn't sleep.
>> Right?
>>
>
> Right.
>
>> >
>> > Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
>> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>> > ---
>> >  mm/vmscan.c |    2 +-
>> >  1 files changed, 1 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 9cebed1..a76b6cc2 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >                }
>> >
>> >                if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
>> > -                                                       classzone_idx, 0))
>> > +                                                       i, 0))
>>
>> Isn't it  better to use 0 instead of i?
>>
>
> I considered it but went with i to compromise between making sure zones
> included enough free pages without requiring that ZONE_DMA meet an
> almost impossible requirement when under continual memory pressure.

I see.
Thanks, Mel.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
@ 2011-06-28 23:23         ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-28 23:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Tue, Jun 28, 2011 at 9:52 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Jun 27, 2011 at 03:53:04PM +0900, Minchan Kim wrote:
>> On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > When deciding if kswapd is sleeping prematurely, the classzone is
>> > taken into account but this is different to what balance_pgdat() and
>> > the allocator are doing. Specifically, the DMA zone will be checked
>> > based on the classzone used when waking kswapd which could be for a
>> > GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
>> > the watermark is not met and kswapd thinks its sleeping prematurely
>> > keeping kswapd awake in error.
>>
>>
>> I thought it was intentional when you submitted a patch firstly.
>
> It was, it also wasn't right.
>
>> "Kswapd makes sure zones include enough free pages(ie, include reserve
>> limit of above zones).
>> But you seem to see DMA zone can't meet above requirement forever in
>> some situation so that kswapd doesn't sleep.
>> Right?
>>
>
> Right.
>
>> >
>> > Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
>> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>> > ---
>> >  mm/vmscan.c |    2 +-
>> >  1 files changed, 1 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 9cebed1..a76b6cc2 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >                }
>> >
>> >                if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
>> > -                                                       classzone_idx, 0))
>> > +                                                       i, 0))
>>
>> Isn't it  better to use 0 instead of i?
>>
>
> I considered it but went with i to compromise between making sure zones
> included enough free pages without requiring that ZONE_DMA meet an
> almost impossible requirement when under continual memory pressure.

I see.
Thanks, Mel.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-28 23:23     ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-28 23:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.
>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone
@ 2011-06-28 23:23     ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-28 23:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.
>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-28 23:38     ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-28 23:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
>
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

It does make sense.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
@ 2011-06-28 23:38     ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-06-28 23:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <mgorman@suse.de> wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
>
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
>
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

It does make sense.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
  2011-06-28 21:49     ` Andrew Morton
@ 2011-06-29 10:57       ` Pádraig Brady
  -1 siblings, 0 replies; 82+ messages in thread
From: Pádraig Brady @ 2011-06-29 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On 28/06/11 22:49, Andrew Morton wrote:
> On Fri, 24 Jun 2011 15:44:54 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
>> During allocator-intensive workloads, kswapd will be woken frequently
>> causing free memory to oscillate between the high and min watermark.
>> This is expected behaviour.
>>
>> A problem occurs if the highest zone is small.  balance_pgdat()
>> only considers unreclaimable zones when priority is DEF_PRIORITY
>> but sleeping_prematurely considers all zones. It's possible for this
>> sequence to occur
>>
>>   1. kswapd wakes up and enters balance_pgdat()
>>   2. At DEF_PRIORITY, marks highest zone unreclaimable
>>   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>>   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>>         highest zone, clearing all_unreclaimable. Highest zone
>>         is still unbalanced
>>   5. kswapd returns and calls sleeping_prematurely
>>   6. sleeping_prematurely looks at *all* zones, not just the ones
>>      being considered by balance_pgdat. The highest small zone
>>      has all_unreclaimable cleared but but the zone is not
>>      balanced. all_zones_ok is false so kswapd stays awake
>>
>> This patch corrects the behaviour of sleeping_prematurely to check
>> the zones balance_pgdat() checked.
> 
> But kswapd is making progress: it's reclaiming slab.  Eventually that
> won't work any more and all_unreclaimable will not be cleared and the
> condition will fix itself up?
> 
> 
> 
> btw,
> 
> 	if (!sleeping_prematurely(...))
> 		sleep();
> 
> hurts my brain.  My brain would prefer
> 
> 	if (kswapd_should_sleep(...))
> 		sleep();
> 
> no?
> 
>> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> 
> But what were the before-and-after observations?  I don't understand
> how this can cause a permanent cpuchew by kswapd.

Context:
  http://marc.info/?t=130865025500001&r=1&w=2
  https://bugzilla.redhat.com/show_bug.cgi?id=712019

Summary:

This will spin kswapd0 on my SNB laptop with 3GB RAM (with small normal zone):

    dd bs=1M count=3000 if=/dev/zero of=spin.test

Basically once a certain amount of data is cached,
kswapd0 will start spinning, until the data
is removed from cache (by `rm spin.test` for example).

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
@ 2011-06-29 10:57       ` Pádraig Brady
  0 siblings, 0 replies; 82+ messages in thread
From: Pádraig Brady @ 2011-06-29 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On 28/06/11 22:49, Andrew Morton wrote:
> On Fri, 24 Jun 2011 15:44:54 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
>> During allocator-intensive workloads, kswapd will be woken frequently
>> causing free memory to oscillate between the high and min watermark.
>> This is expected behaviour.
>>
>> A problem occurs if the highest zone is small.  balance_pgdat()
>> only considers unreclaimable zones when priority is DEF_PRIORITY
>> but sleeping_prematurely considers all zones. It's possible for this
>> sequence to occur
>>
>>   1. kswapd wakes up and enters balance_pgdat()
>>   2. At DEF_PRIORITY, marks highest zone unreclaimable
>>   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>>   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>>         highest zone, clearing all_unreclaimable. Highest zone
>>         is still unbalanced
>>   5. kswapd returns and calls sleeping_prematurely
>>   6. sleeping_prematurely looks at *all* zones, not just the ones
>>      being considered by balance_pgdat. The highest small zone
>>      has all_unreclaimable cleared but but the zone is not
>>      balanced. all_zones_ok is false so kswapd stays awake
>>
>> This patch corrects the behaviour of sleeping_prematurely to check
>> the zones balance_pgdat() checked.
> 
> But kswapd is making progress: it's reclaiming slab.  Eventually that
> won't work any more and all_unreclaimable will not be cleared and the
> condition will fix itself up?
> 
> 
> 
> btw,
> 
> 	if (!sleeping_prematurely(...))
> 		sleep();
> 
> hurts my brain.  My brain would prefer
> 
> 	if (kswapd_should_sleep(...))
> 		sleep();
> 
> no?
> 
>> Reported-and-tested-by: Padraig Brady <P@draigBrady.com>
> 
> But what were the before-and-after observations?  I don't understand
> how this can cause a permanent cpuchew by kswapd.

Context:
  http://marc.info/?t=130865025500001&r=1&w=2
  https://bugzilla.redhat.com/show_bug.cgi?id=712019

Summary:

This will spin kswapd0 on my SNB laptop with 3GB RAM (with small normal zone):

    dd bs=1M count=3000 if=/dev/zero of=spin.test

Basically once a certain amount of data is cached,
kswapd0 will start spinning, until the data
is removed from cache (by `rm spin.test` for example).

cheers,
Padraig.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-30  2:23     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2011-06-30  2:23 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
> 
>   1. kswapd wakes up and enters balance_pgdat()
>   2. At DEF_PRIORITY, marks highest zone unreclaimable
>   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>         highest zone, clearing all_unreclaimable. Highest zone
>         is still unbalanced
>   5. kswapd returns and calls sleeping_prematurely
>   6. sleeping_prematurely looks at *all* zones, not just the ones
>      being considered by balance_pgdat. The highest small zone
>      has all_unreclaimable cleared but but the zone is not
>      balanced. all_zones_ok is false so kswapd stays awake
> 
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
> 
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8ff834e..841e3bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  		return true;
>  
>  	/* Check the watermark levels */
> -	for (i = 0; i < pgdat->nr_zones; i++) {
> +	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
>  		if (!populated_zone(zone))

sorry for the delay.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
@ 2011-06-30  2:23     ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2011-06-30  2:23 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 
> A problem occurs if the highest zone is small.  balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
> 
>   1. kswapd wakes up and enters balance_pgdat()
>   2. At DEF_PRIORITY, marks highest zone unreclaimable
>   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>         highest zone, clearing all_unreclaimable. Highest zone
>         is still unbalanced
>   5. kswapd returns and calls sleeping_prematurely
>   6. sleeping_prematurely looks at *all* zones, not just the ones
>      being considered by balance_pgdat. The highest small zone
>      has all_unreclaimable cleared but but the zone is not
>      balanced. all_zones_ok is false so kswapd stays awake
> 
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
> 
> Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8ff834e..841e3bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  		return true;
>  
>  	/* Check the watermark levels */
> -	for (i = 0; i < pgdat->nr_zones; i++) {
> +	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
>  		if (!populated_zone(zone))

sorry for the delay.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-30  2:37     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2011-06-30  2:37 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
> 
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
> 
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   23 +++++++++++++----------
>  1 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 841e3bf..9cebed1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2507,18 +2507,18 @@ loop_again:
>  				KSWAPD_ZONE_BALANCE_GAP_RATIO);
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone) + balance_gap,
> -					end_zone, 0))
> +					end_zone, 0)) {
>  				shrink_zone(priority, zone, &sc);
> -			reclaim_state->reclaimed_slab = 0;
> -			nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> -			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> -			total_scanned += sc.nr_scanned;
>  
> -			if (zone->all_unreclaimable)
> -				continue;
> -			if (nr_slab == 0 &&
> -			    !zone_reclaimable(zone))
> -				zone->all_unreclaimable = 1;
> +				reclaim_state->reclaimed_slab = 0;
> +				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> +				total_scanned += sc.nr_scanned;
> +
> +				if (nr_slab == 0 && !zone_reclaimable(zone))
> +					zone->all_unreclaimable = 1;
> +			}
> +
>  			/*
>  			 * If we've done a decent amount of scanning and
>  			 * the reclaim ratio is low, start doing writepage
> @@ -2528,6 +2528,9 @@ loop_again:
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
>  
> +			if (zone->all_unreclaimable)
> +				continue;
> +
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), end_zone, 0)) {
>  				all_zones_ok = 0;

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
@ 2011-06-30  2:37     ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2011-06-30  2:37 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
> 
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
> 
> Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   23 +++++++++++++----------
>  1 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 841e3bf..9cebed1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2507,18 +2507,18 @@ loop_again:
>  				KSWAPD_ZONE_BALANCE_GAP_RATIO);
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone) + balance_gap,
> -					end_zone, 0))
> +					end_zone, 0)) {
>  				shrink_zone(priority, zone, &sc);
> -			reclaim_state->reclaimed_slab = 0;
> -			nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> -			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> -			total_scanned += sc.nr_scanned;
>  
> -			if (zone->all_unreclaimable)
> -				continue;
> -			if (nr_slab == 0 &&
> -			    !zone_reclaimable(zone))
> -				zone->all_unreclaimable = 1;
> +				reclaim_state->reclaimed_slab = 0;
> +				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> +				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> +				total_scanned += sc.nr_scanned;
> +
> +				if (nr_slab == 0 && !zone_reclaimable(zone))
> +					zone->all_unreclaimable = 1;
> +			}
> +
>  			/*
>  			 * If we've done a decent amount of scanning and
>  			 * the reclaim ratio is low, start doing writepage
> @@ -2528,6 +2528,9 @@ loop_again:
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
>  
> +			if (zone->all_unreclaimable)
> +				continue;
> +
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), end_zone, 0)) {
>  				all_zones_ok = 0;

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-06-24 14:44   ` Mel Gorman
@ 2011-06-30  9:05     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2011-06-30  9:05 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
> 
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking
> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets
> this as it has been woken up while reclaiming, skips scheduling and
> reclaims again. As there is no useful reclaim work to do, it enters
> into a loop of shrinking slab consuming loads of CPU until the highest
> zone becomes reclaimable for a long period of time.
> 
> There are two problems here. 1) If the returned classzone or order is
> lower, it'll continue reclaiming without scheduling. 2) if the highest
> zone was marked unreclaimable but balance_pgdat() returns immediately
> at DEF_PRIORITY, the new lower classzone is not communicated back to
> kswapd() for sleeping.
> 
> This patch does two things that are related. If the end_zone is
> unreclaimable, this information is communicated back. Second, if
> the classzone or order was reduced due to failing to reclaim, new
> information is not read from pgdat and instead an attempt is made to go
> to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> being interpreted as wakeups.
> 
> Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   34 +++++++++++++++++++++-------------
>  1 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a76b6cc2..fe854d7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2448,7 +2448,6 @@ loop_again:
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), 0, 0)) {
>  				end_zone = i;
> -				*classzone_idx = i;
>  				break;
>  			}
>  		}
> @@ -2528,8 +2527,11 @@ loop_again:
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
>  
> -			if (zone->all_unreclaimable)
> +			if (zone->all_unreclaimable) {
> +				if (end_zone && end_zone == i)
> +					end_zone--;
>  				continue;
> +			}
>  
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), end_zone, 0)) {
> @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>   */
>  static int kswapd(void *p)
>  {
> -	unsigned long order;
> -	int classzone_idx;
> +	unsigned long order, new_order;
> +	int classzone_idx, new_classzone_idx;
>  	pg_data_t *pgdat = (pg_data_t*)p;
>  	struct task_struct *tsk = current;
>  
> @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
>  
> -	order = 0;
> -	classzone_idx = MAX_NR_ZONES - 1;
> +	order = new_order = 0;
> +	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>  	for ( ; ; ) {
> -		unsigned long new_order;
> -		int new_classzone_idx;
>  		int ret;
>  
> -		new_order = pgdat->kswapd_max_order;
> -		new_classzone_idx = pgdat->classzone_idx;
> -		pgdat->kswapd_max_order = 0;
> -		pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +		/*
> +		 * If the last balance_pgdat was unsuccessful it's unlikely a
> +		 * new request of a similar or harder type will succeed soon
> +		 * so consider going to sleep on the basis we reclaimed at
> +		 */
> +		if (classzone_idx >= new_classzone_idx && order == new_order) {

I'm confusing this. If we take a following scenario, new_classzone_idx may be garbage.

1. new_classzone_idx = pgdat->classzone_idx
2. kswapd_try_to_sleep()
3. classzone_idx = pgdat->classzone_idx
4. balance_pgdat()

Wouldn't we need to reinitialize new_classzone_idx nad new_order at kswapd_try_to_sleep()
path too?



> +			new_order = pgdat->kswapd_max_order;
> +			new_classzone_idx = pgdat->classzone_idx;
> +			pgdat->kswapd_max_order =  0;
> +			pgdat->classzone_idx = pgdat->nr_zones - 1;
> +		}
> +
>  		if (order < new_order || classzone_idx > new_classzone_idx) {
>  			/*
>  			 * Don't sleep if someone wants a larger 'order'


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-06-30  9:05     ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2011-06-30  9:05 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
> 
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking
> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets
> this as it has been woken up while reclaiming, skips scheduling and
> reclaims again. As there is no useful reclaim work to do, it enters
> into a loop of shrinking slab consuming loads of CPU until the highest
> zone becomes reclaimable for a long period of time.
> 
> There are two problems here. 1) If the returned classzone or order is
> lower, it'll continue reclaiming without scheduling. 2) if the highest
> zone was marked unreclaimable but balance_pgdat() returns immediately
> at DEF_PRIORITY, the new lower classzone is not communicated back to
> kswapd() for sleeping.
> 
> This patch does two things that are related. If the end_zone is
> unreclaimable, this information is communicated back. Second, if
> the classzone or order was reduced due to failing to reclaim, new
> information is not read from pgdat and instead an attempt is made to go
> to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> being interpreted as wakeups.
> 
> Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   34 +++++++++++++++++++++-------------
>  1 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a76b6cc2..fe854d7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2448,7 +2448,6 @@ loop_again:
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), 0, 0)) {
>  				end_zone = i;
> -				*classzone_idx = i;
>  				break;
>  			}
>  		}
> @@ -2528,8 +2527,11 @@ loop_again:
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
>  
> -			if (zone->all_unreclaimable)
> +			if (zone->all_unreclaimable) {
> +				if (end_zone && end_zone == i)
> +					end_zone--;
>  				continue;
> +			}
>  
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), end_zone, 0)) {
> @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>   */
>  static int kswapd(void *p)
>  {
> -	unsigned long order;
> -	int classzone_idx;
> +	unsigned long order, new_order;
> +	int classzone_idx, new_classzone_idx;
>  	pg_data_t *pgdat = (pg_data_t*)p;
>  	struct task_struct *tsk = current;
>  
> @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
>  
> -	order = 0;
> -	classzone_idx = MAX_NR_ZONES - 1;
> +	order = new_order = 0;
> +	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>  	for ( ; ; ) {
> -		unsigned long new_order;
> -		int new_classzone_idx;
>  		int ret;
>  
> -		new_order = pgdat->kswapd_max_order;
> -		new_classzone_idx = pgdat->classzone_idx;
> -		pgdat->kswapd_max_order = 0;
> -		pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +		/*
> +		 * If the last balance_pgdat was unsuccessful it's unlikely a
> +		 * new request of a similar or harder type will succeed soon
> +		 * so consider going to sleep on the basis we reclaimed at
> +		 */
> +		if (classzone_idx >= new_classzone_idx && order == new_order) {

I'm confusing this. If we take a following scenario, new_classzone_idx may be garbage.

1. new_classzone_idx = pgdat->classzone_idx
2. kswapd_try_to_sleep()
3. classzone_idx = pgdat->classzone_idx
4. balance_pgdat()

Wouldn't we need to reinitialize new_classzone_idx nad new_order at kswapd_try_to_sleep()
path too?



> +			new_order = pgdat->kswapd_max_order;
> +			new_classzone_idx = pgdat->classzone_idx;
> +			pgdat->kswapd_max_order =  0;
> +			pgdat->classzone_idx = pgdat->nr_zones - 1;
> +		}
> +
>  		if (order < new_order || classzone_idx > new_classzone_idx) {
>  			/*
>  			 * Don't sleep if someone wants a larger 'order'

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
  2011-06-28 21:49     ` Andrew Morton
@ 2011-06-30  9:39       ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-30  9:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: P?draig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Tue, Jun 28, 2011 at 02:49:00PM -0700, Andrew Morton wrote:
> On Fri, 24 Jun 2011 15:44:54 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.
> > 
> > A problem occurs if the highest zone is small.  balance_pgdat()
> > only considers unreclaimable zones when priority is DEF_PRIORITY
> > but sleeping_prematurely considers all zones. It's possible for this
> > sequence to occur
> > 
> >   1. kswapd wakes up and enters balance_pgdat()
> >   2. At DEF_PRIORITY, marks highest zone unreclaimable
> >   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
> >   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
> >         highest zone, clearing all_unreclaimable. Highest zone
> >         is still unbalanced
> >   5. kswapd returns and calls sleeping_prematurely
> >   6. sleeping_prematurely looks at *all* zones, not just the ones
> >      being considered by balance_pgdat. The highest small zone
> >      has all_unreclaimable cleared but but the zone is not
> >      balanced. all_zones_ok is false so kswapd stays awake
> > 
> > This patch corrects the behaviour of sleeping_prematurely to check
> > the zones balance_pgdat() checked.
> 
> But kswapd is making progress: it's reclaiming slab.  Eventually that
> won't work any more and all_unreclaimable will not be cleared and the
> condition will fix itself up?
> 

It might, but at that point we've dumped as much slab as we can which
is very aggressive and there is no guarantee the condition is fixed
up. For example, if fork is happening often enough due to terminal
usage for example, it may be just enough allocation requests satisified
from the highest zone to clear all_unreclaimable during exit.

> btw,
> 
> 	if (!sleeping_prematurely(...))
> 		sleep();
> 
> hurts my brain.  My brain would prefer
> 
> 	if (kswapd_should_sleep(...))
> 		sleep();
> 
> no?
> 

kswapd_try_to_sleep -> should_sleep feel like it would hurt too. I
prefer the sleeping_prematurely name because it indicates what
condition we are checking but I'm biased and generally suck at naming.

> > Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> 
> But what were the before-and-after observations?  I don't understand
> how this can cause a permanent cpuchew by kswapd.
> 

Pádraig has reported on his before-and-after observations.

On its own, this patch doesn't entirely fix his problem because all
the patches are required but I felt that a rolled-up patch would be
too hard to review.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely
@ 2011-06-30  9:39       ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-30  9:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: P?draig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Tue, Jun 28, 2011 at 02:49:00PM -0700, Andrew Morton wrote:
> On Fri, 24 Jun 2011 15:44:54 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.
> > 
> > A problem occurs if the highest zone is small.  balance_pgdat()
> > only considers unreclaimable zones when priority is DEF_PRIORITY
> > but sleeping_prematurely considers all zones. It's possible for this
> > sequence to occur
> > 
> >   1. kswapd wakes up and enters balance_pgdat()
> >   2. At DEF_PRIORITY, marks highest zone unreclaimable
> >   3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
> >   4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
> >         highest zone, clearing all_unreclaimable. Highest zone
> >         is still unbalanced
> >   5. kswapd returns and calls sleeping_prematurely
> >   6. sleeping_prematurely looks at *all* zones, not just the ones
> >      being considered by balance_pgdat. The highest small zone
> >      has all_unreclaimable cleared but but the zone is not
> >      balanced. all_zones_ok is false so kswapd stays awake
> > 
> > This patch corrects the behaviour of sleeping_prematurely to check
> > the zones balance_pgdat() checked.
> 
> But kswapd is making progress: it's reclaiming slab.  Eventually that
> won't work any more and all_unreclaimable will not be cleared and the
> condition will fix itself up?
> 

It might, but at that point we've dumped as much slab as we can which
is very aggressive and there is no guarantee the condition is fixed
up. For example, if fork is happening often enough due to terminal
usage for example, it may be just enough allocation requests satisified
from the highest zone to clear all_unreclaimable during exit.

> btw,
> 
> 	if (!sleeping_prematurely(...))
> 		sleep();
> 
> hurts my brain.  My brain would prefer
> 
> 	if (kswapd_should_sleep(...))
> 		sleep();
> 
> no?
> 

kswapd_try_to_sleep -> should_sleep feel like it would hurt too. I
prefer the sleeping_prematurely name because it indicates what
condition we are checking but I'm biased and generally suck at naming.

> > Reported-and-tested-by: Padraig Brady <P@draigBrady.com>
> 
> But what were the before-and-after observations?  I don't understand
> how this can cause a permanent cpuchew by kswapd.
> 

Padraig has reported on his before-and-after observations.

On its own, this patch doesn't entirely fix his problem because all
the patches are required but I felt that a rolled-up patch would be
too hard to review.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-06-30  9:05     ` KOSAKI Motohiro
@ 2011-06-30 10:19       ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-30 10:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

On Thu, Jun 30, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> (2011/06/24 23:44), Mel Gorman wrote:
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > it started because the highest zone was unreclaimable. Before checking
> > if it should go to sleep though, it checks pgdat->classzone_idx which
> > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > this as it has been woken up while reclaiming, skips scheduling and
> > reclaims again. As there is no useful reclaim work to do, it enters
> > into a loop of shrinking slab consuming loads of CPU until the highest
> > zone becomes reclaimable for a long period of time.
> > 
> > There are two problems here. 1) If the returned classzone or order is
> > lower, it'll continue reclaiming without scheduling. 2) if the highest
> > zone was marked unreclaimable but balance_pgdat() returns immediately
> > at DEF_PRIORITY, the new lower classzone is not communicated back to
> > kswapd() for sleeping.
> > 
> > This patch does two things that are related. If the end_zone is
> > unreclaimable, this information is communicated back. Second, if
> > the classzone or order was reduced due to failing to reclaim, new
> > information is not read from pgdat and instead an attempt is made to go
> > to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> > be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> > being interpreted as wakeups.
> > 
> > Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |   34 +++++++++++++++++++++-------------
> >  1 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a76b6cc2..fe854d7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2448,7 +2448,6 @@ loop_again:
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), 0, 0)) {
> >  				end_zone = i;
> > -				*classzone_idx = i;
> >  				break;
> >  			}
> >  		}
> > @@ -2528,8 +2527,11 @@ loop_again:
> >  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> >  				sc.may_writepage = 1;
> >  
> > -			if (zone->all_unreclaimable)
> > +			if (zone->all_unreclaimable) {
> > +				if (end_zone && end_zone == i)
> > +					end_zone--;
> >  				continue;
> > +			}
> >  
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), end_zone, 0)) {
> > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >   */
> >  static int kswapd(void *p)
> >  {
> > -	unsigned long order;
> > -	int classzone_idx;
> > +	unsigned long order, new_order;
> > +	int classzone_idx, new_classzone_idx;
> >  	pg_data_t *pgdat = (pg_data_t*)p;
> >  	struct task_struct *tsk = current;
> >  
> > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >  	set_freezable();
> >  
> > -	order = 0;
> > -	classzone_idx = MAX_NR_ZONES - 1;
> > +	order = new_order = 0;
> > +	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >  	for ( ; ; ) {
> > -		unsigned long new_order;
> > -		int new_classzone_idx;
> >  		int ret;
> >  
> > -		new_order = pgdat->kswapd_max_order;
> > -		new_classzone_idx = pgdat->classzone_idx;
> > -		pgdat->kswapd_max_order = 0;
> > -		pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +		/*
> > +		 * If the last balance_pgdat was unsuccessful it's unlikely a
> > +		 * new request of a similar or harder type will succeed soon
> > +		 * so consider going to sleep on the basis we reclaimed at
> > +		 */
> > +		if (classzone_idx >= new_classzone_idx && order == new_order) {
> 
> I'm confusing this. If we take a following scenario, new_classzone_idx may be garbage.
> 
> 1. new_classzone_idx = pgdat->classzone_idx
> 2. kswapd_try_to_sleep()
> 3. classzone_idx = pgdat->classzone_idx
> 4. balance_pgdat()
> 
> Wouldn't we need to reinitialize new_classzone_idx nad new_order at kswapd_try_to_sleep()
> path too?
> 

I don't understand your question. new_classzone_idx is initialised
before the kswapd main loop and after this patch is only updated
only when balance_pgdat() successfully balanced but the following
situation can arise

1. Read for balance-request-A (order, classzone) pair
2. Fail balance_pgdat
3. Sleep based on (order, classzone) pair
4. Wake for balance-request-B (order, classzone) pair where
   balance-request-B != balance-request-A
5. Succeed balance_pgdat
6. Compare order,classzone with balance-request-A which will treat
   balance_pgdat() as fail and try go to sleep

This is not the same as new_classzone_idx being "garbage" but is it
what you mean? If so, is this your proposed fix?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe854d7..1a518e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2770,6 +2770,8 @@ static int kswapd(void *p)
 			kswapd_try_to_sleep(pgdat, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
+			new_order = order;
+			new_classzone_idx = classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-06-30 10:19       ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-30 10:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: akpm, P, James.Bottomley, colin.king, minchan.kim, luto, riel,
	hannes, linux-mm, linux-kernel

On Thu, Jun 30, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> (2011/06/24 23:44), Mel Gorman wrote:
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > it started because the highest zone was unreclaimable. Before checking
> > if it should go to sleep though, it checks pgdat->classzone_idx which
> > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > this as it has been woken up while reclaiming, skips scheduling and
> > reclaims again. As there is no useful reclaim work to do, it enters
> > into a loop of shrinking slab consuming loads of CPU until the highest
> > zone becomes reclaimable for a long period of time.
> > 
> > There are two problems here. 1) If the returned classzone or order is
> > lower, it'll continue reclaiming without scheduling. 2) if the highest
> > zone was marked unreclaimable but balance_pgdat() returns immediately
> > at DEF_PRIORITY, the new lower classzone is not communicated back to
> > kswapd() for sleeping.
> > 
> > This patch does two things that are related. If the end_zone is
> > unreclaimable, this information is communicated back. Second, if
> > the classzone or order was reduced due to failing to reclaim, new
> > information is not read from pgdat and instead an attempt is made to go
> > to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> > be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> > being interpreted as wakeups.
> > 
> > Reported-and-tested-by: Padraig Brady <P@draigBrady.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |   34 +++++++++++++++++++++-------------
> >  1 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a76b6cc2..fe854d7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2448,7 +2448,6 @@ loop_again:
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), 0, 0)) {
> >  				end_zone = i;
> > -				*classzone_idx = i;
> >  				break;
> >  			}
> >  		}
> > @@ -2528,8 +2527,11 @@ loop_again:
> >  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> >  				sc.may_writepage = 1;
> >  
> > -			if (zone->all_unreclaimable)
> > +			if (zone->all_unreclaimable) {
> > +				if (end_zone && end_zone == i)
> > +					end_zone--;
> >  				continue;
> > +			}
> >  
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), end_zone, 0)) {
> > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >   */
> >  static int kswapd(void *p)
> >  {
> > -	unsigned long order;
> > -	int classzone_idx;
> > +	unsigned long order, new_order;
> > +	int classzone_idx, new_classzone_idx;
> >  	pg_data_t *pgdat = (pg_data_t*)p;
> >  	struct task_struct *tsk = current;
> >  
> > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >  	set_freezable();
> >  
> > -	order = 0;
> > -	classzone_idx = MAX_NR_ZONES - 1;
> > +	order = new_order = 0;
> > +	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >  	for ( ; ; ) {
> > -		unsigned long new_order;
> > -		int new_classzone_idx;
> >  		int ret;
> >  
> > -		new_order = pgdat->kswapd_max_order;
> > -		new_classzone_idx = pgdat->classzone_idx;
> > -		pgdat->kswapd_max_order = 0;
> > -		pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +		/*
> > +		 * If the last balance_pgdat was unsuccessful it's unlikely a
> > +		 * new request of a similar or harder type will succeed soon
> > +		 * so consider going to sleep on the basis we reclaimed at
> > +		 */
> > +		if (classzone_idx >= new_classzone_idx && order == new_order) {
> 
> I'm confusing this. If we take a following scenario, new_classzone_idx may be garbage.
> 
> 1. new_classzone_idx = pgdat->classzone_idx
> 2. kswapd_try_to_sleep()
> 3. classzone_idx = pgdat->classzone_idx
> 4. balance_pgdat()
> 
> Wouldn't we need to reinitialize new_classzone_idx nad new_order at kswapd_try_to_sleep()
> path too?
> 

I don't understand your question. new_classzone_idx is initialised
before the kswapd main loop and after this patch is only updated
only when balance_pgdat() successfully balanced but the following
situation can arise

1. Read for balance-request-A (order, classzone) pair
2. Fail balance_pgdat
3. Sleep based on (order, classzone) pair
4. Wake for balance-request-B (order, classzone) pair where
   balance-request-B != balance-request-A
5. Succeed balance_pgdat
6. Compare order,classzone with balance-request-A which will treat
   balance_pgdat() as fail and try go to sleep

This is not the same as new_classzone_idx being "garbage" but is it
what you mean? If so, is this your proposed fix?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe854d7..1a518e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2770,6 +2770,8 @@ static int kswapd(void *p)
 			kswapd_try_to_sleep(pgdat, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
+			new_order = order;
+			new_classzone_idx = classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-06-24 14:44   ` Mel Gorman
@ 2011-07-19 16:09     ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-19 16:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

Hi Mel,

Too late review.
At that time, I had no time to look into this patch.

On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
> 
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking

Yes.

> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets

Yes.

> this as it has been woken up while reclaiming, skips scheduling and

Hmm. I can't understand this part.
If balance_pgdat returns lower classzone and there is no other activity,
new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
new_classzone_idx. It means it doesn't skip scheduling.

Do I miss something?

-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-19 16:09     ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-19 16:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

Hi Mel,

Too late review.
At that time, I had no time to look into this patch.

On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
> 
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking

Yes.

> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets

Yes.

> this as it has been woken up while reclaiming, skips scheduling and

Hmm. I can't understand this part.
If balance_pgdat returns lower classzone and there is no other activity,
new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
new_classzone_idx. It means it doesn't skip scheduling.

Do I miss something?

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-07-19 16:09     ` Minchan Kim
@ 2011-07-20 10:48       ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-20 10:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> Too late review.

Never too late.

> At that time, I had no time to look into this patch.
> 
> On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > it started because the highest zone was unreclaimable. Before checking
> 
> Yes.
> 
> > if it should go to sleep though, it checks pgdat->classzone_idx which
> > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> 
> Yes.
> 
> > this as it has been woken up while reclaiming, skips scheduling and
> 
> Hmm. I can't understand this part.
> If balance_pgdat returns lower classzone and there is no other activity,
> new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> new_classzone_idx. It means it doesn't skip scheduling.
> 
> Do I miss something?
> 

It was a few weeks ago so I don't rememember if this is the exact
sequence I had in mind at the time of writing but an example sequence
of events is for a node whose highest populated zone is ZONE_NORMAL,
very small, and gets set all_unreclaimable by balance_pgdat() looks
is below. The key is the "very small" part because pages are getting
freed in the zone but the small size means that unreclaimable gets
set easily.

/*
 * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
 * as ZONE_HIGHMEM is not populated.
 */

order = pgdat->kswapd_max_order;
classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
pgdat->kswapd_max_order = 0;
pgdat->classzone_idx = MAX_NR_ZONES - 1;
order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
								 * the highest zone was set unreclaimable
								 * and it exited scanning ZONE_DMA32
								 * because we did not communicate that
								 * information back
								 */
new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
								 * because that is what classzone_idx
								 * gets reset to
								 */
if (order < new_order || classzone_idx > new_classzone_idx) {
	/* does not sleep, this branch not taken */
} else {
	/* tries to sleep, goes here */
	try_to_sleep(ZONE_NORMAL)
		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
        order = pgdat->kswapd_max_order;
        classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
								 * is higher than what it was originally
								 * woken for
								 */
}

/* Looped around to balance_pgdat() again */
order = balance_pgdat()

Between when all_unreclaimable is set and before before kswapd
goes fully to sleep, a page is freed clearing all_reclaimable so
it rechecks all the zones, find the highest one is not balanced and
skip scheduling.

A variation is that it the lower zones are above the low watermark so
the page allocator is not waking kswapd and it should sleep on the
waitqueue. However, it only schedules for HZ/10 during which a page
is freed, the highest zone gets all_unreclaimable cleared and so it
stays awake. In this case, it has reached a scheduling point but it
is not going fully to sleep on the waitqueue as it should.

I see now the problem with the changelog, it sucks and could have
been a lot better at explaining why kswapd stays awake when the
information is not communicated back and why classzone_idx being set
to MAX_NR_ZONES-1 is sloppy :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-20 10:48       ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-20 10:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> Too late review.

Never too late.

> At that time, I had no time to look into this patch.
> 
> On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > it started because the highest zone was unreclaimable. Before checking
> 
> Yes.
> 
> > if it should go to sleep though, it checks pgdat->classzone_idx which
> > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> 
> Yes.
> 
> > this as it has been woken up while reclaiming, skips scheduling and
> 
> Hmm. I can't understand this part.
> If balance_pgdat returns lower classzone and there is no other activity,
> new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> new_classzone_idx. It means it doesn't skip scheduling.
> 
> Do I miss something?
> 

It was a few weeks ago so I don't rememember if this is the exact
sequence I had in mind at the time of writing but an example sequence
of events is for a node whose highest populated zone is ZONE_NORMAL,
very small, and gets set all_unreclaimable by balance_pgdat() looks
is below. The key is the "very small" part because pages are getting
freed in the zone but the small size means that unreclaimable gets
set easily.

/*
 * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
 * as ZONE_HIGHMEM is not populated.
 */

order = pgdat->kswapd_max_order;
classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
pgdat->kswapd_max_order = 0;
pgdat->classzone_idx = MAX_NR_ZONES - 1;
order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
								 * the highest zone was set unreclaimable
								 * and it exited scanning ZONE_DMA32
								 * because we did not communicate that
								 * information back
								 */
new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
								 * because that is what classzone_idx
								 * gets reset to
								 */
if (order < new_order || classzone_idx > new_classzone_idx) {
	/* does not sleep, this branch not taken */
} else {
	/* tries to sleep, goes here */
	try_to_sleep(ZONE_NORMAL)
		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
        order = pgdat->kswapd_max_order;
        classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
								 * is higher than what it was originally
								 * woken for
								 */
}

/* Looped around to balance_pgdat() again */
order = balance_pgdat()

Between when all_unreclaimable is set and before before kswapd
goes fully to sleep, a page is freed clearing all_reclaimable so
it rechecks all the zones, find the highest one is not balanced and
skip scheduling.

A variation is that it the lower zones are above the low watermark so
the page allocator is not waking kswapd and it should sleep on the
waitqueue. However, it only schedules for HZ/10 during which a page
is freed, the highest zone gets all_unreclaimable cleared and so it
stays awake. In this case, it has reached a scheduling point but it
is not going fully to sleep on the waitqueue as it should.

I see now the problem with the changelog, it sucks and could have
been a lot better at explaining why kswapd stays awake when the
information is not communicated back and why classzone_idx being set
to MAX_NR_ZONES-1 is sloppy :(

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-07-20 10:48       ` Mel Gorman
@ 2011-07-21 15:30         ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 15:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Wed, Jul 20, 2011 at 11:48:47AM +0100, Mel Gorman wrote:
> On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > Too late review.
> 
> Never too late.
> 
> > At that time, I had no time to look into this patch.
> > 
> > On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > > During allocator-intensive workloads, kswapd will be woken frequently
> > > causing free memory to oscillate between the high and min watermark.
> > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > small, a problem occurs.
> > > 
> > > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > > it started because the highest zone was unreclaimable. Before checking
> > 
> > Yes.
> > 
> > > if it should go to sleep though, it checks pgdat->classzone_idx which
> > > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > 
> > Yes.
> > 
> > > this as it has been woken up while reclaiming, skips scheduling and
> > 
> > Hmm. I can't understand this part.
> > If balance_pgdat returns lower classzone and there is no other activity,
> > new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> > new_classzone_idx. It means it doesn't skip scheduling.
> > 
> > Do I miss something?
> > 
> 
> It was a few weeks ago so I don't rememember if this is the exact
> sequence I had in mind at the time of writing but an example sequence
> of events is for a node whose highest populated zone is ZONE_NORMAL,
> very small, and gets set all_unreclaimable by balance_pgdat() looks
> is below. The key is the "very small" part because pages are getting
> freed in the zone but the small size means that unreclaimable gets
> set easily.
> 
> /*
>  * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
>  * as ZONE_HIGHMEM is not populated.
>  */
> 
> order = pgdat->kswapd_max_order;
> classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
> pgdat->kswapd_max_order = 0;
> pgdat->classzone_idx = MAX_NR_ZONES - 1;
> order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
> 								 * the highest zone was set unreclaimable
> 								 * and it exited scanning ZONE_DMA32
> 								 * because we did not communicate that
> 								 * information back

								Yes. It's too bad.

> 								 */
> new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
> new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
> 								 * because that is what classzone_idx
> 								 * gets reset to

								Yes. new_classzone_idx is ZONE_HIGHMEM.

> 								 */
> if (order < new_order || classzone_idx > new_classzone_idx) {
> 	/* does not sleep, this branch not taken */
> } else {
> 	/* tries to sleep, goes here */
> 	try_to_sleep(ZONE_NORMAL)
> 		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
>         order = pgdat->kswapd_max_order;
>         classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
> 								 * is higher than what it was originally
> 								 * woken for
> 								 */

								But is it a problem?
								it should be reset to ZONE_NORMAL in balance_pgdat as high zone isn't populated.
> }
> 
> /* Looped around to balance_pgdat() again */
> order = balance_pgdat()
> 
> Between when all_unreclaimable is set and before before kswapd
> goes fully to sleep, a page is freed clearing all_reclaimable so
> it rechecks all the zones, find the highest one is not balanced and
> skip scheduling.

Yes and it could be repeated forever.
Apparently, we should fix wit this patch but I have a qustion about this patch.

Quote from your patch

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a76b6cc2..fe854d7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2448,7 +2448,6 @@ loop_again:
>                       if (!zone_watermark_ok_safe(zone, order,
>                                       high_wmark_pages(zone), 0, 0)) {
>                               end_zone = i;
> -                             *classzone_idx = i;
>                               break;
>                       }
>               }
> @@ -2528,8 +2527,11 @@ loop_again:
>                           total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>                               sc.may_writepage = 1;
>  
> -                     if (zone->all_unreclaimable)
> +                     if (zone->all_unreclaimable) {
> +                             if (end_zone && end_zone == i)
> +                                     end_zone--;

Until now, It's good.

>                               continue;
> +                     }
>  
>                       if (!zone_watermark_ok_safe(zone, order,
>                                       high_wmark_pages(zone), end_zone, 0)) {
> @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>   */
>  static int kswapd(void *p)
>  {
> -     unsigned long order;
> -     int classzone_idx;
> +     unsigned long order, new_order;
> +     int classzone_idx, new_classzone_idx;
>       pg_data_t *pgdat = (pg_data_t*)p;
>       struct task_struct *tsk = current;
>  
> @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
>       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>       set_freezable();
>  
> -     order = 0;
> -     classzone_idx = MAX_NR_ZONES - 1;
> +     order = new_order = 0;
> +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>       for ( ; ; ) {
> -             unsigned long new_order;
> -             int new_classzone_idx;
>               int ret;
>  
> -             new_order = pgdat->kswapd_max_order;
> -             new_classzone_idx = pgdat->classzone_idx;
> -             pgdat->kswapd_max_order = 0;
> -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +             /*
> +              * If the last balance_pgdat was unsuccessful it's unlikely a
> +              * new request of a similar or harder type will succeed soon
> +              * so consider going to sleep on the basis we reclaimed at
> +              */
> +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> +                     new_order = pgdat->kswapd_max_order;
> +                     new_classzone_idx = pgdat->classzone_idx;
> +                     pgdat->kswapd_max_order =  0;
> +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> +             }
> +

But in this part.
Why do we need this?
Although we pass high zone instead of zone we reclaimed at, it would be reset to
ZONE_NORMAL if it's not populated. If high zone is populated and it couldn't meet
watermark, it could be balanced and next normal zone would be balanced, too.

Could you explain what's problem happen without this part?

>               if (order < new_order || classzone_idx > new_classzone_idx) {
>                       /*
>                        * Don't sleep if someone wants a larger 'order'
> @@ -2763,7 +2771,7 @@ static int kswapd(void *p)
>                       order = pgdat->kswapd_max_order;
>                       classzone_idx = pgdat->classzone_idx;
>                       pgdat->kswapd_max_order = 0;
> -                     pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
>               }
>  
>               ret = try_to_freeze();
> -- 
> 1.7.3.4
> 



> 
> A variation is that it the lower zones are above the low watermark so
> the page allocator is not waking kswapd and it should sleep on the
> waitqueue. However, it only schedules for HZ/10 during which a page
> is freed, the highest zone gets all_unreclaimable cleared and so it
> stays awake. In this case, it has reached a scheduling point but it
> is not going fully to sleep on the waitqueue as it should.
> 
> I see now the problem with the changelog, it sucks and could have
> been a lot better at explaining why kswapd stays awake when the
> information is not communicated back and why classzone_idx being set
> to MAX_NR_ZONES-1 is sloppy :(
> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-21 15:30         ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 15:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Wed, Jul 20, 2011 at 11:48:47AM +0100, Mel Gorman wrote:
> On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > Too late review.
> 
> Never too late.
> 
> > At that time, I had no time to look into this patch.
> > 
> > On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > > During allocator-intensive workloads, kswapd will be woken frequently
> > > causing free memory to oscillate between the high and min watermark.
> > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > small, a problem occurs.
> > > 
> > > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > > it started because the highest zone was unreclaimable. Before checking
> > 
> > Yes.
> > 
> > > if it should go to sleep though, it checks pgdat->classzone_idx which
> > > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > 
> > Yes.
> > 
> > > this as it has been woken up while reclaiming, skips scheduling and
> > 
> > Hmm. I can't understand this part.
> > If balance_pgdat returns lower classzone and there is no other activity,
> > new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> > new_classzone_idx. It means it doesn't skip scheduling.
> > 
> > Do I miss something?
> > 
> 
> It was a few weeks ago so I don't rememember if this is the exact
> sequence I had in mind at the time of writing but an example sequence
> of events is for a node whose highest populated zone is ZONE_NORMAL,
> very small, and gets set all_unreclaimable by balance_pgdat() looks
> is below. The key is the "very small" part because pages are getting
> freed in the zone but the small size means that unreclaimable gets
> set easily.
> 
> /*
>  * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
>  * as ZONE_HIGHMEM is not populated.
>  */
> 
> order = pgdat->kswapd_max_order;
> classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
> pgdat->kswapd_max_order = 0;
> pgdat->classzone_idx = MAX_NR_ZONES - 1;
> order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
> 								 * the highest zone was set unreclaimable
> 								 * and it exited scanning ZONE_DMA32
> 								 * because we did not communicate that
> 								 * information back

								Yes. It's too bad.

> 								 */
> new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
> new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
> 								 * because that is what classzone_idx
> 								 * gets reset to

								Yes. new_classzone_idx is ZONE_HIGHMEM.

> 								 */
> if (order < new_order || classzone_idx > new_classzone_idx) {
> 	/* does not sleep, this branch not taken */
> } else {
> 	/* tries to sleep, goes here */
> 	try_to_sleep(ZONE_NORMAL)
> 		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
>         order = pgdat->kswapd_max_order;
>         classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
> 								 * is higher than what it was originally
> 								 * woken for
> 								 */

								But is it a problem?
								it should be reset to ZONE_NORMAL in balance_pgdat as high zone isn't populated.
> }
> 
> /* Looped around to balance_pgdat() again */
> order = balance_pgdat()
> 
> Between when all_unreclaimable is set and before before kswapd
> goes fully to sleep, a page is freed clearing all_reclaimable so
> it rechecks all the zones, find the highest one is not balanced and
> skip scheduling.

Yes and it could be repeated forever.
Apparently, we should fix wit this patch but I have a qustion about this patch.

Quote from your patch

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a76b6cc2..fe854d7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2448,7 +2448,6 @@ loop_again:
>                       if (!zone_watermark_ok_safe(zone, order,
>                                       high_wmark_pages(zone), 0, 0)) {
>                               end_zone = i;
> -                             *classzone_idx = i;
>                               break;
>                       }
>               }
> @@ -2528,8 +2527,11 @@ loop_again:
>                           total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>                               sc.may_writepage = 1;
>  
> -                     if (zone->all_unreclaimable)
> +                     if (zone->all_unreclaimable) {
> +                             if (end_zone && end_zone == i)
> +                                     end_zone--;

Until now, It's good.

>                               continue;
> +                     }
>  
>                       if (!zone_watermark_ok_safe(zone, order,
>                                       high_wmark_pages(zone), end_zone, 0)) {
> @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>   */
>  static int kswapd(void *p)
>  {
> -     unsigned long order;
> -     int classzone_idx;
> +     unsigned long order, new_order;
> +     int classzone_idx, new_classzone_idx;
>       pg_data_t *pgdat = (pg_data_t*)p;
>       struct task_struct *tsk = current;
>  
> @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
>       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>       set_freezable();
>  
> -     order = 0;
> -     classzone_idx = MAX_NR_ZONES - 1;
> +     order = new_order = 0;
> +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>       for ( ; ; ) {
> -             unsigned long new_order;
> -             int new_classzone_idx;
>               int ret;
>  
> -             new_order = pgdat->kswapd_max_order;
> -             new_classzone_idx = pgdat->classzone_idx;
> -             pgdat->kswapd_max_order = 0;
> -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +             /*
> +              * If the last balance_pgdat was unsuccessful it's unlikely a
> +              * new request of a similar or harder type will succeed soon
> +              * so consider going to sleep on the basis we reclaimed at
> +              */
> +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> +                     new_order = pgdat->kswapd_max_order;
> +                     new_classzone_idx = pgdat->classzone_idx;
> +                     pgdat->kswapd_max_order =  0;
> +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> +             }
> +

But in this part.
Why do we need this?
Although we pass high zone instead of zone we reclaimed at, it would be reset to
ZONE_NORMAL if it's not populated. If high zone is populated and it couldn't meet
watermark, it could be balanced and next normal zone would be balanced, too.

Could you explain what's problem happen without this part?

>               if (order < new_order || classzone_idx > new_classzone_idx) {
>                       /*
>                        * Don't sleep if someone wants a larger 'order'
> @@ -2763,7 +2771,7 @@ static int kswapd(void *p)
>                       order = pgdat->kswapd_max_order;
>                       classzone_idx = pgdat->classzone_idx;
>                       pgdat->kswapd_max_order = 0;
> -                     pgdat->classzone_idx = MAX_NR_ZONES - 1;
> +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
>               }
>  
>               ret = try_to_freeze();
> -- 
> 1.7.3.4
> 



> 
> A variation is that it the lower zones are above the low watermark so
> the page allocator is not waking kswapd and it should sleep on the
> waitqueue. However, it only schedules for HZ/10 during which a page
> is freed, the highest zone gets all_unreclaimable cleared and so it
> stays awake. In this case, it has reached a scheduling point but it
> is not going fully to sleep on the waitqueue as it should.
> 
> I see now the problem with the changelog, it sucks and could have
> been a lot better at explaining why kswapd stays awake when the
> information is not communicated back and why classzone_idx being set
> to MAX_NR_ZONES-1 is sloppy :(
> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-06-24 14:44 ` Mel Gorman
@ 2011-07-21 15:37   ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> (Built this time and passed a basic sniff-test.)
> 
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
> 
> This seems to happen most with recent sandybridge laptops but it's
> probably a co-incidence as some of these laptops just happen to have
> a small Normal zone. The reproduction case is almost always during
> copying large files that kswapd pegs at 100% CPU until the file is
> deleted or cache is dropped.
> 
> The problem is mostly down to sleeping_prematurely() keeping kswapd
> awake when the highest zone is small and unreclaimable and compounded
> by the fact we shrink slabs even when not shrinking zones causing a lot
> of time to be spent in shrinkers and a lot of memory to be reclaimed.
> 
> Patch 1 corrects sleeping_prematurely to check the zones matching
> 	the classzone_idx instead of all zones.
> 
> Patch 2 avoids shrinking slab when we are not shrinking a zone.
> 
> Patch 3 notes that sleeping_prematurely is checking lower zones against
> 	a high classzone which is not what allocators or balance_pgdat()
> 	is doing leading to an artifical believe that kswapd should be
> 	still awake.
> 
> Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> 	decision is not communicated to sleeping_prematurely()
> 
> This problem affects 2.6.38.8 for certain and is expected to affect
> 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> to be picked up by distros and this series is against 3.0-rc4. I've
> cc'd people that reported similar problems recently to see if they
> still suffer from the problem and if this fixes it.
> 

Good!
This patch solved the problem.
But there is still a mystery.

In log, we could see excessive shrink_slab calls.
And as you know, we had merged patch which adds cond_resched where last of the function
in shrink_slab. So other task should get the CPU and we should not see
100% CPU of kswapd, I think.

Do you have any idea about this?

>  mm/vmscan.c |   59 +++++++++++++++++++++++++++++++++++------------------------
>  1 files changed, 35 insertions(+), 24 deletions(-)
> 
> -- 
> 1.7.3.4
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-21 15:37   ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Pádraig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> (Built this time and passed a basic sniff-test.)
> 
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.  Unfortunately, if the highest zone is
> small, a problem occurs.
> 
> This seems to happen most with recent sandybridge laptops but it's
> probably a co-incidence as some of these laptops just happen to have
> a small Normal zone. The reproduction case is almost always during
> copying large files that kswapd pegs at 100% CPU until the file is
> deleted or cache is dropped.
> 
> The problem is mostly down to sleeping_prematurely() keeping kswapd
> awake when the highest zone is small and unreclaimable and compounded
> by the fact we shrink slabs even when not shrinking zones causing a lot
> of time to be spent in shrinkers and a lot of memory to be reclaimed.
> 
> Patch 1 corrects sleeping_prematurely to check the zones matching
> 	the classzone_idx instead of all zones.
> 
> Patch 2 avoids shrinking slab when we are not shrinking a zone.
> 
> Patch 3 notes that sleeping_prematurely is checking lower zones against
> 	a high classzone which is not what allocators or balance_pgdat()
> 	is doing leading to an artifical believe that kswapd should be
> 	still awake.
> 
> Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> 	decision is not communicated to sleeping_prematurely()
> 
> This problem affects 2.6.38.8 for certain and is expected to affect
> 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> to be picked up by distros and this series is against 3.0-rc4. I've
> cc'd people that reported similar problems recently to see if they
> still suffer from the problem and if this fixes it.
> 

Good!
This patch solved the problem.
But there is still a mystery.

In log, we could see excessive shrink_slab calls.
And as you know, we had merged patch which adds cond_resched where last of the function
in shrink_slab. So other task should get the CPU and we should not see
100% CPU of kswapd, I think.

Do you have any idea about this?

>  mm/vmscan.c |   59 +++++++++++++++++++++++++++++++++++------------------------
>  1 files changed, 35 insertions(+), 24 deletions(-)
> 
> -- 
> 1.7.3.4
> 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-07-21 15:30         ` Minchan Kim
@ 2011-07-21 16:07           ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-21 16:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 12:30:07AM +0900, Minchan Kim wrote:
> On Wed, Jul 20, 2011 at 11:48:47AM +0100, Mel Gorman wrote:
> > On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> > > Hi Mel,
> > > 
> > > Too late review.
> > 
> > Never too late.
> > 
> > > At that time, I had no time to look into this patch.
> > > 
> > > On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > > > During allocator-intensive workloads, kswapd will be woken frequently
> > > > causing free memory to oscillate between the high and min watermark.
> > > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > > small, a problem occurs.
> > > > 
> > > > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > > > it started because the highest zone was unreclaimable. Before checking
> > > 
> > > Yes.
> > > 
> > > > if it should go to sleep though, it checks pgdat->classzone_idx which
> > > > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > > 
> > > Yes.
> > > 
> > > > this as it has been woken up while reclaiming, skips scheduling and
> > > 
> > > Hmm. I can't understand this part.
> > > If balance_pgdat returns lower classzone and there is no other activity,
> > > new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> > > new_classzone_idx. It means it doesn't skip scheduling.
> > > 
> > > Do I miss something?
> > > 
> > 
> > It was a few weeks ago so I don't rememember if this is the exact
> > sequence I had in mind at the time of writing but an example sequence
> > of events is for a node whose highest populated zone is ZONE_NORMAL,
> > very small, and gets set all_unreclaimable by balance_pgdat() looks
> > is below. The key is the "very small" part because pages are getting
> > freed in the zone but the small size means that unreclaimable gets
> > set easily.
> > 
> > /*
> >  * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
> >  * as ZONE_HIGHMEM is not populated.
> >  */
> > 
> > order = pgdat->kswapd_max_order;
> > classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
> > pgdat->kswapd_max_order = 0;
> > pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
> > 								 * the highest zone was set unreclaimable
> > 								 * and it exited scanning ZONE_DMA32
> > 								 * because we did not communicate that
> > 								 * information back
> 
> 								Yes. It's too bad.
> 
> > 								 */
> > new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
> > new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
> > 								 * because that is what classzone_idx
> > 								 * gets reset to
> 
> 								Yes. new_classzone_idx is ZONE_HIGHMEM.
> 
> > 								 */
> > if (order < new_order || classzone_idx > new_classzone_idx) {
> > 	/* does not sleep, this branch not taken */
> > } else {
> > 	/* tries to sleep, goes here */
> > 	try_to_sleep(ZONE_NORMAL)
> > 		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
> >         order = pgdat->kswapd_max_order;
> >         classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
> > 								 * is higher than what it was originally
> > 								 * woken for
> > 								 */
> 
> 								But is it a problem?
> 								it should be reset to ZONE_NORMAL in balance_pgdat as high zone isn't populated.

At the very least, it's sloppy.

> > }
> > 
> > /* Looped around to balance_pgdat() again */
> > order = balance_pgdat()
> > 
> > Between when all_unreclaimable is set and before before kswapd
> > goes fully to sleep, a page is freed clearing all_reclaimable so
> > it rechecks all the zones, find the highest one is not balanced and
> > skip scheduling.
> 
> Yes and it could be repeated forever.

Resulting in chewing up large amounts of CPU.

> Apparently, we should fix wit this patch but I have a qustion about this patch.
> 
> Quote from your patch
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a76b6cc2..fe854d7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2448,7 +2448,6 @@ loop_again:
> >                       if (!zone_watermark_ok_safe(zone, order,
> >                                       high_wmark_pages(zone), 0, 0)) {
> >                               end_zone = i;
> > -                             *classzone_idx = i;
> >                               break;
> >                       }
> >               }
> > @@ -2528,8 +2527,11 @@ loop_again:
> >                           total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> >                               sc.may_writepage = 1;
> >  
> > -                     if (zone->all_unreclaimable)
> > +                     if (zone->all_unreclaimable) {
> > +                             if (end_zone && end_zone == i)
> > +                                     end_zone--;
> 
> Until now, It's good.
> 
> >                               continue;
> > +                     }
> >  
> >                       if (!zone_watermark_ok_safe(zone, order,
> >                                       high_wmark_pages(zone), end_zone, 0)) {
> > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >   */
> >  static int kswapd(void *p)
> >  {
> > -     unsigned long order;
> > -     int classzone_idx;
> > +     unsigned long order, new_order;
> > +     int classzone_idx, new_classzone_idx;
> >       pg_data_t *pgdat = (pg_data_t*)p;
> >       struct task_struct *tsk = current;
> >  
> > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >       set_freezable();
> >  
> > -     order = 0;
> > -     classzone_idx = MAX_NR_ZONES - 1;
> > +     order = new_order = 0;
> > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >       for ( ; ; ) {
> > -             unsigned long new_order;
> > -             int new_classzone_idx;
> >               int ret;
> >  
> > -             new_order = pgdat->kswapd_max_order;
> > -             new_classzone_idx = pgdat->classzone_idx;
> > -             pgdat->kswapd_max_order = 0;
> > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +             /*
> > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> > +              * new request of a similar or harder type will succeed soon
> > +              * so consider going to sleep on the basis we reclaimed at
> > +              */
> > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> > +                     new_order = pgdat->kswapd_max_order;
> > +                     new_classzone_idx = pgdat->classzone_idx;
> > +                     pgdat->kswapd_max_order =  0;
> > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> > +             }
> > +
> 
> But in this part.
> Why do we need this?

Lets say it's a fork-heavy workload and it is routinely being woken
for order-1 allocations and the highest zone is very small. For the
most part, it's ok because the allocations are being satisfied from
the lower zones which kswapd has no problem balancing.

However, by reading the information even after failing to
balance, kswapd continues balancing for order-1 due to reading
pgdat->kswapd_max_order, each time failing for the highest zone. It
only takes one wakeup request per balance_pgdat() to keep kswapd
awake trying to balance the highest zone in a continual loop.

By avoiding this read, kswapd will try and go to sleep after checking
all the watermarks and all_unreclaimable. If the watermarks are ok, it
will sleep until woken up due to the lower zones hitting their min
watermarks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-21 16:07           ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-21 16:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 12:30:07AM +0900, Minchan Kim wrote:
> On Wed, Jul 20, 2011 at 11:48:47AM +0100, Mel Gorman wrote:
> > On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> > > Hi Mel,
> > > 
> > > Too late review.
> > 
> > Never too late.
> > 
> > > At that time, I had no time to look into this patch.
> > > 
> > > On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > > > During allocator-intensive workloads, kswapd will be woken frequently
> > > > causing free memory to oscillate between the high and min watermark.
> > > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > > small, a problem occurs.
> > > > 
> > > > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > > > it started because the highest zone was unreclaimable. Before checking
> > > 
> > > Yes.
> > > 
> > > > if it should go to sleep though, it checks pgdat->classzone_idx which
> > > > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > > 
> > > Yes.
> > > 
> > > > this as it has been woken up while reclaiming, skips scheduling and
> > > 
> > > Hmm. I can't understand this part.
> > > If balance_pgdat returns lower classzone and there is no other activity,
> > > new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> > > new_classzone_idx. It means it doesn't skip scheduling.
> > > 
> > > Do I miss something?
> > > 
> > 
> > It was a few weeks ago so I don't rememember if this is the exact
> > sequence I had in mind at the time of writing but an example sequence
> > of events is for a node whose highest populated zone is ZONE_NORMAL,
> > very small, and gets set all_unreclaimable by balance_pgdat() looks
> > is below. The key is the "very small" part because pages are getting
> > freed in the zone but the small size means that unreclaimable gets
> > set easily.
> > 
> > /*
> >  * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
> >  * as ZONE_HIGHMEM is not populated.
> >  */
> > 
> > order = pgdat->kswapd_max_order;
> > classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
> > pgdat->kswapd_max_order = 0;
> > pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
> > 								 * the highest zone was set unreclaimable
> > 								 * and it exited scanning ZONE_DMA32
> > 								 * because we did not communicate that
> > 								 * information back
> 
> 								Yes. It's too bad.
> 
> > 								 */
> > new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
> > new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
> > 								 * because that is what classzone_idx
> > 								 * gets reset to
> 
> 								Yes. new_classzone_idx is ZONE_HIGHMEM.
> 
> > 								 */
> > if (order < new_order || classzone_idx > new_classzone_idx) {
> > 	/* does not sleep, this branch not taken */
> > } else {
> > 	/* tries to sleep, goes here */
> > 	try_to_sleep(ZONE_NORMAL)
> > 		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
> >         order = pgdat->kswapd_max_order;
> >         classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
> > 								 * is higher than what it was originally
> > 								 * woken for
> > 								 */
> 
> 								But is it a problem?
> 								it should be reset to ZONE_NORMAL in balance_pgdat as high zone isn't populated.

At the very least, it's sloppy.

> > }
> > 
> > /* Looped around to balance_pgdat() again */
> > order = balance_pgdat()
> > 
> > Between when all_unreclaimable is set and before before kswapd
> > goes fully to sleep, a page is freed clearing all_reclaimable so
> > it rechecks all the zones, find the highest one is not balanced and
> > skip scheduling.
> 
> Yes and it could be repeated forever.

Resulting in chewing up large amounts of CPU.

> Apparently, we should fix wit this patch but I have a qustion about this patch.
> 
> Quote from your patch
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a76b6cc2..fe854d7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2448,7 +2448,6 @@ loop_again:
> >                       if (!zone_watermark_ok_safe(zone, order,
> >                                       high_wmark_pages(zone), 0, 0)) {
> >                               end_zone = i;
> > -                             *classzone_idx = i;
> >                               break;
> >                       }
> >               }
> > @@ -2528,8 +2527,11 @@ loop_again:
> >                           total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> >                               sc.may_writepage = 1;
> >  
> > -                     if (zone->all_unreclaimable)
> > +                     if (zone->all_unreclaimable) {
> > +                             if (end_zone && end_zone == i)
> > +                                     end_zone--;
> 
> Until now, It's good.
> 
> >                               continue;
> > +                     }
> >  
> >                       if (!zone_watermark_ok_safe(zone, order,
> >                                       high_wmark_pages(zone), end_zone, 0)) {
> > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >   */
> >  static int kswapd(void *p)
> >  {
> > -     unsigned long order;
> > -     int classzone_idx;
> > +     unsigned long order, new_order;
> > +     int classzone_idx, new_classzone_idx;
> >       pg_data_t *pgdat = (pg_data_t*)p;
> >       struct task_struct *tsk = current;
> >  
> > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >       set_freezable();
> >  
> > -     order = 0;
> > -     classzone_idx = MAX_NR_ZONES - 1;
> > +     order = new_order = 0;
> > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >       for ( ; ; ) {
> > -             unsigned long new_order;
> > -             int new_classzone_idx;
> >               int ret;
> >  
> > -             new_order = pgdat->kswapd_max_order;
> > -             new_classzone_idx = pgdat->classzone_idx;
> > -             pgdat->kswapd_max_order = 0;
> > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +             /*
> > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> > +              * new request of a similar or harder type will succeed soon
> > +              * so consider going to sleep on the basis we reclaimed at
> > +              */
> > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> > +                     new_order = pgdat->kswapd_max_order;
> > +                     new_classzone_idx = pgdat->classzone_idx;
> > +                     pgdat->kswapd_max_order =  0;
> > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> > +             }
> > +
> 
> But in this part.
> Why do we need this?

Lets say it's a fork-heavy workload and it is routinely being woken
for order-1 allocations and the highest zone is very small. For the
most part, it's ok because the allocations are being satisfied from
the lower zones which kswapd has no problem balancing.

However, by reading the information even after failing to
balance, kswapd continues balancing for order-1 due to reading
pgdat->kswapd_max_order, each time failing for the highest zone. It
only takes one wakeup request per balance_pgdat() to keep kswapd
awake trying to balance the highest zone in a continual loop.

By avoiding this read, kswapd will try and go to sleep after checking
all the watermarks and all_unreclaimable. If the watermarks are ok, it
will sleep until woken up due to the lower zones hitting their min
watermarks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-07-21 15:37   ` Minchan Kim
@ 2011-07-21 16:09     ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-21 16:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
> On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> > (Built this time and passed a basic sniff-test.)
> > 
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > This seems to happen most with recent sandybridge laptops but it's
> > probably a co-incidence as some of these laptops just happen to have
> > a small Normal zone. The reproduction case is almost always during
> > copying large files that kswapd pegs at 100% CPU until the file is
> > deleted or cache is dropped.
> > 
> > The problem is mostly down to sleeping_prematurely() keeping kswapd
> > awake when the highest zone is small and unreclaimable and compounded
> > by the fact we shrink slabs even when not shrinking zones causing a lot
> > of time to be spent in shrinkers and a lot of memory to be reclaimed.
> > 
> > Patch 1 corrects sleeping_prematurely to check the zones matching
> > 	the classzone_idx instead of all zones.
> > 
> > Patch 2 avoids shrinking slab when we are not shrinking a zone.
> > 
> > Patch 3 notes that sleeping_prematurely is checking lower zones against
> > 	a high classzone which is not what allocators or balance_pgdat()
> > 	is doing leading to an artifical believe that kswapd should be
> > 	still awake.
> > 
> > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> > 	decision is not communicated to sleeping_prematurely()
> > 
> > This problem affects 2.6.38.8 for certain and is expected to affect
> > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> > to be picked up by distros and this series is against 3.0-rc4. I've
> > cc'd people that reported similar problems recently to see if they
> > still suffer from the problem and if this fixes it.
> > 
> 
> Good!
> This patch solved the problem.
> But there is still a mystery.
> 
> In log, we could see excessive shrink_slab calls.

Yes, because shrink_slab() was called on each loop through
balance_pgdat() even if the zone was balanced.


> And as you know, we had merged patch which adds cond_resched where last of the function
> in shrink_slab. So other task should get the CPU and we should not see
> 100% CPU of kswapd, I think.
> 

cond_resched() is not a substitute for going to sleep.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-21 16:09     ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-21 16:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
> On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> > (Built this time and passed a basic sniff-test.)
> > 
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > This seems to happen most with recent sandybridge laptops but it's
> > probably a co-incidence as some of these laptops just happen to have
> > a small Normal zone. The reproduction case is almost always during
> > copying large files that kswapd pegs at 100% CPU until the file is
> > deleted or cache is dropped.
> > 
> > The problem is mostly down to sleeping_prematurely() keeping kswapd
> > awake when the highest zone is small and unreclaimable and compounded
> > by the fact we shrink slabs even when not shrinking zones causing a lot
> > of time to be spent in shrinkers and a lot of memory to be reclaimed.
> > 
> > Patch 1 corrects sleeping_prematurely to check the zones matching
> > 	the classzone_idx instead of all zones.
> > 
> > Patch 2 avoids shrinking slab when we are not shrinking a zone.
> > 
> > Patch 3 notes that sleeping_prematurely is checking lower zones against
> > 	a high classzone which is not what allocators or balance_pgdat()
> > 	is doing leading to an artifical believe that kswapd should be
> > 	still awake.
> > 
> > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> > 	decision is not communicated to sleeping_prematurely()
> > 
> > This problem affects 2.6.38.8 for certain and is expected to affect
> > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> > to be picked up by distros and this series is against 3.0-rc4. I've
> > cc'd people that reported similar problems recently to see if they
> > still suffer from the problem and if this fixes it.
> > 
> 
> Good!
> This patch solved the problem.
> But there is still a mystery.
> 
> In log, we could see excessive shrink_slab calls.

Yes, because shrink_slab() was called on each loop through
balance_pgdat() even if the zone was balanced.


> And as you know, we had merged patch which adds cond_resched where last of the function
> in shrink_slab. So other task should get the CPU and we should not see
> 100% CPU of kswapd, I think.
> 

cond_resched() is not a substitute for going to sleep.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-07-21 16:09     ` Mel Gorman
@ 2011-07-21 16:24       ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 16:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> > > (Built this time and passed a basic sniff-test.)
> > > 
> > > During allocator-intensive workloads, kswapd will be woken frequently
> > > causing free memory to oscillate between the high and min watermark.
> > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > small, a problem occurs.
> > > 
> > > This seems to happen most with recent sandybridge laptops but it's
> > > probably a co-incidence as some of these laptops just happen to have
> > > a small Normal zone. The reproduction case is almost always during
> > > copying large files that kswapd pegs at 100% CPU until the file is
> > > deleted or cache is dropped.
> > > 
> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
> > > awake when the highest zone is small and unreclaimable and compounded
> > > by the fact we shrink slabs even when not shrinking zones causing a lot
> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
> > > 
> > > Patch 1 corrects sleeping_prematurely to check the zones matching
> > > 	the classzone_idx instead of all zones.
> > > 
> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
> > > 
> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
> > > 	a high classzone which is not what allocators or balance_pgdat()
> > > 	is doing leading to an artifical believe that kswapd should be
> > > 	still awake.
> > > 
> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> > > 	decision is not communicated to sleeping_prematurely()
> > > 
> > > This problem affects 2.6.38.8 for certain and is expected to affect
> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> > > to be picked up by distros and this series is against 3.0-rc4. I've
> > > cc'd people that reported similar problems recently to see if they
> > > still suffer from the problem and if this fixes it.
> > > 
> > 
> > Good!
> > This patch solved the problem.
> > But there is still a mystery.
> > 
> > In log, we could see excessive shrink_slab calls.
> 
> Yes, because shrink_slab() was called on each loop through
> balance_pgdat() even if the zone was balanced.
> 
> 
> > And as you know, we had merged patch which adds cond_resched where last of the function
> > in shrink_slab. So other task should get the CPU and we should not see
> > 100% CPU of kswapd, I think.
> > 
> 
> cond_resched() is not a substitute for going to sleep.

Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
So we should never see 100% CPU consumption of kswapd.
No?

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-21 16:24       ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 16:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> > > (Built this time and passed a basic sniff-test.)
> > > 
> > > During allocator-intensive workloads, kswapd will be woken frequently
> > > causing free memory to oscillate between the high and min watermark.
> > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > small, a problem occurs.
> > > 
> > > This seems to happen most with recent sandybridge laptops but it's
> > > probably a co-incidence as some of these laptops just happen to have
> > > a small Normal zone. The reproduction case is almost always during
> > > copying large files that kswapd pegs at 100% CPU until the file is
> > > deleted or cache is dropped.
> > > 
> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
> > > awake when the highest zone is small and unreclaimable and compounded
> > > by the fact we shrink slabs even when not shrinking zones causing a lot
> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
> > > 
> > > Patch 1 corrects sleeping_prematurely to check the zones matching
> > > 	the classzone_idx instead of all zones.
> > > 
> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
> > > 
> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
> > > 	a high classzone which is not what allocators or balance_pgdat()
> > > 	is doing leading to an artifical believe that kswapd should be
> > > 	still awake.
> > > 
> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> > > 	decision is not communicated to sleeping_prematurely()
> > > 
> > > This problem affects 2.6.38.8 for certain and is expected to affect
> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> > > to be picked up by distros and this series is against 3.0-rc4. I've
> > > cc'd people that reported similar problems recently to see if they
> > > still suffer from the problem and if this fixes it.
> > > 
> > 
> > Good!
> > This patch solved the problem.
> > But there is still a mystery.
> > 
> > In log, we could see excessive shrink_slab calls.
> 
> Yes, because shrink_slab() was called on each loop through
> balance_pgdat() even if the zone was balanced.
> 
> 
> > And as you know, we had merged patch which adds cond_resched where last of the function
> > in shrink_slab. So other task should get the CPU and we should not see
> > 100% CPU of kswapd, I think.
> > 
> 
> cond_resched() is not a substitute for going to sleep.

Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
So we should never see 100% CPU consumption of kswapd.
No?

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-07-21 16:24       ` Minchan Kim
@ 2011-07-21 16:36         ` Andrew Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-07-21 16:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>> > > (Built this time and passed a basic sniff-test.)
>> > >
>> > > During allocator-intensive workloads, kswapd will be woken frequently
>> > > causing free memory to oscillate between the high and min watermark.
>> > > This is expected behaviour.  Unfortunately, if the highest zone is
>> > > small, a problem occurs.
>> > >
>> > > This seems to happen most with recent sandybridge laptops but it's
>> > > probably a co-incidence as some of these laptops just happen to have
>> > > a small Normal zone. The reproduction case is almost always during
>> > > copying large files that kswapd pegs at 100% CPU until the file is
>> > > deleted or cache is dropped.
>> > >
>> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>> > > awake when the highest zone is small and unreclaimable and compounded
>> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>> > >
>> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>> > >   the classzone_idx instead of all zones.
>> > >
>> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>> > >
>> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>> > >   a high classzone which is not what allocators or balance_pgdat()
>> > >   is doing leading to an artifical believe that kswapd should be
>> > >   still awake.
>> > >
>> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>> > >   decision is not communicated to sleeping_prematurely()
>> > >
>> > > This problem affects 2.6.38.8 for certain and is expected to affect
>> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>> > > to be picked up by distros and this series is against 3.0-rc4. I've
>> > > cc'd people that reported similar problems recently to see if they
>> > > still suffer from the problem and if this fixes it.
>> > >
>> >
>> > Good!
>> > This patch solved the problem.
>> > But there is still a mystery.
>> >
>> > In log, we could see excessive shrink_slab calls.
>>
>> Yes, because shrink_slab() was called on each loop through
>> balance_pgdat() even if the zone was balanced.
>>
>>
>> > And as you know, we had merged patch which adds cond_resched where last of the function
>> > in shrink_slab. So other task should get the CPU and we should not see
>> > 100% CPU of kswapd, I think.
>> >
>>
>> cond_resched() is not a substitute for going to sleep.
>
> Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
> So we should never see 100% CPU consumption of kswapd.
> No?

If the rest of the system is idle, then kswapd will happily use 100%
CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
CPU even if another task is using the other one.  This is bad enough
on a desktop, but on a laptop you start to notice when your battery
dies.)

--Andy

>
>>
>> --
>> Mel Gorman
>> SUSE Labs
>
> --
> Kind regards,
> Minchan Kim
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-21 16:36         ` Andrew Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-07-21 16:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>> > > (Built this time and passed a basic sniff-test.)
>> > >
>> > > During allocator-intensive workloads, kswapd will be woken frequently
>> > > causing free memory to oscillate between the high and min watermark.
>> > > This is expected behaviour.  Unfortunately, if the highest zone is
>> > > small, a problem occurs.
>> > >
>> > > This seems to happen most with recent sandybridge laptops but it's
>> > > probably a co-incidence as some of these laptops just happen to have
>> > > a small Normal zone. The reproduction case is almost always during
>> > > copying large files that kswapd pegs at 100% CPU until the file is
>> > > deleted or cache is dropped.
>> > >
>> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>> > > awake when the highest zone is small and unreclaimable and compounded
>> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>> > >
>> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>> > >   the classzone_idx instead of all zones.
>> > >
>> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>> > >
>> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>> > >   a high classzone which is not what allocators or balance_pgdat()
>> > >   is doing leading to an artifical believe that kswapd should be
>> > >   still awake.
>> > >
>> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>> > >   decision is not communicated to sleeping_prematurely()
>> > >
>> > > This problem affects 2.6.38.8 for certain and is expected to affect
>> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>> > > to be picked up by distros and this series is against 3.0-rc4. I've
>> > > cc'd people that reported similar problems recently to see if they
>> > > still suffer from the problem and if this fixes it.
>> > >
>> >
>> > Good!
>> > This patch solved the problem.
>> > But there is still a mystery.
>> >
>> > In log, we could see excessive shrink_slab calls.
>>
>> Yes, because shrink_slab() was called on each loop through
>> balance_pgdat() even if the zone was balanced.
>>
>>
>> > And as you know, we had merged patch which adds cond_resched where last of the function
>> > in shrink_slab. So other task should get the CPU and we should not see
>> > 100% CPU of kswapd, I think.
>> >
>>
>> cond_resched() is not a substitute for going to sleep.
>
> Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
> So we should never see 100% CPU consumption of kswapd.
> No?

If the rest of the system is idle, then kswapd will happily use 100%
CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
CPU even if another task is using the other one.  This is bad enough
on a desktop, but on a laptop you start to notice when your battery
dies.)

--Andy

>
>>
>> --
>> Mel Gorman
>> SUSE Labs
>
> --
> Kind regards,
> Minchan Kim
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-07-21 16:07           ` Mel Gorman
@ 2011-07-21 16:36             ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 16:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 05:07:06PM +0100, Mel Gorman wrote:
> On Fri, Jul 22, 2011 at 12:30:07AM +0900, Minchan Kim wrote:
> > On Wed, Jul 20, 2011 at 11:48:47AM +0100, Mel Gorman wrote:
> > > On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> > > > Hi Mel,
> > > > 
> > > > Too late review.
> > > 
> > > Never too late.
> > > 
> > > > At that time, I had no time to look into this patch.
> > > > 
> > > > On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > > > > During allocator-intensive workloads, kswapd will be woken frequently
> > > > > causing free memory to oscillate between the high and min watermark.
> > > > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > > > small, a problem occurs.
> > > > > 
> > > > > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > > > > it started because the highest zone was unreclaimable. Before checking
> > > > 
> > > > Yes.
> > > > 
> > > > > if it should go to sleep though, it checks pgdat->classzone_idx which
> > > > > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > > > 
> > > > Yes.
> > > > 
> > > > > this as it has been woken up while reclaiming, skips scheduling and
> > > > 
> > > > Hmm. I can't understand this part.
> > > > If balance_pgdat returns lower classzone and there is no other activity,
> > > > new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> > > > new_classzone_idx. It means it doesn't skip scheduling.
> > > > 
> > > > Do I miss something?
> > > > 
> > > 
> > > It was a few weeks ago so I don't rememember if this is the exact
> > > sequence I had in mind at the time of writing but an example sequence
> > > of events is for a node whose highest populated zone is ZONE_NORMAL,
> > > very small, and gets set all_unreclaimable by balance_pgdat() looks
> > > is below. The key is the "very small" part because pages are getting
> > > freed in the zone but the small size means that unreclaimable gets
> > > set easily.
> > > 
> > > /*
> > >  * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
> > >  * as ZONE_HIGHMEM is not populated.
> > >  */
> > > 
> > > order = pgdat->kswapd_max_order;
> > > classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
> > > pgdat->kswapd_max_order = 0;
> > > pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > > order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
> > > 								 * the highest zone was set unreclaimable
> > > 								 * and it exited scanning ZONE_DMA32
> > > 								 * because we did not communicate that
> > > 								 * information back
> > 
> > 								Yes. It's too bad.
> > 
> > > 								 */
> > > new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
> > > new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
> > > 								 * because that is what classzone_idx
> > > 								 * gets reset to
> > 
> > 								Yes. new_classzone_idx is ZONE_HIGHMEM.
> > 
> > > 								 */
> > > if (order < new_order || classzone_idx > new_classzone_idx) {
> > > 	/* does not sleep, this branch not taken */
> > > } else {
> > > 	/* tries to sleep, goes here */
> > > 	try_to_sleep(ZONE_NORMAL)
> > > 		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
> > >         order = pgdat->kswapd_max_order;
> > >         classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
> > > 								 * is higher than what it was originally
> > > 								 * woken for
> > > 								 */
> > 
> > 								But is it a problem?
> > 								it should be reset to ZONE_NORMAL in balance_pgdat as high zone isn't populated.
> 
> At the very least, it's sloppy.

Agree.

> 
> > > }
> > > 
> > > /* Looped around to balance_pgdat() again */
> > > order = balance_pgdat()
> > > 
> > > Between when all_unreclaimable is set and before before kswapd
> > > goes fully to sleep, a page is freed clearing all_reclaimable so
> > > it rechecks all the zones, find the highest one is not balanced and
> > > skip scheduling.
> > 
> > Yes and it could be repeated forever.
> 
> Resulting in chewing up large amounts of CPU.
> 
> > Apparently, we should fix wit this patch but I have a qustion about this patch.
> > 
> > Quote from your patch
> > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index a76b6cc2..fe854d7 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2448,7 +2448,6 @@ loop_again:
> > >                       if (!zone_watermark_ok_safe(zone, order,
> > >                                       high_wmark_pages(zone), 0, 0)) {
> > >                               end_zone = i;
> > > -                             *classzone_idx = i;
> > >                               break;
> > >                       }
> > >               }
> > > @@ -2528,8 +2527,11 @@ loop_again:
> > >                           total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> > >                               sc.may_writepage = 1;
> > >  
> > > -                     if (zone->all_unreclaimable)
> > > +                     if (zone->all_unreclaimable) {
> > > +                             if (end_zone && end_zone == i)
> > > +                                     end_zone--;
> > 
> > Until now, It's good.
> > 
> > >                               continue;
> > > +                     }
> > >  
> > >                       if (!zone_watermark_ok_safe(zone, order,
> > >                                       high_wmark_pages(zone), end_zone, 0)) {
> > > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > -     unsigned long order;
> > > -     int classzone_idx;
> > > +     unsigned long order, new_order;
> > > +     int classzone_idx, new_classzone_idx;
> > >       pg_data_t *pgdat = (pg_data_t*)p;
> > >       struct task_struct *tsk = current;
> > >  
> > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >       set_freezable();
> > >  
> > > -     order = 0;
> > > -     classzone_idx = MAX_NR_ZONES - 1;
> > > +     order = new_order = 0;
> > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> > >       for ( ; ; ) {
> > > -             unsigned long new_order;
> > > -             int new_classzone_idx;
> > >               int ret;
> > >  
> > > -             new_order = pgdat->kswapd_max_order;
> > > -             new_classzone_idx = pgdat->classzone_idx;
> > > -             pgdat->kswapd_max_order = 0;
> > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > > +             /*
> > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> > > +              * new request of a similar or harder type will succeed soon
> > > +              * so consider going to sleep on the basis we reclaimed at
> > > +              */
> > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> > > +                     new_order = pgdat->kswapd_max_order;
> > > +                     new_classzone_idx = pgdat->classzone_idx;
> > > +                     pgdat->kswapd_max_order =  0;
> > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> > > +             }
> > > +
> > 
> > But in this part.
> > Why do we need this?
> 
> Lets say it's a fork-heavy workload and it is routinely being woken
> for order-1 allocations and the highest zone is very small. For the
> most part, it's ok because the allocations are being satisfied from
> the lower zones which kswapd has no problem balancing.
> 
> However, by reading the information even after failing to
> balance, kswapd continues balancing for order-1 due to reading
> pgdat->kswapd_max_order, each time failing for the highest zone. It
> only takes one wakeup request per balance_pgdat() to keep kswapd
> awake trying to balance the highest zone in a continual loop.

You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.

> 
> By avoiding this read, kswapd will try and go to sleep after checking
> all the watermarks and all_unreclaimable. If the watermarks are ok, it
> will sleep until woken up due to the lower zones hitting their min
> watermarks.
> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-21 16:36             ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 16:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 05:07:06PM +0100, Mel Gorman wrote:
> On Fri, Jul 22, 2011 at 12:30:07AM +0900, Minchan Kim wrote:
> > On Wed, Jul 20, 2011 at 11:48:47AM +0100, Mel Gorman wrote:
> > > On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> > > > Hi Mel,
> > > > 
> > > > Too late review.
> > > 
> > > Never too late.
> > > 
> > > > At that time, I had no time to look into this patch.
> > > > 
> > > > On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > > > > During allocator-intensive workloads, kswapd will be woken frequently
> > > > > causing free memory to oscillate between the high and min watermark.
> > > > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > > > small, a problem occurs.
> > > > > 
> > > > > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > > > > it started because the highest zone was unreclaimable. Before checking
> > > > 
> > > > Yes.
> > > > 
> > > > > if it should go to sleep though, it checks pgdat->classzone_idx which
> > > > > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > > > 
> > > > Yes.
> > > > 
> > > > > this as it has been woken up while reclaiming, skips scheduling and
> > > > 
> > > > Hmm. I can't understand this part.
> > > > If balance_pgdat returns lower classzone and there is no other activity,
> > > > new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> > > > new_classzone_idx. It means it doesn't skip scheduling.
> > > > 
> > > > Do I miss something?
> > > > 
> > > 
> > > It was a few weeks ago so I don't rememember if this is the exact
> > > sequence I had in mind at the time of writing but an example sequence
> > > of events is for a node whose highest populated zone is ZONE_NORMAL,
> > > very small, and gets set all_unreclaimable by balance_pgdat() looks
> > > is below. The key is the "very small" part because pages are getting
> > > freed in the zone but the small size means that unreclaimable gets
> > > set easily.
> > > 
> > > /*
> > >  * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
> > >  * as ZONE_HIGHMEM is not populated.
> > >  */
> > > 
> > > order = pgdat->kswapd_max_order;
> > > classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
> > > pgdat->kswapd_max_order = 0;
> > > pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > > order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
> > > 								 * the highest zone was set unreclaimable
> > > 								 * and it exited scanning ZONE_DMA32
> > > 								 * because we did not communicate that
> > > 								 * information back
> > 
> > 								Yes. It's too bad.
> > 
> > > 								 */
> > > new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
> > > new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
> > > 								 * because that is what classzone_idx
> > > 								 * gets reset to
> > 
> > 								Yes. new_classzone_idx is ZONE_HIGHMEM.
> > 
> > > 								 */
> > > if (order < new_order || classzone_idx > new_classzone_idx) {
> > > 	/* does not sleep, this branch not taken */
> > > } else {
> > > 	/* tries to sleep, goes here */
> > > 	try_to_sleep(ZONE_NORMAL)
> > > 		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
> > >         order = pgdat->kswapd_max_order;
> > >         classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
> > > 								 * is higher than what it was originally
> > > 								 * woken for
> > > 								 */
> > 
> > 								But is it a problem?
> > 								it should be reset to ZONE_NORMAL in balance_pgdat as high zone isn't populated.
> 
> At the very least, it's sloppy.

Agree.

> 
> > > }
> > > 
> > > /* Looped around to balance_pgdat() again */
> > > order = balance_pgdat()
> > > 
> > > Between when all_unreclaimable is set and before before kswapd
> > > goes fully to sleep, a page is freed clearing all_reclaimable so
> > > it rechecks all the zones, find the highest one is not balanced and
> > > skip scheduling.
> > 
> > Yes and it could be repeated forever.
> 
> Resulting in chewing up large amounts of CPU.
> 
> > Apparently, we should fix wit this patch but I have a qustion about this patch.
> > 
> > Quote from your patch
> > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index a76b6cc2..fe854d7 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2448,7 +2448,6 @@ loop_again:
> > >                       if (!zone_watermark_ok_safe(zone, order,
> > >                                       high_wmark_pages(zone), 0, 0)) {
> > >                               end_zone = i;
> > > -                             *classzone_idx = i;
> > >                               break;
> > >                       }
> > >               }
> > > @@ -2528,8 +2527,11 @@ loop_again:
> > >                           total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> > >                               sc.may_writepage = 1;
> > >  
> > > -                     if (zone->all_unreclaimable)
> > > +                     if (zone->all_unreclaimable) {
> > > +                             if (end_zone && end_zone == i)
> > > +                                     end_zone--;
> > 
> > Until now, It's good.
> > 
> > >                               continue;
> > > +                     }
> > >  
> > >                       if (!zone_watermark_ok_safe(zone, order,
> > >                                       high_wmark_pages(zone), end_zone, 0)) {
> > > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > -     unsigned long order;
> > > -     int classzone_idx;
> > > +     unsigned long order, new_order;
> > > +     int classzone_idx, new_classzone_idx;
> > >       pg_data_t *pgdat = (pg_data_t*)p;
> > >       struct task_struct *tsk = current;
> > >  
> > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >       set_freezable();
> > >  
> > > -     order = 0;
> > > -     classzone_idx = MAX_NR_ZONES - 1;
> > > +     order = new_order = 0;
> > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> > >       for ( ; ; ) {
> > > -             unsigned long new_order;
> > > -             int new_classzone_idx;
> > >               int ret;
> > >  
> > > -             new_order = pgdat->kswapd_max_order;
> > > -             new_classzone_idx = pgdat->classzone_idx;
> > > -             pgdat->kswapd_max_order = 0;
> > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > > +             /*
> > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> > > +              * new request of a similar or harder type will succeed soon
> > > +              * so consider going to sleep on the basis we reclaimed at
> > > +              */
> > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> > > +                     new_order = pgdat->kswapd_max_order;
> > > +                     new_classzone_idx = pgdat->classzone_idx;
> > > +                     pgdat->kswapd_max_order =  0;
> > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> > > +             }
> > > +
> > 
> > But in this part.
> > Why do we need this?
> 
> Lets say it's a fork-heavy workload and it is routinely being woken
> for order-1 allocations and the highest zone is very small. For the
> most part, it's ok because the allocations are being satisfied from
> the lower zones which kswapd has no problem balancing.
> 
> However, by reading the information even after failing to
> balance, kswapd continues balancing for order-1 due to reading
> pgdat->kswapd_max_order, each time failing for the highest zone. It
> only takes one wakeup request per balance_pgdat() to keep kswapd
> awake trying to balance the highest zone in a continual loop.

You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.

> 
> By avoiding this read, kswapd will try and go to sleep after checking
> all the watermarks and all_unreclaimable. If the watermarks are ok, it
> will sleep until woken up due to the lower zones hitting their min
> watermarks.
> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-07-21 16:36         ` Andrew Lutomirski
@ 2011-07-21 16:42           ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 16:42 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> >> > > (Built this time and passed a basic sniff-test.)
> >> > >
> >> > > During allocator-intensive workloads, kswapd will be woken frequently
> >> > > causing free memory to oscillate between the high and min watermark.
> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
> >> > > small, a problem occurs.
> >> > >
> >> > > This seems to happen most with recent sandybridge laptops but it's
> >> > > probably a co-incidence as some of these laptops just happen to have
> >> > > a small Normal zone. The reproduction case is almost always during
> >> > > copying large files that kswapd pegs at 100% CPU until the file is
> >> > > deleted or cache is dropped.
> >> > >
> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
> >> > > awake when the highest zone is small and unreclaimable and compounded
> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
> >> > >
> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
> >> > >   the classzone_idx instead of all zones.
> >> > >
> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
> >> > >
> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
> >> > >   a high classzone which is not what allocators or balance_pgdat()
> >> > >   is doing leading to an artifical believe that kswapd should be
> >> > >   still awake.
> >> > >
> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> >> > >   decision is not communicated to sleeping_prematurely()
> >> > >
> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
> >> > > cc'd people that reported similar problems recently to see if they
> >> > > still suffer from the problem and if this fixes it.
> >> > >
> >> >
> >> > Good!
> >> > This patch solved the problem.
> >> > But there is still a mystery.
> >> >
> >> > In log, we could see excessive shrink_slab calls.
> >>
> >> Yes, because shrink_slab() was called on each loop through
> >> balance_pgdat() even if the zone was balanced.
> >>
> >>
> >> > And as you know, we had merged patch which adds cond_resched where last of the function
> >> > in shrink_slab. So other task should get the CPU and we should not see
> >> > 100% CPU of kswapd, I think.
> >> >
> >>
> >> cond_resched() is not a substitute for going to sleep.
> >
> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
> > So we should never see 100% CPU consumption of kswapd.
> > No?
> 
> If the rest of the system is idle, then kswapd will happily use 100%
> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one

Of course. But at least, we have a test program and I think it's not idle.

> CPU even if another task is using the other one.  This is bad enough
> on a desktop, but on a laptop you start to notice when your battery

Of course it's bad. :)
What I want to know is just what's exact cause of 100% CPU usage.
It might be not 100% but we might use the word sloppily.

> dies.)
> 
> --Andy
> 
> >
> >>
> >> --
> >> Mel Gorman
> >> SUSE Labs
> >
> > --
> > Kind regards,
> > Minchan Kim
> >

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-21 16:42           ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-21 16:42 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
> >> > > (Built this time and passed a basic sniff-test.)
> >> > >
> >> > > During allocator-intensive workloads, kswapd will be woken frequently
> >> > > causing free memory to oscillate between the high and min watermark.
> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
> >> > > small, a problem occurs.
> >> > >
> >> > > This seems to happen most with recent sandybridge laptops but it's
> >> > > probably a co-incidence as some of these laptops just happen to have
> >> > > a small Normal zone. The reproduction case is almost always during
> >> > > copying large files that kswapd pegs at 100% CPU until the file is
> >> > > deleted or cache is dropped.
> >> > >
> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
> >> > > awake when the highest zone is small and unreclaimable and compounded
> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
> >> > >
> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
> >> > >   the classzone_idx instead of all zones.
> >> > >
> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
> >> > >
> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
> >> > >   a high classzone which is not what allocators or balance_pgdat()
> >> > >   is doing leading to an artifical believe that kswapd should be
> >> > >   still awake.
> >> > >
> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
> >> > >   decision is not communicated to sleeping_prematurely()
> >> > >
> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
> >> > > cc'd people that reported similar problems recently to see if they
> >> > > still suffer from the problem and if this fixes it.
> >> > >
> >> >
> >> > Good!
> >> > This patch solved the problem.
> >> > But there is still a mystery.
> >> >
> >> > In log, we could see excessive shrink_slab calls.
> >>
> >> Yes, because shrink_slab() was called on each loop through
> >> balance_pgdat() even if the zone was balanced.
> >>
> >>
> >> > And as you know, we had merged patch which adds cond_resched where last of the function
> >> > in shrink_slab. So other task should get the CPU and we should not see
> >> > 100% CPU of kswapd, I think.
> >> >
> >>
> >> cond_resched() is not a substitute for going to sleep.
> >
> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
> > So we should never see 100% CPU consumption of kswapd.
> > No?
> 
> If the rest of the system is idle, then kswapd will happily use 100%
> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one

Of course. But at least, we have a test program and I think it's not idle.

> CPU even if another task is using the other one.  This is bad enough
> on a desktop, but on a laptop you start to notice when your battery

Of course it's bad. :)
What I want to know is just what's exact cause of 100% CPU usage.
It might be not 100% but we might use the word sloppily.

> dies.)
> 
> --Andy
> 
> >
> >>
> >> --
> >> Mel Gorman
> >> SUSE Labs
> >
> > --
> > Kind regards,
> > Minchan Kim
> >

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-07-21 16:42           ` Minchan Kim
@ 2011-07-21 16:58             ` Andrew Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-07-21 16:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 12:42 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
>> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>> >> > > (Built this time and passed a basic sniff-test.)
>> >> > >
>> >> > > During allocator-intensive workloads, kswapd will be woken frequently
>> >> > > causing free memory to oscillate between the high and min watermark.
>> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
>> >> > > small, a problem occurs.
>> >> > >
>> >> > > This seems to happen most with recent sandybridge laptops but it's
>> >> > > probably a co-incidence as some of these laptops just happen to have
>> >> > > a small Normal zone. The reproduction case is almost always during
>> >> > > copying large files that kswapd pegs at 100% CPU until the file is
>> >> > > deleted or cache is dropped.
>> >> > >
>> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>> >> > > awake when the highest zone is small and unreclaimable and compounded
>> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>> >> > >
>> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>> >> > >   the classzone_idx instead of all zones.
>> >> > >
>> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>> >> > >
>> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>> >> > >   a high classzone which is not what allocators or balance_pgdat()
>> >> > >   is doing leading to an artifical believe that kswapd should be
>> >> > >   still awake.
>> >> > >
>> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>> >> > >   decision is not communicated to sleeping_prematurely()
>> >> > >
>> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
>> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
>> >> > > cc'd people that reported similar problems recently to see if they
>> >> > > still suffer from the problem and if this fixes it.
>> >> > >
>> >> >
>> >> > Good!
>> >> > This patch solved the problem.
>> >> > But there is still a mystery.
>> >> >
>> >> > In log, we could see excessive shrink_slab calls.
>> >>
>> >> Yes, because shrink_slab() was called on each loop through
>> >> balance_pgdat() even if the zone was balanced.
>> >>
>> >>
>> >> > And as you know, we had merged patch which adds cond_resched where last of the function
>> >> > in shrink_slab. So other task should get the CPU and we should not see
>> >> > 100% CPU of kswapd, I think.
>> >> >
>> >>
>> >> cond_resched() is not a substitute for going to sleep.
>> >
>> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
>> > So we should never see 100% CPU consumption of kswapd.
>> > No?
>>
>> If the rest of the system is idle, then kswapd will happily use 100%
>> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
>
> Of course. But at least, we have a test program and I think it's not idle.

The test program I used was 'top', which is pretty close to idle.

>
>> CPU even if another task is using the other one.  This is bad enough
>> on a desktop, but on a laptop you start to notice when your battery
>> dies.)
>
> Of course it's bad. :)
> What I want to know is just what's exact cause of 100% CPU usage.
> It might be not 100% but we might use the word sloppily.
>

Well, if you want to pedantic, my laptop can, in theory, demonstrate
true 100% CPU usage.  Trigger the bug, suspend every other thread, and
listen to the laptop fan spin and feel the laptop get hot.  (The fan
is controlled by the EC and takes no CPU.)

In practice, the usage was close enough to 100% that it got rounded.

The cond_resched was enough to at least make the system responsive
instead of the hard freeze I used to get.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-21 16:58             ` Andrew Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-07-21 16:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 12:42 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
>> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>> >> > > (Built this time and passed a basic sniff-test.)
>> >> > >
>> >> > > During allocator-intensive workloads, kswapd will be woken frequently
>> >> > > causing free memory to oscillate between the high and min watermark.
>> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
>> >> > > small, a problem occurs.
>> >> > >
>> >> > > This seems to happen most with recent sandybridge laptops but it's
>> >> > > probably a co-incidence as some of these laptops just happen to have
>> >> > > a small Normal zone. The reproduction case is almost always during
>> >> > > copying large files that kswapd pegs at 100% CPU until the file is
>> >> > > deleted or cache is dropped.
>> >> > >
>> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>> >> > > awake when the highest zone is small and unreclaimable and compounded
>> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>> >> > >
>> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>> >> > >   the classzone_idx instead of all zones.
>> >> > >
>> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>> >> > >
>> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>> >> > >   a high classzone which is not what allocators or balance_pgdat()
>> >> > >   is doing leading to an artifical believe that kswapd should be
>> >> > >   still awake.
>> >> > >
>> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>> >> > >   decision is not communicated to sleeping_prematurely()
>> >> > >
>> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
>> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
>> >> > > cc'd people that reported similar problems recently to see if they
>> >> > > still suffer from the problem and if this fixes it.
>> >> > >
>> >> >
>> >> > Good!
>> >> > This patch solved the problem.
>> >> > But there is still a mystery.
>> >> >
>> >> > In log, we could see excessive shrink_slab calls.
>> >>
>> >> Yes, because shrink_slab() was called on each loop through
>> >> balance_pgdat() even if the zone was balanced.
>> >>
>> >>
>> >> > And as you know, we had merged patch which adds cond_resched where last of the function
>> >> > in shrink_slab. So other task should get the CPU and we should not see
>> >> > 100% CPU of kswapd, I think.
>> >> >
>> >>
>> >> cond_resched() is not a substitute for going to sleep.
>> >
>> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
>> > So we should never see 100% CPU consumption of kswapd.
>> > No?
>>
>> If the rest of the system is idle, then kswapd will happily use 100%
>> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
>
> Of course. But at least, we have a test program and I think it's not idle.

The test program I used was 'top', which is pretty close to idle.

>
>> CPU even if another task is using the other one.  This is bad enough
>> on a desktop, but on a laptop you start to notice when your battery
>> dies.)
>
> Of course it's bad. :)
> What I want to know is just what's exact cause of 100% CPU usage.
> It might be not 100% but we might use the word sloppily.
>

Well, if you want to pedantic, my laptop can, in theory, demonstrate
true 100% CPU usage.  Trigger the bug, suspend every other thread, and
listen to the laptop fan spin and feel the laptop get hot.  (The fan
is controlled by the EC and takes no CPU.)

In practice, the usage was close enough to 100% that it got rounded.

The cond_resched was enough to at least make the system responsive
instead of the hard freeze I used to get.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-07-21 16:36             ` Minchan Kim
@ 2011-07-21 17:01               ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-21 17:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
> > > > <SNIP>
> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> > > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > > >       set_freezable();
> > > >  
> > > > -     order = 0;
> > > > -     classzone_idx = MAX_NR_ZONES - 1;
> > > > +     order = new_order = 0;
> > > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> > > >       for ( ; ; ) {
> > > > -             unsigned long new_order;
> > > > -             int new_classzone_idx;
> > > >               int ret;
> > > >  
> > > > -             new_order = pgdat->kswapd_max_order;
> > > > -             new_classzone_idx = pgdat->classzone_idx;
> > > > -             pgdat->kswapd_max_order = 0;
> > > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > > > +             /*
> > > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> > > > +              * new request of a similar or harder type will succeed soon
> > > > +              * so consider going to sleep on the basis we reclaimed at
> > > > +              */
> > > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> > > > +                     new_order = pgdat->kswapd_max_order;
> > > > +                     new_classzone_idx = pgdat->classzone_idx;
> > > > +                     pgdat->kswapd_max_order =  0;
> > > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> > > > +             }
> > > > +
> > > 
> > > But in this part.
> > > Why do we need this?
> > 
> > Lets say it's a fork-heavy workload and it is routinely being woken
> > for order-1 allocations and the highest zone is very small. For the
> > most part, it's ok because the allocations are being satisfied from
> > the lower zones which kswapd has no problem balancing.
> > 
> > However, by reading the information even after failing to
> > balance, kswapd continues balancing for order-1 due to reading
> > pgdat->kswapd_max_order, each time failing for the highest zone. It
> > only takes one wakeup request per balance_pgdat() to keep kswapd
> > awake trying to balance the highest zone in a continual loop.
> 
> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
> 

If a wakeup for order-1 happened during the last pgdat, the
classzone_idx as communicated back from balance_pgdat() is lost and it
will not sleep in this ordering of events

kswapd 									other processes
====== 									===============
order = balance_pgdat(pgdat, order, &classzone_idx);
									wakeup for order-1
kswapd balances lower zone 
									allocate from lower zone
balance_pgdat fails balance for highest zone, returns
	with lower classzone_idx and possibly lower order
new_order = pgdat->kswapd_max_order      (order == 1)
new_classzone_idx = pgdat->classzone_idx (highest zone)
if (order < new_order || classzone_idx > new_classzone_idx) {
        order = new_order;
        classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
}
order = balance_pgdat(pgdat, order, &classzone_idx);

The wakup for order-1 at any point during balance_pgdat() is enough to
keep kswapd awake even though the process that called wakeup_kswapd
would be able to allocate from the lower zones without significant
difficulty.

This is why if balance_pgdat() fails its request, it should go to sleep
if watermarks for the lower zones are met until woken by another
process.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-21 17:01               ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-21 17:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
> > > > <SNIP>
> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> > > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > > >       set_freezable();
> > > >  
> > > > -     order = 0;
> > > > -     classzone_idx = MAX_NR_ZONES - 1;
> > > > +     order = new_order = 0;
> > > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> > > >       for ( ; ; ) {
> > > > -             unsigned long new_order;
> > > > -             int new_classzone_idx;
> > > >               int ret;
> > > >  
> > > > -             new_order = pgdat->kswapd_max_order;
> > > > -             new_classzone_idx = pgdat->classzone_idx;
> > > > -             pgdat->kswapd_max_order = 0;
> > > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > > > +             /*
> > > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> > > > +              * new request of a similar or harder type will succeed soon
> > > > +              * so consider going to sleep on the basis we reclaimed at
> > > > +              */
> > > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> > > > +                     new_order = pgdat->kswapd_max_order;
> > > > +                     new_classzone_idx = pgdat->classzone_idx;
> > > > +                     pgdat->kswapd_max_order =  0;
> > > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> > > > +             }
> > > > +
> > > 
> > > But in this part.
> > > Why do we need this?
> > 
> > Lets say it's a fork-heavy workload and it is routinely being woken
> > for order-1 allocations and the highest zone is very small. For the
> > most part, it's ok because the allocations are being satisfied from
> > the lower zones which kswapd has no problem balancing.
> > 
> > However, by reading the information even after failing to
> > balance, kswapd continues balancing for order-1 due to reading
> > pgdat->kswapd_max_order, each time failing for the highest zone. It
> > only takes one wakeup request per balance_pgdat() to keep kswapd
> > awake trying to balance the highest zone in a continual loop.
> 
> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
> 

If a wakeup for order-1 happened during the last pgdat, the
classzone_idx as communicated back from balance_pgdat() is lost and it
will not sleep in this ordering of events

kswapd 									other processes
====== 									===============
order = balance_pgdat(pgdat, order, &classzone_idx);
									wakeup for order-1
kswapd balances lower zone 
									allocate from lower zone
balance_pgdat fails balance for highest zone, returns
	with lower classzone_idx and possibly lower order
new_order = pgdat->kswapd_max_order      (order == 1)
new_classzone_idx = pgdat->classzone_idx (highest zone)
if (order < new_order || classzone_idx > new_classzone_idx) {
        order = new_order;
        classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
}
order = balance_pgdat(pgdat, order, &classzone_idx);

The wakup for order-1 at any point during balance_pgdat() is enough to
keep kswapd awake even though the process that called wakeup_kswapd
would be able to allocate from the lower zones without significant
difficulty.

This is why if balance_pgdat() fails its request, it should go to sleep
if watermarks for the lower zones are met until woken by another
process.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-07-21 17:01               ` Mel Gorman
@ 2011-07-22  0:21                 ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-22  0:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 2:01 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
>> > > > <SNIP>
>> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
>> > > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>> > > >       set_freezable();
>> > > >
>> > > > -     order = 0;
>> > > > -     classzone_idx = MAX_NR_ZONES - 1;
>> > > > +     order = new_order = 0;
>> > > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>> > > >       for ( ; ; ) {
>> > > > -             unsigned long new_order;
>> > > > -             int new_classzone_idx;
>> > > >               int ret;
>> > > >
>> > > > -             new_order = pgdat->kswapd_max_order;
>> > > > -             new_classzone_idx = pgdat->classzone_idx;
>> > > > -             pgdat->kswapd_max_order = 0;
>> > > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
>> > > > +             /*
>> > > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
>> > > > +              * new request of a similar or harder type will succeed soon
>> > > > +              * so consider going to sleep on the basis we reclaimed at
>> > > > +              */
>> > > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
>> > > > +                     new_order = pgdat->kswapd_max_order;
>> > > > +                     new_classzone_idx = pgdat->classzone_idx;
>> > > > +                     pgdat->kswapd_max_order =  0;
>> > > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
>> > > > +             }
>> > > > +
>> > >
>> > > But in this part.
>> > > Why do we need this?
>> >
>> > Lets say it's a fork-heavy workload and it is routinely being woken
>> > for order-1 allocations and the highest zone is very small. For the
>> > most part, it's ok because the allocations are being satisfied from
>> > the lower zones which kswapd has no problem balancing.
>> >
>> > However, by reading the information even after failing to
>> > balance, kswapd continues balancing for order-1 due to reading
>> > pgdat->kswapd_max_order, each time failing for the highest zone. It
>> > only takes one wakeup request per balance_pgdat() to keep kswapd
>> > awake trying to balance the highest zone in a continual loop.
>>
>> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
>> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
>> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
>>
>
> If a wakeup for order-1 happened during the last pgdat, the
> classzone_idx as communicated back from balance_pgdat() is lost and it
> will not sleep in this ordering of events
>
> kswapd                                                                  other processes
> ======                                                                  ===============
> order = balance_pgdat(pgdat, order, &classzone_idx);
>                                                                        wakeup for order-1
> kswapd balances lower zone
>                                                                        allocate from lower zone
> balance_pgdat fails balance for highest zone, returns
>        with lower classzone_idx and possibly lower order
> new_order = pgdat->kswapd_max_order      (order == 1)
> new_classzone_idx = pgdat->classzone_idx (highest zone)
> if (order < new_order || classzone_idx > new_classzone_idx) {
>        order = new_order;
>        classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
> }
> order = balance_pgdat(pgdat, order, &classzone_idx);
>
> The wakup for order-1 at any point during balance_pgdat() is enough to
> keep kswapd awake even though the process that called wakeup_kswapd
> would be able to allocate from the lower zones without significant
> difficulty.
>
> This is why if balance_pgdat() fails its request, it should go to sleep
> if watermarks for the lower zones are met until woken by another
> process.

Hmm.

The role of kswapd is to reclaim pages by background until all of zone
meet HIGH_WMARK to prevent costly direct reclaim.(Of course, there is
another reason like GFP_ATOMIC). So it's not wrong to consume many cpu
usage by design unless other tasks are ready. It would be balanced or
unreclaimable at last so it should end up. However, the problem is
small part of highest zone is easily [set|reset] to be
all_unreclaimabe so the situation could be forever like our example.
So fundamental solution is to prevent it that all_unreclaimable is
set/reset easily, I think.
Unfortunately it have no idea now.

In different viewpoint,  the problem is that it's too excessive
because kswapd is just best-effort and if it got fails, we have next
wakeup and even direct reclaim as last resort. In such POV, I think
this patch is right and it would be a good solution. Then, other
concern is on your reply about KOSAKI's question.

I think below your patch is needed.

Quote from
"
1. Read for balance-request-A (order, classzone) pair
2. Fail balance_pgdat
3. Sleep based on (order, classzone) pair
4. Wake for balance-request-B (order, classzone) pair where
  balance-request-B != balance-request-A
5. Succeed balance_pgdat
6. Compare order,classzone with balance-request-A which will treat
  balance_pgdat() as fail and try go to sleep

This is not the same as new_classzone_idx being "garbage" but is it
what you mean? If so, is this your proposed fix?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe854d7..1a518e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2770,6 +2770,8 @@ static int kswapd(void *p)
                       kswapd_try_to_sleep(pgdat, order, classzone_idx);
                       order = pgdat->kswapd_max_order;
                       classzone_idx = pgdat->classzone_idx;
+                       new_order = order;
+                       new_classzone_idx = classzone_idx;
"



-
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-22  0:21                 ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-22  0:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 2:01 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
>> > > > <SNIP>
>> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
>> > > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>> > > >       set_freezable();
>> > > >
>> > > > -     order = 0;
>> > > > -     classzone_idx = MAX_NR_ZONES - 1;
>> > > > +     order = new_order = 0;
>> > > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>> > > >       for ( ; ; ) {
>> > > > -             unsigned long new_order;
>> > > > -             int new_classzone_idx;
>> > > >               int ret;
>> > > >
>> > > > -             new_order = pgdat->kswapd_max_order;
>> > > > -             new_classzone_idx = pgdat->classzone_idx;
>> > > > -             pgdat->kswapd_max_order = 0;
>> > > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
>> > > > +             /*
>> > > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
>> > > > +              * new request of a similar or harder type will succeed soon
>> > > > +              * so consider going to sleep on the basis we reclaimed at
>> > > > +              */
>> > > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
>> > > > +                     new_order = pgdat->kswapd_max_order;
>> > > > +                     new_classzone_idx = pgdat->classzone_idx;
>> > > > +                     pgdat->kswapd_max_order =  0;
>> > > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
>> > > > +             }
>> > > > +
>> > >
>> > > But in this part.
>> > > Why do we need this?
>> >
>> > Lets say it's a fork-heavy workload and it is routinely being woken
>> > for order-1 allocations and the highest zone is very small. For the
>> > most part, it's ok because the allocations are being satisfied from
>> > the lower zones which kswapd has no problem balancing.
>> >
>> > However, by reading the information even after failing to
>> > balance, kswapd continues balancing for order-1 due to reading
>> > pgdat->kswapd_max_order, each time failing for the highest zone. It
>> > only takes one wakeup request per balance_pgdat() to keep kswapd
>> > awake trying to balance the highest zone in a continual loop.
>>
>> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
>> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
>> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
>>
>
> If a wakeup for order-1 happened during the last pgdat, the
> classzone_idx as communicated back from balance_pgdat() is lost and it
> will not sleep in this ordering of events
>
> kswapd                                                                  other processes
> ======                                                                  ===============
> order = balance_pgdat(pgdat, order, &classzone_idx);
>                                                                        wakeup for order-1
> kswapd balances lower zone
>                                                                        allocate from lower zone
> balance_pgdat fails balance for highest zone, returns
>        with lower classzone_idx and possibly lower order
> new_order = pgdat->kswapd_max_order      (order == 1)
> new_classzone_idx = pgdat->classzone_idx (highest zone)
> if (order < new_order || classzone_idx > new_classzone_idx) {
>        order = new_order;
>        classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
> }
> order = balance_pgdat(pgdat, order, &classzone_idx);
>
> The wakup for order-1 at any point during balance_pgdat() is enough to
> keep kswapd awake even though the process that called wakeup_kswapd
> would be able to allocate from the lower zones without significant
> difficulty.
>
> This is why if balance_pgdat() fails its request, it should go to sleep
> if watermarks for the lower zones are met until woken by another
> process.

Hmm.

The role of kswapd is to reclaim pages by background until all of zone
meet HIGH_WMARK to prevent costly direct reclaim.(Of course, there is
another reason like GFP_ATOMIC). So it's not wrong to consume many cpu
usage by design unless other tasks are ready. It would be balanced or
unreclaimable at last so it should end up. However, the problem is
small part of highest zone is easily [set|reset] to be
all_unreclaimabe so the situation could be forever like our example.
So fundamental solution is to prevent it that all_unreclaimable is
set/reset easily, I think.
Unfortunately it have no idea now.

In different viewpoint,  the problem is that it's too excessive
because kswapd is just best-effort and if it got fails, we have next
wakeup and even direct reclaim as last resort. In such POV, I think
this patch is right and it would be a good solution. Then, other
concern is on your reply about KOSAKI's question.

I think below your patch is needed.

Quote from
"
1. Read for balance-request-A (order, classzone) pair
2. Fail balance_pgdat
3. Sleep based on (order, classzone) pair
4. Wake for balance-request-B (order, classzone) pair where
  balance-request-B != balance-request-A
5. Succeed balance_pgdat
6. Compare order,classzone with balance-request-A which will treat
  balance_pgdat() as fail and try go to sleep

This is not the same as new_classzone_idx being "garbage" but is it
what you mean? If so, is this your proposed fix?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe854d7..1a518e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2770,6 +2770,8 @@ static int kswapd(void *p)
                       kswapd_try_to_sleep(pgdat, order, classzone_idx);
                       order = pgdat->kswapd_max_order;
                       classzone_idx = pgdat->classzone_idx;
+                       new_order = order;
+                       new_classzone_idx = classzone_idx;
"



-
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-07-21 16:58             ` Andrew Lutomirski
@ 2011-07-22  0:30               ` Minchan Kim
  -1 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-22  0:30 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 1:58 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Thu, Jul 21, 2011 at 12:42 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
>>> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>>> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>>> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>>> >> > > (Built this time and passed a basic sniff-test.)
>>> >> > >
>>> >> > > During allocator-intensive workloads, kswapd will be woken frequently
>>> >> > > causing free memory to oscillate between the high and min watermark.
>>> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
>>> >> > > small, a problem occurs.
>>> >> > >
>>> >> > > This seems to happen most with recent sandybridge laptops but it's
>>> >> > > probably a co-incidence as some of these laptops just happen to have
>>> >> > > a small Normal zone. The reproduction case is almost always during
>>> >> > > copying large files that kswapd pegs at 100% CPU until the file is
>>> >> > > deleted or cache is dropped.
>>> >> > >
>>> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>>> >> > > awake when the highest zone is small and unreclaimable and compounded
>>> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>>> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>>> >> > >
>>> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>>> >> > >   the classzone_idx instead of all zones.
>>> >> > >
>>> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>>> >> > >
>>> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>>> >> > >   a high classzone which is not what allocators or balance_pgdat()
>>> >> > >   is doing leading to an artifical believe that kswapd should be
>>> >> > >   still awake.
>>> >> > >
>>> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>>> >> > >   decision is not communicated to sleeping_prematurely()
>>> >> > >
>>> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
>>> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>>> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
>>> >> > > cc'd people that reported similar problems recently to see if they
>>> >> > > still suffer from the problem and if this fixes it.
>>> >> > >
>>> >> >
>>> >> > Good!
>>> >> > This patch solved the problem.
>>> >> > But there is still a mystery.
>>> >> >
>>> >> > In log, we could see excessive shrink_slab calls.
>>> >>
>>> >> Yes, because shrink_slab() was called on each loop through
>>> >> balance_pgdat() even if the zone was balanced.
>>> >>
>>> >>
>>> >> > And as you know, we had merged patch which adds cond_resched where last of the function
>>> >> > in shrink_slab. So other task should get the CPU and we should not see
>>> >> > 100% CPU of kswapd, I think.
>>> >> >
>>> >>
>>> >> cond_resched() is not a substitute for going to sleep.
>>> >
>>> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
>>> > So we should never see 100% CPU consumption of kswapd.
>>> > No?
>>>
>>> If the rest of the system is idle, then kswapd will happily use 100%
>>> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
>>
>> Of course. But at least, we have a test program and I think it's not idle.
>
> The test program I used was 'top', which is pretty close to idle.
>
>>
>>> CPU even if another task is using the other one.  This is bad enough
>>> on a desktop, but on a laptop you start to notice when your battery
>>> dies.)
>>
>> Of course it's bad. :)
>> What I want to know is just what's exact cause of 100% CPU usage.
>> It might be not 100% but we might use the word sloppily.
>>
>
> Well, if you want to pedantic, my laptop can, in theory, demonstrate
> true 100% CPU usage.  Trigger the bug, suspend every other thread, and
> listen to the laptop fan spin and feel the laptop get hot.  (The fan
> is controlled by the EC and takes no CPU.)
>
> In practice, the usage was close enough to 100% that it got rounded.
>
> The cond_resched was enough to at least make the system responsive
> instead of the hard freeze I used to get.

I don't want to be pedantic. :)
What I have a thought about 100% CPU usage was that it doesn't yield
CPU and spins on the CPU but as I heard your example(ie, cond_resched
makes the system responsive), it's not the case. It was just to use
most of time in kswapd, not 100%. It seems I was paranoid about the
word, sorry for that.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-22  0:30               ` Minchan Kim
  0 siblings, 0 replies; 82+ messages in thread
From: Minchan Kim @ 2011-07-22  0:30 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 1:58 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Thu, Jul 21, 2011 at 12:42 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
>>> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>>> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>>> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>>> >> > > (Built this time and passed a basic sniff-test.)
>>> >> > >
>>> >> > > During allocator-intensive workloads, kswapd will be woken frequently
>>> >> > > causing free memory to oscillate between the high and min watermark.
>>> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
>>> >> > > small, a problem occurs.
>>> >> > >
>>> >> > > This seems to happen most with recent sandybridge laptops but it's
>>> >> > > probably a co-incidence as some of these laptops just happen to have
>>> >> > > a small Normal zone. The reproduction case is almost always during
>>> >> > > copying large files that kswapd pegs at 100% CPU until the file is
>>> >> > > deleted or cache is dropped.
>>> >> > >
>>> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>>> >> > > awake when the highest zone is small and unreclaimable and compounded
>>> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>>> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>>> >> > >
>>> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>>> >> > >   the classzone_idx instead of all zones.
>>> >> > >
>>> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>>> >> > >
>>> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>>> >> > >   a high classzone which is not what allocators or balance_pgdat()
>>> >> > >   is doing leading to an artifical believe that kswapd should be
>>> >> > >   still awake.
>>> >> > >
>>> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>>> >> > >   decision is not communicated to sleeping_prematurely()
>>> >> > >
>>> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
>>> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>>> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
>>> >> > > cc'd people that reported similar problems recently to see if they
>>> >> > > still suffer from the problem and if this fixes it.
>>> >> > >
>>> >> >
>>> >> > Good!
>>> >> > This patch solved the problem.
>>> >> > But there is still a mystery.
>>> >> >
>>> >> > In log, we could see excessive shrink_slab calls.
>>> >>
>>> >> Yes, because shrink_slab() was called on each loop through
>>> >> balance_pgdat() even if the zone was balanced.
>>> >>
>>> >>
>>> >> > And as you know, we had merged patch which adds cond_resched where last of the function
>>> >> > in shrink_slab. So other task should get the CPU and we should not see
>>> >> > 100% CPU of kswapd, I think.
>>> >> >
>>> >>
>>> >> cond_resched() is not a substitute for going to sleep.
>>> >
>>> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
>>> > So we should never see 100% CPU consumption of kswapd.
>>> > No?
>>>
>>> If the rest of the system is idle, then kswapd will happily use 100%
>>> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
>>
>> Of course. But at least, we have a test program and I think it's not idle.
>
> The test program I used was 'top', which is pretty close to idle.
>
>>
>>> CPU even if another task is using the other one.  This is bad enough
>>> on a desktop, but on a laptop you start to notice when your battery
>>> dies.)
>>
>> Of course it's bad. :)
>> What I want to know is just what's exact cause of 100% CPU usage.
>> It might be not 100% but we might use the word sloppily.
>>
>
> Well, if you want to pedantic, my laptop can, in theory, demonstrate
> true 100% CPU usage.  Trigger the bug, suspend every other thread, and
> listen to the laptop fan spin and feel the laptop get hot.  (The fan
> is controlled by the EC and takes no CPU.)
>
> In practice, the usage was close enough to 100% that it got rounded.
>
> The cond_resched was enough to at least make the system responsive
> instead of the hard freeze I used to get.

I don't want to be pedantic. :)
What I have a thought about 100% CPU usage was that it doesn't yield
CPU and spins on the CPU but as I heard your example(ie, cond_resched
makes the system responsive), it's not the case. It was just to use
most of time in kswapd, not 100%. It seems I was paranoid about the
word, sorry for that.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
  2011-07-22  0:21                 ` Minchan Kim
@ 2011-07-22  7:42                   ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-22  7:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 09:21:57AM +0900, Minchan Kim wrote:
> On Fri, Jul 22, 2011 at 2:01 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
> >> > > > <SNIP>
> >> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >> > > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >> > > >       set_freezable();
> >> > > >
> >> > > > -     order = 0;
> >> > > > -     classzone_idx = MAX_NR_ZONES - 1;
> >> > > > +     order = new_order = 0;
> >> > > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >> > > >       for ( ; ; ) {
> >> > > > -             unsigned long new_order;
> >> > > > -             int new_classzone_idx;
> >> > > >               int ret;
> >> > > >
> >> > > > -             new_order = pgdat->kswapd_max_order;
> >> > > > -             new_classzone_idx = pgdat->classzone_idx;
> >> > > > -             pgdat->kswapd_max_order = 0;
> >> > > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> >> > > > +             /*
> >> > > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> >> > > > +              * new request of a similar or harder type will succeed soon
> >> > > > +              * so consider going to sleep on the basis we reclaimed at
> >> > > > +              */
> >> > > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> >> > > > +                     new_order = pgdat->kswapd_max_order;
> >> > > > +                     new_classzone_idx = pgdat->classzone_idx;
> >> > > > +                     pgdat->kswapd_max_order =  0;
> >> > > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> >> > > > +             }
> >> > > > +
> >> > >
> >> > > But in this part.
> >> > > Why do we need this?
> >> >
> >> > Lets say it's a fork-heavy workload and it is routinely being woken
> >> > for order-1 allocations and the highest zone is very small. For the
> >> > most part, it's ok because the allocations are being satisfied from
> >> > the lower zones which kswapd has no problem balancing.
> >> >
> >> > However, by reading the information even after failing to
> >> > balance, kswapd continues balancing for order-1 due to reading
> >> > pgdat->kswapd_max_order, each time failing for the highest zone. It
> >> > only takes one wakeup request per balance_pgdat() to keep kswapd
> >> > awake trying to balance the highest zone in a continual loop.
> >>
> >> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
> >> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
> >> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
> >>
> >
> > If a wakeup for order-1 happened during the last pgdat, the
> > classzone_idx as communicated back from balance_pgdat() is lost and it
> > will not sleep in this ordering of events
> >
> > kswapd                                                                  other processes
> > ======                                                                  ===============
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> >                                                                        wakeup for order-1
> > kswapd balances lower zone
> >                                                                        allocate from lower zone
> > balance_pgdat fails balance for highest zone, returns
> >        with lower classzone_idx and possibly lower order
> > new_order = pgdat->kswapd_max_order      (order == 1)
> > new_classzone_idx = pgdat->classzone_idx (highest zone)
> > if (order < new_order || classzone_idx > new_classzone_idx) {
> >        order = new_order;
> >        classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
> > }
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> >
> > The wakup for order-1 at any point during balance_pgdat() is enough to
> > keep kswapd awake even though the process that called wakeup_kswapd
> > would be able to allocate from the lower zones without significant
> > difficulty.
> >
> > This is why if balance_pgdat() fails its request, it should go to sleep
> > if watermarks for the lower zones are met until woken by another
> > process.
> 
> Hmm.
> 
> The role of kswapd is to reclaim pages by background until all of zone
> meet HIGH_WMARK to prevent costly direct reclaim.(Of course, there is
> another reason like GFP_ATOMIC).

kswapd does not necessarily have to balance every zone to prevent direct
reclaim. Again, if the highest zone is small, it does not remain
balanced for very long because it's often the first choice for
allocating from. It gets used very quickly but direct reclaim does not
stall because there are the lower zones.

> So it's not wrong to consume many cpu
> usage by design unless other tasks are ready.

It wastes power while not making the system run any faster. It will
look odd to any user or administrator that is running top and generates
bug reports.

> It would be balanced or
> unreclaimable at last so it should end up. However, the problem is
> small part of highest zone is easily [set|reset] to be
> all_unreclaimabe so the situation could be forever like our example.
> So fundamental solution is to prevent it that all_unreclaimable is
> set/reset easily, I think.
> Unfortunately it have no idea now.

One way would be to have the allocator skip over it easily and
implement a placement policy that relocates only long-lived and very
old pages to the highest zone and then leave them there and have
kswapd ignore the zone. We don't have anything like this at the moment.

> In different viewpoint,  the problem is that it's too excessive
> because kswapd is just best-effort and if it got fails, we have next
> wakeup and even direct reclaim as last resort. In such POV, I think
> this patch is right and it would be a good solution. Then, other
> concern is on your reply about KOSAKI's question.
> 
> I think below your patch is needed.
> 
> Quote from
> "
> 1. Read for balance-request-A (order, classzone) pair
> 2. Fail balance_pgdat
> 3. Sleep based on (order, classzone) pair
> 4. Wake for balance-request-B (order, classzone) pair where
>   balance-request-B != balance-request-A
> 5. Succeed balance_pgdat
> 6. Compare order,classzone with balance-request-A which will treat
>   balance_pgdat() as fail and try go to sleep
> 
> This is not the same as new_classzone_idx being "garbage" but is it
> what you mean? If so, is this your proposed fix?
> 

That was the proposed fix but discussion died. I'll pick it up again
later and am keeping an eye out for any bugs that could be attributed to
it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
@ 2011-07-22  7:42                   ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-07-22  7:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, P?draig Brady, James Bottomley, Colin King,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jul 22, 2011 at 09:21:57AM +0900, Minchan Kim wrote:
> On Fri, Jul 22, 2011 at 2:01 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
> >> > > > <SNIP>
> >> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >> > > >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >> > > >       set_freezable();
> >> > > >
> >> > > > -     order = 0;
> >> > > > -     classzone_idx = MAX_NR_ZONES - 1;
> >> > > > +     order = new_order = 0;
> >> > > > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >> > > >       for ( ; ; ) {
> >> > > > -             unsigned long new_order;
> >> > > > -             int new_classzone_idx;
> >> > > >               int ret;
> >> > > >
> >> > > > -             new_order = pgdat->kswapd_max_order;
> >> > > > -             new_classzone_idx = pgdat->classzone_idx;
> >> > > > -             pgdat->kswapd_max_order = 0;
> >> > > > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> >> > > > +             /*
> >> > > > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> >> > > > +              * new request of a similar or harder type will succeed soon
> >> > > > +              * so consider going to sleep on the basis we reclaimed at
> >> > > > +              */
> >> > > > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> >> > > > +                     new_order = pgdat->kswapd_max_order;
> >> > > > +                     new_classzone_idx = pgdat->classzone_idx;
> >> > > > +                     pgdat->kswapd_max_order =  0;
> >> > > > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> >> > > > +             }
> >> > > > +
> >> > >
> >> > > But in this part.
> >> > > Why do we need this?
> >> >
> >> > Lets say it's a fork-heavy workload and it is routinely being woken
> >> > for order-1 allocations and the highest zone is very small. For the
> >> > most part, it's ok because the allocations are being satisfied from
> >> > the lower zones which kswapd has no problem balancing.
> >> >
> >> > However, by reading the information even after failing to
> >> > balance, kswapd continues balancing for order-1 due to reading
> >> > pgdat->kswapd_max_order, each time failing for the highest zone. It
> >> > only takes one wakeup request per balance_pgdat() to keep kswapd
> >> > awake trying to balance the highest zone in a continual loop.
> >>
> >> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
> >> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
> >> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
> >>
> >
> > If a wakeup for order-1 happened during the last pgdat, the
> > classzone_idx as communicated back from balance_pgdat() is lost and it
> > will not sleep in this ordering of events
> >
> > kswapd                                                                  other processes
> > ======                                                                  ===============
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> >                                                                        wakeup for order-1
> > kswapd balances lower zone
> >                                                                        allocate from lower zone
> > balance_pgdat fails balance for highest zone, returns
> >        with lower classzone_idx and possibly lower order
> > new_order = pgdat->kswapd_max_order      (order == 1)
> > new_classzone_idx = pgdat->classzone_idx (highest zone)
> > if (order < new_order || classzone_idx > new_classzone_idx) {
> >        order = new_order;
> >        classzone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
> > }
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> >
> > The wakup for order-1 at any point during balance_pgdat() is enough to
> > keep kswapd awake even though the process that called wakeup_kswapd
> > would be able to allocate from the lower zones without significant
> > difficulty.
> >
> > This is why if balance_pgdat() fails its request, it should go to sleep
> > if watermarks for the lower zones are met until woken by another
> > process.
> 
> Hmm.
> 
> The role of kswapd is to reclaim pages by background until all of zone
> meet HIGH_WMARK to prevent costly direct reclaim.(Of course, there is
> another reason like GFP_ATOMIC).

kswapd does not necessarily have to balance every zone to prevent direct
reclaim. Again, if the highest zone is small, it does not remain
balanced for very long because it's often the first choice for
allocating from. It gets used very quickly but direct reclaim does not
stall because there are the lower zones.

> So it's not wrong to consume many cpu
> usage by design unless other tasks are ready.

It wastes power while not making the system run any faster. It will
look odd to any user or administrator that is running top and generates
bug reports.

> It would be balanced or
> unreclaimable at last so it should end up. However, the problem is
> small part of highest zone is easily [set|reset] to be
> all_unreclaimabe so the situation could be forever like our example.
> So fundamental solution is to prevent it that all_unreclaimable is
> set/reset easily, I think.
> Unfortunately it have no idea now.

One way would be to have the allocator skip over it easily and
implement a placement policy that relocates only long-lived and very
old pages to the highest zone and then leave them there and have
kswapd ignore the zone. We don't have anything like this at the moment.

> In different viewpoint,  the problem is that it's too excessive
> because kswapd is just best-effort and if it got fails, we have next
> wakeup and even direct reclaim as last resort. In such POV, I think
> this patch is right and it would be a good solution. Then, other
> concern is on your reply about KOSAKI's question.
> 
> I think below your patch is needed.
> 
> Quote from
> "
> 1. Read for balance-request-A (order, classzone) pair
> 2. Fail balance_pgdat
> 3. Sleep based on (order, classzone) pair
> 4. Wake for balance-request-B (order, classzone) pair where
>   balance-request-B != balance-request-A
> 5. Succeed balance_pgdat
> 6. Compare order,classzone with balance-request-A which will treat
>   balance_pgdat() as fail and try go to sleep
> 
> This is not the same as new_classzone_idx being "garbage" but is it
> what you mean? If so, is this your proposed fix?
> 

That was the proposed fix but discussion died. I'll pick it up again
later and am keeping an eye out for any bugs that could be attributed to
it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
  2011-07-22  0:30               ` Minchan Kim
@ 2011-07-22 13:21                 ` Andrew Lutomirski
  -1 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-07-22 13:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 8:30 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, Jul 22, 2011 at 1:58 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>> On Thu, Jul 21, 2011 at 12:42 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>> On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
>>>> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>>> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>>>> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>>>> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>>>> >> > > (Built this time and passed a basic sniff-test.)
>>>> >> > >
>>>> >> > > During allocator-intensive workloads, kswapd will be woken frequently
>>>> >> > > causing free memory to oscillate between the high and min watermark.
>>>> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
>>>> >> > > small, a problem occurs.
>>>> >> > >
>>>> >> > > This seems to happen most with recent sandybridge laptops but it's
>>>> >> > > probably a co-incidence as some of these laptops just happen to have
>>>> >> > > a small Normal zone. The reproduction case is almost always during
>>>> >> > > copying large files that kswapd pegs at 100% CPU until the file is
>>>> >> > > deleted or cache is dropped.
>>>> >> > >
>>>> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>>>> >> > > awake when the highest zone is small and unreclaimable and compounded
>>>> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>>>> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>>>> >> > >
>>>> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>>>> >> > >   the classzone_idx instead of all zones.
>>>> >> > >
>>>> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>>>> >> > >
>>>> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>>>> >> > >   a high classzone which is not what allocators or balance_pgdat()
>>>> >> > >   is doing leading to an artifical believe that kswapd should be
>>>> >> > >   still awake.
>>>> >> > >
>>>> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>>>> >> > >   decision is not communicated to sleeping_prematurely()
>>>> >> > >
>>>> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
>>>> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>>>> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
>>>> >> > > cc'd people that reported similar problems recently to see if they
>>>> >> > > still suffer from the problem and if this fixes it.
>>>> >> > >
>>>> >> >
>>>> >> > Good!
>>>> >> > This patch solved the problem.
>>>> >> > But there is still a mystery.
>>>> >> >
>>>> >> > In log, we could see excessive shrink_slab calls.
>>>> >>
>>>> >> Yes, because shrink_slab() was called on each loop through
>>>> >> balance_pgdat() even if the zone was balanced.
>>>> >>
>>>> >>
>>>> >> > And as you know, we had merged patch which adds cond_resched where last of the function
>>>> >> > in shrink_slab. So other task should get the CPU and we should not see
>>>> >> > 100% CPU of kswapd, I think.
>>>> >> >
>>>> >>
>>>> >> cond_resched() is not a substitute for going to sleep.
>>>> >
>>>> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
>>>> > So we should never see 100% CPU consumption of kswapd.
>>>> > No?
>>>>
>>>> If the rest of the system is idle, then kswapd will happily use 100%
>>>> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
>>>
>>> Of course. But at least, we have a test program and I think it's not idle.
>>
>> The test program I used was 'top', which is pretty close to idle.
>>
>>>
>>>> CPU even if another task is using the other one.  This is bad enough
>>>> on a desktop, but on a laptop you start to notice when your battery
>>>> dies.)
>>>
>>> Of course it's bad. :)
>>> What I want to know is just what's exact cause of 100% CPU usage.
>>> It might be not 100% but we might use the word sloppily.
>>>
>>
>> Well, if you want to pedantic, my laptop can, in theory, demonstrate
>> true 100% CPU usage.  Trigger the bug, suspend every other thread, and
>> listen to the laptop fan spin and feel the laptop get hot.  (The fan
>> is controlled by the EC and takes no CPU.)
>>
>> In practice, the usage was close enough to 100% that it got rounded.
>>
>> The cond_resched was enough to at least make the system responsive
>> instead of the hard freeze I used to get.
>
> I don't want to be pedantic. :)
> What I have a thought about 100% CPU usage was that it doesn't yield
> CPU and spins on the CPU but as I heard your example(ie, cond_resched
> makes the system responsive), it's not the case. It was just to use
> most of time in kswapd, not 100%. It seems I was paranoid about the
> word, sorry for that.

Ah, sorry.  I must have been unclear in my original email.

In 2.6.39, it made my system unresponsive.  With your cond_resched and
pgdat_balanced fixes, it just made kswapd eat all available CPU, but
the system still worked.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small
@ 2011-07-22 13:21                 ` Andrew Lutomirski
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Lutomirski @ 2011-07-22 13:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, P?draig Brady, James Bottomley,
	Colin King, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Thu, Jul 21, 2011 at 8:30 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, Jul 22, 2011 at 1:58 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>> On Thu, Jul 21, 2011 at 12:42 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>> On Thu, Jul 21, 2011 at 12:36:11PM -0400, Andrew Lutomirski wrote:
>>>> On Thu, Jul 21, 2011 at 12:24 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>>> > On Thu, Jul 21, 2011 at 05:09:59PM +0100, Mel Gorman wrote:
>>>> >> On Fri, Jul 22, 2011 at 12:37:22AM +0900, Minchan Kim wrote:
>>>> >> > On Fri, Jun 24, 2011 at 03:44:53PM +0100, Mel Gorman wrote:
>>>> >> > > (Built this time and passed a basic sniff-test.)
>>>> >> > >
>>>> >> > > During allocator-intensive workloads, kswapd will be woken frequently
>>>> >> > > causing free memory to oscillate between the high and min watermark.
>>>> >> > > This is expected behaviour.  Unfortunately, if the highest zone is
>>>> >> > > small, a problem occurs.
>>>> >> > >
>>>> >> > > This seems to happen most with recent sandybridge laptops but it's
>>>> >> > > probably a co-incidence as some of these laptops just happen to have
>>>> >> > > a small Normal zone. The reproduction case is almost always during
>>>> >> > > copying large files that kswapd pegs at 100% CPU until the file is
>>>> >> > > deleted or cache is dropped.
>>>> >> > >
>>>> >> > > The problem is mostly down to sleeping_prematurely() keeping kswapd
>>>> >> > > awake when the highest zone is small and unreclaimable and compounded
>>>> >> > > by the fact we shrink slabs even when not shrinking zones causing a lot
>>>> >> > > of time to be spent in shrinkers and a lot of memory to be reclaimed.
>>>> >> > >
>>>> >> > > Patch 1 corrects sleeping_prematurely to check the zones matching
>>>> >> > >   the classzone_idx instead of all zones.
>>>> >> > >
>>>> >> > > Patch 2 avoids shrinking slab when we are not shrinking a zone.
>>>> >> > >
>>>> >> > > Patch 3 notes that sleeping_prematurely is checking lower zones against
>>>> >> > >   a high classzone which is not what allocators or balance_pgdat()
>>>> >> > >   is doing leading to an artifical believe that kswapd should be
>>>> >> > >   still awake.
>>>> >> > >
>>>> >> > > Patch 4 notes that when balance_pgdat() gives up on a high zone that the
>>>> >> > >   decision is not communicated to sleeping_prematurely()
>>>> >> > >
>>>> >> > > This problem affects 2.6.38.8 for certain and is expected to affect
>>>> >> > > 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable
>>>> >> > > to be picked up by distros and this series is against 3.0-rc4. I've
>>>> >> > > cc'd people that reported similar problems recently to see if they
>>>> >> > > still suffer from the problem and if this fixes it.
>>>> >> > >
>>>> >> >
>>>> >> > Good!
>>>> >> > This patch solved the problem.
>>>> >> > But there is still a mystery.
>>>> >> >
>>>> >> > In log, we could see excessive shrink_slab calls.
>>>> >>
>>>> >> Yes, because shrink_slab() was called on each loop through
>>>> >> balance_pgdat() even if the zone was balanced.
>>>> >>
>>>> >>
>>>> >> > And as you know, we had merged patch which adds cond_resched where last of the function
>>>> >> > in shrink_slab. So other task should get the CPU and we should not see
>>>> >> > 100% CPU of kswapd, I think.
>>>> >> >
>>>> >>
>>>> >> cond_resched() is not a substitute for going to sleep.
>>>> >
>>>> > Of course, it's not equal with sleep but other task should get CPU and conusme their time slice
>>>> > So we should never see 100% CPU consumption of kswapd.
>>>> > No?
>>>>
>>>> If the rest of the system is idle, then kswapd will happily use 100%
>>>> CPU.  (Or on a multi-core system, kswapd will use close to 100% of one
>>>
>>> Of course. But at least, we have a test program and I think it's not idle.
>>
>> The test program I used was 'top', which is pretty close to idle.
>>
>>>
>>>> CPU even if another task is using the other one.  This is bad enough
>>>> on a desktop, but on a laptop you start to notice when your battery
>>>> dies.)
>>>
>>> Of course it's bad. :)
>>> What I want to know is just what's exact cause of 100% CPU usage.
>>> It might be not 100% but we might use the word sloppily.
>>>
>>
>> Well, if you want to pedantic, my laptop can, in theory, demonstrate
>> true 100% CPU usage.  Trigger the bug, suspend every other thread, and
>> listen to the laptop fan spin and feel the laptop get hot.  (The fan
>> is controlled by the EC and takes no CPU.)
>>
>> In practice, the usage was close enough to 100% that it got rounded.
>>
>> The cond_resched was enough to at least make the system responsive
>> instead of the hard freeze I used to get.
>
> I don't want to be pedantic. :)
> What I have a thought about 100% CPU usage was that it doesn't yield
> CPU and spins on the CPU but as I heard your example(ie, cond_resched
> makes the system responsive), it's not the case. It was just to use
> most of time in kswapd, not 100%. It seems I was paranoid about the
> word, sorry for that.

Ah, sorry.  I must have been unclear in my original email.

In 2.6.39, it made my system unresponsive.  With your cond_resched and
pgdat_balanced fixes, it just made kswapd eat all available CPU, but
the system still worked.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
  2011-06-24 13:43   ` Mel Gorman
@ 2011-06-24 13:59     ` Mel Gorman
  -1 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: P?draig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 02:43:16PM +0100, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 

Bah, I accidentally exported a branch with a build error in this
patch. Will resend shortly.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
@ 2011-06-24 13:59     ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: P?draig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel

On Fri, Jun 24, 2011 at 02:43:16PM +0100, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
> 

Bah, I accidentally exported a branch with a build error in this
patch. Will resend shortly.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
  2011-06-24 13:43 Mel Gorman
@ 2011-06-24 13:43   ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 13:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

When kswapd applies pressure to zones during node balancing, it checks
if the zone is above a high+balance_gap threshold. If it is, it does
not apply pressure but it unconditionally shrinks slab on a global
basis which is excessive. In the event kswapd is being kept awake due to
a high small unreclaimable zone, it skips zone shrinking but still
calls shrink_slab().

Once pressure has been applied, the check for zone being unreclaimable
is being made before the check is made if all_unreclaimable should be
set. This miss of unreclaimable can cause has_under_min_watermark_zone
to be set due to an unreclaimable zone preventing kswapd backing off
on congestion_wait().

Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   21 ++++++++++++---------
 1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 841e3bf..38665ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2509,16 +2509,16 @@ loop_again:
 					high_wmark_pages(zone) + balance_gap,
 					end_zone, 0))
 				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_scanned += sc.nr_scanned;
 
-			if (zone->all_unreclaimable)
-				continue;
-			if (nr_slab == 0 &&
-			    !zone_reclaimable(zone))
-				zone->all_unreclaimable = 1;
+				reclaim_state->reclaimed_slab = 0;
+				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				total_scanned += sc.nr_scanned;
+
+				if (nr_slab == 0 && !zone_reclaimable(zone))
+					zone->all_unreclaimable = 1;
+			}
+
 			/*
 			 * If we've done a decent amount of scanning and
 			 * the reclaim ratio is low, start doing writepage
@@ -2528,6 +2528,9 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
+			if (zone->all_unreclaimable)
+				continue;
+
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone
@ 2011-06-24 13:43   ` Mel Gorman
  0 siblings, 0 replies; 82+ messages in thread
From: Mel Gorman @ 2011-06-24 13:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pádraig Brady, James Bottomley, Colin King, Minchan Kim,
	Andrew Lutomirski, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Mel Gorman

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

When kswapd applies pressure to zones during node balancing, it checks
if the zone is above a high+balance_gap threshold. If it is, it does
not apply pressure but it unconditionally shrinks slab on a global
basis which is excessive. In the event kswapd is being kept awake due to
a high small unreclaimable zone, it skips zone shrinking but still
calls shrink_slab().

Once pressure has been applied, the check for zone being unreclaimable
is being made before the check is made if all_unreclaimable should be
set. This miss of unreclaimable can cause has_under_min_watermark_zone
to be set due to an unreclaimable zone preventing kswapd backing off
on congestion_wait().

Reported-and-tested-by: PA!draig Brady <P@draigBrady.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   21 ++++++++++++---------
 1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 841e3bf..38665ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2509,16 +2509,16 @@ loop_again:
 					high_wmark_pages(zone) + balance_gap,
 					end_zone, 0))
 				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_scanned += sc.nr_scanned;
 
-			if (zone->all_unreclaimable)
-				continue;
-			if (nr_slab == 0 &&
-			    !zone_reclaimable(zone))
-				zone->all_unreclaimable = 1;
+				reclaim_state->reclaimed_slab = 0;
+				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				total_scanned += sc.nr_scanned;
+
+				if (nr_slab == 0 && !zone_reclaimable(zone))
+					zone->all_unreclaimable = 1;
+			}
+
 			/*
 			 * If we've done a decent amount of scanning and
 			 * the reclaim ratio is low, start doing writepage
@@ -2528,6 +2528,9 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
+			if (zone->all_unreclaimable)
+				continue;
+
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2011-07-22 13:22 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-24 14:44 [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small Mel Gorman
2011-06-24 14:44 ` Mel Gorman
2011-06-24 14:44 ` [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 21:33   ` Rik van Riel
2011-06-25 21:33     ` Rik van Riel
2011-06-27  6:10   ` Minchan Kim
2011-06-27  6:10     ` Minchan Kim
2011-06-28 21:49   ` Andrew Morton
2011-06-28 21:49     ` Andrew Morton
2011-06-29 10:57     ` Pádraig Brady
2011-06-29 10:57       ` Pádraig Brady
2011-06-30  9:39     ` Mel Gorman
2011-06-30  9:39       ` Mel Gorman
2011-06-30  2:23   ` KOSAKI Motohiro
2011-06-30  2:23     ` KOSAKI Motohiro
2011-06-24 14:44 ` [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 21:40   ` Rik van Riel
2011-06-25 21:40     ` Rik van Riel
2011-06-28 23:38   ` Minchan Kim
2011-06-28 23:38     ` Minchan Kim
2011-06-30  2:37   ` KOSAKI Motohiro
2011-06-30  2:37     ` KOSAKI Motohiro
2011-06-24 14:44 ` [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 21:42   ` Rik van Riel
2011-06-25 21:42     ` Rik van Riel
2011-06-27  6:53   ` Minchan Kim
2011-06-27  6:53     ` Minchan Kim
2011-06-28 12:52     ` Mel Gorman
2011-06-28 12:52       ` Mel Gorman
2011-06-28 23:23       ` Minchan Kim
2011-06-28 23:23         ` Minchan Kim
2011-06-28 23:23   ` Minchan Kim
2011-06-28 23:23     ` Minchan Kim
2011-06-24 14:44 ` [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 23:17   ` Rik van Riel
2011-06-25 23:17     ` Rik van Riel
2011-06-30  9:05   ` KOSAKI Motohiro
2011-06-30  9:05     ` KOSAKI Motohiro
2011-06-30 10:19     ` Mel Gorman
2011-06-30 10:19       ` Mel Gorman
2011-07-19 16:09   ` Minchan Kim
2011-07-19 16:09     ` Minchan Kim
2011-07-20 10:48     ` Mel Gorman
2011-07-20 10:48       ` Mel Gorman
2011-07-21 15:30       ` Minchan Kim
2011-07-21 15:30         ` Minchan Kim
2011-07-21 16:07         ` Mel Gorman
2011-07-21 16:07           ` Mel Gorman
2011-07-21 16:36           ` Minchan Kim
2011-07-21 16:36             ` Minchan Kim
2011-07-21 17:01             ` Mel Gorman
2011-07-21 17:01               ` Mel Gorman
2011-07-22  0:21               ` Minchan Kim
2011-07-22  0:21                 ` Minchan Kim
2011-07-22  7:42                 ` Mel Gorman
2011-07-22  7:42                   ` Mel Gorman
2011-06-25 14:23 ` [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small Andrew Lutomirski
2011-06-25 14:23   ` Andrew Lutomirski
2011-07-21 15:37 ` Minchan Kim
2011-07-21 15:37   ` Minchan Kim
2011-07-21 16:09   ` Mel Gorman
2011-07-21 16:09     ` Mel Gorman
2011-07-21 16:24     ` Minchan Kim
2011-07-21 16:24       ` Minchan Kim
2011-07-21 16:36       ` Andrew Lutomirski
2011-07-21 16:36         ` Andrew Lutomirski
2011-07-21 16:42         ` Minchan Kim
2011-07-21 16:42           ` Minchan Kim
2011-07-21 16:58           ` Andrew Lutomirski
2011-07-21 16:58             ` Andrew Lutomirski
2011-07-22  0:30             ` Minchan Kim
2011-07-22  0:30               ` Minchan Kim
2011-07-22 13:21               ` Andrew Lutomirski
2011-07-22 13:21                 ` Andrew Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2011-06-24 13:43 Mel Gorman
2011-06-24 13:43 ` [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone Mel Gorman
2011-06-24 13:43   ` Mel Gorman
2011-06-24 13:59   ` Mel Gorman
2011-06-24 13:59     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.