All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-13 14:03 ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

Changelog since V1
  o kswapd should sleep if need_resched
  o Remove __GFP_REPEAT from GFP flags when speculatively using high
    orders so direct/compaction exits earlier
  o Remove __GFP_NORETRY for correctness
  o Correct logic in sleeping_prematurely
  o Leave SLUB using the default slub_max_order

There are a few reports of people experiencing hangs when copying
large amounts of data with kswapd using a large amount of CPU which
appear to be due to recent reclaim changes.

SLUB using high orders is the trigger but not the root cause as SLUB
has been using high orders for a while. The following four patches
aim to fix the problems in reclaim while reducing the cost for SLUB
using those high orders.

Patch 1 corrects logic introduced by commit [1741c877: mm:
	kswapd: keep kswapd awake for high-order allocations until
	a percentage of the node is balanced] to allow kswapd to
	go to sleep when balanced for high orders.

Patch 2 prevents kswapd waking up in response to SLUBs speculative
	use of high orders.

Patch 3 further reduces the cost by prevent SLUB entering direct
	compaction or reclaim paths on the grounds that falling
	back to order-0 should be cheaper.

Patch 4 notes that even when kswapd is failing to keep up with
	allocation requests, it should still go to sleep when its
	quota has expired to prevent it spinning.

My own data on this is not great. I haven't really been able to
reproduce the same problem locally.

The test case is simple. "download tar" wgets a large tar file and
stores it locally. "unpack" is expanding it (15 times physical RAM
in this case) and "delete source dirs" is the tarfile being deleted
again. I also experimented with having the tar copied numerous times
and into deeper directories to increase the size but the results were
not particularly interesting so I left it as one tar.

In the background, applications are being launched to time to vaguely
simulate activity on the desktop and to measure how long it takes
applications to start.

Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
LARGE COPY AND UNTAR
                      vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35

MMTests Statistics: application launch
evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88

evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11

Having SLUB avoid expensive steps in reclaim improves performance
by quite a bit with the overall test completing 1.5 minutes
faster. Application launch times were not really affected but it's
not something my test machine was suffering from in the first place
so it's not really conclusive. The kswapd patches also did not appear
to help but again, the test machine wasn't suffering that problem.

These patches are against 2.6.39-rc7. Again, testing would be
appreciated.

 Documentation/vm/slub.txt |    2 +-
 mm/page_alloc.c           |    3 ++-
 mm/slub.c                 |    5 +++--
 3 files changed, 6 insertions(+), 4 deletions(-)

 mm/page_alloc.c |    3 ++-
 mm/slub.c       |    3 ++-
 mm/vmscan.c     |    6 +++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-13 14:03 ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

Changelog since V1
  o kswapd should sleep if need_resched
  o Remove __GFP_REPEAT from GFP flags when speculatively using high
    orders so direct/compaction exits earlier
  o Remove __GFP_NORETRY for correctness
  o Correct logic in sleeping_prematurely
  o Leave SLUB using the default slub_max_order

There are a few reports of people experiencing hangs when copying
large amounts of data with kswapd using a large amount of CPU which
appear to be due to recent reclaim changes.

SLUB using high orders is the trigger but not the root cause as SLUB
has been using high orders for a while. The following four patches
aim to fix the problems in reclaim while reducing the cost for SLUB
using those high orders.

Patch 1 corrects logic introduced by commit [1741c877: mm:
	kswapd: keep kswapd awake for high-order allocations until
	a percentage of the node is balanced] to allow kswapd to
	go to sleep when balanced for high orders.

Patch 2 prevents kswapd waking up in response to SLUBs speculative
	use of high orders.

Patch 3 further reduces the cost by prevent SLUB entering direct
	compaction or reclaim paths on the grounds that falling
	back to order-0 should be cheaper.

Patch 4 notes that even when kswapd is failing to keep up with
	allocation requests, it should still go to sleep when its
	quota has expired to prevent it spinning.

My own data on this is not great. I haven't really been able to
reproduce the same problem locally.

The test case is simple. "download tar" wgets a large tar file and
stores it locally. "unpack" is expanding it (15 times physical RAM
in this case) and "delete source dirs" is the tarfile being deleted
again. I also experimented with having the tar copied numerous times
and into deeper directories to increase the size but the results were
not particularly interesting so I left it as one tar.

In the background, applications are being launched to time to vaguely
simulate activity on the desktop and to measure how long it takes
applications to start.

Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
LARGE COPY AND UNTAR
                      vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35

MMTests Statistics: application launch
evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88

evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11

Having SLUB avoid expensive steps in reclaim improves performance
by quite a bit with the overall test completing 1.5 minutes
faster. Application launch times were not really affected but it's
not something my test machine was suffering from in the first place
so it's not really conclusive. The kswapd patches also did not appear
to help but again, the test machine wasn't suffering that problem.

These patches are against 2.6.39-rc7. Again, testing would be
appreciated.

 Documentation/vm/slub.txt |    2 +-
 mm/page_alloc.c           |    3 ++-
 mm/slub.c                 |    5 +++--
 3 files changed, 6 insertions(+), 4 deletions(-)

 mm/page_alloc.c |    3 ++-
 mm/slub.c       |    3 ++-
 mm/vmscan.c     |    6 +++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-13 14:03 ` Mel Gorman
@ 2011-05-13 14:03   ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

Johannes Weiner poined out that the logic in commit [1741c877: mm:
kswapd: keep kswapd awake for high-order allocations until a percentage
of the node is balanced] is backwards. Instead of allowing kswapd to go
to sleep when balancing for high order allocations, it keeps it kswapd
running uselessly.

From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..af24d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
+		return !pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
 		return !all_zones_ok;
 }
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-13 14:03   ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

Johannes Weiner poined out that the logic in commit [1741c877: mm:
kswapd: keep kswapd awake for high-order allocations until a percentage
of the node is balanced] is backwards. Instead of allowing kswapd to go
to sleep when balancing for high order allocations, it keeps it kswapd
running uselessly.

From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..af24d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
+		return !pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
 		return !all_zones_ok;
 }
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
  2011-05-13 14:03 ` Mel Gorman
@ 2011-05-13 14:03   ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail.  However, by simply trying to allocate, kswapd is woken up to
start reclaiming at that order. On a desktop system, two users report
that the system is getting locked up with kswapd using large amounts
of CPU.  Using SLAB instead of SLUB made this problem go away.

This patch prevents kswapd being woken up for high-order allocations.
Testing indicated that with this patch applied, the system was much
harder to hang and even when it did, it eventually recovered.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slub.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9d2e5e4..98c358d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * Let the initial higher-order allocation fail under memory pressure
 	 * so we fall-back to the minimum order allocation.
 	 */
-	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
+	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
 
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
@ 2011-05-13 14:03   ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail.  However, by simply trying to allocate, kswapd is woken up to
start reclaiming at that order. On a desktop system, two users report
that the system is getting locked up with kswapd using large amounts
of CPU.  Using SLAB instead of SLUB made this problem go away.

This patch prevents kswapd being woken up for high-order allocations.
Testing indicated that with this patch applied, the system was much
harder to hang and even when it did, it eventually recovered.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slub.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9d2e5e4..98c358d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * Let the initial higher-order allocation fail under memory pressure
 	 * so we fall-back to the minimum order allocation.
 	 */
-	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
+	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
 
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-13 14:03 ` Mel Gorman
@ 2011-05-13 14:03   ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail. However, by simply trying to allocate, the caller can enter
compaction or reclaim - both of which are likely to cost more than the
benefit of using high-order pages in SLUB. On a desktop system, two
users report that the system is getting stalled with kswapd using large
amounts of CPU.

This patch prevents SLUB taking any expensive steps when trying to use
high-order allocations. Instead, it is expected to fall back to smaller
orders more aggressively. Testing was somewhat inconclusive on how much
this helped but it makes sense that falling back to order-0 allocations
is faster than entering compaction or direct reclaim.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    3 ++-
 mm/slub.c       |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..057f1e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	if (!wait && can_wake_kswapd) {
 		/*
 		 * Not worth trying to allocate harder for
 		 * __GFP_NOMEMALLOC even if it can't schedule.
diff --git a/mm/slub.c b/mm/slub.c
index 98c358d..c5797ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * Let the initial higher-order allocation fail under memory pressure
 	 * so we fall-back to the minimum order allocation.
 	 */
-	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
+	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
+			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
 
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-13 14:03   ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail. However, by simply trying to allocate, the caller can enter
compaction or reclaim - both of which are likely to cost more than the
benefit of using high-order pages in SLUB. On a desktop system, two
users report that the system is getting stalled with kswapd using large
amounts of CPU.

This patch prevents SLUB taking any expensive steps when trying to use
high-order allocations. Instead, it is expected to fall back to smaller
orders more aggressively. Testing was somewhat inconclusive on how much
this helped but it makes sense that falling back to order-0 allocations
is faster than entering compaction or direct reclaim.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    3 ++-
 mm/slub.c       |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..057f1e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	if (!wait && can_wake_kswapd) {
 		/*
 		 * Not worth trying to allocate harder for
 		 * __GFP_NOMEMALLOC even if it can't schedule.
diff --git a/mm/slub.c b/mm/slub.c
index 98c358d..c5797ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * Let the initial higher-order allocation fail under memory pressure
 	 * so we fall-back to the minimum order allocation.
 	 */
-	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
+	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
+			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
 
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-13 14:03 ` Mel Gorman
@ 2011-05-13 14:03   ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 
+	/* If kswapd has been running too long, just sleep */
+	if (need_resched())
+		return false;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-13 14:03   ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 
+	/* If kswapd has been running too long, just sleep */
+	if (need_resched())
+		return false;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-13 14:03   ` Mel Gorman
@ 2011-05-13 14:28     ` Johannes Weiner
  -1 siblings, 0 replies; 119+ messages in thread
From: Johannes Weiner @ 2011-05-13 14:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Fri, May 13, 2011 at 03:03:21PM +0100, Mel Gorman wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
> 
> From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for picking it up, Mel.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

> Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..af24d1e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  	 * must be balanced
>  	 */
>  	if (order)
> -		return pgdat_balanced(pgdat, balanced, classzone_idx);
> +		return !pgdat_balanced(pgdat, balanced, classzone_idx);
>  	else
>  		return !all_zones_ok;
>  }
> -- 
> 1.7.3.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-13 14:28     ` Johannes Weiner
  0 siblings, 0 replies; 119+ messages in thread
From: Johannes Weiner @ 2011-05-13 14:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Fri, May 13, 2011 at 03:03:21PM +0100, Mel Gorman wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
> 
> From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for picking it up, Mel.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

> Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..af24d1e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  	 * must be balanced
>  	 */
>  	if (order)
> -		return pgdat_balanced(pgdat, balanced, classzone_idx);
> +		return !pgdat_balanced(pgdat, balanced, classzone_idx);
>  	else
>  		return !all_zones_ok;
>  }
> -- 
> 1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
  2011-05-13 14:03 ` Mel Gorman
  (?)
@ 2011-05-13 15:19   ` James Bottomley
  -1 siblings, 0 replies; 119+ messages in thread
From: James Bottomley @ 2011-05-13 15:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> Changelog since V1
>   o kswapd should sleep if need_resched
>   o Remove __GFP_REPEAT from GFP flags when speculatively using high
>     orders so direct/compaction exits earlier
>   o Remove __GFP_NORETRY for correctness
>   o Correct logic in sleeping_prematurely
>   o Leave SLUB using the default slub_max_order
> 
> There are a few reports of people experiencing hangs when copying
> large amounts of data with kswapd using a large amount of CPU which
> appear to be due to recent reclaim changes.
> 
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
> 
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> 	kswapd: keep kswapd awake for high-order allocations until
> 	a percentage of the node is balanced] to allow kswapd to
> 	go to sleep when balanced for high orders.
> 
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders.
> 
> Patch 3 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.
> 
> Patch 4 notes that even when kswapd is failing to keep up with
> 	allocation requests, it should still go to sleep when its
> 	quota has expired to prevent it spinning.

This all works fine for me ... three untar runs and no kswapd hangs or
pegging the CPU at 99% ... in fact, kswapd rarely gets over 20%

This isn't as good as the kswapd sleeping_prematurely() throttling
patch.  For total CPU time on a three 90GB untar run, it's about 64s of
CPU time with your patch rather than 6s, but that's vastly better than
the 15 minutes of CPU time kswapd was taking even under PREEMPT.

James



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-13 15:19   ` James Bottomley
  0 siblings, 0 replies; 119+ messages in thread
From: James Bottomley @ 2011-05-13 15:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> Changelog since V1
>   o kswapd should sleep if need_resched
>   o Remove __GFP_REPEAT from GFP flags when speculatively using high
>     orders so direct/compaction exits earlier
>   o Remove __GFP_NORETRY for correctness
>   o Correct logic in sleeping_prematurely
>   o Leave SLUB using the default slub_max_order
> 
> There are a few reports of people experiencing hangs when copying
> large amounts of data with kswapd using a large amount of CPU which
> appear to be due to recent reclaim changes.
> 
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
> 
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> 	kswapd: keep kswapd awake for high-order allocations until
> 	a percentage of the node is balanced] to allow kswapd to
> 	go to sleep when balanced for high orders.
> 
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders.
> 
> Patch 3 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.
> 
> Patch 4 notes that even when kswapd is failing to keep up with
> 	allocation requests, it should still go to sleep when its
> 	quota has expired to prevent it spinning.

This all works fine for me ... three untar runs and no kswapd hangs or
pegging the CPU at 99% ... in fact, kswapd rarely gets over 20%

This isn't as good as the kswapd sleeping_prematurely() throttling
patch.  For total CPU time on a three 90GB untar run, it's about 64s of
CPU time with your patch rather than 6s, but that's vastly better than
the 15 minutes of CPU time kswapd was taking even under PREEMPT.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-13 15:19   ` James Bottomley
  0 siblings, 0 replies; 119+ messages in thread
From: James Bottomley @ 2011-05-13 15:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> Changelog since V1
>   o kswapd should sleep if need_resched
>   o Remove __GFP_REPEAT from GFP flags when speculatively using high
>     orders so direct/compaction exits earlier
>   o Remove __GFP_NORETRY for correctness
>   o Correct logic in sleeping_prematurely
>   o Leave SLUB using the default slub_max_order
> 
> There are a few reports of people experiencing hangs when copying
> large amounts of data with kswapd using a large amount of CPU which
> appear to be due to recent reclaim changes.
> 
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
> 
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> 	kswapd: keep kswapd awake for high-order allocations until
> 	a percentage of the node is balanced] to allow kswapd to
> 	go to sleep when balanced for high orders.
> 
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders.
> 
> Patch 3 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.
> 
> Patch 4 notes that even when kswapd is failing to keep up with
> 	allocation requests, it should still go to sleep when its
> 	quota has expired to prevent it spinning.

This all works fine for me ... three untar runs and no kswapd hangs or
pegging the CPU at 99% ... in fact, kswapd rarely gets over 20%

This isn't as good as the kswapd sleeping_prematurely() throttling
patch.  For total CPU time on a three 90GB untar run, it's about 64s of
CPU time with your patch rather than 6s, but that's vastly better than
the 15 minutes of CPU time kswapd was taking even under PREEMPT.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
  2011-05-13 14:03 ` Mel Gorman
@ 2011-05-13 15:21   ` Christoph Lameter
  -1 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-13 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, 13 May 2011, Mel Gorman wrote:

> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
>
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> 	kswapd: keep kswapd awake for high-order allocations until
> 	a percentage of the node is balanced] to allow kswapd to
> 	go to sleep when balanced for high orders.

The above looks good.

> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders.

Not sure if that is necessary since it seems that we triggered kswapd
before? Why not continue to do it? Once kswapd has enough higher order
pages kswapd should no longer be triggered right?

> Patch 3 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.

Its cheaper for reclaim path true but more expensive in terms of SLUBs
management costs of the data and it also increases the memory wasted. A
higher order means denser packing of objects less page management
overhead. Fallback is not for free. Reasonable effort should be made to
allocate the page order requested.

> Patch 4 notes that even when kswapd is failing to keep up with
> 	allocation requests, it should still go to sleep when its
> 	quota has expired to prevent it spinning.

Looks good too.

Overall, it looks like the compaction logic and the modifications to
reclaim introduced recently with the intend to increase the amount of
physically contiguous memory is not working as expected.

SLUBs chance of getting higher order pages should be *increasing* as a
result of these changes. The above looks like the chances are decreasing
now.

This is a matter of future concern. The metadata management overhead
in the kernel is continually increasing since memory sizes keep growing
and we typically manage memory in 4k chunks. Through large allocation
sizes we can reduce that management overhead but we can only do this if we
have an effective way of defragmenting memory to get longer contiguous
chunks that can be managed to a single page struct.

Please make sure that compaction and related measures really work properly.

The patches suggest that the recent modifications are not improving the
situation.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-13 15:21   ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-13 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, 13 May 2011, Mel Gorman wrote:

> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
>
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> 	kswapd: keep kswapd awake for high-order allocations until
> 	a percentage of the node is balanced] to allow kswapd to
> 	go to sleep when balanced for high orders.

The above looks good.

> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders.

Not sure if that is necessary since it seems that we triggered kswapd
before? Why not continue to do it? Once kswapd has enough higher order
pages kswapd should no longer be triggered right?

> Patch 3 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.

Its cheaper for reclaim path true but more expensive in terms of SLUBs
management costs of the data and it also increases the memory wasted. A
higher order means denser packing of objects less page management
overhead. Fallback is not for free. Reasonable effort should be made to
allocate the page order requested.

> Patch 4 notes that even when kswapd is failing to keep up with
> 	allocation requests, it should still go to sleep when its
> 	quota has expired to prevent it spinning.

Looks good too.

Overall, it looks like the compaction logic and the modifications to
reclaim introduced recently with the intend to increase the amount of
physically contiguous memory is not working as expected.

SLUBs chance of getting higher order pages should be *increasing* as a
result of these changes. The above looks like the chances are decreasing
now.

This is a matter of future concern. The metadata management overhead
in the kernel is continually increasing since memory sizes keep growing
and we typically manage memory in 4k chunks. Through large allocation
sizes we can reduce that management overhead but we can only do this if we
have an effective way of defragmenting memory to get longer contiguous
chunks that can be managed to a single page struct.

Please make sure that compaction and related measures really work properly.

The patches suggest that the recent modifications are not improving the
situation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
  2011-05-13 15:21   ` Christoph Lameter
@ 2011-05-13 15:43     ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 15:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, May 13, 2011 at 10:21:46AM -0500, Christoph Lameter wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
> 
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> >
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > 	kswapd: keep kswapd awake for high-order allocations until
> > 	a percentage of the node is balanced] to allow kswapd to
> > 	go to sleep when balanced for high orders.
> 
> The above looks good.
> 

Ok.

> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > 	use of high orders.
> 
> Not sure if that is necessary since it seems that we triggered kswapd
> before? Why not continue to do it? Once kswapd has enough higher order
> pages kswapd should no longer be triggered right?
> 

Because kswapd waking up isn't cheap and we are reclaiming pages
just so SLUB may get high-order pages in the future. As it's for
PAGE_ORDER_COSTLY_ORDER, we are not entering lumpy reclaim and just
selecting a few random order-0 pages which may or may not help. There
is very little control of how many pages are getting freed if kswapd
is being woken frequently.

> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > 	compaction or reclaim paths on the grounds that falling
> > 	back to order-0 should be cheaper.
> 
> Its cheaper for reclaim path true but more expensive in terms of SLUBs
> management costs of the data and it also increases the memory wasted.

Surely the reclaim cost exceeds SLUB management cost?

> A
> higher order means denser packing of objects less page management
> overhead. Fallback is not for free.

Neither is reclaiming a large bunch of pages. Worse, reclaiming
pages so SLUB gets a high-order means it's likely to be stealing
MIGRATE_MOVABLE blocks which eventually gives diminishing returns but
may not be noticeable for weeks. From a fragmentation perspective,
it's better if SLUB uses order-0 allocations when memory is low so
that SLUB pages continue to get packed into as few MIGRATE_UNMOVABLE
and MIGRATE_UNRECLAIMABLE blocks as possible.

>  Reasonable effort should be made to
> allocate the page order requested.
> 
> > Patch 4 notes that even when kswapd is failing to keep up with
> > 	allocation requests, it should still go to sleep when its
> > 	quota has expired to prevent it spinning.
> 
> Looks good too.
> 
> Overall, it looks like the compaction logic and the modifications to
> reclaim introduced recently with the intend to increase the amount of
> physically contiguous memory is not working as expected.
> 

The reclaim and kswapd damage was unintended and this is my fault
but reclaim/compaction still makes a lot more sense than lumpy
reclaim. Testing showed it disrupted the system a lot less and
allocated high-order pages faster with fewer pages reclaimed.

> SLUBs chance of getting higher order pages should be *increasing* as a
> result of these changes. The above looks like the chances are decreasing
> now.
> 

Patches 2 and 3 may mean that SLUB gets fewer high order pages when
memory is low and it's depending on high-order pages to be naturally
freed by SLUB as it recycles slabs of old objects. On the flip-side,
fewer pages will be reclaimed. I'd expect the latter option is
cheaper overall.

> This is a matter of future concern. The metadata management overhead
> in the kernel is continually increasing since memory sizes keep growing
> and we typically manage memory in 4k chunks. Through large allocation
> sizes we can reduce that management overhead but we can only do this if we
> have an effective way of defragmenting memory to get longer contiguous
> chunks that can be managed to a single page struct.
> 
> Please make sure that compaction and related measures really work properly.
> 

Local testing still shows them to be behaving as expected but then
again, I haven't reproduced the simple problem reported by Chris
and James despite using a few different laptops and two different
low-end servers.

> The patches suggest that the recent modifications are not improving the
> situation.
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-13 15:43     ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 15:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, May 13, 2011 at 10:21:46AM -0500, Christoph Lameter wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
> 
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> >
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > 	kswapd: keep kswapd awake for high-order allocations until
> > 	a percentage of the node is balanced] to allow kswapd to
> > 	go to sleep when balanced for high orders.
> 
> The above looks good.
> 

Ok.

> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > 	use of high orders.
> 
> Not sure if that is necessary since it seems that we triggered kswapd
> before? Why not continue to do it? Once kswapd has enough higher order
> pages kswapd should no longer be triggered right?
> 

Because kswapd waking up isn't cheap and we are reclaiming pages
just so SLUB may get high-order pages in the future. As it's for
PAGE_ORDER_COSTLY_ORDER, we are not entering lumpy reclaim and just
selecting a few random order-0 pages which may or may not help. There
is very little control of how many pages are getting freed if kswapd
is being woken frequently.

> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > 	compaction or reclaim paths on the grounds that falling
> > 	back to order-0 should be cheaper.
> 
> Its cheaper for reclaim path true but more expensive in terms of SLUBs
> management costs of the data and it also increases the memory wasted.

Surely the reclaim cost exceeds SLUB management cost?

> A
> higher order means denser packing of objects less page management
> overhead. Fallback is not for free.

Neither is reclaiming a large bunch of pages. Worse, reclaiming
pages so SLUB gets a high-order means it's likely to be stealing
MIGRATE_MOVABLE blocks which eventually gives diminishing returns but
may not be noticeable for weeks. From a fragmentation perspective,
it's better if SLUB uses order-0 allocations when memory is low so
that SLUB pages continue to get packed into as few MIGRATE_UNMOVABLE
and MIGRATE_UNRECLAIMABLE blocks as possible.

>  Reasonable effort should be made to
> allocate the page order requested.
> 
> > Patch 4 notes that even when kswapd is failing to keep up with
> > 	allocation requests, it should still go to sleep when its
> > 	quota has expired to prevent it spinning.
> 
> Looks good too.
> 
> Overall, it looks like the compaction logic and the modifications to
> reclaim introduced recently with the intend to increase the amount of
> physically contiguous memory is not working as expected.
> 

The reclaim and kswapd damage was unintended and this is my fault
but reclaim/compaction still makes a lot more sense than lumpy
reclaim. Testing showed it disrupted the system a lot less and
allocated high-order pages faster with fewer pages reclaimed.

> SLUBs chance of getting higher order pages should be *increasing* as a
> result of these changes. The above looks like the chances are decreasing
> now.
> 

Patches 2 and 3 may mean that SLUB gets fewer high order pages when
memory is low and it's depending on high-order pages to be naturally
freed by SLUB as it recycles slabs of old objects. On the flip-side,
fewer pages will be reclaimed. I'd expect the latter option is
cheaper overall.

> This is a matter of future concern. The metadata management overhead
> in the kernel is continually increasing since memory sizes keep growing
> and we typically manage memory in 4k chunks. Through large allocation
> sizes we can reduce that management overhead but we can only do this if we
> have an effective way of defragmenting memory to get longer contiguous
> chunks that can be managed to a single page struct.
> 
> Please make sure that compaction and related measures really work properly.
> 

Local testing still shows them to be behaving as expected but then
again, I haven't reproduced the simple problem reported by Chris
and James despite using a few different laptops and two different
low-end servers.

> The patches suggest that the recent modifications are not improving the
> situation.
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
  2011-05-13 15:19   ` James Bottomley
@ 2011-05-13 15:52     ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 15:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, May 13, 2011 at 10:19:44AM -0500, James Bottomley wrote:
> On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > Changelog since V1
> >   o kswapd should sleep if need_resched
> >   o Remove __GFP_REPEAT from GFP flags when speculatively using high
> >     orders so direct/compaction exits earlier
> >   o Remove __GFP_NORETRY for correctness
> >   o Correct logic in sleeping_prematurely
> >   o Leave SLUB using the default slub_max_order
> > 
> > There are a few reports of people experiencing hangs when copying
> > large amounts of data with kswapd using a large amount of CPU which
> > appear to be due to recent reclaim changes.
> > 
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> > 
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > 	kswapd: keep kswapd awake for high-order allocations until
> > 	a percentage of the node is balanced] to allow kswapd to
> > 	go to sleep when balanced for high orders.
> > 
> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > 	use of high orders.
> > 
> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > 	compaction or reclaim paths on the grounds that falling
> > 	back to order-0 should be cheaper.
> > 
> > Patch 4 notes that even when kswapd is failing to keep up with
> > 	allocation requests, it should still go to sleep when its
> > 	quota has expired to prevent it spinning.
> 
> This all works fine for me ... three untar runs and no kswapd hangs or
> pegging the CPU at 99% ... in fact, kswapd rarely gets over 20%
> 

Good stuff, thanks.

> This isn't as good as the kswapd sleeping_prematurely() throttling
> patch. For total CPU time on a three 90GB untar run, it's about 64s of
> CPU time with your patch rather than 6s, but that's vastly better than
> the 15 minutes of CPU time kswapd was taking even under PREEMPT.
> 

The throttling patch is unfortunately a bit hand-wavy based on number of
times it's entered and time passed. It'll be even harder to debug
problems related to this in the future particularly as it's using global
information (a static) for kswapd (per-node which could be worse in the
future depending on what memcg do).

However, as you are testing against stable, can you also apply this
patch? [2876592f: mm: vmscan: stop reclaim/compaction earlier due to
insufficient progress if !__GFP_REPEAT]. It makes a difference as to
when reclaimers give up on high-orders and go to sleep.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-13 15:52     ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-13 15:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, May 13, 2011 at 10:19:44AM -0500, James Bottomley wrote:
> On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > Changelog since V1
> >   o kswapd should sleep if need_resched
> >   o Remove __GFP_REPEAT from GFP flags when speculatively using high
> >     orders so direct/compaction exits earlier
> >   o Remove __GFP_NORETRY for correctness
> >   o Correct logic in sleeping_prematurely
> >   o Leave SLUB using the default slub_max_order
> > 
> > There are a few reports of people experiencing hangs when copying
> > large amounts of data with kswapd using a large amount of CPU which
> > appear to be due to recent reclaim changes.
> > 
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> > 
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > 	kswapd: keep kswapd awake for high-order allocations until
> > 	a percentage of the node is balanced] to allow kswapd to
> > 	go to sleep when balanced for high orders.
> > 
> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > 	use of high orders.
> > 
> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > 	compaction or reclaim paths on the grounds that falling
> > 	back to order-0 should be cheaper.
> > 
> > Patch 4 notes that even when kswapd is failing to keep up with
> > 	allocation requests, it should still go to sleep when its
> > 	quota has expired to prevent it spinning.
> 
> This all works fine for me ... three untar runs and no kswapd hangs or
> pegging the CPU at 99% ... in fact, kswapd rarely gets over 20%
> 

Good stuff, thanks.

> This isn't as good as the kswapd sleeping_prematurely() throttling
> patch. For total CPU time on a three 90GB untar run, it's about 64s of
> CPU time with your patch rather than 6s, but that's vastly better than
> the 15 minutes of CPU time kswapd was taking even under PREEMPT.
> 

The throttling patch is unfortunately a bit hand-wavy based on number of
times it's entered and time passed. It'll be even harder to debug
problems related to this in the future particularly as it's using global
information (a static) for kswapd (per-node which could be worse in the
future depending on what memcg do).

However, as you are testing against stable, can you also apply this
patch? [2876592f: mm: vmscan: stop reclaim/compaction earlier due to
insufficient progress if !__GFP_REPEAT]. It makes a difference as to
when reclaimers give up on high-orders and go to sleep.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
  2011-05-13 14:03 ` Mel Gorman
@ 2011-05-14  8:34   ` Colin Ian King
  -1 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-14  8:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> Changelog since V1
>   o kswapd should sleep if need_resched
>   o Remove __GFP_REPEAT from GFP flags when speculatively using high
>     orders so direct/compaction exits earlier
>   o Remove __GFP_NORETRY for correctness
>   o Correct logic in sleeping_prematurely
>   o Leave SLUB using the default slub_max_order
> 
> There are a few reports of people experiencing hangs when copying
> large amounts of data with kswapd using a large amount of CPU which
> appear to be due to recent reclaim changes.
> 
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
> 
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> 	kswapd: keep kswapd awake for high-order allocations until
> 	a percentage of the node is balanced] to allow kswapd to
> 	go to sleep when balanced for high orders.
> 
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders.
> 
> Patch 3 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.
> 
> Patch 4 notes that even when kswapd is failing to keep up with
> 	allocation requests, it should still go to sleep when its
> 	quota has expired to prevent it spinning.
> 
> My own data on this is not great. I haven't really been able to
> reproduce the same problem locally.
> 
> The test case is simple. "download tar" wgets a large tar file and
> stores it locally. "unpack" is expanding it (15 times physical RAM
> in this case) and "delete source dirs" is the tarfile being deleted
> again. I also experimented with having the tar copied numerous times
> and into deeper directories to increase the size but the results were
> not particularly interesting so I left it as one tar.
> 
> In the background, applications are being launched to time to vaguely
> simulate activity on the desktop and to measure how long it takes
> applications to start.
> 
> Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> LARGE COPY AND UNTAR
>                       vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
> download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
> unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
> copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
> delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
> Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35
> 
> MMTests Statistics: application launch
> evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
> gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
> iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88
> 
> evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
> gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
> iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11
> 
> Having SLUB avoid expensive steps in reclaim improves performance
> by quite a bit with the overall test completing 1.5 minutes
> faster. Application launch times were not really affected but it's
> not something my test machine was suffering from in the first place
> so it's not really conclusive. The kswapd patches also did not appear
> to help but again, the test machine wasn't suffering that problem.
> 
> These patches are against 2.6.39-rc7. Again, testing would be
> appreciated.

These patches solve the problem for me.  I've been soak testing the file
copy test
for 3.5 hours with nearly 400 test cycles and observed no lockups at all
- rock solid. From my observations from the output from vmstat the
system is behaving sanely.
Thanks for finding a solution - much appreciated!

> 
>  Documentation/vm/slub.txt |    2 +-
>  mm/page_alloc.c           |    3 ++-
>  mm/slub.c                 |    5 +++--
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
>  mm/page_alloc.c |    3 ++-
>  mm/slub.c       |    3 ++-
>  mm/vmscan.c     |    6 +++++-
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-14  8:34   ` Colin Ian King
  0 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-14  8:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> Changelog since V1
>   o kswapd should sleep if need_resched
>   o Remove __GFP_REPEAT from GFP flags when speculatively using high
>     orders so direct/compaction exits earlier
>   o Remove __GFP_NORETRY for correctness
>   o Correct logic in sleeping_prematurely
>   o Leave SLUB using the default slub_max_order
> 
> There are a few reports of people experiencing hangs when copying
> large amounts of data with kswapd using a large amount of CPU which
> appear to be due to recent reclaim changes.
> 
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
> 
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> 	kswapd: keep kswapd awake for high-order allocations until
> 	a percentage of the node is balanced] to allow kswapd to
> 	go to sleep when balanced for high orders.
> 
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders.
> 
> Patch 3 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.
> 
> Patch 4 notes that even when kswapd is failing to keep up with
> 	allocation requests, it should still go to sleep when its
> 	quota has expired to prevent it spinning.
> 
> My own data on this is not great. I haven't really been able to
> reproduce the same problem locally.
> 
> The test case is simple. "download tar" wgets a large tar file and
> stores it locally. "unpack" is expanding it (15 times physical RAM
> in this case) and "delete source dirs" is the tarfile being deleted
> again. I also experimented with having the tar copied numerous times
> and into deeper directories to increase the size but the results were
> not particularly interesting so I left it as one tar.
> 
> In the background, applications are being launched to time to vaguely
> simulate activity on the desktop and to measure how long it takes
> applications to start.
> 
> Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> LARGE COPY AND UNTAR
>                       vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
> download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
> unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
> copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
> delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
> Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35
> 
> MMTests Statistics: application launch
> evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
> gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
> iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88
> 
> evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
> gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
> iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11
> 
> Having SLUB avoid expensive steps in reclaim improves performance
> by quite a bit with the overall test completing 1.5 minutes
> faster. Application launch times were not really affected but it's
> not something my test machine was suffering from in the first place
> so it's not really conclusive. The kswapd patches also did not appear
> to help but again, the test machine wasn't suffering that problem.
> 
> These patches are against 2.6.39-rc7. Again, testing would be
> appreciated.

These patches solve the problem for me.  I've been soak testing the file
copy test
for 3.5 hours with nearly 400 test cycles and observed no lockups at all
- rock solid. From my observations from the output from vmstat the
system is behaving sanely.
Thanks for finding a solution - much appreciated!

> 
>  Documentation/vm/slub.txt |    2 +-
>  mm/page_alloc.c           |    3 ++-
>  mm/slub.c                 |    5 +++--
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
>  mm/page_alloc.c |    3 ++-
>  mm/slub.c       |    3 ++-
>  mm/vmscan.c     |    6 +++++-
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-13 14:03   ` Mel Gorman
@ 2011-05-14 16:30     ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-14 16:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Fri, May 13, 2011 at 11:03 PM, Mel Gorman <mgorman@suse.de> wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Nice catch! Hannes.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-14 16:30     ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-14 16:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Fri, May 13, 2011 at 11:03 PM, Mel Gorman <mgorman@suse.de> wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Nice catch! Hannes.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-13 14:03   ` Mel Gorman
@ 2011-05-15 10:27     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-15 10:27 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

(2011/05/13 23:03), Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
> 
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> ---
>   mm/vmscan.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>   	unsigned long balanced = 0;
>   	bool all_zones_ok = true;
> 
> +	/* If kswapd has been running too long, just sleep */
> +	if (need_resched())
> +		return false;
> +

Hmm... I don't like this patch so much. because this code does

- don't sleep if kswapd got context switch at shrink_inactive_list
- sleep if kswapd didn't

It seems to be semi random behavior.




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-15 10:27     ` KOSAKI Motohiro
  0 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-15 10:27 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

(2011/05/13 23:03), Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
> 
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> ---
>   mm/vmscan.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>   	unsigned long balanced = 0;
>   	bool all_zones_ok = true;
> 
> +	/* If kswapd has been running too long, just sleep */
> +	if (need_resched())
> +		return false;
> +

Hmm... I don't like this patch so much. because this code does

- don't sleep if kswapd got context switch at shrink_inactive_list
- sleep if kswapd didn't

It seems to be semi random behavior.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-15 10:27     ` KOSAKI Motohiro
@ 2011-05-16  4:21       ` James Bottomley
  -1 siblings, 0 replies; 119+ messages in thread
From: James Bottomley @ 2011-05-16  4:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: mgorman, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl,
	penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > ---
> >   mm/vmscan.c |    4 ++++
> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >   	unsigned long balanced = 0;
> >   	bool all_zones_ok = true;
> > 
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> 
> Hmm... I don't like this patch so much. because this code does
> 
> - don't sleep if kswapd got context switch at shrink_inactive_list

This isn't entirely true:  need_resched() will be false, so we'll follow
the normal path for determining whether to sleep or not, in effect
leaving the current behaviour unchanged.

> - sleep if kswapd didn't

This also isn't entirely true: whether need_resched() is true at this
point depends on a whole lot more that whether we did a context switch
in shrink_inactive. It mostly depends on how long we've been running
without giving up the CPU.  Generally that will mean we've been round
the shrinker loop hundreds to thousands of times without sleeping.

> It seems to be semi random behavior.

Well, we have to do something.  Chris Mason first suspected the hang was
a kswapd rescheduling problem a while ago.  We tried putting
cond_rescheds() in several places in the vmscan code, but to no avail.
The need_resched() in sleeping_prematurely() seems to be about the best
option.  The other option might be just to put a cond_resched() in
kswapd_try_to_sleep(), but that will really have about the same effect.

James



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16  4:21       ` James Bottomley
  0 siblings, 0 replies; 119+ messages in thread
From: James Bottomley @ 2011-05-16  4:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: mgorman, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl,
	penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > ---
> >   mm/vmscan.c |    4 ++++
> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >   	unsigned long balanced = 0;
> >   	bool all_zones_ok = true;
> > 
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> 
> Hmm... I don't like this patch so much. because this code does
> 
> - don't sleep if kswapd got context switch at shrink_inactive_list

This isn't entirely true:  need_resched() will be false, so we'll follow
the normal path for determining whether to sleep or not, in effect
leaving the current behaviour unchanged.

> - sleep if kswapd didn't

This also isn't entirely true: whether need_resched() is true at this
point depends on a whole lot more that whether we did a context switch
in shrink_inactive. It mostly depends on how long we've been running
without giving up the CPU.  Generally that will mean we've been round
the shrinker loop hundreds to thousands of times without sleeping.

> It seems to be semi random behavior.

Well, we have to do something.  Chris Mason first suspected the hang was
a kswapd rescheduling problem a while ago.  We tried putting
cond_rescheds() in several places in the vmscan code, but to no avail.
The need_resched() in sleeping_prematurely() seems to be about the best
option.  The other option might be just to put a cond_resched() in
kswapd_try_to_sleep(), but that will really have about the same effect.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16  4:21       ` James Bottomley
@ 2011-05-16  5:04         ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-16  5:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: KOSAKI Motohiro, mgorman, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Mon, May 16, 2011 at 1:21 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> (2011/05/13 23:03), Mel Gorman wrote:
>> > Under constant allocation pressure, kswapd can be in the situation where
>> > sleeping_prematurely() will always return true even if kswapd has been
>> > running a long time. Check if kswapd needs to be scheduled.
>> >
>> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> > ---
>> >   mm/vmscan.c |    4 ++++
>> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index af24d1e..4d24828 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >     unsigned long balanced = 0;
>> >     bool all_zones_ok = true;
>> >
>> > +   /* If kswapd has been running too long, just sleep */
>> > +   if (need_resched())
>> > +           return false;
>> > +
>>
>> Hmm... I don't like this patch so much. because this code does
>>
>> - don't sleep if kswapd got context switch at shrink_inactive_list
>
> This isn't entirely true:  need_resched() will be false, so we'll follow
> the normal path for determining whether to sleep or not, in effect
> leaving the current behaviour unchanged.
>
>> - sleep if kswapd didn't
>
> This also isn't entirely true: whether need_resched() is true at this
> point depends on a whole lot more that whether we did a context switch
> in shrink_inactive. It mostly depends on how long we've been running
> without giving up the CPU.  Generally that will mean we've been round
> the shrinker loop hundreds to thousands of times without sleeping.
>
>> It seems to be semi random behavior.
>
> Well, we have to do something.  Chris Mason first suspected the hang was
> a kswapd rescheduling problem a while ago.  We tried putting
> cond_rescheds() in several places in the vmscan code, but to no avail.

Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?

If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
Because, although we complete zone balancing, kswapd doesn't sleep as
pgdat_balance returns wrong result. And at last VM calls
balance_pgdat. In this case, balance_pgdat returns without any work as
kswap couldn't find zones which have not enough free pages and goto
out. kswapd could repeat this work infinitely. So you don't have a
chance to call cond_resched.

But if your test was with Hanne's patch, I am very curious how come
kswapd consumes CPU a lot.

> The need_resched() in sleeping_prematurely() seems to be about the best
> option.  The other option might be just to put a cond_resched() in
> kswapd_try_to_sleep(), but that will really have about the same effect.

I don't oppose it but before that, I think we have to know why kswapd
consumes CPU a lot although we applied Hannes' patch.

>
> James
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16  5:04         ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-16  5:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: KOSAKI Motohiro, mgorman, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Mon, May 16, 2011 at 1:21 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> (2011/05/13 23:03), Mel Gorman wrote:
>> > Under constant allocation pressure, kswapd can be in the situation where
>> > sleeping_prematurely() will always return true even if kswapd has been
>> > running a long time. Check if kswapd needs to be scheduled.
>> >
>> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> > ---
>> >   mm/vmscan.c |    4 ++++
>> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index af24d1e..4d24828 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >     unsigned long balanced = 0;
>> >     bool all_zones_ok = true;
>> >
>> > +   /* If kswapd has been running too long, just sleep */
>> > +   if (need_resched())
>> > +           return false;
>> > +
>>
>> Hmm... I don't like this patch so much. because this code does
>>
>> - don't sleep if kswapd got context switch at shrink_inactive_list
>
> This isn't entirely true:  need_resched() will be false, so we'll follow
> the normal path for determining whether to sleep or not, in effect
> leaving the current behaviour unchanged.
>
>> - sleep if kswapd didn't
>
> This also isn't entirely true: whether need_resched() is true at this
> point depends on a whole lot more that whether we did a context switch
> in shrink_inactive. It mostly depends on how long we've been running
> without giving up the CPU.  Generally that will mean we've been round
> the shrinker loop hundreds to thousands of times without sleeping.
>
>> It seems to be semi random behavior.
>
> Well, we have to do something.  Chris Mason first suspected the hang was
> a kswapd rescheduling problem a while ago.  We tried putting
> cond_rescheds() in several places in the vmscan code, but to no avail.

Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?

If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
Because, although we complete zone balancing, kswapd doesn't sleep as
pgdat_balance returns wrong result. And at last VM calls
balance_pgdat. In this case, balance_pgdat returns without any work as
kswap couldn't find zones which have not enough free pages and goto
out. kswapd could repeat this work infinitely. So you don't have a
chance to call cond_resched.

But if your test was with Hanne's patch, I am very curious how come
kswapd consumes CPU a lot.

> The need_resched() in sleeping_prematurely() seems to be about the best
> option.  The other option might be just to put a cond_resched() in
> kswapd_try_to_sleep(), but that will really have about the same effect.

I don't oppose it but before that, I think we have to know why kswapd
consumes CPU a lot although we applied Hannes' patch.

>
> James
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
  2011-05-14  8:34   ` Colin Ian King
@ 2011-05-16  8:37     ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16  8:37 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote:
> On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > Changelog since V1
> >   o kswapd should sleep if need_resched
> >   o Remove __GFP_REPEAT from GFP flags when speculatively using high
> >     orders so direct/compaction exits earlier
> >   o Remove __GFP_NORETRY for correctness
> >   o Correct logic in sleeping_prematurely
> >   o Leave SLUB using the default slub_max_order
> > 
> > There are a few reports of people experiencing hangs when copying
> > large amounts of data with kswapd using a large amount of CPU which
> > appear to be due to recent reclaim changes.
> > 
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> > 
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > 	kswapd: keep kswapd awake for high-order allocations until
> > 	a percentage of the node is balanced] to allow kswapd to
> > 	go to sleep when balanced for high orders.
> > 
> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > 	use of high orders.
> > 
> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > 	compaction or reclaim paths on the grounds that falling
> > 	back to order-0 should be cheaper.
> > 
> > Patch 4 notes that even when kswapd is failing to keep up with
> > 	allocation requests, it should still go to sleep when its
> > 	quota has expired to prevent it spinning.
> > 
> > My own data on this is not great. I haven't really been able to
> > reproduce the same problem locally.
> > 
> > The test case is simple. "download tar" wgets a large tar file and
> > stores it locally. "unpack" is expanding it (15 times physical RAM
> > in this case) and "delete source dirs" is the tarfile being deleted
> > again. I also experimented with having the tar copied numerous times
> > and into deeper directories to increase the size but the results were
> > not particularly interesting so I left it as one tar.
> > 
> > In the background, applications are being launched to time to vaguely
> > simulate activity on the desktop and to measure how long it takes
> > applications to start.
> > 
> > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> > LARGE COPY AND UNTAR
> >                       vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
> > download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
> > unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
> > copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
> > delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
> > Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35
> > 
> > MMTests Statistics: application launch
> > evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
> > gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
> > iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88
> > 
> > evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
> > gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
> > iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11
> > 
> > Having SLUB avoid expensive steps in reclaim improves performance
> > by quite a bit with the overall test completing 1.5 minutes
> > faster. Application launch times were not really affected but it's
> > not something my test machine was suffering from in the first place
> > so it's not really conclusive. The kswapd patches also did not appear
> > to help but again, the test machine wasn't suffering that problem.
> > 
> > These patches are against 2.6.39-rc7. Again, testing would be
> > appreciated.
> 
> These patches solve the problem for me.  I've been soak testing the file
> copy test
> for 3.5 hours with nearly 400 test cycles and observed no lockups at all
> - rock solid. From my observations from the output from vmstat the
> system is behaving sanely.
> Thanks for finding a solution - much appreciated!
> 

Can you tell me if just patches 1 and 4 fix the problem please? It'd be good
to know if this was only a reclaim-related problem. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-16  8:37     ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16  8:37 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote:
> On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > Changelog since V1
> >   o kswapd should sleep if need_resched
> >   o Remove __GFP_REPEAT from GFP flags when speculatively using high
> >     orders so direct/compaction exits earlier
> >   o Remove __GFP_NORETRY for correctness
> >   o Correct logic in sleeping_prematurely
> >   o Leave SLUB using the default slub_max_order
> > 
> > There are a few reports of people experiencing hangs when copying
> > large amounts of data with kswapd using a large amount of CPU which
> > appear to be due to recent reclaim changes.
> > 
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> > 
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > 	kswapd: keep kswapd awake for high-order allocations until
> > 	a percentage of the node is balanced] to allow kswapd to
> > 	go to sleep when balanced for high orders.
> > 
> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > 	use of high orders.
> > 
> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > 	compaction or reclaim paths on the grounds that falling
> > 	back to order-0 should be cheaper.
> > 
> > Patch 4 notes that even when kswapd is failing to keep up with
> > 	allocation requests, it should still go to sleep when its
> > 	quota has expired to prevent it spinning.
> > 
> > My own data on this is not great. I haven't really been able to
> > reproduce the same problem locally.
> > 
> > The test case is simple. "download tar" wgets a large tar file and
> > stores it locally. "unpack" is expanding it (15 times physical RAM
> > in this case) and "delete source dirs" is the tarfile being deleted
> > again. I also experimented with having the tar copied numerous times
> > and into deeper directories to increase the size but the results were
> > not particularly interesting so I left it as one tar.
> > 
> > In the background, applications are being launched to time to vaguely
> > simulate activity on the desktop and to measure how long it takes
> > applications to start.
> > 
> > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> > LARGE COPY AND UNTAR
> >                       vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
> > download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
> > unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
> > copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
> > delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
> > Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35
> > 
> > MMTests Statistics: application launch
> > evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
> > gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
> > iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88
> > 
> > evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
> > gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
> > iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11
> > 
> > Having SLUB avoid expensive steps in reclaim improves performance
> > by quite a bit with the overall test completing 1.5 minutes
> > faster. Application launch times were not really affected but it's
> > not something my test machine was suffering from in the first place
> > so it's not really conclusive. The kswapd patches also did not appear
> > to help but again, the test machine wasn't suffering that problem.
> > 
> > These patches are against 2.6.39-rc7. Again, testing would be
> > appreciated.
> 
> These patches solve the problem for me.  I've been soak testing the file
> copy test
> for 3.5 hours with nearly 400 test cycles and observed no lockups at all
> - rock solid. From my observations from the output from vmstat the
> system is behaving sanely.
> Thanks for finding a solution - much appreciated!
> 

Can you tell me if just patches 1 and 4 fix the problem please? It'd be good
to know if this was only a reclaim-related problem. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-15 10:27     ` KOSAKI Motohiro
@ 2011-05-16  8:45       ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16  8:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Sun, May 15, 2011 at 07:27:12PM +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > ---
> >   mm/vmscan.c |    4 ++++
> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >   	unsigned long balanced = 0;
> >   	bool all_zones_ok = true;
> > 
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> 
> Hmm... I don't like this patch so much. because this code does
> 
> - don't sleep if kswapd got context switch at shrink_inactive_list
> - sleep if kswapd didn't
> 
> It seems to be semi random behavior.
> 

It's possible to keep kswapd awake simply by allocating fast enough that
the watermarks are never balanced making kswapd appear to consume 100%
of CPU. This check causes kswapd to sleep in this case. The processes
doing the allocations will enter direct reclaim and probably stall while
processes that are not allocating will get some CPU time.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16  8:45       ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16  8:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Sun, May 15, 2011 at 07:27:12PM +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > ---
> >   mm/vmscan.c |    4 ++++
> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >   	unsigned long balanced = 0;
> >   	bool all_zones_ok = true;
> > 
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> 
> Hmm... I don't like this patch so much. because this code does
> 
> - don't sleep if kswapd got context switch at shrink_inactive_list
> - sleep if kswapd didn't
> 
> It seems to be semi random behavior.
> 

It's possible to keep kswapd awake simply by allocating fast enough that
the watermarks are never balanced making kswapd appear to consume 100%
of CPU. This check causes kswapd to sleep in this case. The processes
doing the allocations will enter direct reclaim and probably stall while
processes that are not allocating will get some CPU time.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16  5:04         ` Minchan Kim
  (?)
@ 2011-05-16  8:45           ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16  8:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >
> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> > ---
> >> >   mm/vmscan.c |    4 ++++
> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index af24d1e..4d24828 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >     unsigned long balanced = 0;
> >> >     bool all_zones_ok = true;
> >> >
> >> > +   /* If kswapd has been running too long, just sleep */
> >> > +   if (need_resched())
> >> > +           return false;
> >> > +
> >>
> >> Hmm... I don't like this patch so much. because this code does
> >>
> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >
> > This isn't entirely true:  need_resched() will be false, so we'll follow
> > the normal path for determining whether to sleep or not, in effect
> > leaving the current behaviour unchanged.
> >
> >> - sleep if kswapd didn't
> >
> > This also isn't entirely true: whether need_resched() is true at this
> > point depends on a whole lot more that whether we did a context switch
> > in shrink_inactive. It mostly depends on how long we've been running
> > without giving up the CPU.  Generally that will mean we've been round
> > the shrinker loop hundreds to thousands of times without sleeping.
> >
> >> It seems to be semi random behavior.
> >
> > Well, we have to do something.  Chris Mason first suspected the hang was
> > a kswapd rescheduling problem a while ago.  We tried putting
> > cond_rescheds() in several places in the vmscan code, but to no avail.
> 
> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> 
> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> Because, although we complete zone balancing, kswapd doesn't sleep as
> pgdat_balance returns wrong result. And at last VM calls
> balance_pgdat. In this case, balance_pgdat returns without any work as
> kswap couldn't find zones which have not enough free pages and goto
> out. kswapd could repeat this work infinitely. So you don't have a
> chance to call cond_resched.
> 
> But if your test was with Hanne's patch, I am very curious how come
> kswapd consumes CPU a lot.
> 
> > The need_resched() in sleeping_prematurely() seems to be about the best
> > option.  The other option might be just to put a cond_resched() in
> > kswapd_try_to_sleep(), but that will really have about the same effect.
> 
> I don't oppose it but before that, I think we have to know why kswapd
> consumes CPU a lot although we applied Hannes' patch.
> 

Because it's still possible for processes to allocate pages at the same
rate kswapd is freeing them leading to a situation where kswapd does not
consider the zone balanced for prolonged periods of time.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16  8:45           ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16  8:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >
> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> > ---
> >> >   mm/vmscan.c |    4 ++++
> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index af24d1e..4d24828 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >     unsigned long balanced = 0;
> >> >     bool all_zones_ok = true;
> >> >
> >> > +   /* If kswapd has been running too long, just sleep */
> >> > +   if (need_resched())
> >> > +           return false;
> >> > +
> >>
> >> Hmm... I don't like this patch so much. because this code does
> >>
> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >
> > This isn't entirely true:  need_resched() will be false, so we'll follow
> > the normal path for determining whether to sleep or not, in effect
> > leaving the current behaviour unchanged.
> >
> >> - sleep if kswapd didn't
> >
> > This also isn't entirely true: whether need_resched() is true at this
> > point depends on a whole lot more that whether we did a context switch
> > in shrink_inactive. It mostly depends on how long we've been running
> > without giving up the CPU.  Generally that will mean we've been round
> > the shrinker loop hundreds to thousands of times without sleeping.
> >
> >> It seems to be semi random behavior.
> >
> > Well, we have to do something.  Chris Mason first suspected the hang was
> > a kswapd rescheduling problem a while ago.  We tried putting
> > cond_rescheds() in several places in the vmscan code, but to no avail.
> 
> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> 
> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> Because, although we complete zone balancing, kswapd doesn't sleep as
> pgdat_balance returns wrong result. And at last VM calls
> balance_pgdat. In this case, balance_pgdat returns without any work as
> kswap couldn't find zones which have not enough free pages and goto
> out. kswapd could repeat this work infinitely. So you don't have a
> chance to call cond_resched.
> 
> But if your test was with Hanne's patch, I am very curious how come
> kswapd consumes CPU a lot.
> 
> > The need_resched() in sleeping_prematurely() seems to be about the best
> > option.  The other option might be just to put a cond_resched() in
> > kswapd_try_to_sleep(), but that will really have about the same effect.
> 
> I don't oppose it but before that, I think we have to know why kswapd
> consumes CPU a lot although we applied Hannes' patch.
> 

Because it's still possible for processes to allocate pages at the same
rate kswapd is freeing them leading to a situation where kswapd does not
consider the zone balanced for prolonged periods of time.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16  8:45           ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16  8:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >
> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> > ---
> >> >   mm/vmscan.c |    4 ++++
> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index af24d1e..4d24828 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >     unsigned long balanced = 0;
> >> >     bool all_zones_ok = true;
> >> >
> >> > +   /* If kswapd has been running too long, just sleep */
> >> > +   if (need_resched())
> >> > +           return false;
> >> > +
> >>
> >> Hmm... I don't like this patch so much. because this code does
> >>
> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >
> > This isn't entirely true:  need_resched() will be false, so we'll follow
> > the normal path for determining whether to sleep or not, in effect
> > leaving the current behaviour unchanged.
> >
> >> - sleep if kswapd didn't
> >
> > This also isn't entirely true: whether need_resched() is true at this
> > point depends on a whole lot more that whether we did a context switch
> > in shrink_inactive. It mostly depends on how long we've been running
> > without giving up the CPU.  Generally that will mean we've been round
> > the shrinker loop hundreds to thousands of times without sleeping.
> >
> >> It seems to be semi random behavior.
> >
> > Well, we have to do something.  Chris Mason first suspected the hang was
> > a kswapd rescheduling problem a while ago.  We tried putting
> > cond_rescheds() in several places in the vmscan code, but to no avail.
> 
> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> 
> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> Because, although we complete zone balancing, kswapd doesn't sleep as
> pgdat_balance returns wrong result. And at last VM calls
> balance_pgdat. In this case, balance_pgdat returns without any work as
> kswap couldn't find zones which have not enough free pages and goto
> out. kswapd could repeat this work infinitely. So you don't have a
> chance to call cond_resched.
> 
> But if your test was with Hanne's patch, I am very curious how come
> kswapd consumes CPU a lot.
> 
> > The need_resched() in sleeping_prematurely() seems to be about the best
> > option.  The other option might be just to put a cond_resched() in
> > kswapd_try_to_sleep(), but that will really have about the same effect.
> 
> I don't oppose it but before that, I think we have to know why kswapd
> consumes CPU a lot although we applied Hannes' patch.
> 

Because it's still possible for processes to allocate pages at the same
rate kswapd is freeing them leading to a situation where kswapd does not
consider the zone balanced for prolonged periods of time.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16  8:45           ` Mel Gorman
  (?)
@ 2011-05-16  8:58             ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-16  8:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >
>> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> >> > ---
>> >> >   mm/vmscan.c |    4 ++++
>> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >
>> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> > index af24d1e..4d24828 100644
>> >> > --- a/mm/vmscan.c
>> >> > +++ b/mm/vmscan.c
>> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> >     unsigned long balanced = 0;
>> >> >     bool all_zones_ok = true;
>> >> >
>> >> > +   /* If kswapd has been running too long, just sleep */
>> >> > +   if (need_resched())
>> >> > +           return false;
>> >> > +
>> >>
>> >> Hmm... I don't like this patch so much. because this code does
>> >>
>> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >
>> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> > the normal path for determining whether to sleep or not, in effect
>> > leaving the current behaviour unchanged.
>> >
>> >> - sleep if kswapd didn't
>> >
>> > This also isn't entirely true: whether need_resched() is true at this
>> > point depends on a whole lot more that whether we did a context switch
>> > in shrink_inactive. It mostly depends on how long we've been running
>> > without giving up the CPU.  Generally that will mean we've been round
>> > the shrinker loop hundreds to thousands of times without sleeping.
>> >
>> >> It seems to be semi random behavior.
>> >
>> > Well, we have to do something.  Chris Mason first suspected the hang was
>> > a kswapd rescheduling problem a while ago.  We tried putting
>> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>
>> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>>
>> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> Because, although we complete zone balancing, kswapd doesn't sleep as
>> pgdat_balance returns wrong result. And at last VM calls
>> balance_pgdat. In this case, balance_pgdat returns without any work as
>> kswap couldn't find zones which have not enough free pages and goto
>> out. kswapd could repeat this work infinitely. So you don't have a
>> chance to call cond_resched.
>>
>> But if your test was with Hanne's patch, I am very curious how come
>> kswapd consumes CPU a lot.
>>
>> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > option.  The other option might be just to put a cond_resched() in
>> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>
>> I don't oppose it but before that, I think we have to know why kswapd
>> consumes CPU a lot although we applied Hannes' patch.
>>
>
> Because it's still possible for processes to allocate pages at the same
> rate kswapd is freeing them leading to a situation where kswapd does not
> consider the zone balanced for prolonged periods of time.

We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
So I think kswapd can be scheduled out although it's scheduled in
after a short time as task scheduled also need page reclaim. Although
all task in system need reclaim, kswapd cpu 99% consumption is a
natural result, I think.
Do I miss something?

>
> --
> Mel Gorman
> SUSE Labs
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16  8:58             ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-16  8:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >
>> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> >> > ---
>> >> >   mm/vmscan.c |    4 ++++
>> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >
>> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> > index af24d1e..4d24828 100644
>> >> > --- a/mm/vmscan.c
>> >> > +++ b/mm/vmscan.c
>> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> >     unsigned long balanced = 0;
>> >> >     bool all_zones_ok = true;
>> >> >
>> >> > +   /* If kswapd has been running too long, just sleep */
>> >> > +   if (need_resched())
>> >> > +           return false;
>> >> > +
>> >>
>> >> Hmm... I don't like this patch so much. because this code does
>> >>
>> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >
>> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> > the normal path for determining whether to sleep or not, in effect
>> > leaving the current behaviour unchanged.
>> >
>> >> - sleep if kswapd didn't
>> >
>> > This also isn't entirely true: whether need_resched() is true at this
>> > point depends on a whole lot more that whether we did a context switch
>> > in shrink_inactive. It mostly depends on how long we've been running
>> > without giving up the CPU.  Generally that will mean we've been round
>> > the shrinker loop hundreds to thousands of times without sleeping.
>> >
>> >> It seems to be semi random behavior.
>> >
>> > Well, we have to do something.  Chris Mason first suspected the hang was
>> > a kswapd rescheduling problem a while ago.  We tried putting
>> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>
>> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>>
>> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> Because, although we complete zone balancing, kswapd doesn't sleep as
>> pgdat_balance returns wrong result. And at last VM calls
>> balance_pgdat. In this case, balance_pgdat returns without any work as
>> kswap couldn't find zones which have not enough free pages and goto
>> out. kswapd could repeat this work infinitely. So you don't have a
>> chance to call cond_resched.
>>
>> But if your test was with Hanne's patch, I am very curious how come
>> kswapd consumes CPU a lot.
>>
>> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > option.  The other option might be just to put a cond_resched() in
>> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>
>> I don't oppose it but before that, I think we have to know why kswapd
>> consumes CPU a lot although we applied Hannes' patch.
>>
>
> Because it's still possible for processes to allocate pages at the same
> rate kswapd is freeing them leading to a situation where kswapd does not
> consider the zone balanced for prolonged periods of time.

We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
So I think kswapd can be scheduled out although it's scheduled in
after a short time as task scheduled also need page reclaim. Although
all task in system need reclaim, kswapd cpu 99% consumption is a
natural result, I think.
Do I miss something?

>
> --
> Mel Gorman
> SUSE Labs
>



-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16  8:58             ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-16  8:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >
>> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> >> > ---
>> >> >   mm/vmscan.c |    4 ++++
>> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >
>> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> > index af24d1e..4d24828 100644
>> >> > --- a/mm/vmscan.c
>> >> > +++ b/mm/vmscan.c
>> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> >     unsigned long balanced = 0;
>> >> >     bool all_zones_ok = true;
>> >> >
>> >> > +   /* If kswapd has been running too long, just sleep */
>> >> > +   if (need_resched())
>> >> > +           return false;
>> >> > +
>> >>
>> >> Hmm... I don't like this patch so much. because this code does
>> >>
>> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >
>> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> > the normal path for determining whether to sleep or not, in effect
>> > leaving the current behaviour unchanged.
>> >
>> >> - sleep if kswapd didn't
>> >
>> > This also isn't entirely true: whether need_resched() is true at this
>> > point depends on a whole lot more that whether we did a context switch
>> > in shrink_inactive. It mostly depends on how long we've been running
>> > without giving up the CPU.  Generally that will mean we've been round
>> > the shrinker loop hundreds to thousands of times without sleeping.
>> >
>> >> It seems to be semi random behavior.
>> >
>> > Well, we have to do something.  Chris Mason first suspected the hang was
>> > a kswapd rescheduling problem a while ago.  We tried putting
>> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>
>> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>>
>> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> Because, although we complete zone balancing, kswapd doesn't sleep as
>> pgdat_balance returns wrong result. And at last VM calls
>> balance_pgdat. In this case, balance_pgdat returns without any work as
>> kswap couldn't find zones which have not enough free pages and goto
>> out. kswapd could repeat this work infinitely. So you don't have a
>> chance to call cond_resched.
>>
>> But if your test was with Hanne's patch, I am very curious how come
>> kswapd consumes CPU a lot.
>>
>> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > option.  The other option might be just to put a cond_resched() in
>> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>
>> I don't oppose it but before that, I think we have to know why kswapd
>> consumes CPU a lot although we applied Hannes' patch.
>>
>
> Because it's still possible for processes to allocate pages at the same
> rate kswapd is freeing them leading to a situation where kswapd does not
> consider the zone balanced for prolonged periods of time.

We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
So I think kswapd can be scheduled out although it's scheduled in
after a short time as task scheduled also need page reclaim. Although
all task in system need reclaim, kswapd cpu 99% consumption is a
natural result, I think.
Do I miss something?

>
> --
> Mel Gorman
> SUSE Labs
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16  8:58             ` Minchan Kim
  (?)
@ 2011-05-16 10:27               ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16 10:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >
> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> >> > ---
> >> >> >   mm/vmscan.c |    4 ++++
> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >
> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> > index af24d1e..4d24828 100644
> >> >> > --- a/mm/vmscan.c
> >> >> > +++ b/mm/vmscan.c
> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >     unsigned long balanced = 0;
> >> >> >     bool all_zones_ok = true;
> >> >> >
> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> >> > +   if (need_resched())
> >> >> > +           return false;
> >> >> > +
> >> >>
> >> >> Hmm... I don't like this patch so much. because this code does
> >> >>
> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >
> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> > the normal path for determining whether to sleep or not, in effect
> >> > leaving the current behaviour unchanged.
> >> >
> >> >> - sleep if kswapd didn't
> >> >
> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > point depends on a whole lot more that whether we did a context switch
> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > without giving up the CPU.  Generally that will mean we've been round
> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >
> >> >> It seems to be semi random behavior.
> >> >
> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >>
> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >>
> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> pgdat_balance returns wrong result. And at last VM calls
> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> kswap couldn't find zones which have not enough free pages and goto
> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> chance to call cond_resched.
> >>
> >> But if your test was with Hanne's patch, I am very curious how come
> >> kswapd consumes CPU a lot.
> >>
> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > option.  The other option might be just to put a cond_resched() in
> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >>
> >> I don't oppose it but before that, I think we have to know why kswapd
> >> consumes CPU a lot although we applied Hannes' patch.
> >>
> >
> > Because it's still possible for processes to allocate pages at the same
> > rate kswapd is freeing them leading to a situation where kswapd does not
> > consider the zone balanced for prolonged periods of time.
> 
> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> So I think kswapd can be scheduled out although it's scheduled in
> after a short time as task scheduled also need page reclaim. Although
> all task in system need reclaim, kswapd cpu 99% consumption is a
> natural result, I think.
> Do I miss something?
> 

Lets see;

shrink_page_list() only applies if inactive pages were isolated
	which in turn may not happen if all_unreclaimable is set in
	shrink_zones(). If for whatver reason, all_unreclaimable is
	set on all zones, we can miss calling cond_resched().

shrink_slab only applies if we are reclaiming slab pages. If the first
	shrinker returns -1, we do not call cond_resched(). If that
	first shrinker is dcache and __GFP_FS is not set, direct
	reclaimers will not shrink at all. However, if there are
	enough of them running or if one of the other shrinkers
	is running for a very long time, kswapd could be starved
	acquiring the shrinker_rwsem and never reaching the
	cond_resched().

balance_pgdat() only calls cond_resched if the zones are not
	balanced. For a high-order allocation that is balanced, it
	checks order-0 again. During that window, order-0 might have
	become unbalanced so it loops again for order-0 and returns
	that was reclaiming for order-0 to kswapd(). It can then find
	that a caller has rewoken kswapd for a high-order and re-enters
	balance_pgdat() without ever have called cond_resched().

While it appears unlikely, there are bad conditions which can result
in cond_resched() being avoided.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16 10:27               ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16 10:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >
> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> >> > ---
> >> >> >   mm/vmscan.c |    4 ++++
> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >
> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> > index af24d1e..4d24828 100644
> >> >> > --- a/mm/vmscan.c
> >> >> > +++ b/mm/vmscan.c
> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >     unsigned long balanced = 0;
> >> >> >     bool all_zones_ok = true;
> >> >> >
> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> >> > +   if (need_resched())
> >> >> > +           return false;
> >> >> > +
> >> >>
> >> >> Hmm... I don't like this patch so much. because this code does
> >> >>
> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >
> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> > the normal path for determining whether to sleep or not, in effect
> >> > leaving the current behaviour unchanged.
> >> >
> >> >> - sleep if kswapd didn't
> >> >
> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > point depends on a whole lot more that whether we did a context switch
> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > without giving up the CPU.  Generally that will mean we've been round
> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >
> >> >> It seems to be semi random behavior.
> >> >
> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >>
> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >>
> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> pgdat_balance returns wrong result. And at last VM calls
> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> kswap couldn't find zones which have not enough free pages and goto
> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> chance to call cond_resched.
> >>
> >> But if your test was with Hanne's patch, I am very curious how come
> >> kswapd consumes CPU a lot.
> >>
> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > option.  The other option might be just to put a cond_resched() in
> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >>
> >> I don't oppose it but before that, I think we have to know why kswapd
> >> consumes CPU a lot although we applied Hannes' patch.
> >>
> >
> > Because it's still possible for processes to allocate pages at the same
> > rate kswapd is freeing them leading to a situation where kswapd does not
> > consider the zone balanced for prolonged periods of time.
> 
> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> So I think kswapd can be scheduled out although it's scheduled in
> after a short time as task scheduled also need page reclaim. Although
> all task in system need reclaim, kswapd cpu 99% consumption is a
> natural result, I think.
> Do I miss something?
> 

Lets see;

shrink_page_list() only applies if inactive pages were isolated
	which in turn may not happen if all_unreclaimable is set in
	shrink_zones(). If for whatver reason, all_unreclaimable is
	set on all zones, we can miss calling cond_resched().

shrink_slab only applies if we are reclaiming slab pages. If the first
	shrinker returns -1, we do not call cond_resched(). If that
	first shrinker is dcache and __GFP_FS is not set, direct
	reclaimers will not shrink at all. However, if there are
	enough of them running or if one of the other shrinkers
	is running for a very long time, kswapd could be starved
	acquiring the shrinker_rwsem and never reaching the
	cond_resched().

balance_pgdat() only calls cond_resched if the zones are not
	balanced. For a high-order allocation that is balanced, it
	checks order-0 again. During that window, order-0 might have
	become unbalanced so it loops again for order-0 and returns
	that was reclaiming for order-0 to kswapd(). It can then find
	that a caller has rewoken kswapd for a high-order and re-enters
	balance_pgdat() without ever have called cond_resched().

While it appears unlikely, there are bad conditions which can result
in cond_resched() being avoided.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16 10:27               ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-16 10:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >
> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> >> > ---
> >> >> >   mm/vmscan.c |    4 ++++
> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >
> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> > index af24d1e..4d24828 100644
> >> >> > --- a/mm/vmscan.c
> >> >> > +++ b/mm/vmscan.c
> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >     unsigned long balanced = 0;
> >> >> >     bool all_zones_ok = true;
> >> >> >
> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> >> > +   if (need_resched())
> >> >> > +           return false;
> >> >> > +
> >> >>
> >> >> Hmm... I don't like this patch so much. because this code does
> >> >>
> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >
> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> > the normal path for determining whether to sleep or not, in effect
> >> > leaving the current behaviour unchanged.
> >> >
> >> >> - sleep if kswapd didn't
> >> >
> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > point depends on a whole lot more that whether we did a context switch
> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > without giving up the CPU.  Generally that will mean we've been round
> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >
> >> >> It seems to be semi random behavior.
> >> >
> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >>
> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >>
> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> pgdat_balance returns wrong result. And at last VM calls
> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> kswap couldn't find zones which have not enough free pages and goto
> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> chance to call cond_resched.
> >>
> >> But if your test was with Hanne's patch, I am very curious how come
> >> kswapd consumes CPU a lot.
> >>
> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > option.  The other option might be just to put a cond_resched() in
> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >>
> >> I don't oppose it but before that, I think we have to know why kswapd
> >> consumes CPU a lot although we applied Hannes' patch.
> >>
> >
> > Because it's still possible for processes to allocate pages at the same
> > rate kswapd is freeing them leading to a situation where kswapd does not
> > consider the zone balanced for prolonged periods of time.
> 
> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> So I think kswapd can be scheduled out although it's scheduled in
> after a short time as task scheduled also need page reclaim. Although
> all task in system need reclaim, kswapd cpu 99% consumption is a
> natural result, I think.
> Do I miss something?
> 

Lets see;

shrink_page_list() only applies if inactive pages were isolated
	which in turn may not happen if all_unreclaimable is set in
	shrink_zones(). If for whatver reason, all_unreclaimable is
	set on all zones, we can miss calling cond_resched().

shrink_slab only applies if we are reclaiming slab pages. If the first
	shrinker returns -1, we do not call cond_resched(). If that
	first shrinker is dcache and __GFP_FS is not set, direct
	reclaimers will not shrink at all. However, if there are
	enough of them running or if one of the other shrinkers
	is running for a very long time, kswapd could be starved
	acquiring the shrinker_rwsem and never reaching the
	cond_resched().

balance_pgdat() only calls cond_resched if the zones are not
	balanced. For a high-order allocation that is balanced, it
	checks order-0 again. During that window, order-0 might have
	become unbalanced so it loops again for order-0 and returns
	that was reclaiming for order-0 to kswapd(). It can then find
	that a caller has rewoken kswapd for a high-order and re-enters
	balance_pgdat() without ever have called cond_resched().

While it appears unlikely, there are bad conditions which can result
in cond_resched() being avoided.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
  2011-05-16  8:37     ` Mel Gorman
@ 2011-05-16 11:24       ` Colin Ian King
  -1 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-16 11:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Mon, 2011-05-16 at 09:37 +0100, Mel Gorman wrote:
> On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote:
> > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > > Changelog since V1
> > >   o kswapd should sleep if need_resched
> > >   o Remove __GFP_REPEAT from GFP flags when speculatively using high
> > >     orders so direct/compaction exits earlier
> > >   o Remove __GFP_NORETRY for correctness
> > >   o Correct logic in sleeping_prematurely
> > >   o Leave SLUB using the default slub_max_order
> > > 
> > > There are a few reports of people experiencing hangs when copying
> > > large amounts of data with kswapd using a large amount of CPU which
> > > appear to be due to recent reclaim changes.
> > > 
> > > SLUB using high orders is the trigger but not the root cause as SLUB
> > > has been using high orders for a while. The following four patches
> > > aim to fix the problems in reclaim while reducing the cost for SLUB
> > > using those high orders.
> > > 
> > > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > > 	kswapd: keep kswapd awake for high-order allocations until
> > > 	a percentage of the node is balanced] to allow kswapd to
> > > 	go to sleep when balanced for high orders.
> > > 
> > > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > > 	use of high orders.
> > > 
> > > Patch 3 further reduces the cost by prevent SLUB entering direct
> > > 	compaction or reclaim paths on the grounds that falling
> > > 	back to order-0 should be cheaper.
> > > 
> > > Patch 4 notes that even when kswapd is failing to keep up with
> > > 	allocation requests, it should still go to sleep when its
> > > 	quota has expired to prevent it spinning.
> > > 
> > > My own data on this is not great. I haven't really been able to
> > > reproduce the same problem locally.
> > > 
> > > The test case is simple. "download tar" wgets a large tar file and
> > > stores it locally. "unpack" is expanding it (15 times physical RAM
> > > in this case) and "delete source dirs" is the tarfile being deleted
> > > again. I also experimented with having the tar copied numerous times
> > > and into deeper directories to increase the size but the results were
> > > not particularly interesting so I left it as one tar.
> > > 
> > > In the background, applications are being launched to time to vaguely
> > > simulate activity on the desktop and to measure how long it takes
> > > applications to start.
> > > 
> > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> > > LARGE COPY AND UNTAR
> > >                       vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
> > > download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
> > > unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
> > > copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
> > > delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
> > > MMTests Statistics: duration
> > > User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
> > > Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35
> > > 
> > > MMTests Statistics: application launch
> > > evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
> > > gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
> > > iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88
> > > 
> > > evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
> > > gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
> > > iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11
> > > 
> > > Having SLUB avoid expensive steps in reclaim improves performance
> > > by quite a bit with the overall test completing 1.5 minutes
> > > faster. Application launch times were not really affected but it's
> > > not something my test machine was suffering from in the first place
> > > so it's not really conclusive. The kswapd patches also did not appear
> > > to help but again, the test machine wasn't suffering that problem.
> > > 
> > > These patches are against 2.6.39-rc7. Again, testing would be
> > > appreciated.
> > 
> > These patches solve the problem for me.  I've been soak testing the file
> > copy test
> > for 3.5 hours with nearly 400 test cycles and observed no lockups at all
> > - rock solid. From my observations from the output from vmstat the
> > system is behaving sanely.
> > Thanks for finding a solution - much appreciated!
> > 
> 
> Can you tell me if just patches 1 and 4 fix the problem please? It'd be good
> to know if this was only a reclaim-related problem. Thanks.

Hi Mel,

Soak tested just patches 1 + 4 and works fine.  Did 250 cycles for ~2
hours, no lockups, and the output from vmstat looked sane.

Colin
> 



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2
@ 2011-05-16 11:24       ` Colin Ian King
  0 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-16 11:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Mon, 2011-05-16 at 09:37 +0100, Mel Gorman wrote:
> On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote:
> > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > > Changelog since V1
> > >   o kswapd should sleep if need_resched
> > >   o Remove __GFP_REPEAT from GFP flags when speculatively using high
> > >     orders so direct/compaction exits earlier
> > >   o Remove __GFP_NORETRY for correctness
> > >   o Correct logic in sleeping_prematurely
> > >   o Leave SLUB using the default slub_max_order
> > > 
> > > There are a few reports of people experiencing hangs when copying
> > > large amounts of data with kswapd using a large amount of CPU which
> > > appear to be due to recent reclaim changes.
> > > 
> > > SLUB using high orders is the trigger but not the root cause as SLUB
> > > has been using high orders for a while. The following four patches
> > > aim to fix the problems in reclaim while reducing the cost for SLUB
> > > using those high orders.
> > > 
> > > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > > 	kswapd: keep kswapd awake for high-order allocations until
> > > 	a percentage of the node is balanced] to allow kswapd to
> > > 	go to sleep when balanced for high orders.
> > > 
> > > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > > 	use of high orders.
> > > 
> > > Patch 3 further reduces the cost by prevent SLUB entering direct
> > > 	compaction or reclaim paths on the grounds that falling
> > > 	back to order-0 should be cheaper.
> > > 
> > > Patch 4 notes that even when kswapd is failing to keep up with
> > > 	allocation requests, it should still go to sleep when its
> > > 	quota has expired to prevent it spinning.
> > > 
> > > My own data on this is not great. I haven't really been able to
> > > reproduce the same problem locally.
> > > 
> > > The test case is simple. "download tar" wgets a large tar file and
> > > stores it locally. "unpack" is expanding it (15 times physical RAM
> > > in this case) and "delete source dirs" is the tarfile being deleted
> > > again. I also experimented with having the tar copied numerous times
> > > and into deeper directories to increase the size but the results were
> > > not particularly interesting so I left it as one tar.
> > > 
> > > In the background, applications are being launched to time to vaguely
> > > simulate activity on the desktop and to measure how long it takes
> > > applications to start.
> > > 
> > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> > > LARGE COPY AND UNTAR
> > >                       vanilla       fixprematurely  kswapd-nowwake slub-noexstep  kswapdsleep
> > > download tar           95 ( 0.00%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)   94 ( 1.06%)
> > > unpack tar            654 ( 0.00%)  649 ( 0.77%)  655 (-0.15%)  589 (11.04%)  598 ( 9.36%)
> > > copy source files       0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)    0 ( 0.00%)
> > > delete source dirs    327 ( 0.00%)  334 (-2.10%)  318 ( 2.83%)  325 ( 0.62%)  320 ( 2.19%)
> > > MMTests Statistics: duration
> > > User/Sys Time Running Test (seconds)        1139.7   1142.55   1149.78   1109.32   1113.26
> > > Total Elapsed Time (seconds)               1341.59   1342.45   1324.90   1271.02   1247.35
> > > 
> > > MMTests Statistics: application launch
> > > evolution-wait30     mean     34.92   34.96   34.92   34.92   35.08
> > > gnome-terminal-find  mean      7.96    7.96    8.76    7.80    7.96
> > > iceweasel-table      mean      7.93    7.81    7.73    7.65    7.88
> > > 
> > > evolution-wait30     stddev    0.96    1.22    1.27    1.20    1.15
> > > gnome-terminal-find  stddev    3.02    3.09    3.51    2.99    3.02
> > > iceweasel-table      stddev    1.05    0.90    1.09    1.11    1.11
> > > 
> > > Having SLUB avoid expensive steps in reclaim improves performance
> > > by quite a bit with the overall test completing 1.5 minutes
> > > faster. Application launch times were not really affected but it's
> > > not something my test machine was suffering from in the first place
> > > so it's not really conclusive. The kswapd patches also did not appear
> > > to help but again, the test machine wasn't suffering that problem.
> > > 
> > > These patches are against 2.6.39-rc7. Again, testing would be
> > > appreciated.
> > 
> > These patches solve the problem for me.  I've been soak testing the file
> > copy test
> > for 3.5 hours with nearly 400 test cycles and observed no lockups at all
> > - rock solid. From my observations from the output from vmstat the
> > system is behaving sanely.
> > Thanks for finding a solution - much appreciated!
> > 
> 
> Can you tell me if just patches 1 and 4 fix the problem please? It'd be good
> to know if this was only a reclaim-related problem. Thanks.

Hi Mel,

Soak tested just patches 1 + 4 and works fine.  Did 250 cycles for ~2
hours, no lockups, and the output from vmstat looked sane.

Colin
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-13 14:03   ` Mel Gorman
@ 2011-05-16 14:30     ` Rik van Riel
  -1 siblings, 0 replies; 119+ messages in thread
From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On 05/13/2011 10:03 AM, Mel Gorman wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> From-but-was-not-signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> Will-sign-off-after-Johannes: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-16 14:30     ` Rik van Riel
  0 siblings, 0 replies; 119+ messages in thread
From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On 05/13/2011 10:03 AM, Mel Gorman wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> From-but-was-not-signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> Will-sign-off-after-Johannes: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-13 14:03   ` Mel Gorman
@ 2011-05-16 14:30     ` Rik van Riel
  -1 siblings, 0 replies; 119+ messages in thread
From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On 05/13/2011 10:03 AM, Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16 14:30     ` Rik van Riel
  0 siblings, 0 replies; 119+ messages in thread
From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On 05/13/2011 10:03 AM, Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Acked-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
  2011-05-13 14:03   ` Mel Gorman
@ 2011-05-16 21:10     ` David Rientjes
  -1 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-16 21:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Fri, 13 May 2011, Mel Gorman wrote:

> To avoid locking and per-cpu overhead, SLUB optimisically uses
> high-order allocations and falls back to lower allocations if they
> fail.  However, by simply trying to allocate, kswapd is woken up to
> start reclaiming at that order. On a desktop system, two users report
> that the system is getting locked up with kswapd using large amounts
> of CPU.  Using SLAB instead of SLUB made this problem go away.
> 
> This patch prevents kswapd being woken up for high-order allocations.
> Testing indicated that with this patch applied, the system was much
> harder to hang and even when it did, it eventually recovered.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
@ 2011-05-16 21:10     ` David Rientjes
  0 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-16 21:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Fri, 13 May 2011, Mel Gorman wrote:

> To avoid locking and per-cpu overhead, SLUB optimisically uses
> high-order allocations and falls back to lower allocations if they
> fail.  However, by simply trying to allocate, kswapd is woken up to
> start reclaiming at that order. On a desktop system, two users report
> that the system is getting locked up with kswapd using large amounts
> of CPU.  Using SLAB instead of SLUB made this problem go away.
> 
> This patch prevents kswapd being woken up for high-order allocations.
> Testing indicated that with this patch applied, the system was much
> harder to hang and even when it did, it eventually recovered.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-13 14:03   ` Mel Gorman
@ 2011-05-16 21:16     ` David Rientjes
  -1 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-16 21:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Fri, 13 May 2011, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9f8a97b..057f1e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
>  
>  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 */
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
> -	if (!wait) {
> +	if (!wait && can_wake_kswapd) {
>  		/*
>  		 * Not worth trying to allocate harder for
>  		 * __GFP_NOMEMALLOC even if it can't schedule.
> diff --git a/mm/slub.c b/mm/slub.c
> index 98c358d..c5797ab 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	 * Let the initial higher-order allocation fail under memory pressure
>  	 * so we fall-back to the minimum order allocation.
>  	 */
> -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
>  
>  	page = alloc_slab_page(alloc_gfp, node, oo);
>  	if (unlikely(!page)) {

It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
will immediately fail.

alloc_gfp would probably benefit from having a comment about why 
__GFP_WAIT should be masked off here: that we don't want to do compaction 
or direct reclaim or retry the allocation more than once (so both 
__GFP_NORETRY and __GFP_REPEAT are no-ops).

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-16 21:16     ` David Rientjes
  0 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-16 21:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Fri, 13 May 2011, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9f8a97b..057f1e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
>  
>  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 */
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
> -	if (!wait) {
> +	if (!wait && can_wake_kswapd) {
>  		/*
>  		 * Not worth trying to allocate harder for
>  		 * __GFP_NOMEMALLOC even if it can't schedule.
> diff --git a/mm/slub.c b/mm/slub.c
> index 98c358d..c5797ab 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	 * Let the initial higher-order allocation fail under memory pressure
>  	 * so we fall-back to the minimum order allocation.
>  	 */
> -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
>  
>  	page = alloc_slab_page(alloc_gfp, node, oo);
>  	if (unlikely(!page)) {

It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
will immediately fail.

alloc_gfp would probably benefit from having a comment about why 
__GFP_WAIT should be masked off here: that we don't want to do compaction 
or direct reclaim or retry the allocation more than once (so both 
__GFP_NORETRY and __GFP_REPEAT are no-ops).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 10:27               ` Mel Gorman
@ 2011-05-16 23:50                 ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-16 23:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> >> <James.Bottomley@hansenpartnership.com> wrote:
>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >> >
>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> >> >> > ---
>> >> >> >   mm/vmscan.c |    4 ++++
>> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >> >
>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> >> > index af24d1e..4d24828 100644
>> >> >> > --- a/mm/vmscan.c
>> >> >> > +++ b/mm/vmscan.c
>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> >> >     unsigned long balanced = 0;
>> >> >> >     bool all_zones_ok = true;
>> >> >> >
>> >> >> > +   /* If kswapd has been running too long, just sleep */
>> >> >> > +   if (need_resched())
>> >> >> > +           return false;
>> >> >> > +
>> >> >>
>> >> >> Hmm... I don't like this patch so much. because this code does
>> >> >>
>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >> >
>> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> >> > the normal path for determining whether to sleep or not, in effect
>> >> > leaving the current behaviour unchanged.
>> >> >
>> >> >> - sleep if kswapd didn't
>> >> >
>> >> > This also isn't entirely true: whether need_resched() is true at this
>> >> > point depends on a whole lot more that whether we did a context switch
>> >> > in shrink_inactive. It mostly depends on how long we've been running
>> >> > without giving up the CPU.  Generally that will mean we've been round
>> >> > the shrinker loop hundreds to thousands of times without sleeping.
>> >> >
>> >> >> It seems to be semi random behavior.
>> >> >
>> >> > Well, we have to do something.  Chris Mason first suspected the hang was
>> >> > a kswapd rescheduling problem a while ago.  We tried putting
>> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>> >>
>> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>> >>
>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>> >> pgdat_balance returns wrong result. And at last VM calls
>> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>> >> kswap couldn't find zones which have not enough free pages and goto
>> >> out. kswapd could repeat this work infinitely. So you don't have a
>> >> chance to call cond_resched.
>> >>
>> >> But if your test was with Hanne's patch, I am very curious how come
>> >> kswapd consumes CPU a lot.
>> >>
>> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>> >> > option.  The other option might be just to put a cond_resched() in
>> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>> >>
>> >> I don't oppose it but before that, I think we have to know why kswapd
>> >> consumes CPU a lot although we applied Hannes' patch.
>> >>
>> >
>> > Because it's still possible for processes to allocate pages at the same
>> > rate kswapd is freeing them leading to a situation where kswapd does not
>> > consider the zone balanced for prolonged periods of time.
>>
>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>> So I think kswapd can be scheduled out although it's scheduled in
>> after a short time as task scheduled also need page reclaim. Although
>> all task in system need reclaim, kswapd cpu 99% consumption is a
>> natural result, I think.
>> Do I miss something?
>>
>
> Lets see;
>
> shrink_page_list() only applies if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> shrink_slab only applies if we are reclaiming slab pages. If the first
>        shrinker returns -1, we do not call cond_resched(). If that
>        first shrinker is dcache and __GFP_FS is not set, direct
>        reclaimers will not shrink at all. However, if there are
>        enough of them running or if one of the other shrinkers
>        is running for a very long time, kswapd could be starved
>        acquiring the shrinker_rwsem and never reaching the
>        cond_resched().

Don't we have to move cond_resched?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..633e761 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
        if (scanned == 0)
                scanned = SWAP_CLUSTER_MAX;

-       if (!down_read_trylock(&shrinker_rwsem))
-               return 1;       /* Assume we'll be able to shrink next time */
+       if (!down_read_trylock(&shrinker_rwsem)) {
+               ret = 1;
+               goto out; /* Assume we'll be able to shrink next time */
+       }

        list_for_each_entry(shrinker, &shrinker_list, list) {
                unsigned long long delta;
@@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
                        count_vm_events(SLABS_SCANNED, this_scan);
                        total_scan -= this_scan;

-                       cond_resched();
                }

                shrinker->nr += total_scan;
+               cond_resched();
        }
        up_read(&shrinker_rwsem);
+out:
+       cond_resched();
        return ret;
 }


>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that was reclaiming for order-0 to kswapd(). It can then find
>        that a caller has rewoken kswapd for a high-order and re-enters
>        balance_pgdat() without ever have called cond_resched().

If kswapd reclaims order-o followed by high order, it would have a
chance to call cond_resched in shrink_page_list. But if all zones are
all_unreclaimable is set, balance_pgdat could return any work. Okay.
It does make sense.
By your scenario, someone wakes up kswapd with higher order, again.
So re-enters balance_pgdat without ever have called cond_resched.
But if someone wakes up higher order again, we can't have a chance to
call kswapd_try_to_sleep. So your patch effect would be nop, too.

It would be better to put cond_resched after balance_pgdat?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..61c45d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2753,6 +2753,7 @@ static int kswapd(void *p)
                if (!ret) {
                        trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
                        order = balance_pgdat(pgdat, order, &classzone_idx);
+                       cond_resched();
                }
        }
        return 0;

>
> While it appears unlikely, there are bad conditions which can result
> in cond_resched() being avoided.

>
> --
> Mel Gorman
> SUSE Labs
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16 23:50                 ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-16 23:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> >> <James.Bottomley@hansenpartnership.com> wrote:
>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >> >
>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> >> >> > ---
>> >> >> >   mm/vmscan.c |    4 ++++
>> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >> >
>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> >> > index af24d1e..4d24828 100644
>> >> >> > --- a/mm/vmscan.c
>> >> >> > +++ b/mm/vmscan.c
>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> >> >     unsigned long balanced = 0;
>> >> >> >     bool all_zones_ok = true;
>> >> >> >
>> >> >> > +   /* If kswapd has been running too long, just sleep */
>> >> >> > +   if (need_resched())
>> >> >> > +           return false;
>> >> >> > +
>> >> >>
>> >> >> Hmm... I don't like this patch so much. because this code does
>> >> >>
>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >> >
>> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> >> > the normal path for determining whether to sleep or not, in effect
>> >> > leaving the current behaviour unchanged.
>> >> >
>> >> >> - sleep if kswapd didn't
>> >> >
>> >> > This also isn't entirely true: whether need_resched() is true at this
>> >> > point depends on a whole lot more that whether we did a context switch
>> >> > in shrink_inactive. It mostly depends on how long we've been running
>> >> > without giving up the CPU.  Generally that will mean we've been round
>> >> > the shrinker loop hundreds to thousands of times without sleeping.
>> >> >
>> >> >> It seems to be semi random behavior.
>> >> >
>> >> > Well, we have to do something.  Chris Mason first suspected the hang was
>> >> > a kswapd rescheduling problem a while ago.  We tried putting
>> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>> >>
>> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>> >>
>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>> >> pgdat_balance returns wrong result. And at last VM calls
>> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>> >> kswap couldn't find zones which have not enough free pages and goto
>> >> out. kswapd could repeat this work infinitely. So you don't have a
>> >> chance to call cond_resched.
>> >>
>> >> But if your test was with Hanne's patch, I am very curious how come
>> >> kswapd consumes CPU a lot.
>> >>
>> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>> >> > option.  The other option might be just to put a cond_resched() in
>> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>> >>
>> >> I don't oppose it but before that, I think we have to know why kswapd
>> >> consumes CPU a lot although we applied Hannes' patch.
>> >>
>> >
>> > Because it's still possible for processes to allocate pages at the same
>> > rate kswapd is freeing them leading to a situation where kswapd does not
>> > consider the zone balanced for prolonged periods of time.
>>
>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>> So I think kswapd can be scheduled out although it's scheduled in
>> after a short time as task scheduled also need page reclaim. Although
>> all task in system need reclaim, kswapd cpu 99% consumption is a
>> natural result, I think.
>> Do I miss something?
>>
>
> Lets see;
>
> shrink_page_list() only applies if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> shrink_slab only applies if we are reclaiming slab pages. If the first
>        shrinker returns -1, we do not call cond_resched(). If that
>        first shrinker is dcache and __GFP_FS is not set, direct
>        reclaimers will not shrink at all. However, if there are
>        enough of them running or if one of the other shrinkers
>        is running for a very long time, kswapd could be starved
>        acquiring the shrinker_rwsem and never reaching the
>        cond_resched().

Don't we have to move cond_resched?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..633e761 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
        if (scanned == 0)
                scanned = SWAP_CLUSTER_MAX;

-       if (!down_read_trylock(&shrinker_rwsem))
-               return 1;       /* Assume we'll be able to shrink next time */
+       if (!down_read_trylock(&shrinker_rwsem)) {
+               ret = 1;
+               goto out; /* Assume we'll be able to shrink next time */
+       }

        list_for_each_entry(shrinker, &shrinker_list, list) {
                unsigned long long delta;
@@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
                        count_vm_events(SLABS_SCANNED, this_scan);
                        total_scan -= this_scan;

-                       cond_resched();
                }

                shrinker->nr += total_scan;
+               cond_resched();
        }
        up_read(&shrinker_rwsem);
+out:
+       cond_resched();
        return ret;
 }


>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that was reclaiming for order-0 to kswapd(). It can then find
>        that a caller has rewoken kswapd for a high-order and re-enters
>        balance_pgdat() without ever have called cond_resched().

If kswapd reclaims order-o followed by high order, it would have a
chance to call cond_resched in shrink_page_list. But if all zones are
all_unreclaimable is set, balance_pgdat could return any work. Okay.
It does make sense.
By your scenario, someone wakes up kswapd with higher order, again.
So re-enters balance_pgdat without ever have called cond_resched.
But if someone wakes up higher order again, we can't have a chance to
call kswapd_try_to_sleep. So your patch effect would be nop, too.

It would be better to put cond_resched after balance_pgdat?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..61c45d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2753,6 +2753,7 @@ static int kswapd(void *p)
                if (!ret) {
                        trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
                        order = balance_pgdat(pgdat, order, &classzone_idx);
+                       cond_resched();
                }
        }
        return 0;

>
> While it appears unlikely, there are bad conditions which can result
> in cond_resched() being avoided.

>
> --
> Mel Gorman
> SUSE Labs
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 23:50                 ` Minchan Kim
  (?)
@ 2011-05-17  0:48                   ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-17  0:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, May 17, 2011 at 8:50 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
>>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>>> >> <James.Bottomley@hansenpartnership.com> wrote:
>>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>>> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>>> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>>> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>>> >> >> > running a long time. Check if kswapd needs to be scheduled.
>>> >> >> >
>>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>>> >> >> > ---
>>> >> >> >   mm/vmscan.c |    4 ++++
>>> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>>> >> >> >
>>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> >> >> > index af24d1e..4d24828 100644
>>> >> >> > --- a/mm/vmscan.c
>>> >> >> > +++ b/mm/vmscan.c
>>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>>> >> >> >     unsigned long balanced = 0;
>>> >> >> >     bool all_zones_ok = true;
>>> >> >> >
>>> >> >> > +   /* If kswapd has been running too long, just sleep */
>>> >> >> > +   if (need_resched())
>>> >> >> > +           return false;
>>> >> >> > +
>>> >> >>
>>> >> >> Hmm... I don't like this patch so much. because this code does
>>> >> >>
>>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>>> >> >
>>> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
>>> >> > the normal path for determining whether to sleep or not, in effect
>>> >> > leaving the current behaviour unchanged.
>>> >> >
>>> >> >> - sleep if kswapd didn't
>>> >> >
>>> >> > This also isn't entirely true: whether need_resched() is true at this
>>> >> > point depends on a whole lot more that whether we did a context switch
>>> >> > in shrink_inactive. It mostly depends on how long we've been running
>>> >> > without giving up the CPU.  Generally that will mean we've been round
>>> >> > the shrinker loop hundreds to thousands of times without sleeping.
>>> >> >
>>> >> >> It seems to be semi random behavior.
>>> >> >
>>> >> > Well, we have to do something.  Chris Mason first suspected the hang was
>>> >> > a kswapd rescheduling problem a while ago.  We tried putting
>>> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>> >>
>>> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>>> >>
>>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>>> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>>> >> pgdat_balance returns wrong result. And at last VM calls
>>> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>>> >> kswap couldn't find zones which have not enough free pages and goto
>>> >> out. kswapd could repeat this work infinitely. So you don't have a
>>> >> chance to call cond_resched.
>>> >>
>>> >> But if your test was with Hanne's patch, I am very curious how come
>>> >> kswapd consumes CPU a lot.
>>> >>
>>> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>>> >> > option.  The other option might be just to put a cond_resched() in
>>> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>> >>
>>> >> I don't oppose it but before that, I think we have to know why kswapd
>>> >> consumes CPU a lot although we applied Hannes' patch.
>>> >>
>>> >
>>> > Because it's still possible for processes to allocate pages at the same
>>> > rate kswapd is freeing them leading to a situation where kswapd does not
>>> > consider the zone balanced for prolonged periods of time.
>>>
>>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>>> So I think kswapd can be scheduled out although it's scheduled in
>>> after a short time as task scheduled also need page reclaim. Although
>>> all task in system need reclaim, kswapd cpu 99% consumption is a
>>> natural result, I think.
>>> Do I miss something?
>>>
>>
>> Lets see;
>>
>> shrink_page_list() only applies if inactive pages were isolated
>>        which in turn may not happen if all_unreclaimable is set in
>>        shrink_zones(). If for whatver reason, all_unreclaimable is
>>        set on all zones, we can miss calling cond_resched().
>>
>> shrink_slab only applies if we are reclaiming slab pages. If the first
>>        shrinker returns -1, we do not call cond_resched(). If that
>>        first shrinker is dcache and __GFP_FS is not set, direct
>>        reclaimers will not shrink at all. However, if there are
>>        enough of them running or if one of the other shrinkers
>>        is running for a very long time, kswapd could be starved
>>        acquiring the shrinker_rwsem and never reaching the
>>        cond_resched().
>
> Don't we have to move cond_resched?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>        if (scanned == 0)
>                scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               ret = 1;
> +               goto out; /* Assume we'll be able to shrink next time */
> +       }
>
>        list_for_each_entry(shrinker, &shrinker_list, list) {
>                unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>                        count_vm_events(SLABS_SCANNED, this_scan);
>                        total_scan -= this_scan;
>
> -                       cond_resched();
>                }
>
>                shrinker->nr += total_scan;
> +               cond_resched();
>        }
>        up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>        return ret;
>  }
>
>
>>
>> balance_pgdat() only calls cond_resched if the zones are not
>>        balanced. For a high-order allocation that is balanced, it
>>        checks order-0 again. During that window, order-0 might have
>>        become unbalanced so it loops again for order-0 and returns
>>        that was reclaiming for order-0 to kswapd(). It can then find
>>        that a caller has rewoken kswapd for a high-order and re-enters
>>        balance_pgdat() without ever have called cond_resched().
>
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work.

    Typo : without any work.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-17  0:48                   ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-17  0:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, May 17, 2011 at 8:50 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
>>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>>> >> <James.Bottomley@hansenpartnership.com> wrote:
>>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>>> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>>> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>>> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>>> >> >> > running a long time. Check if kswapd needs to be scheduled.
>>> >> >> >
>>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>>> >> >> > ---
>>> >> >> >   mm/vmscan.c |    4 ++++
>>> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>>> >> >> >
>>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> >> >> > index af24d1e..4d24828 100644
>>> >> >> > --- a/mm/vmscan.c
>>> >> >> > +++ b/mm/vmscan.c
>>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>>> >> >> >     unsigned long balanced = 0;
>>> >> >> >     bool all_zones_ok = true;
>>> >> >> >
>>> >> >> > +   /* If kswapd has been running too long, just sleep */
>>> >> >> > +   if (need_resched())
>>> >> >> > +           return false;
>>> >> >> > +
>>> >> >>
>>> >> >> Hmm... I don't like this patch so much. because this code does
>>> >> >>
>>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>>> >> >
>>> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
>>> >> > the normal path for determining whether to sleep or not, in effect
>>> >> > leaving the current behaviour unchanged.
>>> >> >
>>> >> >> - sleep if kswapd didn't
>>> >> >
>>> >> > This also isn't entirely true: whether need_resched() is true at this
>>> >> > point depends on a whole lot more that whether we did a context switch
>>> >> > in shrink_inactive. It mostly depends on how long we've been running
>>> >> > without giving up the CPU.  Generally that will mean we've been round
>>> >> > the shrinker loop hundreds to thousands of times without sleeping.
>>> >> >
>>> >> >> It seems to be semi random behavior.
>>> >> >
>>> >> > Well, we have to do something.  Chris Mason first suspected the hang was
>>> >> > a kswapd rescheduling problem a while ago.  We tried putting
>>> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>> >>
>>> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>>> >>
>>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>>> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>>> >> pgdat_balance returns wrong result. And at last VM calls
>>> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>>> >> kswap couldn't find zones which have not enough free pages and goto
>>> >> out. kswapd could repeat this work infinitely. So you don't have a
>>> >> chance to call cond_resched.
>>> >>
>>> >> But if your test was with Hanne's patch, I am very curious how come
>>> >> kswapd consumes CPU a lot.
>>> >>
>>> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>>> >> > option.  The other option might be just to put a cond_resched() in
>>> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>> >>
>>> >> I don't oppose it but before that, I think we have to know why kswapd
>>> >> consumes CPU a lot although we applied Hannes' patch.
>>> >>
>>> >
>>> > Because it's still possible for processes to allocate pages at the same
>>> > rate kswapd is freeing them leading to a situation where kswapd does not
>>> > consider the zone balanced for prolonged periods of time.
>>>
>>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>>> So I think kswapd can be scheduled out although it's scheduled in
>>> after a short time as task scheduled also need page reclaim. Although
>>> all task in system need reclaim, kswapd cpu 99% consumption is a
>>> natural result, I think.
>>> Do I miss something?
>>>
>>
>> Lets see;
>>
>> shrink_page_list() only applies if inactive pages were isolated
>>        which in turn may not happen if all_unreclaimable is set in
>>        shrink_zones(). If for whatver reason, all_unreclaimable is
>>        set on all zones, we can miss calling cond_resched().
>>
>> shrink_slab only applies if we are reclaiming slab pages. If the first
>>        shrinker returns -1, we do not call cond_resched(). If that
>>        first shrinker is dcache and __GFP_FS is not set, direct
>>        reclaimers will not shrink at all. However, if there are
>>        enough of them running or if one of the other shrinkers
>>        is running for a very long time, kswapd could be starved
>>        acquiring the shrinker_rwsem and never reaching the
>>        cond_resched().
>
> Don't we have to move cond_resched?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>        if (scanned == 0)
>                scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               ret = 1;
> +               goto out; /* Assume we'll be able to shrink next time */
> +       }
>
>        list_for_each_entry(shrinker, &shrinker_list, list) {
>                unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>                        count_vm_events(SLABS_SCANNED, this_scan);
>                        total_scan -= this_scan;
>
> -                       cond_resched();
>                }
>
>                shrinker->nr += total_scan;
> +               cond_resched();
>        }
>        up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>        return ret;
>  }
>
>
>>
>> balance_pgdat() only calls cond_resched if the zones are not
>>        balanced. For a high-order allocation that is balanced, it
>>        checks order-0 again. During that window, order-0 might have
>>        become unbalanced so it loops again for order-0 and returns
>>        that was reclaiming for order-0 to kswapd(). It can then find
>>        that a caller has rewoken kswapd for a high-order and re-enters
>>        balance_pgdat() without ever have called cond_resched().
>
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work.

    Typo : without any work.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-17  0:48                   ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-17  0:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, May 17, 2011 at 8:50 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
>> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
>>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>>> >> <James.Bottomley@hansenpartnership.com> wrote:
>>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>>> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>>> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>>> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>>> >> >> > running a long time. Check if kswapd needs to be scheduled.
>>> >> >> >
>>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>>> >> >> > ---
>>> >> >> >   mm/vmscan.c |    4 ++++
>>> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>>> >> >> >
>>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> >> >> > index af24d1e..4d24828 100644
>>> >> >> > --- a/mm/vmscan.c
>>> >> >> > +++ b/mm/vmscan.c
>>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>>> >> >> >     unsigned long balanced = 0;
>>> >> >> >     bool all_zones_ok = true;
>>> >> >> >
>>> >> >> > +   /* If kswapd has been running too long, just sleep */
>>> >> >> > +   if (need_resched())
>>> >> >> > +           return false;
>>> >> >> > +
>>> >> >>
>>> >> >> Hmm... I don't like this patch so much. because this code does
>>> >> >>
>>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>>> >> >
>>> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
>>> >> > the normal path for determining whether to sleep or not, in effect
>>> >> > leaving the current behaviour unchanged.
>>> >> >
>>> >> >> - sleep if kswapd didn't
>>> >> >
>>> >> > This also isn't entirely true: whether need_resched() is true at this
>>> >> > point depends on a whole lot more that whether we did a context switch
>>> >> > in shrink_inactive. It mostly depends on how long we've been running
>>> >> > without giving up the CPU.  Generally that will mean we've been round
>>> >> > the shrinker loop hundreds to thousands of times without sleeping.
>>> >> >
>>> >> >> It seems to be semi random behavior.
>>> >> >
>>> >> > Well, we have to do something.  Chris Mason first suspected the hang was
>>> >> > a kswapd rescheduling problem a while ago.  We tried putting
>>> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>> >>
>>> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>>> >>
>>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>>> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>>> >> pgdat_balance returns wrong result. And at last VM calls
>>> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>>> >> kswap couldn't find zones which have not enough free pages and goto
>>> >> out. kswapd could repeat this work infinitely. So you don't have a
>>> >> chance to call cond_resched.
>>> >>
>>> >> But if your test was with Hanne's patch, I am very curious how come
>>> >> kswapd consumes CPU a lot.
>>> >>
>>> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>>> >> > option.  The other option might be just to put a cond_resched() in
>>> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>> >>
>>> >> I don't oppose it but before that, I think we have to know why kswapd
>>> >> consumes CPU a lot although we applied Hannes' patch.
>>> >>
>>> >
>>> > Because it's still possible for processes to allocate pages at the same
>>> > rate kswapd is freeing them leading to a situation where kswapd does not
>>> > consider the zone balanced for prolonged periods of time.
>>>
>>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>>> So I think kswapd can be scheduled out although it's scheduled in
>>> after a short time as task scheduled also need page reclaim. Although
>>> all task in system need reclaim, kswapd cpu 99% consumption is a
>>> natural result, I think.
>>> Do I miss something?
>>>
>>
>> Lets see;
>>
>> shrink_page_list() only applies if inactive pages were isolated
>>        which in turn may not happen if all_unreclaimable is set in
>>        shrink_zones(). If for whatver reason, all_unreclaimable is
>>        set on all zones, we can miss calling cond_resched().
>>
>> shrink_slab only applies if we are reclaiming slab pages. If the first
>>        shrinker returns -1, we do not call cond_resched(). If that
>>        first shrinker is dcache and __GFP_FS is not set, direct
>>        reclaimers will not shrink at all. However, if there are
>>        enough of them running or if one of the other shrinkers
>>        is running for a very long time, kswapd could be starved
>>        acquiring the shrinker_rwsem and never reaching the
>>        cond_resched().
>
> Don't we have to move cond_resched?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>        if (scanned == 0)
>                scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               ret = 1;
> +               goto out; /* Assume we'll be able to shrink next time */
> +       }
>
>        list_for_each_entry(shrinker, &shrinker_list, list) {
>                unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>                        count_vm_events(SLABS_SCANNED, this_scan);
>                        total_scan -= this_scan;
>
> -                       cond_resched();
>                }
>
>                shrinker->nr += total_scan;
> +               cond_resched();
>        }
>        up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>        return ret;
>  }
>
>
>>
>> balance_pgdat() only calls cond_resched if the zones are not
>>        balanced. For a high-order allocation that is balanced, it
>>        checks order-0 again. During that window, order-0 might have
>>        become unbalanced so it loops again for order-0 and returns
>>        that was reclaiming for order-0 to kswapd(). It can then find
>>        that a caller has rewoken kswapd for a high-order and re-enters
>>        balance_pgdat() without ever have called cond_resched().
>
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work.

    Typo : without any work.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-16 21:16     ` David Rientjes
@ 2011-05-17  8:42       ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17  8:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Mon, May 16, 2011 at 02:16:46PM -0700, David Rientjes wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9f8a97b..057f1e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  {
> >  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> >  
> >  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> >  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	 */
> >  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> >  
> > -	if (!wait) {
> > +	if (!wait && can_wake_kswapd) {
> >  		/*
> >  		 * Not worth trying to allocate harder for
> >  		 * __GFP_NOMEMALLOC even if it can't schedule.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 98c358d..c5797ab 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> >  	 * Let the initial higher-order allocation fail under memory pressure
> >  	 * so we fall-back to the minimum order allocation.
> >  	 */
> > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> >  
> >  	page = alloc_slab_page(alloc_gfp, node, oo);
> >  	if (unlikely(!page)) {
> 
> It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
> will immediately fail.
> 

We can enter enter direct compaction or direct reclaim
at least once. If compaction is enabled and we enter
reclaim/compaction, the presense of __GFP_REPEAT makes a difference
in should_continue_reclaim().  With compaction disabled, the presense
of the flag is relevant in should_alloc_retry() with it being possible
to loop in the allocator instead of failing the SLUB allocation and
dropping back.

Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
more sense. In that case, we clear both flags because
__GFP_REPEAT && !_GFP_WAIT is a senseless combination of flags.
If for whatever reason the __GFP_WAIT was re-added, the presense of
__GFP_REPEAT could cause problems in reclaim that would be hard to
spot again.

> alloc_gfp would probably benefit from having a comment about why 
> __GFP_WAIT should be masked off here: that we don't want to do compaction 
> or direct reclaim or retry the allocation more than once (so both 
> __GFP_NORETRY and __GFP_REPEAT are no-ops).

That would have been helpful all right. I should have caught that
and explained it properly. In the event there is a new version of
the patch, I'll add one. For the moment, I'm dropping this patch
entirely. Christoph wants to maintain historic behaviour of SLUB to
maximise the number of high-order pages it uses and at the end of the
day, which option performs better depends entirely on the workload
and machine configuration.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-17  8:42       ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17  8:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Mon, May 16, 2011 at 02:16:46PM -0700, David Rientjes wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9f8a97b..057f1e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  {
> >  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> >  
> >  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> >  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	 */
> >  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> >  
> > -	if (!wait) {
> > +	if (!wait && can_wake_kswapd) {
> >  		/*
> >  		 * Not worth trying to allocate harder for
> >  		 * __GFP_NOMEMALLOC even if it can't schedule.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 98c358d..c5797ab 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> >  	 * Let the initial higher-order allocation fail under memory pressure
> >  	 * so we fall-back to the minimum order allocation.
> >  	 */
> > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> >  
> >  	page = alloc_slab_page(alloc_gfp, node, oo);
> >  	if (unlikely(!page)) {
> 
> It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
> will immediately fail.
> 

We can enter enter direct compaction or direct reclaim
at least once. If compaction is enabled and we enter
reclaim/compaction, the presense of __GFP_REPEAT makes a difference
in should_continue_reclaim().  With compaction disabled, the presense
of the flag is relevant in should_alloc_retry() with it being possible
to loop in the allocator instead of failing the SLUB allocation and
dropping back.

Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
more sense. In that case, we clear both flags because
__GFP_REPEAT && !_GFP_WAIT is a senseless combination of flags.
If for whatever reason the __GFP_WAIT was re-added, the presense of
__GFP_REPEAT could cause problems in reclaim that would be hard to
spot again.

> alloc_gfp would probably benefit from having a comment about why 
> __GFP_WAIT should be masked off here: that we don't want to do compaction 
> or direct reclaim or retry the allocation more than once (so both 
> __GFP_NORETRY and __GFP_REPEAT are no-ops).

That would have been helpful all right. I should have caught that
and explained it properly. In the event there is a new version of
the patch, I'll add one. For the moment, I'm dropping this patch
entirely. Christoph wants to maintain historic behaviour of SLUB to
maximise the number of high-order pages it uses and at the end of the
day, which option performs better depends entirely on the workload
and machine configuration.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 23:50                 ` Minchan Kim
  (?)
@ 2011-05-17 10:38                   ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17 10:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >> >
> >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> >> >> > ---
> >> >> >> >   mm/vmscan.c |    4 ++++
> >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> >> > index af24d1e..4d24828 100644
> >> >> >> > --- a/mm/vmscan.c
> >> >> >> > +++ b/mm/vmscan.c
> >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >> >     unsigned long balanced = 0;
> >> >> >> >     bool all_zones_ok = true;
> >> >> >> >
> >> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> >> >> > +   if (need_resched())
> >> >> >> > +           return false;
> >> >> >> > +
> >> >> >>
> >> >> >> Hmm... I don't like this patch so much. because this code does
> >> >> >>
> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >> >
> >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> >> > the normal path for determining whether to sleep or not, in effect
> >> >> > leaving the current behaviour unchanged.
> >> >> >
> >> >> >> - sleep if kswapd didn't
> >> >> >
> >> >> > This also isn't entirely true: whether need_resched() is true at this
> >> >> > point depends on a whole lot more that whether we did a context switch
> >> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> >> > without giving up the CPU.  Generally that will mean we've been round
> >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >> >
> >> >> >> It seems to be semi random behavior.
> >> >> >
> >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >> >>
> >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >> >>
> >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> >> pgdat_balance returns wrong result. And at last VM calls
> >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> >> kswap couldn't find zones which have not enough free pages and goto
> >> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> >> chance to call cond_resched.
> >> >>
> >> >> But if your test was with Hanne's patch, I am very curious how come
> >> >> kswapd consumes CPU a lot.
> >> >>
> >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> >> > option.  The other option might be just to put a cond_resched() in
> >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >> >>
> >> >> I don't oppose it but before that, I think we have to know why kswapd
> >> >> consumes CPU a lot although we applied Hannes' patch.
> >> >>
> >> >
> >> > Because it's still possible for processes to allocate pages at the same
> >> > rate kswapd is freeing them leading to a situation where kswapd does not
> >> > consider the zone balanced for prolonged periods of time.
> >>
> >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> >> So I think kswapd can be scheduled out although it's scheduled in
> >> after a short time as task scheduled also need page reclaim. Although
> >> all task in system need reclaim, kswapd cpu 99% consumption is a
> >> natural result, I think.
> >> Do I miss something?
> >>
> >
> > Lets see;
> >
> > shrink_page_list() only applies if inactive pages were isolated
> >        which in turn may not happen if all_unreclaimable is set in
> >        shrink_zones(). If for whatver reason, all_unreclaimable is
> >        set on all zones, we can miss calling cond_resched().
> >
> > shrink_slab only applies if we are reclaiming slab pages. If the first
> >        shrinker returns -1, we do not call cond_resched(). If that
> >        first shrinker is dcache and __GFP_FS is not set, direct
> >        reclaimers will not shrink at all. However, if there are
> >        enough of them running or if one of the other shrinkers
> >        is running for a very long time, kswapd could be starved
> >        acquiring the shrinker_rwsem and never reaching the
> >        cond_resched().
> 
> Don't we have to move cond_resched?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>         if (scanned == 0)
>                 scanned = SWAP_CLUSTER_MAX;
> 
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               ret = 1;
> +               goto out; /* Assume we'll be able to shrink next time */
> +       }
> 
>         list_for_each_entry(shrinker, &shrinker_list, list) {
>                 unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>                         count_vm_events(SLABS_SCANNED, this_scan);
>                         total_scan -= this_scan;
> 
> -                       cond_resched();
>                 }
> 
>                 shrinker->nr += total_scan;
> +               cond_resched();
>         }
>         up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>         return ret;
>  }
> 

This makes some sense for the exit path but if one or more of the
shrinkers takes a very long time without sleeping (extremely long
list searches for example) then kswapd will not call cond_resched()
between shrinkers and still consume a lot of CPU.

> >
> > balance_pgdat() only calls cond_resched if the zones are not
> >        balanced. For a high-order allocation that is balanced, it
> >        checks order-0 again. During that window, order-0 might have
> >        become unbalanced so it loops again for order-0 and returns
> >        that was reclaiming for order-0 to kswapd(). It can then find
> >        that a caller has rewoken kswapd for a high-order and re-enters
> >        balance_pgdat() without ever have called cond_resched().
> 
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work. Okay.
> It does make sense.
> By your scenario, someone wakes up kswapd with higher order, again.
> So re-enters balance_pgdat without ever have called cond_resched.
> But if someone wakes up higher order again, we can't have a chance to
> call kswapd_try_to_sleep. So your patch effect would be nop, too.
> 
> It would be better to put cond_resched after balance_pgdat?
> 

Which will leave kswapd runnable instead of going to sleep but
guarantees a scheduling point. Lets see if the problem is that
cond_resched is being missed although if this was the case then patch
4 would truly be a no-op but Colin has already reported that patch 1 on
its own didn't fix his problem. If the problem is sandybridge-specific
where kswapd remains runnable and consuming large amounts of CPU in
turbo mode then we know that there are other cond_resched() decisions
that will need to be revisited.

Colin or James, would you be willing to test with patch 1 from this
series and Minchan's patch below? Thanks.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>                 if (!ret) {
>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
>                         order = balance_pgdat(pgdat, order, &classzone_idx);
> +                       cond_resched();
>                 }
>         }
>         return 0;
> 
> >
> > While it appears unlikely, there are bad conditions which can result
> > in cond_resched() being avoided.
> 
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-17 10:38                   ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17 10:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >> >
> >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> >> >> > ---
> >> >> >> >   mm/vmscan.c |    4 ++++
> >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> >> > index af24d1e..4d24828 100644
> >> >> >> > --- a/mm/vmscan.c
> >> >> >> > +++ b/mm/vmscan.c
> >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >> >     unsigned long balanced = 0;
> >> >> >> >     bool all_zones_ok = true;
> >> >> >> >
> >> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> >> >> > +   if (need_resched())
> >> >> >> > +           return false;
> >> >> >> > +
> >> >> >>
> >> >> >> Hmm... I don't like this patch so much. because this code does
> >> >> >>
> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >> >
> >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> >> > the normal path for determining whether to sleep or not, in effect
> >> >> > leaving the current behaviour unchanged.
> >> >> >
> >> >> >> - sleep if kswapd didn't
> >> >> >
> >> >> > This also isn't entirely true: whether need_resched() is true at this
> >> >> > point depends on a whole lot more that whether we did a context switch
> >> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> >> > without giving up the CPU.  Generally that will mean we've been round
> >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >> >
> >> >> >> It seems to be semi random behavior.
> >> >> >
> >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >> >>
> >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >> >>
> >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> >> pgdat_balance returns wrong result. And at last VM calls
> >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> >> kswap couldn't find zones which have not enough free pages and goto
> >> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> >> chance to call cond_resched.
> >> >>
> >> >> But if your test was with Hanne's patch, I am very curious how come
> >> >> kswapd consumes CPU a lot.
> >> >>
> >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> >> > option.  The other option might be just to put a cond_resched() in
> >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >> >>
> >> >> I don't oppose it but before that, I think we have to know why kswapd
> >> >> consumes CPU a lot although we applied Hannes' patch.
> >> >>
> >> >
> >> > Because it's still possible for processes to allocate pages at the same
> >> > rate kswapd is freeing them leading to a situation where kswapd does not
> >> > consider the zone balanced for prolonged periods of time.
> >>
> >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> >> So I think kswapd can be scheduled out although it's scheduled in
> >> after a short time as task scheduled also need page reclaim. Although
> >> all task in system need reclaim, kswapd cpu 99% consumption is a
> >> natural result, I think.
> >> Do I miss something?
> >>
> >
> > Lets see;
> >
> > shrink_page_list() only applies if inactive pages were isolated
> >        which in turn may not happen if all_unreclaimable is set in
> >        shrink_zones(). If for whatver reason, all_unreclaimable is
> >        set on all zones, we can miss calling cond_resched().
> >
> > shrink_slab only applies if we are reclaiming slab pages. If the first
> >        shrinker returns -1, we do not call cond_resched(). If that
> >        first shrinker is dcache and __GFP_FS is not set, direct
> >        reclaimers will not shrink at all. However, if there are
> >        enough of them running or if one of the other shrinkers
> >        is running for a very long time, kswapd could be starved
> >        acquiring the shrinker_rwsem and never reaching the
> >        cond_resched().
> 
> Don't we have to move cond_resched?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>         if (scanned == 0)
>                 scanned = SWAP_CLUSTER_MAX;
> 
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               ret = 1;
> +               goto out; /* Assume we'll be able to shrink next time */
> +       }
> 
>         list_for_each_entry(shrinker, &shrinker_list, list) {
>                 unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>                         count_vm_events(SLABS_SCANNED, this_scan);
>                         total_scan -= this_scan;
> 
> -                       cond_resched();
>                 }
> 
>                 shrinker->nr += total_scan;
> +               cond_resched();
>         }
>         up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>         return ret;
>  }
> 

This makes some sense for the exit path but if one or more of the
shrinkers takes a very long time without sleeping (extremely long
list searches for example) then kswapd will not call cond_resched()
between shrinkers and still consume a lot of CPU.

> >
> > balance_pgdat() only calls cond_resched if the zones are not
> >        balanced. For a high-order allocation that is balanced, it
> >        checks order-0 again. During that window, order-0 might have
> >        become unbalanced so it loops again for order-0 and returns
> >        that was reclaiming for order-0 to kswapd(). It can then find
> >        that a caller has rewoken kswapd for a high-order and re-enters
> >        balance_pgdat() without ever have called cond_resched().
> 
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work. Okay.
> It does make sense.
> By your scenario, someone wakes up kswapd with higher order, again.
> So re-enters balance_pgdat without ever have called cond_resched.
> But if someone wakes up higher order again, we can't have a chance to
> call kswapd_try_to_sleep. So your patch effect would be nop, too.
> 
> It would be better to put cond_resched after balance_pgdat?
> 

Which will leave kswapd runnable instead of going to sleep but
guarantees a scheduling point. Lets see if the problem is that
cond_resched is being missed although if this was the case then patch
4 would truly be a no-op but Colin has already reported that patch 1 on
its own didn't fix his problem. If the problem is sandybridge-specific
where kswapd remains runnable and consuming large amounts of CPU in
turbo mode then we know that there are other cond_resched() decisions
that will need to be revisited.

Colin or James, would you be willing to test with patch 1 from this
series and Minchan's patch below? Thanks.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>                 if (!ret) {
>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
>                         order = balance_pgdat(pgdat, order, &classzone_idx);
> +                       cond_resched();
>                 }
>         }
>         return 0;
> 
> >
> > While it appears unlikely, there are bad conditions which can result
> > in cond_resched() being avoided.
> 
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-17 10:38                   ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17 10:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >> >
> >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> >> >> > ---
> >> >> >> >   mm/vmscan.c |    4 ++++
> >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> >> > index af24d1e..4d24828 100644
> >> >> >> > --- a/mm/vmscan.c
> >> >> >> > +++ b/mm/vmscan.c
> >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >> >     unsigned long balanced = 0;
> >> >> >> >     bool all_zones_ok = true;
> >> >> >> >
> >> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> >> >> > +   if (need_resched())
> >> >> >> > +           return false;
> >> >> >> > +
> >> >> >>
> >> >> >> Hmm... I don't like this patch so much. because this code does
> >> >> >>
> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >> >
> >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> >> > the normal path for determining whether to sleep or not, in effect
> >> >> > leaving the current behaviour unchanged.
> >> >> >
> >> >> >> - sleep if kswapd didn't
> >> >> >
> >> >> > This also isn't entirely true: whether need_resched() is true at this
> >> >> > point depends on a whole lot more that whether we did a context switch
> >> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> >> > without giving up the CPU.  Generally that will mean we've been round
> >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >> >
> >> >> >> It seems to be semi random behavior.
> >> >> >
> >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >> >>
> >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >> >>
> >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> >> pgdat_balance returns wrong result. And at last VM calls
> >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> >> kswap couldn't find zones which have not enough free pages and goto
> >> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> >> chance to call cond_resched.
> >> >>
> >> >> But if your test was with Hanne's patch, I am very curious how come
> >> >> kswapd consumes CPU a lot.
> >> >>
> >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> >> > option.  The other option might be just to put a cond_resched() in
> >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >> >>
> >> >> I don't oppose it but before that, I think we have to know why kswapd
> >> >> consumes CPU a lot although we applied Hannes' patch.
> >> >>
> >> >
> >> > Because it's still possible for processes to allocate pages at the same
> >> > rate kswapd is freeing them leading to a situation where kswapd does not
> >> > consider the zone balanced for prolonged periods of time.
> >>
> >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> >> So I think kswapd can be scheduled out although it's scheduled in
> >> after a short time as task scheduled also need page reclaim. Although
> >> all task in system need reclaim, kswapd cpu 99% consumption is a
> >> natural result, I think.
> >> Do I miss something?
> >>
> >
> > Lets see;
> >
> > shrink_page_list() only applies if inactive pages were isolated
> >        which in turn may not happen if all_unreclaimable is set in
> >        shrink_zones(). If for whatver reason, all_unreclaimable is
> >        set on all zones, we can miss calling cond_resched().
> >
> > shrink_slab only applies if we are reclaiming slab pages. If the first
> >        shrinker returns -1, we do not call cond_resched(). If that
> >        first shrinker is dcache and __GFP_FS is not set, direct
> >        reclaimers will not shrink at all. However, if there are
> >        enough of them running or if one of the other shrinkers
> >        is running for a very long time, kswapd could be starved
> >        acquiring the shrinker_rwsem and never reaching the
> >        cond_resched().
> 
> Don't we have to move cond_resched?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>         if (scanned == 0)
>                 scanned = SWAP_CLUSTER_MAX;
> 
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               ret = 1;
> +               goto out; /* Assume we'll be able to shrink next time */
> +       }
> 
>         list_for_each_entry(shrinker, &shrinker_list, list) {
>                 unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>                         count_vm_events(SLABS_SCANNED, this_scan);
>                         total_scan -= this_scan;
> 
> -                       cond_resched();
>                 }
> 
>                 shrinker->nr += total_scan;
> +               cond_resched();
>         }
>         up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>         return ret;
>  }
> 

This makes some sense for the exit path but if one or more of the
shrinkers takes a very long time without sleeping (extremely long
list searches for example) then kswapd will not call cond_resched()
between shrinkers and still consume a lot of CPU.

> >
> > balance_pgdat() only calls cond_resched if the zones are not
> >        balanced. For a high-order allocation that is balanced, it
> >        checks order-0 again. During that window, order-0 might have
> >        become unbalanced so it loops again for order-0 and returns
> >        that was reclaiming for order-0 to kswapd(). It can then find
> >        that a caller has rewoken kswapd for a high-order and re-enters
> >        balance_pgdat() without ever have called cond_resched().
> 
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work. Okay.
> It does make sense.
> By your scenario, someone wakes up kswapd with higher order, again.
> So re-enters balance_pgdat without ever have called cond_resched.
> But if someone wakes up higher order again, we can't have a chance to
> call kswapd_try_to_sleep. So your patch effect would be nop, too.
> 
> It would be better to put cond_resched after balance_pgdat?
> 

Which will leave kswapd runnable instead of going to sleep but
guarantees a scheduling point. Lets see if the problem is that
cond_resched is being missed although if this was the case then patch
4 would truly be a no-op but Colin has already reported that patch 1 on
its own didn't fix his problem. If the problem is sandybridge-specific
where kswapd remains runnable and consuming large amounts of CPU in
turbo mode then we know that there are other cond_resched() decisions
that will need to be revisited.

Colin or James, would you be willing to test with patch 1 from this
series and Minchan's patch below? Thanks.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>                 if (!ret) {
>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
>                         order = balance_pgdat(pgdat, order, &classzone_idx);
> +                       cond_resched();
>                 }
>         }
>         return 0;
> 
> >
> > While it appears unlikely, there are bad conditions which can result
> > in cond_resched() being avoided.
> 
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-17 10:38                   ` Mel Gorman
@ 2011-05-17 13:50                     ` Colin Ian King
  -1 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-17 13:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, James Bottomley, KOSAKI Motohiro, akpm,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> > >> >> <James.Bottomley@hansenpartnership.com> wrote:
> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> > >> >> >> >
> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > >> >> >> > ---
> > >> >> >> >   mm/vmscan.c |    4 ++++
> > >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > >> >> >> >
> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >> >> >> > index af24d1e..4d24828 100644
> > >> >> >> > --- a/mm/vmscan.c
> > >> >> >> > +++ b/mm/vmscan.c
> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > >> >> >> >     unsigned long balanced = 0;
> > >> >> >> >     bool all_zones_ok = true;
> > >> >> >> >
> > >> >> >> > +   /* If kswapd has been running too long, just sleep */
> > >> >> >> > +   if (need_resched())
> > >> >> >> > +           return false;
> > >> >> >> > +
> > >> >> >>
> > >> >> >> Hmm... I don't like this patch so much. because this code does
> > >> >> >>
> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> > >> >> >
> > >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> > >> >> > the normal path for determining whether to sleep or not, in effect
> > >> >> > leaving the current behaviour unchanged.
> > >> >> >
> > >> >> >> - sleep if kswapd didn't
> > >> >> >
> > >> >> > This also isn't entirely true: whether need_resched() is true at this
> > >> >> > point depends on a whole lot more that whether we did a context switch
> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
> > >> >> > without giving up the CPU.  Generally that will mean we've been round
> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> > >> >> >
> > >> >> >> It seems to be semi random behavior.
> > >> >> >
> > >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> > >> >> > a kswapd rescheduling problem a while ago.  We tried putting
> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> > >> >>
> > >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> > >> >>
> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> > >> >> pgdat_balance returns wrong result. And at last VM calls
> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> > >> >> kswap couldn't find zones which have not enough free pages and goto
> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
> > >> >> chance to call cond_resched.
> > >> >>
> > >> >> But if your test was with Hanne's patch, I am very curious how come
> > >> >> kswapd consumes CPU a lot.
> > >> >>
> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> > >> >> > option.  The other option might be just to put a cond_resched() in
> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> > >> >>
> > >> >> I don't oppose it but before that, I think we have to know why kswapd
> > >> >> consumes CPU a lot although we applied Hannes' patch.
> > >> >>
> > >> >
> > >> > Because it's still possible for processes to allocate pages at the same
> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
> > >> > consider the zone balanced for prolonged periods of time.
> > >>
> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> > >> So I think kswapd can be scheduled out although it's scheduled in
> > >> after a short time as task scheduled also need page reclaim. Although
> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
> > >> natural result, I think.
> > >> Do I miss something?
> > >>
> > >
> > > Lets see;
> > >
> > > shrink_page_list() only applies if inactive pages were isolated
> > >        which in turn may not happen if all_unreclaimable is set in
> > >        shrink_zones(). If for whatver reason, all_unreclaimable is
> > >        set on all zones, we can miss calling cond_resched().
> > >
> > > shrink_slab only applies if we are reclaiming slab pages. If the first
> > >        shrinker returns -1, we do not call cond_resched(). If that
> > >        first shrinker is dcache and __GFP_FS is not set, direct
> > >        reclaimers will not shrink at all. However, if there are
> > >        enough of them running or if one of the other shrinkers
> > >        is running for a very long time, kswapd could be starved
> > >        acquiring the shrinker_rwsem and never reaching the
> > >        cond_resched().
> > 
> > Don't we have to move cond_resched?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..633e761 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >         if (scanned == 0)
> >                 scanned = SWAP_CLUSTER_MAX;
> > 
> > -       if (!down_read_trylock(&shrinker_rwsem))
> > -               return 1;       /* Assume we'll be able to shrink next time */
> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> > +               ret = 1;
> > +               goto out; /* Assume we'll be able to shrink next time */
> > +       }
> > 
> >         list_for_each_entry(shrinker, &shrinker_list, list) {
> >                 unsigned long long delta;
> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >                         count_vm_events(SLABS_SCANNED, this_scan);
> >                         total_scan -= this_scan;
> > 
> > -                       cond_resched();
> >                 }
> > 
> >                 shrinker->nr += total_scan;
> > +               cond_resched();
> >         }
> >         up_read(&shrinker_rwsem);
> > +out:
> > +       cond_resched();
> >         return ret;
> >  }
> > 
> 
> This makes some sense for the exit path but if one or more of the
> shrinkers takes a very long time without sleeping (extremely long
> list searches for example) then kswapd will not call cond_resched()
> between shrinkers and still consume a lot of CPU.
> 
> > >
> > > balance_pgdat() only calls cond_resched if the zones are not
> > >        balanced. For a high-order allocation that is balanced, it
> > >        checks order-0 again. During that window, order-0 might have
> > >        become unbalanced so it loops again for order-0 and returns
> > >        that was reclaiming for order-0 to kswapd(). It can then find
> > >        that a caller has rewoken kswapd for a high-order and re-enters
> > >        balance_pgdat() without ever have called cond_resched().
> > 
> > If kswapd reclaims order-o followed by high order, it would have a
> > chance to call cond_resched in shrink_page_list. But if all zones are
> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> > It does make sense.
> > By your scenario, someone wakes up kswapd with higher order, again.
> > So re-enters balance_pgdat without ever have called cond_resched.
> > But if someone wakes up higher order again, we can't have a chance to
> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> > 
> > It would be better to put cond_resched after balance_pgdat?
> > 
> 
> Which will leave kswapd runnable instead of going to sleep but
> guarantees a scheduling point. Lets see if the problem is that
> cond_resched is being missed although if this was the case then patch
> 4 would truly be a no-op but Colin has already reported that patch 1 on
> its own didn't fix his problem. If the problem is sandybridge-specific
> where kswapd remains runnable and consuming large amounts of CPU in
> turbo mode then we know that there are other cond_resched() decisions
> that will need to be revisited.
> 
> Colin or James, would you be willing to test with patch 1 from this
> series and Minchan's patch below? Thanks.

This works OK fine.  Ran 250 test cycles for about 2 hours.

> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..61c45d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >                 if (!ret) {
> >                         trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >                         order = balance_pgdat(pgdat, order, &classzone_idx);
> > +                       cond_resched();
> >                 }
> >         }
> >         return 0;
> > 
> > >
> > > While it appears unlikely, there are bad conditions which can result
> > > in cond_resched() being avoided.
> > 
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
> > 
> > 
> > 
> > -- 
> > Kind regards,
> > Minchan Kim
> 



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-17 13:50                     ` Colin Ian King
  0 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-17 13:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, James Bottomley, KOSAKI Motohiro, akpm,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> > >> >> <James.Bottomley@hansenpartnership.com> wrote:
> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> > >> >> >> >
> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > >> >> >> > ---
> > >> >> >> >   mm/vmscan.c |    4 ++++
> > >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> > >> >> >> >
> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >> >> >> > index af24d1e..4d24828 100644
> > >> >> >> > --- a/mm/vmscan.c
> > >> >> >> > +++ b/mm/vmscan.c
> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > >> >> >> >     unsigned long balanced = 0;
> > >> >> >> >     bool all_zones_ok = true;
> > >> >> >> >
> > >> >> >> > +   /* If kswapd has been running too long, just sleep */
> > >> >> >> > +   if (need_resched())
> > >> >> >> > +           return false;
> > >> >> >> > +
> > >> >> >>
> > >> >> >> Hmm... I don't like this patch so much. because this code does
> > >> >> >>
> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> > >> >> >
> > >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> > >> >> > the normal path for determining whether to sleep or not, in effect
> > >> >> > leaving the current behaviour unchanged.
> > >> >> >
> > >> >> >> - sleep if kswapd didn't
> > >> >> >
> > >> >> > This also isn't entirely true: whether need_resched() is true at this
> > >> >> > point depends on a whole lot more that whether we did a context switch
> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
> > >> >> > without giving up the CPU.  Generally that will mean we've been round
> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> > >> >> >
> > >> >> >> It seems to be semi random behavior.
> > >> >> >
> > >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> > >> >> > a kswapd rescheduling problem a while ago.  We tried putting
> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> > >> >>
> > >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> > >> >>
> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> > >> >> pgdat_balance returns wrong result. And at last VM calls
> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> > >> >> kswap couldn't find zones which have not enough free pages and goto
> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
> > >> >> chance to call cond_resched.
> > >> >>
> > >> >> But if your test was with Hanne's patch, I am very curious how come
> > >> >> kswapd consumes CPU a lot.
> > >> >>
> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> > >> >> > option.  The other option might be just to put a cond_resched() in
> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> > >> >>
> > >> >> I don't oppose it but before that, I think we have to know why kswapd
> > >> >> consumes CPU a lot although we applied Hannes' patch.
> > >> >>
> > >> >
> > >> > Because it's still possible for processes to allocate pages at the same
> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
> > >> > consider the zone balanced for prolonged periods of time.
> > >>
> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> > >> So I think kswapd can be scheduled out although it's scheduled in
> > >> after a short time as task scheduled also need page reclaim. Although
> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
> > >> natural result, I think.
> > >> Do I miss something?
> > >>
> > >
> > > Lets see;
> > >
> > > shrink_page_list() only applies if inactive pages were isolated
> > >        which in turn may not happen if all_unreclaimable is set in
> > >        shrink_zones(). If for whatver reason, all_unreclaimable is
> > >        set on all zones, we can miss calling cond_resched().
> > >
> > > shrink_slab only applies if we are reclaiming slab pages. If the first
> > >        shrinker returns -1, we do not call cond_resched(). If that
> > >        first shrinker is dcache and __GFP_FS is not set, direct
> > >        reclaimers will not shrink at all. However, if there are
> > >        enough of them running or if one of the other shrinkers
> > >        is running for a very long time, kswapd could be starved
> > >        acquiring the shrinker_rwsem and never reaching the
> > >        cond_resched().
> > 
> > Don't we have to move cond_resched?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..633e761 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >         if (scanned == 0)
> >                 scanned = SWAP_CLUSTER_MAX;
> > 
> > -       if (!down_read_trylock(&shrinker_rwsem))
> > -               return 1;       /* Assume we'll be able to shrink next time */
> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> > +               ret = 1;
> > +               goto out; /* Assume we'll be able to shrink next time */
> > +       }
> > 
> >         list_for_each_entry(shrinker, &shrinker_list, list) {
> >                 unsigned long long delta;
> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >                         count_vm_events(SLABS_SCANNED, this_scan);
> >                         total_scan -= this_scan;
> > 
> > -                       cond_resched();
> >                 }
> > 
> >                 shrinker->nr += total_scan;
> > +               cond_resched();
> >         }
> >         up_read(&shrinker_rwsem);
> > +out:
> > +       cond_resched();
> >         return ret;
> >  }
> > 
> 
> This makes some sense for the exit path but if one or more of the
> shrinkers takes a very long time without sleeping (extremely long
> list searches for example) then kswapd will not call cond_resched()
> between shrinkers and still consume a lot of CPU.
> 
> > >
> > > balance_pgdat() only calls cond_resched if the zones are not
> > >        balanced. For a high-order allocation that is balanced, it
> > >        checks order-0 again. During that window, order-0 might have
> > >        become unbalanced so it loops again for order-0 and returns
> > >        that was reclaiming for order-0 to kswapd(). It can then find
> > >        that a caller has rewoken kswapd for a high-order and re-enters
> > >        balance_pgdat() without ever have called cond_resched().
> > 
> > If kswapd reclaims order-o followed by high order, it would have a
> > chance to call cond_resched in shrink_page_list. But if all zones are
> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> > It does make sense.
> > By your scenario, someone wakes up kswapd with higher order, again.
> > So re-enters balance_pgdat without ever have called cond_resched.
> > But if someone wakes up higher order again, we can't have a chance to
> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> > 
> > It would be better to put cond_resched after balance_pgdat?
> > 
> 
> Which will leave kswapd runnable instead of going to sleep but
> guarantees a scheduling point. Lets see if the problem is that
> cond_resched is being missed although if this was the case then patch
> 4 would truly be a no-op but Colin has already reported that patch 1 on
> its own didn't fix his problem. If the problem is sandybridge-specific
> where kswapd remains runnable and consuming large amounts of CPU in
> turbo mode then we know that there are other cond_resched() decisions
> that will need to be revisited.
> 
> Colin or James, would you be willing to test with patch 1 from this
> series and Minchan's patch below? Thanks.

This works OK fine.  Ran 250 test cycles for about 2 hours.

> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..61c45d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >                 if (!ret) {
> >                         trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >                         order = balance_pgdat(pgdat, order, &classzone_idx);
> > +                       cond_resched();
> >                 }
> >         }
> >         return 0;
> > 
> > >
> > > While it appears unlikely, there are bad conditions which can result
> > > in cond_resched() being avoided.
> > 
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
> > 
> > 
> > 
> > -- 
> > Kind regards,
> > Minchan Kim
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-17  8:42       ` Mel Gorman
@ 2011-05-17 13:51         ` Christoph Lameter
  -1 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-17 13:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Mel Gorman wrote:

> entirely. Christoph wants to maintain historic behaviour of SLUB to
> maximise the number of high-order pages it uses and at the end of the
> day, which option performs better depends entirely on the workload
> and machine configuration.

That is not what I meant. I would like more higher order allocations to
succeed. That does not mean that slubs allocation methods and flags passed
have to stay the same. You can change the slub behavior if it helps.

I am just suspicious of compaction. If these mods are needed to reduce the
amount of higher order pages then compaction does not have the
beneficial effect that it should have. It does not actually
increase the available higher order pages. Fix that first.




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-17 13:51         ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-17 13:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Mel Gorman wrote:

> entirely. Christoph wants to maintain historic behaviour of SLUB to
> maximise the number of high-order pages it uses and at the end of the
> day, which option performs better depends entirely on the workload
> and machine configuration.

That is not what I meant. I would like more higher order allocations to
succeed. That does not mean that slubs allocation methods and flags passed
have to stay the same. You can change the slub behavior if it helps.

I am just suspicious of compaction. If these mods are needed to reduce the
amount of higher order pages then compaction does not have the
beneficial effect that it should have. It does not actually
increase the available higher order pages. Fix that first.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
  2011-05-17 13:50                     ` Colin Ian King
@ 2011-05-17 16:15                       ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17 16:15 UTC (permalink / raw)
  To: akpm
  Cc: Minchan Kim, Colin Ian King, James Bottomley, KOSAKI Motohiro,
	akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel,
	hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

It has been reported on some laptops that kswapd is consuming large
amounts of CPU and not being scheduled when SLUB is enabled during
large amounts of file copying. It is expected that this is due to
kswapd missing every cond_resched() point because;

shrink_page_list() calls cond_resched() if inactive pages were isolated
        which in turn may not happen if all_unreclaimable is set in
        shrink_zones(). If for whatver reason, all_unreclaimable is
        set on all zones, we can miss calling cond_resched().

balance_pgdat() only calls cond_resched if the zones are not
        balanced. For a high-order allocation that is balanced, it
        checks order-0 again. During that window, order-0 might have
        become unbalanced so it loops again for order-0 and returns
        that it was reclaiming for order-0 to kswapd(). It can then
        find that a caller has rewoken kswapd for a high-order and
        re-enters balance_pgdat() without ever calling cond_resched().

shrink_slab only calls cond_resched() if we are reclaiming slab
	pages. If there are a large number of direct reclaimers, the
	shrinker_rwsem can be contended and prevent kswapd calling
	cond_resched().

This patch modifies the shrink_slab() case. If the semaphore is
contended, the caller will still check cond_resched(). After each
successful call into a shrinker, the check for cond_resched() is
still necessary in case one shrinker call is particularly slow.

This patch replaces
mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
in -mm.

[mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
From: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..0bed248 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	if (scanned == 0)
 		scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+	if (!down_read_trylock(&shrinker_rwsem)) {
+		/* Assume we'll be able to shrink next time */
+		ret = 1;
+		goto out;
+	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
@@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+out:
+	cond_resched();
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-17 16:15                       ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17 16:15 UTC (permalink / raw)
  To: akpm
  Cc: Minchan Kim, Colin Ian King, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

It has been reported on some laptops that kswapd is consuming large
amounts of CPU and not being scheduled when SLUB is enabled during
large amounts of file copying. It is expected that this is due to
kswapd missing every cond_resched() point because;

shrink_page_list() calls cond_resched() if inactive pages were isolated
        which in turn may not happen if all_unreclaimable is set in
        shrink_zones(). If for whatver reason, all_unreclaimable is
        set on all zones, we can miss calling cond_resched().

balance_pgdat() only calls cond_resched if the zones are not
        balanced. For a high-order allocation that is balanced, it
        checks order-0 again. During that window, order-0 might have
        become unbalanced so it loops again for order-0 and returns
        that it was reclaiming for order-0 to kswapd(). It can then
        find that a caller has rewoken kswapd for a high-order and
        re-enters balance_pgdat() without ever calling cond_resched().

shrink_slab only calls cond_resched() if we are reclaiming slab
	pages. If there are a large number of direct reclaimers, the
	shrinker_rwsem can be contended and prevent kswapd calling
	cond_resched().

This patch modifies the shrink_slab() case. If the semaphore is
contended, the caller will still check cond_resched(). After each
successful call into a shrinker, the check for cond_resched() is
still necessary in case one shrinker call is particularly slow.

This patch replaces
mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
in -mm.

[mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
From: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..0bed248 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	if (scanned == 0)
 		scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+	if (!down_read_trylock(&shrinker_rwsem)) {
+		/* Assume we'll be able to shrink next time */
+		ret = 1;
+		goto out;
+	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
@@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+out:
+	cond_resched();
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-17 13:51         ` Christoph Lameter
@ 2011-05-17 16:22           ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17 16:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, May 17, 2011 at 08:51:47AM -0500, Christoph Lameter wrote:
> On Tue, 17 May 2011, Mel Gorman wrote:
> 
> > entirely. Christoph wants to maintain historic behaviour of SLUB to
> > maximise the number of high-order pages it uses and at the end of the
> > day, which option performs better depends entirely on the workload
> > and machine configuration.
> 
> That is not what I meant. I would like more higher order allocations to
> succeed. That does not mean that slubs allocation methods and flags passed
> have to stay the same. You can change the slub behavior if it helps.
> 

In this particular patch, the success rate for high order allocations
would likely decrease in low memory conditions albeit the latency when
calling the page allocator will be lower and the disruption to the
system will be less (no copying or reclaim of pages). My expectation
would be that it's cheaper for SLUB to fall back than compact memory
or reclaim pages even if this means a slab page is smaller until more
memory is free. However, if the "goodness" criteria is high order
allocation success rate, the patch shouldn't be merged.

> I am just suspicious of compaction. If these mods are needed to reduce the
> amount of higher order pages then compaction does not have the
> beneficial effect that it should have. It does not actually
> increase the available higher order pages. Fix that first.
> 

The problem being addressed was the machine being hung at worst and in
other cases having kswapd pinned at 99-100% CPU. It's now been shown
that modifying SLUB is not necessary to fix this because the bug was
in page reclaim. The high-order allocation success rate didn't come
into it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-17 16:22           ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-17 16:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, May 17, 2011 at 08:51:47AM -0500, Christoph Lameter wrote:
> On Tue, 17 May 2011, Mel Gorman wrote:
> 
> > entirely. Christoph wants to maintain historic behaviour of SLUB to
> > maximise the number of high-order pages it uses and at the end of the
> > day, which option performs better depends entirely on the workload
> > and machine configuration.
> 
> That is not what I meant. I would like more higher order allocations to
> succeed. That does not mean that slubs allocation methods and flags passed
> have to stay the same. You can change the slub behavior if it helps.
> 

In this particular patch, the success rate for high order allocations
would likely decrease in low memory conditions albeit the latency when
calling the page allocator will be lower and the disruption to the
system will be less (no copying or reclaim of pages). My expectation
would be that it's cheaper for SLUB to fall back than compact memory
or reclaim pages even if this means a slab page is smaller until more
memory is free. However, if the "goodness" criteria is high order
allocation success rate, the patch shouldn't be merged.

> I am just suspicious of compaction. If these mods are needed to reduce the
> amount of higher order pages then compaction does not have the
> beneficial effect that it should have. It does not actually
> increase the available higher order pages. Fix that first.
> 

The problem being addressed was the machine being hung at worst and in
other cases having kswapd pinned at 99-100% CPU. It's now been shown
that modifying SLUB is not necessary to fix this because the bug was
in page reclaim. The high-order allocation success rate didn't come
into it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-17 16:22           ` Mel Gorman
@ 2011-05-17 17:52             ` Christoph Lameter
  -1 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-17 17:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Mel Gorman wrote:

> > That is not what I meant. I would like more higher order allocations to
> > succeed. That does not mean that slubs allocation methods and flags passed
> > have to stay the same. You can change the slub behavior if it helps.
> >
>
> In this particular patch, the success rate for high order allocations
> would likely decrease in low memory conditions albeit the latency when
> calling the page allocator will be lower and the disruption to the
> system will be less (no copying or reclaim of pages). My expectation
> would be that it's cheaper for SLUB to fall back than compact memory
> or reclaim pages even if this means a slab page is smaller until more
> memory is free. However, if the "goodness" criteria is high order
> allocation success rate, the patch shouldn't be merged.

The criteria is certainly overall system performance and not a high order
allocation rate.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-17 17:52             ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-17 17:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Mel Gorman wrote:

> > That is not what I meant. I would like more higher order allocations to
> > succeed. That does not mean that slubs allocation methods and flags passed
> > have to stay the same. You can change the slub behavior if it helps.
> >
>
> In this particular patch, the success rate for high order allocations
> would likely decrease in low memory conditions albeit the latency when
> calling the page allocator will be lower and the disruption to the
> system will be less (no copying or reclaim of pages). My expectation
> would be that it's cheaper for SLUB to fall back than compact memory
> or reclaim pages even if this means a slab page is smaller until more
> memory is free. However, if the "goodness" criteria is high order
> allocation success rate, the patch shouldn't be merged.

The criteria is certainly overall system performance and not a high order
allocation rate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-17  8:42       ` Mel Gorman
@ 2011-05-17 19:31         ` David Rientjes
  -1 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-17 19:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Mel Gorman wrote:

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 9f8a97b..057f1e2 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  {
> > >  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> > >  
> > >  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> > >  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	 */
> > >  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> > >  
> > > -	if (!wait) {
> > > +	if (!wait && can_wake_kswapd) {
> > >  		/*
> > >  		 * Not worth trying to allocate harder for
> > >  		 * __GFP_NOMEMALLOC even if it can't schedule.
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 98c358d..c5797ab 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> > >  	 * Let the initial higher-order allocation fail under memory pressure
> > >  	 * so we fall-back to the minimum order allocation.
> > >  	 */
> > > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > > +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> > >  
> > >  	page = alloc_slab_page(alloc_gfp, node, oo);
> > >  	if (unlikely(!page)) {
> > 
> > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
> > will immediately fail.
> > 
> 
> We can enter enter direct compaction or direct reclaim
> at least once. If compaction is enabled and we enter
> reclaim/compaction, the presense of __GFP_REPEAT makes a difference
> in should_continue_reclaim().  With compaction disabled, the presense
> of the flag is relevant in should_alloc_retry() with it being possible
> to loop in the allocator instead of failing the SLUB allocation and
> dropping back.
> 

You've cleared __GFP_WAIT, so it cannot enter direct compaction or direct 
reclaim, so clearing __GFP_REPEAT here doesn't actually do anything.  
That's why I suggested adding a comment about why you're clearing 
__GFP_WAIT: to make it obvious that these allocations will immediately 
fail if the alloc is unsuccessful and we don't need to add __GFP_NORETRY 
or remove __GFP_REPEAT.

> Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
> more sense.

No, I meant !__GFP_NOFAIL since the high priority allocations (if 
PF_MEMALLOC or TIF_MEMDIE) will not loop forever looking for a page 
without that bit.  That allows this !__GFP_WAIT allocation to immediately 
fail.  __GFP_NORETRY and __GFP_REPEAT are no-ops unless you have 
__GFP_WAIT.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-17 19:31         ` David Rientjes
  0 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-17 19:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Mel Gorman wrote:

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 9f8a97b..057f1e2 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  {
> > >  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> > >  
> > >  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> > >  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	 */
> > >  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> > >  
> > > -	if (!wait) {
> > > +	if (!wait && can_wake_kswapd) {
> > >  		/*
> > >  		 * Not worth trying to allocate harder for
> > >  		 * __GFP_NOMEMALLOC even if it can't schedule.
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 98c358d..c5797ab 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> > >  	 * Let the initial higher-order allocation fail under memory pressure
> > >  	 * so we fall-back to the minimum order allocation.
> > >  	 */
> > > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > > +			~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> > >  
> > >  	page = alloc_slab_page(alloc_gfp, node, oo);
> > >  	if (unlikely(!page)) {
> > 
> > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations 
> > will immediately fail.
> > 
> 
> We can enter enter direct compaction or direct reclaim
> at least once. If compaction is enabled and we enter
> reclaim/compaction, the presense of __GFP_REPEAT makes a difference
> in should_continue_reclaim().  With compaction disabled, the presense
> of the flag is relevant in should_alloc_retry() with it being possible
> to loop in the allocator instead of failing the SLUB allocation and
> dropping back.
> 

You've cleared __GFP_WAIT, so it cannot enter direct compaction or direct 
reclaim, so clearing __GFP_REPEAT here doesn't actually do anything.  
That's why I suggested adding a comment about why you're clearing 
__GFP_WAIT: to make it obvious that these allocations will immediately 
fail if the alloc is unsuccessful and we don't need to add __GFP_NORETRY 
or remove __GFP_REPEAT.

> Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
> more sense.

No, I meant !__GFP_NOFAIL since the high priority allocations (if 
PF_MEMALLOC or TIF_MEMDIE) will not loop forever looking for a page 
without that bit.  That allows this !__GFP_WAIT allocation to immediately 
fail.  __GFP_NORETRY and __GFP_REPEAT are no-ops unless you have 
__GFP_WAIT.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-17 17:52             ` Christoph Lameter
@ 2011-05-17 19:35               ` David Rientjes
  -1 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-17 19:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Christoph Lameter wrote:

> > In this particular patch, the success rate for high order allocations
> > would likely decrease in low memory conditions albeit the latency when
> > calling the page allocator will be lower and the disruption to the
> > system will be less (no copying or reclaim of pages). My expectation
> > would be that it's cheaper for SLUB to fall back than compact memory
> > or reclaim pages even if this means a slab page is smaller until more
> > memory is free. However, if the "goodness" criteria is high order
> > allocation success rate, the patch shouldn't be merged.
> 
> The criteria is certainly overall system performance and not a high order
> allocation rate.
> 

SLUB definitely depends on these higher order allocations being successful 
for performance, dropping back to the min order is a last resort as 
opposed to failing the kmalloc().  If it's the last resort, then it makes 
sense that we'd want to try both compaction and reclaim while we're 
already in the page allocator as we go down the slub slowpath.  Why not 
try just a little harder (compaction and/or reclaim) to alloc the cache's 
preferred order?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
@ 2011-05-17 19:35               ` David Rientjes
  0 siblings, 0 replies; 119+ messages in thread
From: David Rientjes @ 2011-05-17 19:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Tue, 17 May 2011, Christoph Lameter wrote:

> > In this particular patch, the success rate for high order allocations
> > would likely decrease in low memory conditions albeit the latency when
> > calling the page allocator will be lower and the disruption to the
> > system will be less (no copying or reclaim of pages). My expectation
> > would be that it's cheaper for SLUB to fall back than compact memory
> > or reclaim pages even if this means a slab page is smaller until more
> > memory is free. However, if the "goodness" criteria is high order
> > allocation success rate, the patch shouldn't be merged.
> 
> The criteria is certainly overall system performance and not a high order
> allocation rate.
> 

SLUB definitely depends on these higher order allocations being successful 
for performance, dropping back to the min order is a last resort as 
opposed to failing the kmalloc().  If it's the last resort, then it makes 
sense that we'd want to try both compaction and reclaim while we're 
already in the page allocator as we go down the slub slowpath.  Why not 
try just a little harder (compaction and/or reclaim) to alloc the cache's 
preferred order?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 10:27               ` Mel Gorman
@ 2011-05-18  0:26                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  0:26 UTC (permalink / raw)
  To: mgorman
  Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

> Lets see;
>
> shrink_page_list() only applies if inactive pages were isolated
> 	which in turn may not happen if all_unreclaimable is set in
> 	shrink_zones(). If for whatver reason, all_unreclaimable is
> 	set on all zones, we can miss calling cond_resched().
>
> shrink_slab only applies if we are reclaiming slab pages. If the first
> 	shrinker returns -1, we do not call cond_resched(). If that
> 	first shrinker is dcache and __GFP_FS is not set, direct
> 	reclaimers will not shrink at all. However, if there are
> 	enough of them running or if one of the other shrinkers
> 	is running for a very long time, kswapd could be starved
> 	acquiring the shrinker_rwsem and never reaching the
> 	cond_resched().

OK.


>
> balance_pgdat() only calls cond_resched if the zones are not
> 	balanced. For a high-order allocation that is balanced, it
> 	checks order-0 again. During that window, order-0 might have
> 	become unbalanced so it loops again for order-0 and returns
> 	that was reclaiming for order-0 to kswapd(). It can then find
> 	that a caller has rewoken kswapd for a high-order and re-enters
> 	balance_pgdat() without ever have called cond_resched().

Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
The problem is NOT 100% cpu consumption. if kswapd will sleep, other
processes need to reclaim old pages. The problem is, kswapd doesn't
invoke context switch and other tasks hang-up.




> While it appears unlikely, there are bad conditions which can result
> in cond_resched() being avoided.
>



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  0:26                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  0:26 UTC (permalink / raw)
  To: mgorman
  Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

> Lets see;
>
> shrink_page_list() only applies if inactive pages were isolated
> 	which in turn may not happen if all_unreclaimable is set in
> 	shrink_zones(). If for whatver reason, all_unreclaimable is
> 	set on all zones, we can miss calling cond_resched().
>
> shrink_slab only applies if we are reclaiming slab pages. If the first
> 	shrinker returns -1, we do not call cond_resched(). If that
> 	first shrinker is dcache and __GFP_FS is not set, direct
> 	reclaimers will not shrink at all. However, if there are
> 	enough of them running or if one of the other shrinkers
> 	is running for a very long time, kswapd could be starved
> 	acquiring the shrinker_rwsem and never reaching the
> 	cond_resched().

OK.


>
> balance_pgdat() only calls cond_resched if the zones are not
> 	balanced. For a high-order allocation that is balanced, it
> 	checks order-0 again. During that window, order-0 might have
> 	become unbalanced so it loops again for order-0 and returns
> 	that was reclaiming for order-0 to kswapd(). It can then find
> 	that a caller has rewoken kswapd for a high-order and re-enters
> 	balance_pgdat() without ever have called cond_resched().

Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
The problem is NOT 100% cpu consumption. if kswapd will sleep, other
processes need to reclaim old pages. The problem is, kswapd doesn't
invoke context switch and other tasks hang-up.




> While it appears unlikely, there are bad conditions which can result
> in cond_resched() being avoided.
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
  2011-05-17 16:15                       ` Mel Gorman
@ 2011-05-18  0:45                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  0:45 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, minchan.kim, colin.king, James.Bottomley, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

(2011/05/18 1:15), Mel Gorman wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>          which in turn may not happen if all_unreclaimable is set in
>          shrink_zones(). If for whatver reason, all_unreclaimable is
>          set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>          balanced. For a high-order allocation that is balanced, it
>          checks order-0 again. During that window, order-0 might have
>          become unbalanced so it loops again for order-0 and returns
>          that it was reclaiming for order-0 to kswapd(). It can then
>          find that a caller has rewoken kswapd for a high-order and
>          re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
> 	pages. If there are a large number of direct reclaimers, the
> 	shrinker_rwsem can be contended and prevent kswapd calling
> 	cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim<minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-18  0:45                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  0:45 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, minchan.kim, colin.king, James.Bottomley, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

(2011/05/18 1:15), Mel Gorman wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>          which in turn may not happen if all_unreclaimable is set in
>          shrink_zones(). If for whatver reason, all_unreclaimable is
>          set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>          balanced. For a high-order allocation that is balanced, it
>          checks order-0 again. During that window, order-0 might have
>          become unbalanced so it loops again for order-0 and returns
>          that it was reclaiming for order-0 to kswapd(). It can then
>          find that a caller has rewoken kswapd for a high-order and
>          re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
> 	pages. If there are a large number of direct reclaimers, the
> 	shrinker_rwsem can be contended and prevent kswapd calling
> 	cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim<minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 23:50                 ` Minchan Kim
@ 2011-05-18  1:05                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  1:05 UTC (permalink / raw)
  To: minchan.kim
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

> It would be better to put cond_resched after balance_pgdat?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>                  if (!ret) {
>                          trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
>                          order = balance_pgdat(pgdat, order,&classzone_idx);
> +                       cond_resched();
>                  }
>          }
>          return 0;
>
>>>> While it appears unlikely, there are bad conditions which can result
>> in cond_resched() being avoided.

Every reclaim priority decreasing or every shrink_zone() calling makes more
fine grained preemption. I think.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  1:05                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  1:05 UTC (permalink / raw)
  To: minchan.kim
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

> It would be better to put cond_resched after balance_pgdat?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>                  if (!ret) {
>                          trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
>                          order = balance_pgdat(pgdat, order,&classzone_idx);
> +                       cond_resched();
>                  }
>          }
>          return 0;
>
>>>> While it appears unlikely, there are bad conditions which can result
>> in cond_resched() being avoided.

Every reclaim priority decreasing or every shrink_zone() calling makes more
fine grained preemption. I think.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-17 10:38                   ` Mel Gorman
@ 2011-05-18  4:09                     ` James Bottomley
  -1 siblings, 0 replies; 119+ messages in thread
From: James Bottomley @ 2011-05-18  4:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> > Don't we have to move cond_resched?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..633e761 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >         if (scanned == 0)
> >                 scanned = SWAP_CLUSTER_MAX;
> > 
> > -       if (!down_read_trylock(&shrinker_rwsem))
> > -               return 1;       /* Assume we'll be able to shrink next time */
> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> > +               ret = 1;
> > +               goto out; /* Assume we'll be able to shrink next time */
> > +       }
> > 
> >         list_for_each_entry(shrinker, &shrinker_list, list) {
> >                 unsigned long long delta;
> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >                         count_vm_events(SLABS_SCANNED, this_scan);
> >                         total_scan -= this_scan;
> > 
> > -                       cond_resched();
> >                 }
> > 
> >                 shrinker->nr += total_scan;
> > +               cond_resched();
> >         }
> >         up_read(&shrinker_rwsem);
> > +out:
> > +       cond_resched();
> >         return ret;
> >  }
> > 
> 
> This makes some sense for the exit path but if one or more of the
> shrinkers takes a very long time without sleeping (extremely long
> list searches for example) then kswapd will not call cond_resched()
> between shrinkers and still consume a lot of CPU.
> 
> > >
> > > balance_pgdat() only calls cond_resched if the zones are not
> > >        balanced. For a high-order allocation that is balanced, it
> > >        checks order-0 again. During that window, order-0 might have
> > >        become unbalanced so it loops again for order-0 and returns
> > >        that was reclaiming for order-0 to kswapd(). It can then find
> > >        that a caller has rewoken kswapd for a high-order and re-enters
> > >        balance_pgdat() without ever have called cond_resched().
> > 
> > If kswapd reclaims order-o followed by high order, it would have a
> > chance to call cond_resched in shrink_page_list. But if all zones are
> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> > It does make sense.
> > By your scenario, someone wakes up kswapd with higher order, again.
> > So re-enters balance_pgdat without ever have called cond_resched.
> > But if someone wakes up higher order again, we can't have a chance to
> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> > 
> > It would be better to put cond_resched after balance_pgdat?
> > 
> 
> Which will leave kswapd runnable instead of going to sleep but
> guarantees a scheduling point. Lets see if the problem is that
> cond_resched is being missed although if this was the case then patch
> 4 would truly be a no-op but Colin has already reported that patch 1 on
> its own didn't fix his problem. If the problem is sandybridge-specific
> where kswapd remains runnable and consuming large amounts of CPU in
> turbo mode then we know that there are other cond_resched() decisions
> that will need to be revisited.
> 
> Colin or James, would you be willing to test with patch 1 from this
> series and Minchan's patch below? Thanks.

Yes, but unfortunately I'm on the road at the moment.  I won't get back
to the laptop showing the problem until late on Tuesday (24th).  If it
works for Colin, I'd assume it's OK.

James


> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..61c45d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >                 if (!ret) {
> >                         trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >                         order = balance_pgdat(pgdat, order, &classzone_idx);
> > +                       cond_resched();
> >                 }
> >         }
> >         return 0;
> > 
> > >
> > > While it appears unlikely, there are bad conditions which can result
> > > in cond_resched() being avoided.
> > 
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
> > 
> > 
> > 
> > -- 
> > Kind regards,
> > Minchan Kim
> 



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  4:09                     ` James Bottomley
  0 siblings, 0 replies; 119+ messages in thread
From: James Bottomley @ 2011-05-18  4:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> > Don't we have to move cond_resched?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..633e761 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >         if (scanned == 0)
> >                 scanned = SWAP_CLUSTER_MAX;
> > 
> > -       if (!down_read_trylock(&shrinker_rwsem))
> > -               return 1;       /* Assume we'll be able to shrink next time */
> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> > +               ret = 1;
> > +               goto out; /* Assume we'll be able to shrink next time */
> > +       }
> > 
> >         list_for_each_entry(shrinker, &shrinker_list, list) {
> >                 unsigned long long delta;
> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >                         count_vm_events(SLABS_SCANNED, this_scan);
> >                         total_scan -= this_scan;
> > 
> > -                       cond_resched();
> >                 }
> > 
> >                 shrinker->nr += total_scan;
> > +               cond_resched();
> >         }
> >         up_read(&shrinker_rwsem);
> > +out:
> > +       cond_resched();
> >         return ret;
> >  }
> > 
> 
> This makes some sense for the exit path but if one or more of the
> shrinkers takes a very long time without sleeping (extremely long
> list searches for example) then kswapd will not call cond_resched()
> between shrinkers and still consume a lot of CPU.
> 
> > >
> > > balance_pgdat() only calls cond_resched if the zones are not
> > >        balanced. For a high-order allocation that is balanced, it
> > >        checks order-0 again. During that window, order-0 might have
> > >        become unbalanced so it loops again for order-0 and returns
> > >        that was reclaiming for order-0 to kswapd(). It can then find
> > >        that a caller has rewoken kswapd for a high-order and re-enters
> > >        balance_pgdat() without ever have called cond_resched().
> > 
> > If kswapd reclaims order-o followed by high order, it would have a
> > chance to call cond_resched in shrink_page_list. But if all zones are
> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> > It does make sense.
> > By your scenario, someone wakes up kswapd with higher order, again.
> > So re-enters balance_pgdat without ever have called cond_resched.
> > But if someone wakes up higher order again, we can't have a chance to
> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> > 
> > It would be better to put cond_resched after balance_pgdat?
> > 
> 
> Which will leave kswapd runnable instead of going to sleep but
> guarantees a scheduling point. Lets see if the problem is that
> cond_resched is being missed although if this was the case then patch
> 4 would truly be a no-op but Colin has already reported that patch 1 on
> its own didn't fix his problem. If the problem is sandybridge-specific
> where kswapd remains runnable and consuming large amounts of CPU in
> turbo mode then we know that there are other cond_resched() decisions
> that will need to be revisited.
> 
> Colin or James, would you be willing to test with patch 1 from this
> series and Minchan's patch below? Thanks.

Yes, but unfortunately I'm on the road at the moment.  I won't get back
to the laptop showing the problem until late on Tuesday (24th).  If it
works for Colin, I'd assume it's OK.

James


> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..61c45d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >                 if (!ret) {
> >                         trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >                         order = balance_pgdat(pgdat, order, &classzone_idx);
> > +                       cond_resched();
> >                 }
> >         }
> >         return 0;
> > 
> > >
> > > While it appears unlikely, there are bad conditions which can result
> > > in cond_resched() being avoided.
> > 
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
> > 
> > 
> > 
> > -- 
> > Kind regards,
> > Minchan Kim
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-17 13:50                     ` Colin Ian King
@ 2011-05-18  4:19                       ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-18  4:19 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

Hello Colin,

On Tue, May 17, 2011 at 10:50 PM, Colin Ian King
<colin.king@canonical.com> wrote:
> On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
>> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
>> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> > >> >> <James.Bottomley@hansenpartnership.com> wrote:
>> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
>> > >> >> >> >
>> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> > >> >> >> > ---
>> > >> >> >> >   mm/vmscan.c |    4 ++++
>> > >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> > >> >> >> >
>> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > >> >> >> > index af24d1e..4d24828 100644
>> > >> >> >> > --- a/mm/vmscan.c
>> > >> >> >> > +++ b/mm/vmscan.c
>> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> > >> >> >> >     unsigned long balanced = 0;
>> > >> >> >> >     bool all_zones_ok = true;
>> > >> >> >> >
>> > >> >> >> > +   /* If kswapd has been running too long, just sleep */
>> > >> >> >> > +   if (need_resched())
>> > >> >> >> > +           return false;
>> > >> >> >> > +
>> > >> >> >>
>> > >> >> >> Hmm... I don't like this patch so much. because this code does
>> > >> >> >>
>> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> > >> >> >
>> > >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> > >> >> > the normal path for determining whether to sleep or not, in effect
>> > >> >> > leaving the current behaviour unchanged.
>> > >> >> >
>> > >> >> >> - sleep if kswapd didn't
>> > >> >> >
>> > >> >> > This also isn't entirely true: whether need_resched() is true at this
>> > >> >> > point depends on a whole lot more that whether we did a context switch
>> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
>> > >> >> > without giving up the CPU.  Generally that will mean we've been round
>> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
>> > >> >> >
>> > >> >> >> It seems to be semi random behavior.
>> > >> >> >
>> > >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
>> > >> >> > a kswapd rescheduling problem a while ago.  We tried putting
>> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>> > >> >>
>> > >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>> > >> >>
>> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>> > >> >> pgdat_balance returns wrong result. And at last VM calls
>> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>> > >> >> kswap couldn't find zones which have not enough free pages and goto
>> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
>> > >> >> chance to call cond_resched.
>> > >> >>
>> > >> >> But if your test was with Hanne's patch, I am very curious how come
>> > >> >> kswapd consumes CPU a lot.
>> > >> >>
>> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > >> >> > option.  The other option might be just to put a cond_resched() in
>> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>> > >> >>
>> > >> >> I don't oppose it but before that, I think we have to know why kswapd
>> > >> >> consumes CPU a lot although we applied Hannes' patch.
>> > >> >>
>> > >> >
>> > >> > Because it's still possible for processes to allocate pages at the same
>> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
>> > >> > consider the zone balanced for prolonged periods of time.
>> > >>
>> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>> > >> So I think kswapd can be scheduled out although it's scheduled in
>> > >> after a short time as task scheduled also need page reclaim. Although
>> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
>> > >> natural result, I think.
>> > >> Do I miss something?
>> > >>
>> > >
>> > > Lets see;
>> > >
>> > > shrink_page_list() only applies if inactive pages were isolated
>> > >        which in turn may not happen if all_unreclaimable is set in
>> > >        shrink_zones(). If for whatver reason, all_unreclaimable is
>> > >        set on all zones, we can miss calling cond_resched().
>> > >
>> > > shrink_slab only applies if we are reclaiming slab pages. If the first
>> > >        shrinker returns -1, we do not call cond_resched(). If that
>> > >        first shrinker is dcache and __GFP_FS is not set, direct
>> > >        reclaimers will not shrink at all. However, if there are
>> > >        enough of them running or if one of the other shrinkers
>> > >        is running for a very long time, kswapd could be starved
>> > >        acquiring the shrinker_rwsem and never reaching the
>> > >        cond_resched().
>> >
>> > Don't we have to move cond_resched?
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 292582c..633e761 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>> >         if (scanned == 0)
>> >                 scanned = SWAP_CLUSTER_MAX;
>> >
>> > -       if (!down_read_trylock(&shrinker_rwsem))
>> > -               return 1;       /* Assume we'll be able to shrink next time */
>> > +       if (!down_read_trylock(&shrinker_rwsem)) {
>> > +               ret = 1;
>> > +               goto out; /* Assume we'll be able to shrink next time */
>> > +       }
>> >
>> >         list_for_each_entry(shrinker, &shrinker_list, list) {
>> >                 unsigned long long delta;
>> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>> >                         count_vm_events(SLABS_SCANNED, this_scan);
>> >                         total_scan -= this_scan;
>> >
>> > -                       cond_resched();
>> >                 }
>> >
>> >                 shrinker->nr += total_scan;
>> > +               cond_resched();
>> >         }
>> >         up_read(&shrinker_rwsem);
>> > +out:
>> > +       cond_resched();
>> >         return ret;
>> >  }
>> >
>>
>> This makes some sense for the exit path but if one or more of the
>> shrinkers takes a very long time without sleeping (extremely long
>> list searches for example) then kswapd will not call cond_resched()
>> between shrinkers and still consume a lot of CPU.
>>
>> > >
>> > > balance_pgdat() only calls cond_resched if the zones are not
>> > >        balanced. For a high-order allocation that is balanced, it
>> > >        checks order-0 again. During that window, order-0 might have
>> > >        become unbalanced so it loops again for order-0 and returns
>> > >        that was reclaiming for order-0 to kswapd(). It can then find
>> > >        that a caller has rewoken kswapd for a high-order and re-enters
>> > >        balance_pgdat() without ever have called cond_resched().
>> >
>> > If kswapd reclaims order-o followed by high order, it would have a
>> > chance to call cond_resched in shrink_page_list. But if all zones are
>> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
>> > It does make sense.
>> > By your scenario, someone wakes up kswapd with higher order, again.
>> > So re-enters balance_pgdat without ever have called cond_resched.
>> > But if someone wakes up higher order again, we can't have a chance to
>> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
>> >
>> > It would be better to put cond_resched after balance_pgdat?
>> >
>>
>> Which will leave kswapd runnable instead of going to sleep but
>> guarantees a scheduling point. Lets see if the problem is that
>> cond_resched is being missed although if this was the case then patch
>> 4 would truly be a no-op but Colin has already reported that patch 1 on
>> its own didn't fix his problem. If the problem is sandybridge-specific
>> where kswapd remains runnable and consuming large amounts of CPU in
>> turbo mode then we know that there are other cond_resched() decisions
>> that will need to be revisited.
>>
>> Colin or James, would you be willing to test with patch 1 from this
>> series and Minchan's patch below? Thanks.
>
> This works OK fine.  Ran 250 test cycles for about 2 hours.

Thanks for the testing!.
I would like to know exact patch for you to apply.
My modification of inserting cond_resched is two.

1) shrink_slab function
2) kswapd right after balance_pgdat.

1) or 2) ?
Or
Both?


Thanks
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  4:19                       ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-18  4:19 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

Hello Colin,

On Tue, May 17, 2011 at 10:50 PM, Colin Ian King
<colin.king@canonical.com> wrote:
> On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
>> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
>> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> > >> >> <James.Bottomley@hansenpartnership.com> wrote:
>> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
>> > >> >> >> >
>> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
>> > >> >> >> > ---
>> > >> >> >> >   mm/vmscan.c |    4 ++++
>> > >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
>> > >> >> >> >
>> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > >> >> >> > index af24d1e..4d24828 100644
>> > >> >> >> > --- a/mm/vmscan.c
>> > >> >> >> > +++ b/mm/vmscan.c
>> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> > >> >> >> >     unsigned long balanced = 0;
>> > >> >> >> >     bool all_zones_ok = true;
>> > >> >> >> >
>> > >> >> >> > +   /* If kswapd has been running too long, just sleep */
>> > >> >> >> > +   if (need_resched())
>> > >> >> >> > +           return false;
>> > >> >> >> > +
>> > >> >> >>
>> > >> >> >> Hmm... I don't like this patch so much. because this code does
>> > >> >> >>
>> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> > >> >> >
>> > >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
>> > >> >> > the normal path for determining whether to sleep or not, in effect
>> > >> >> > leaving the current behaviour unchanged.
>> > >> >> >
>> > >> >> >> - sleep if kswapd didn't
>> > >> >> >
>> > >> >> > This also isn't entirely true: whether need_resched() is true at this
>> > >> >> > point depends on a whole lot more that whether we did a context switch
>> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
>> > >> >> > without giving up the CPU.  Generally that will mean we've been round
>> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
>> > >> >> >
>> > >> >> >> It seems to be semi random behavior.
>> > >> >> >
>> > >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
>> > >> >> > a kswapd rescheduling problem a while ago.  We tried putting
>> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>> > >> >>
>> > >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
>> > >> >>
>> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>> > >> >> pgdat_balance returns wrong result. And at last VM calls
>> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>> > >> >> kswap couldn't find zones which have not enough free pages and goto
>> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
>> > >> >> chance to call cond_resched.
>> > >> >>
>> > >> >> But if your test was with Hanne's patch, I am very curious how come
>> > >> >> kswapd consumes CPU a lot.
>> > >> >>
>> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > >> >> > option.  The other option might be just to put a cond_resched() in
>> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>> > >> >>
>> > >> >> I don't oppose it but before that, I think we have to know why kswapd
>> > >> >> consumes CPU a lot although we applied Hannes' patch.
>> > >> >>
>> > >> >
>> > >> > Because it's still possible for processes to allocate pages at the same
>> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
>> > >> > consider the zone balanced for prolonged periods of time.
>> > >>
>> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>> > >> So I think kswapd can be scheduled out although it's scheduled in
>> > >> after a short time as task scheduled also need page reclaim. Although
>> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
>> > >> natural result, I think.
>> > >> Do I miss something?
>> > >>
>> > >
>> > > Lets see;
>> > >
>> > > shrink_page_list() only applies if inactive pages were isolated
>> > >        which in turn may not happen if all_unreclaimable is set in
>> > >        shrink_zones(). If for whatver reason, all_unreclaimable is
>> > >        set on all zones, we can miss calling cond_resched().
>> > >
>> > > shrink_slab only applies if we are reclaiming slab pages. If the first
>> > >        shrinker returns -1, we do not call cond_resched(). If that
>> > >        first shrinker is dcache and __GFP_FS is not set, direct
>> > >        reclaimers will not shrink at all. However, if there are
>> > >        enough of them running or if one of the other shrinkers
>> > >        is running for a very long time, kswapd could be starved
>> > >        acquiring the shrinker_rwsem and never reaching the
>> > >        cond_resched().
>> >
>> > Don't we have to move cond_resched?
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 292582c..633e761 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>> >         if (scanned == 0)
>> >                 scanned = SWAP_CLUSTER_MAX;
>> >
>> > -       if (!down_read_trylock(&shrinker_rwsem))
>> > -               return 1;       /* Assume we'll be able to shrink next time */
>> > +       if (!down_read_trylock(&shrinker_rwsem)) {
>> > +               ret = 1;
>> > +               goto out; /* Assume we'll be able to shrink next time */
>> > +       }
>> >
>> >         list_for_each_entry(shrinker, &shrinker_list, list) {
>> >                 unsigned long long delta;
>> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>> >                         count_vm_events(SLABS_SCANNED, this_scan);
>> >                         total_scan -= this_scan;
>> >
>> > -                       cond_resched();
>> >                 }
>> >
>> >                 shrinker->nr += total_scan;
>> > +               cond_resched();
>> >         }
>> >         up_read(&shrinker_rwsem);
>> > +out:
>> > +       cond_resched();
>> >         return ret;
>> >  }
>> >
>>
>> This makes some sense for the exit path but if one or more of the
>> shrinkers takes a very long time without sleeping (extremely long
>> list searches for example) then kswapd will not call cond_resched()
>> between shrinkers and still consume a lot of CPU.
>>
>> > >
>> > > balance_pgdat() only calls cond_resched if the zones are not
>> > >        balanced. For a high-order allocation that is balanced, it
>> > >        checks order-0 again. During that window, order-0 might have
>> > >        become unbalanced so it loops again for order-0 and returns
>> > >        that was reclaiming for order-0 to kswapd(). It can then find
>> > >        that a caller has rewoken kswapd for a high-order and re-enters
>> > >        balance_pgdat() without ever have called cond_resched().
>> >
>> > If kswapd reclaims order-o followed by high order, it would have a
>> > chance to call cond_resched in shrink_page_list. But if all zones are
>> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
>> > It does make sense.
>> > By your scenario, someone wakes up kswapd with higher order, again.
>> > So re-enters balance_pgdat without ever have called cond_resched.
>> > But if someone wakes up higher order again, we can't have a chance to
>> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
>> >
>> > It would be better to put cond_resched after balance_pgdat?
>> >
>>
>> Which will leave kswapd runnable instead of going to sleep but
>> guarantees a scheduling point. Lets see if the problem is that
>> cond_resched is being missed although if this was the case then patch
>> 4 would truly be a no-op but Colin has already reported that patch 1 on
>> its own didn't fix his problem. If the problem is sandybridge-specific
>> where kswapd remains runnable and consuming large amounts of CPU in
>> turbo mode then we know that there are other cond_resched() decisions
>> that will need to be revisited.
>>
>> Colin or James, would you be willing to test with patch 1 from this
>> series and Minchan's patch below? Thanks.
>
> This works OK fine.  Ran 250 test cycles for about 2 hours.

Thanks for the testing!.
I would like to know exact patch for you to apply.
My modification of inserting cond_resched is two.

1) shrink_slab function
2) kswapd right after balance_pgdat.

1) or 2) ?
Or
Both?


Thanks
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  1:05                   ` KOSAKI Motohiro
  (?)
@ 2011-05-18  5:44                     ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-18  5:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> It would be better to put cond_resched after balance_pgdat?
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 292582c..61c45d0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>>                 if (!ret) {
>>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
>> order);
>>                         order = balance_pgdat(pgdat,
>> order,&classzone_idx);
>> +                       cond_resched();
>>                 }
>>         }
>>         return 0;
>>
>>>>> While it appears unlikely, there are bad conditions which can result
>>>
>>> in cond_resched() being avoided.
>
> Every reclaim priority decreasing or every shrink_zone() calling makes more
> fine grained preemption. I think.

It could be.
But in direct reclaim case, I have a concern about losing pages
reclaimed to other tasks by preemption.

Hmm,, anyway, we also needs test.
Hmm,, how long should we bother them(Colins and James)?
First of all, Let's fix one just between us and ask test to them and
send the last patch to akpm.

1. shrink_slab
2. right after balance_pgdat
3. shrink_zone
4. reclaim priority decreasing routine.

Now, I vote 1) and 2).

Mel, KOSAKI?
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  5:44                     ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-18  5:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> It would be better to put cond_resched after balance_pgdat?
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 292582c..61c45d0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>>                 if (!ret) {
>>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
>> order);
>>                         order = balance_pgdat(pgdat,
>> order,&classzone_idx);
>> +                       cond_resched();
>>                 }
>>         }
>>         return 0;
>>
>>>>> While it appears unlikely, there are bad conditions which can result
>>>
>>> in cond_resched() being avoided.
>
> Every reclaim priority decreasing or every shrink_zone() calling makes more
> fine grained preemption. I think.

It could be.
But in direct reclaim case, I have a concern about losing pages
reclaimed to other tasks by preemption.

Hmm,, anyway, we also needs test.
Hmm,, how long should we bother them(Colins and James)?
First of all, Let's fix one just between us and ask test to them and
send the last patch to akpm.

1. shrink_slab
2. right after balance_pgdat
3. shrink_zone
4. reclaim priority decreasing routine.

Now, I vote 1) and 2).

Mel, KOSAKI?
-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  5:44                     ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-18  5:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> It would be better to put cond_resched after balance_pgdat?
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 292582c..61c45d0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>>                 if (!ret) {
>>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
>> order);
>>                         order = balance_pgdat(pgdat,
>> order,&classzone_idx);
>> +                       cond_resched();
>>                 }
>>         }
>>         return 0;
>>
>>>>> While it appears unlikely, there are bad conditions which can result
>>>
>>> in cond_resched() being avoided.
>
> Every reclaim priority decreasing or every shrink_zone() calling makes more
> fine grained preemption. I think.

It could be.
But in direct reclaim case, I have a concern about losing pages
reclaimed to other tasks by preemption.

Hmm,, anyway, we also needs test.
Hmm,, how long should we bother them(Colins and James)?
First of all, Let's fix one just between us and ask test to them and
send the last patch to akpm.

1. shrink_slab
2. right after balance_pgdat
3. shrink_zone
4. reclaim priority decreasing routine.

Now, I vote 1) and 2).

Mel, KOSAKI?
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  5:44                     ` Minchan Kim
@ 2011-05-18  6:05                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  6:05 UTC (permalink / raw)
  To: minchan.kim
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

>>>>>> While it appears unlikely, there are bad conditions which can result
>>>>
>>>> in cond_resched() being avoided.
>>
>> Every reclaim priority decreasing or every shrink_zone() calling makes more
>> fine grained preemption. I think.
>
> It could be.
> But in direct reclaim case, I have a concern about losing pages
> reclaimed to other tasks by preemption.

Nope, I proposed to add cond_resched() into balance_pgdat().

> Hmm,, anyway, we also needs test.
> Hmm,, how long should we bother them(Colins and James)?
> First of all, Let's fix one just between us and ask test to them and
> send the last patch to akpm.
>
> 1. shrink_slab
> 2. right after balance_pgdat
> 3. shrink_zone
> 4. reclaim priority decreasing routine.
>
> Now, I vote 1) and 2).
>
> Mel, KOSAKI?

I think following patch makes enough preemption.

Thanks.



 From e7d88be1916184ea7c93a6f2746b15c7a32d1973 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Wed, 18 May 2011 15:00:39 +0900
Subject: [PATCH] vmscan: balance_pgdat() call cond_resched() unconditionally

Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Colin King <colin.king@canonical.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
---
  mm/vmscan.c |    3 +--
  1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 19e179b..87c88fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2449,6 +2449,7 @@ loop_again:
  			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
  			total_scanned += sc.nr_scanned;

+			cond_resched();
  			if (zone->all_unreclaimable)
  				continue;
  			if (nr_slab == 0 &&
@@ -2518,8 +2519,6 @@ out:
  	 *             for the node to be balanced
  	 */
  	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) {
-		cond_resched();
-
  		try_to_freeze();

  		/*
-- 
1.7.3.1





^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  6:05                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18  6:05 UTC (permalink / raw)
  To: minchan.kim
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

>>>>>> While it appears unlikely, there are bad conditions which can result
>>>>
>>>> in cond_resched() being avoided.
>>
>> Every reclaim priority decreasing or every shrink_zone() calling makes more
>> fine grained preemption. I think.
>
> It could be.
> But in direct reclaim case, I have a concern about losing pages
> reclaimed to other tasks by preemption.

Nope, I proposed to add cond_resched() into balance_pgdat().

> Hmm,, anyway, we also needs test.
> Hmm,, how long should we bother them(Colins and James)?
> First of all, Let's fix one just between us and ask test to them and
> send the last patch to akpm.
>
> 1. shrink_slab
> 2. right after balance_pgdat
> 3. shrink_zone
> 4. reclaim priority decreasing routine.
>
> Now, I vote 1) and 2).
>
> Mel, KOSAKI?

I think following patch makes enough preemption.

Thanks.



 From e7d88be1916184ea7c93a6f2746b15c7a32d1973 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Wed, 18 May 2011 15:00:39 +0900
Subject: [PATCH] vmscan: balance_pgdat() call cond_resched() unconditionally

Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Colin King <colin.king@canonical.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
---
  mm/vmscan.c |    3 +--
  1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 19e179b..87c88fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2449,6 +2449,7 @@ loop_again:
  			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
  			total_scanned += sc.nr_scanned;

+			cond_resched();
  			if (zone->all_unreclaimable)
  				continue;
  			if (nr_slab == 0 &&
@@ -2518,8 +2519,6 @@ out:
  	 *             for the node to be balanced
  	 */
  	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) {
-		cond_resched();
-
  		try_to_freeze();

  		/*
-- 
1.7.3.1




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
  2011-05-16 21:10     ` David Rientjes
@ 2011-05-18  6:09       ` Pekka Enberg
  -1 siblings, 0 replies; 119+ messages in thread
From: Pekka Enberg @ 2011-05-18  6:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On 5/17/11 12:10 AM, David Rientjes wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
>
>> To avoid locking and per-cpu overhead, SLUB optimisically uses
>> high-order allocations and falls back to lower allocations if they
>> fail.  However, by simply trying to allocate, kswapd is woken up to
>> start reclaiming at that order. On a desktop system, two users report
>> that the system is getting locked up with kswapd using large amounts
>> of CPU.  Using SLAB instead of SLUB made this problem go away.
>>
>> This patch prevents kswapd being woken up for high-order allocations.
>> Testing indicated that with this patch applied, the system was much
>> harder to hang and even when it did, it eventually recovered.
>>
>> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Acked-by: David Rientjes<rientjes@google.com>

Christoph? I think this patch is sane although the original rationale 
was to workaround kswapd problems.

                 Pekka

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
@ 2011-05-18  6:09       ` Pekka Enberg
  0 siblings, 0 replies; 119+ messages in thread
From: Pekka Enberg @ 2011-05-18  6:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On 5/17/11 12:10 AM, David Rientjes wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
>
>> To avoid locking and per-cpu overhead, SLUB optimisically uses
>> high-order allocations and falls back to lower allocations if they
>> fail.  However, by simply trying to allocate, kswapd is woken up to
>> start reclaiming at that order. On a desktop system, two users report
>> that the system is getting locked up with kswapd using large amounts
>> of CPU.  Using SLAB instead of SLUB made this problem go away.
>>
>> This patch prevents kswapd being woken up for high-order allocations.
>> Testing indicated that with this patch applied, the system was much
>> harder to hang and even when it did, it eventually recovered.
>>
>> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Acked-by: David Rientjes<rientjes@google.com>

Christoph? I think this patch is sane although the original rationale 
was to workaround kswapd problems.

                 Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  4:19                       ` Minchan Kim
@ 2011-05-18  7:39                         ` Colin Ian King
  -1 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-18  7:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, 2011-05-18 at 13:19 +0900, Minchan Kim wrote:
> Hello Colin,
> 
> On Tue, May 17, 2011 at 10:50 PM, Colin Ian King
> <colin.king@canonical.com> wrote:
> > On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> >> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> >> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> >> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> > >> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> > >> >> >> >
> >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> > >> >> >> > ---
> >> > >> >> >> >   mm/vmscan.c |    4 ++++
> >> > >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> > >> >> >> >
> >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > >> >> >> > index af24d1e..4d24828 100644
> >> > >> >> >> > --- a/mm/vmscan.c
> >> > >> >> >> > +++ b/mm/vmscan.c
> >> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > >> >> >> >     unsigned long balanced = 0;
> >> > >> >> >> >     bool all_zones_ok = true;
> >> > >> >> >> >
> >> > >> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> > >> >> >> > +   if (need_resched())
> >> > >> >> >> > +           return false;
> >> > >> >> >> > +
> >> > >> >> >>
> >> > >> >> >> Hmm... I don't like this patch so much. because this code does
> >> > >> >> >>
> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> > >> >> >
> >> > >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> > >> >> > the normal path for determining whether to sleep or not, in effect
> >> > >> >> > leaving the current behaviour unchanged.
> >> > >> >> >
> >> > >> >> >> - sleep if kswapd didn't
> >> > >> >> >
> >> > >> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > >> >> > point depends on a whole lot more that whether we did a context switch
> >> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > >> >> > without giving up the CPU.  Generally that will mean we've been round
> >> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> > >> >> >
> >> > >> >> >> It seems to be semi random behavior.
> >> > >> >> >
> >> > >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> > >> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >> > >> >>
> >> > >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >> > >> >>
> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> > >> >> pgdat_balance returns wrong result. And at last VM calls
> >> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> > >> >> kswap couldn't find zones which have not enough free pages and goto
> >> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> > >> >> chance to call cond_resched.
> >> > >> >>
> >> > >> >> But if your test was with Hanne's patch, I am very curious how come
> >> > >> >> kswapd consumes CPU a lot.
> >> > >> >>
> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > >> >> > option.  The other option might be just to put a cond_resched() in
> >> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >> > >> >>
> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd
> >> > >> >> consumes CPU a lot although we applied Hannes' patch.
> >> > >> >>
> >> > >> >
> >> > >> > Because it's still possible for processes to allocate pages at the same
> >> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
> >> > >> > consider the zone balanced for prolonged periods of time.
> >> > >>
> >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> >> > >> So I think kswapd can be scheduled out although it's scheduled in
> >> > >> after a short time as task scheduled also need page reclaim. Although
> >> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
> >> > >> natural result, I think.
> >> > >> Do I miss something?
> >> > >>
> >> > >
> >> > > Lets see;
> >> > >
> >> > > shrink_page_list() only applies if inactive pages were isolated
> >> > >        which in turn may not happen if all_unreclaimable is set in
> >> > >        shrink_zones(). If for whatver reason, all_unreclaimable is
> >> > >        set on all zones, we can miss calling cond_resched().
> >> > >
> >> > > shrink_slab only applies if we are reclaiming slab pages. If the first
> >> > >        shrinker returns -1, we do not call cond_resched(). If that
> >> > >        first shrinker is dcache and __GFP_FS is not set, direct
> >> > >        reclaimers will not shrink at all. However, if there are
> >> > >        enough of them running or if one of the other shrinkers
> >> > >        is running for a very long time, kswapd could be starved
> >> > >        acquiring the shrinker_rwsem and never reaching the
> >> > >        cond_resched().
> >> >
> >> > Don't we have to move cond_resched?
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 292582c..633e761 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >> >         if (scanned == 0)
> >> >                 scanned = SWAP_CLUSTER_MAX;
> >> >
> >> > -       if (!down_read_trylock(&shrinker_rwsem))
> >> > -               return 1;       /* Assume we'll be able to shrink next time */
> >> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> >> > +               ret = 1;
> >> > +               goto out; /* Assume we'll be able to shrink next time */
> >> > +       }
> >> >
> >> >         list_for_each_entry(shrinker, &shrinker_list, list) {
> >> >                 unsigned long long delta;
> >> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >> >                         count_vm_events(SLABS_SCANNED, this_scan);
> >> >                         total_scan -= this_scan;
> >> >
> >> > -                       cond_resched();
> >> >                 }
> >> >
> >> >                 shrinker->nr += total_scan;
> >> > +               cond_resched();
> >> >         }
> >> >         up_read(&shrinker_rwsem);
> >> > +out:
> >> > +       cond_resched();
> >> >         return ret;
> >> >  }
> >> >
> >>
> >> This makes some sense for the exit path but if one or more of the
> >> shrinkers takes a very long time without sleeping (extremely long
> >> list searches for example) then kswapd will not call cond_resched()
> >> between shrinkers and still consume a lot of CPU.
> >>
> >> > >
> >> > > balance_pgdat() only calls cond_resched if the zones are not
> >> > >        balanced. For a high-order allocation that is balanced, it
> >> > >        checks order-0 again. During that window, order-0 might have
> >> > >        become unbalanced so it loops again for order-0 and returns
> >> > >        that was reclaiming for order-0 to kswapd(). It can then find
> >> > >        that a caller has rewoken kswapd for a high-order and re-enters
> >> > >        balance_pgdat() without ever have called cond_resched().
> >> >
> >> > If kswapd reclaims order-o followed by high order, it would have a
> >> > chance to call cond_resched in shrink_page_list. But if all zones are
> >> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> >> > It does make sense.
> >> > By your scenario, someone wakes up kswapd with higher order, again.
> >> > So re-enters balance_pgdat without ever have called cond_resched.
> >> > But if someone wakes up higher order again, we can't have a chance to
> >> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> >> >
> >> > It would be better to put cond_resched after balance_pgdat?
> >> >
> >>
> >> Which will leave kswapd runnable instead of going to sleep but
> >> guarantees a scheduling point. Lets see if the problem is that
> >> cond_resched is being missed although if this was the case then patch
> >> 4 would truly be a no-op but Colin has already reported that patch 1 on
> >> its own didn't fix his problem. If the problem is sandybridge-specific
> >> where kswapd remains runnable and consuming large amounts of CPU in
> >> turbo mode then we know that there are other cond_resched() decisions
> >> that will need to be revisited.
> >>
> >> Colin or James, would you be willing to test with patch 1 from this
> >> series and Minchan's patch below? Thanks.
> >
> > This works OK fine.  Ran 250 test cycles for about 2 hours.
> 
> Thanks for the testing!.
> I would like to know exact patch for you to apply.
> My modification of inserting cond_resched is two.
> 
> 1) shrink_slab function
> 2) kswapd right after balance_pgdat.
> 
> 1) or 2) ?
> Or
> Both?
> 
I just followed Mel's request, so, patch 1 from the series and *just*
the following:


>Colin or James, would you be willing to test with patch 1 from this
>series and Minchan's patch below? Thanks.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>                 if (!ret) {
>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
order);
>                         order = balance_pgdat(pgdat, order,
&classzone_idx);
> +                       cond_resched();
>                 }
>         }
>         return 0;

Colin


> Thanks



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  7:39                         ` Colin Ian King
  0 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-18  7:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, 2011-05-18 at 13:19 +0900, Minchan Kim wrote:
> Hello Colin,
> 
> On Tue, May 17, 2011 at 10:50 PM, Colin Ian King
> <colin.king@canonical.com> wrote:
> > On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> >> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> >> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> >> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> > >> >> <James.Bottomley@hansenpartnership.com> wrote:
> >> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> > >> >> >> >
> >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de>
> >> > >> >> >> > ---
> >> > >> >> >> >   mm/vmscan.c |    4 ++++
> >> > >> >> >> >   1 files changed, 4 insertions(+), 0 deletions(-)
> >> > >> >> >> >
> >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > >> >> >> > index af24d1e..4d24828 100644
> >> > >> >> >> > --- a/mm/vmscan.c
> >> > >> >> >> > +++ b/mm/vmscan.c
> >> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > >> >> >> >     unsigned long balanced = 0;
> >> > >> >> >> >     bool all_zones_ok = true;
> >> > >> >> >> >
> >> > >> >> >> > +   /* If kswapd has been running too long, just sleep */
> >> > >> >> >> > +   if (need_resched())
> >> > >> >> >> > +           return false;
> >> > >> >> >> > +
> >> > >> >> >>
> >> > >> >> >> Hmm... I don't like this patch so much. because this code does
> >> > >> >> >>
> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> > >> >> >
> >> > >> >> > This isn't entirely true:  need_resched() will be false, so we'll follow
> >> > >> >> > the normal path for determining whether to sleep or not, in effect
> >> > >> >> > leaving the current behaviour unchanged.
> >> > >> >> >
> >> > >> >> >> - sleep if kswapd didn't
> >> > >> >> >
> >> > >> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > >> >> > point depends on a whole lot more that whether we did a context switch
> >> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > >> >> > without giving up the CPU.  Generally that will mean we've been round
> >> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> > >> >> >
> >> > >> >> >> It seems to be semi random behavior.
> >> > >> >> >
> >> > >> >> > Well, we have to do something.  Chris Mason first suspected the hang was
> >> > >> >> > a kswapd rescheduling problem a while ago.  We tried putting
> >> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >> > >> >>
> >> > >> >> Is it a result of  test with patch of Hannes(ie, !pgdat_balanced)?
> >> > >> >>
> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> > >> >> pgdat_balance returns wrong result. And at last VM calls
> >> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> > >> >> kswap couldn't find zones which have not enough free pages and goto
> >> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> > >> >> chance to call cond_resched.
> >> > >> >>
> >> > >> >> But if your test was with Hanne's patch, I am very curious how come
> >> > >> >> kswapd consumes CPU a lot.
> >> > >> >>
> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > >> >> > option.  The other option might be just to put a cond_resched() in
> >> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >> > >> >>
> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd
> >> > >> >> consumes CPU a lot although we applied Hannes' patch.
> >> > >> >>
> >> > >> >
> >> > >> > Because it's still possible for processes to allocate pages at the same
> >> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
> >> > >> > consider the zone balanced for prolonged periods of time.
> >> > >>
> >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> >> > >> So I think kswapd can be scheduled out although it's scheduled in
> >> > >> after a short time as task scheduled also need page reclaim. Although
> >> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
> >> > >> natural result, I think.
> >> > >> Do I miss something?
> >> > >>
> >> > >
> >> > > Lets see;
> >> > >
> >> > > shrink_page_list() only applies if inactive pages were isolated
> >> > >        which in turn may not happen if all_unreclaimable is set in
> >> > >        shrink_zones(). If for whatver reason, all_unreclaimable is
> >> > >        set on all zones, we can miss calling cond_resched().
> >> > >
> >> > > shrink_slab only applies if we are reclaiming slab pages. If the first
> >> > >        shrinker returns -1, we do not call cond_resched(). If that
> >> > >        first shrinker is dcache and __GFP_FS is not set, direct
> >> > >        reclaimers will not shrink at all. However, if there are
> >> > >        enough of them running or if one of the other shrinkers
> >> > >        is running for a very long time, kswapd could be starved
> >> > >        acquiring the shrinker_rwsem and never reaching the
> >> > >        cond_resched().
> >> >
> >> > Don't we have to move cond_resched?
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 292582c..633e761 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >> >         if (scanned == 0)
> >> >                 scanned = SWAP_CLUSTER_MAX;
> >> >
> >> > -       if (!down_read_trylock(&shrinker_rwsem))
> >> > -               return 1;       /* Assume we'll be able to shrink next time */
> >> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> >> > +               ret = 1;
> >> > +               goto out; /* Assume we'll be able to shrink next time */
> >> > +       }
> >> >
> >> >         list_for_each_entry(shrinker, &shrinker_list, list) {
> >> >                 unsigned long long delta;
> >> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >> >                         count_vm_events(SLABS_SCANNED, this_scan);
> >> >                         total_scan -= this_scan;
> >> >
> >> > -                       cond_resched();
> >> >                 }
> >> >
> >> >                 shrinker->nr += total_scan;
> >> > +               cond_resched();
> >> >         }
> >> >         up_read(&shrinker_rwsem);
> >> > +out:
> >> > +       cond_resched();
> >> >         return ret;
> >> >  }
> >> >
> >>
> >> This makes some sense for the exit path but if one or more of the
> >> shrinkers takes a very long time without sleeping (extremely long
> >> list searches for example) then kswapd will not call cond_resched()
> >> between shrinkers and still consume a lot of CPU.
> >>
> >> > >
> >> > > balance_pgdat() only calls cond_resched if the zones are not
> >> > >        balanced. For a high-order allocation that is balanced, it
> >> > >        checks order-0 again. During that window, order-0 might have
> >> > >        become unbalanced so it loops again for order-0 and returns
> >> > >        that was reclaiming for order-0 to kswapd(). It can then find
> >> > >        that a caller has rewoken kswapd for a high-order and re-enters
> >> > >        balance_pgdat() without ever have called cond_resched().
> >> >
> >> > If kswapd reclaims order-o followed by high order, it would have a
> >> > chance to call cond_resched in shrink_page_list. But if all zones are
> >> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> >> > It does make sense.
> >> > By your scenario, someone wakes up kswapd with higher order, again.
> >> > So re-enters balance_pgdat without ever have called cond_resched.
> >> > But if someone wakes up higher order again, we can't have a chance to
> >> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> >> >
> >> > It would be better to put cond_resched after balance_pgdat?
> >> >
> >>
> >> Which will leave kswapd runnable instead of going to sleep but
> >> guarantees a scheduling point. Lets see if the problem is that
> >> cond_resched is being missed although if this was the case then patch
> >> 4 would truly be a no-op but Colin has already reported that patch 1 on
> >> its own didn't fix his problem. If the problem is sandybridge-specific
> >> where kswapd remains runnable and consuming large amounts of CPU in
> >> turbo mode then we know that there are other cond_resched() decisions
> >> that will need to be revisited.
> >>
> >> Colin or James, would you be willing to test with patch 1 from this
> >> series and Minchan's patch below? Thanks.
> >
> > This works OK fine.  Ran 250 test cycles for about 2 hours.
> 
> Thanks for the testing!.
> I would like to know exact patch for you to apply.
> My modification of inserting cond_resched is two.
> 
> 1) shrink_slab function
> 2) kswapd right after balance_pgdat.
> 
> 1) or 2) ?
> Or
> Both?
> 
I just followed Mel's request, so, patch 1 from the series and *just*
the following:


>Colin or James, would you be willing to test with patch 1 from this
>series and Minchan's patch below? Thanks.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>                 if (!ret) {
>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
order);
>                         order = balance_pgdat(pgdat, order,
&classzone_idx);
> +                       cond_resched();
>                 }
>         }
>         return 0;

Colin


> Thanks


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  0:26                 ` KOSAKI Motohiro
@ 2011-05-18  9:57                   ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-18  9:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 09:26:09AM +0900, KOSAKI Motohiro wrote:
> >Lets see;
> >
> >shrink_page_list() only applies if inactive pages were isolated
> >	which in turn may not happen if all_unreclaimable is set in
> >	shrink_zones(). If for whatver reason, all_unreclaimable is
> >	set on all zones, we can miss calling cond_resched().
> >
> >shrink_slab only applies if we are reclaiming slab pages. If the first
> >	shrinker returns -1, we do not call cond_resched(). If that
> >	first shrinker is dcache and __GFP_FS is not set, direct
> >	reclaimers will not shrink at all. However, if there are
> >	enough of them running or if one of the other shrinkers
> >	is running for a very long time, kswapd could be starved
> >	acquiring the shrinker_rwsem and never reaching the
> >	cond_resched().
> 
> OK.
> 
> 
> >
> >balance_pgdat() only calls cond_resched if the zones are not
> >	balanced. For a high-order allocation that is balanced, it
> >	checks order-0 again. During that window, order-0 might have
> >	become unbalanced so it loops again for order-0 and returns
> >	that was reclaiming for order-0 to kswapd(). It can then find
> >	that a caller has rewoken kswapd for a high-order and re-enters
> >	balance_pgdat() without ever have called cond_resched().
> 
> Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
> The problem is NOT 100% cpu consumption. if kswapd will sleep, other
> processes need to reclaim old pages. The problem is, kswapd doesn't
> invoke context switch and other tasks hang-up.
> 

Which the shrink_slab patch does (either version). What's the gain from
sprinkling more cond_resched() around? If you think there is, submit
another pair of patches (include patch 1 from this series) but I'm not
seeing the advantage myself.

> 
> >While it appears unlikely, there are bad conditions which can result
> >in cond_resched() being avoided.
> >
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  9:57                   ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-18  9:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13,
	jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 09:26:09AM +0900, KOSAKI Motohiro wrote:
> >Lets see;
> >
> >shrink_page_list() only applies if inactive pages were isolated
> >	which in turn may not happen if all_unreclaimable is set in
> >	shrink_zones(). If for whatver reason, all_unreclaimable is
> >	set on all zones, we can miss calling cond_resched().
> >
> >shrink_slab only applies if we are reclaiming slab pages. If the first
> >	shrinker returns -1, we do not call cond_resched(). If that
> >	first shrinker is dcache and __GFP_FS is not set, direct
> >	reclaimers will not shrink at all. However, if there are
> >	enough of them running or if one of the other shrinkers
> >	is running for a very long time, kswapd could be starved
> >	acquiring the shrinker_rwsem and never reaching the
> >	cond_resched().
> 
> OK.
> 
> 
> >
> >balance_pgdat() only calls cond_resched if the zones are not
> >	balanced. For a high-order allocation that is balanced, it
> >	checks order-0 again. During that window, order-0 might have
> >	become unbalanced so it loops again for order-0 and returns
> >	that was reclaiming for order-0 to kswapd(). It can then find
> >	that a caller has rewoken kswapd for a high-order and re-enters
> >	balance_pgdat() without ever have called cond_resched().
> 
> Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
> The problem is NOT 100% cpu consumption. if kswapd will sleep, other
> processes need to reclaim old pages. The problem is, kswapd doesn't
> invoke context switch and other tasks hang-up.
> 

Which the shrink_slab patch does (either version). What's the gain from
sprinkling more cond_resched() around? If you think there is, submit
another pair of patches (include patch 1 from this series) but I'm not
seeing the advantage myself.

> 
> >While it appears unlikely, there are bad conditions which can result
> >in cond_resched() being avoided.
> >
> 
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  5:44                     ` Minchan Kim
  (?)
@ 2011-05-18  9:58                       ` Mel Gorman
  -1 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-18  9:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> It would be better to put cond_resched after balance_pgdat?
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 292582c..61c45d0 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >>                 if (!ret) {
> >>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> >> order);
> >>                         order = balance_pgdat(pgdat,
> >> order,&classzone_idx);
> >> +                       cond_resched();
> >>                 }
> >>         }
> >>         return 0;
> >>
> >>>>> While it appears unlikely, there are bad conditions which can result
> >>>
> >>> in cond_resched() being avoided.
> >
> > Every reclaim priority decreasing or every shrink_zone() calling makes more
> > fine grained preemption. I think.
> 
> It could be.
> But in direct reclaim case, I have a concern about losing pages
> reclaimed to other tasks by preemption.
> 
> Hmm,, anyway, we also needs test.
> Hmm,, how long should we bother them(Colins and James)?
> First of all, Let's fix one just between us and ask test to them and
> send the last patch to akpm.
> 
> 1. shrink_slab
> 2. right after balance_pgdat
> 3. shrink_zone
> 4. reclaim priority decreasing routine.
> 
> Now, I vote 1) and 2).
> 

I've already submitted a pair of patches for option 1. I don't think
option 2 gains us anything. I think it's more likely we should worry
about all_unreclaimable being set when shrink_slab is returning 0 and we
are encountering so many dirty pages that pages_scanned is high enough.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  9:58                       ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-18  9:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> It would be better to put cond_resched after balance_pgdat?
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 292582c..61c45d0 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >>                 if (!ret) {
> >>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> >> order);
> >>                         order = balance_pgdat(pgdat,
> >> order,&classzone_idx);
> >> +                       cond_resched();
> >>                 }
> >>         }
> >>         return 0;
> >>
> >>>>> While it appears unlikely, there are bad conditions which can result
> >>>
> >>> in cond_resched() being avoided.
> >
> > Every reclaim priority decreasing or every shrink_zone() calling makes more
> > fine grained preemption. I think.
> 
> It could be.
> But in direct reclaim case, I have a concern about losing pages
> reclaimed to other tasks by preemption.
> 
> Hmm,, anyway, we also needs test.
> Hmm,, how long should we bother them(Colins and James)?
> First of all, Let's fix one just between us and ask test to them and
> send the last patch to akpm.
> 
> 1. shrink_slab
> 2. right after balance_pgdat
> 3. shrink_zone
> 4. reclaim priority decreasing routine.
> 
> Now, I vote 1) and 2).
> 

I've already submitted a pair of patches for option 1. I don't think
option 2 gains us anything. I think it's more likely we should worry
about all_unreclaimable being set when shrink_slab is returning 0 and we
are encountering so many dirty pages that pages_scanned is high enough.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  9:58                       ` Mel Gorman
  0 siblings, 0 replies; 119+ messages in thread
From: Mel Gorman @ 2011-05-18  9:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> It would be better to put cond_resched after balance_pgdat?
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 292582c..61c45d0 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >>                 if (!ret) {
> >>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> >> order);
> >>                         order = balance_pgdat(pgdat,
> >> order,&classzone_idx);
> >> +                       cond_resched();
> >>                 }
> >>         }
> >>         return 0;
> >>
> >>>>> While it appears unlikely, there are bad conditions which can result
> >>>
> >>> in cond_resched() being avoided.
> >
> > Every reclaim priority decreasing or every shrink_zone() calling makes more
> > fine grained preemption. I think.
> 
> It could be.
> But in direct reclaim case, I have a concern about losing pages
> reclaimed to other tasks by preemption.
> 
> Hmm,, anyway, we also needs test.
> Hmm,, how long should we bother them(Colins and James)?
> First of all, Let's fix one just between us and ask test to them and
> send the last patch to akpm.
> 
> 1. shrink_slab
> 2. right after balance_pgdat
> 3. shrink_zone
> 4. reclaim priority decreasing routine.
> 
> Now, I vote 1) and 2).
> 

I've already submitted a pair of patches for option 1. I don't think
option 2 gains us anything. I think it's more likely we should worry
about all_unreclaimable being set when shrink_slab is returning 0 and we
are encountering so many dirty pages that pages_scanned is high enough.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
  2011-05-18  6:09       ` Pekka Enberg
@ 2011-05-18 17:21         ` Christoph Lameter
  -1 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-18 17:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Mel Gorman, Andrew Morton, James Bottomley,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 18 May 2011, Pekka Enberg wrote:

> On 5/17/11 12:10 AM, David Rientjes wrote:
> > On Fri, 13 May 2011, Mel Gorman wrote:
> >
> > > To avoid locking and per-cpu overhead, SLUB optimisically uses
> > > high-order allocations and falls back to lower allocations if they
> > > fail.  However, by simply trying to allocate, kswapd is woken up to
> > > start reclaiming at that order. On a desktop system, two users report
> > > that the system is getting locked up with kswapd using large amounts
> > > of CPU.  Using SLAB instead of SLUB made this problem go away.
> > >
> > > This patch prevents kswapd being woken up for high-order allocations.
> > > Testing indicated that with this patch applied, the system was much
> > > harder to hang and even when it did, it eventually recovered.
> > >
> > > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > Acked-by: David Rientjes<rientjes@google.com>
>
> Christoph? I think this patch is sane although the original rationale was to
> workaround kswapd problems.

I am mostly fine with it. The concerns that I have is if there is a
large series of high order allocs then at some point you would want
kswapd to be triggered instead of high order allocs constantly failing.

Can we have a "trigger once in a while" functionality?


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
@ 2011-05-18 17:21         ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2011-05-18 17:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Mel Gorman, Andrew Morton, James Bottomley,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 18 May 2011, Pekka Enberg wrote:

> On 5/17/11 12:10 AM, David Rientjes wrote:
> > On Fri, 13 May 2011, Mel Gorman wrote:
> >
> > > To avoid locking and per-cpu overhead, SLUB optimisically uses
> > > high-order allocations and falls back to lower allocations if they
> > > fail.  However, by simply trying to allocate, kswapd is woken up to
> > > start reclaiming at that order. On a desktop system, two users report
> > > that the system is getting locked up with kswapd using large amounts
> > > of CPU.  Using SLAB instead of SLUB made this problem go away.
> > >
> > > This patch prevents kswapd being woken up for high-order allocations.
> > > Testing indicated that with this patch applied, the system was much
> > > harder to hang and even when it did, it eventually recovered.
> > >
> > > Signed-off-by: Mel Gorman<mgorman@suse.de>
> > Acked-by: David Rientjes<rientjes@google.com>
>
> Christoph? I think this patch is sane although the original rationale was to
> workaround kswapd problems.

I am mostly fine with it. The concerns that I have is if there is a
large series of high order allocs then at some point you would want
kswapd to be triggered instead of high order allocs constantly failing.

Can we have a "trigger once in a while" functionality?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  9:58                       ` Mel Gorman
@ 2011-05-18 22:55                         ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-18 22:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 6:58 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote:
>> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> It would be better to put cond_resched after balance_pgdat?
>> >>
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index 292582c..61c45d0 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>> >>                 if (!ret) {
>> >>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
>> >> order);
>> >>                         order = balance_pgdat(pgdat,
>> >> order,&classzone_idx);
>> >> +                       cond_resched();
>> >>                 }
>> >>         }
>> >>         return 0;
>> >>
>> >>>>> While it appears unlikely, there are bad conditions which can result
>> >>>
>> >>> in cond_resched() being avoided.
>> >
>> > Every reclaim priority decreasing or every shrink_zone() calling makes more
>> > fine grained preemption. I think.
>>
>> It could be.
>> But in direct reclaim case, I have a concern about losing pages
>> reclaimed to other tasks by preemption.
>>
>> Hmm,, anyway, we also needs test.
>> Hmm,, how long should we bother them(Colins and James)?
>> First of all, Let's fix one just between us and ask test to them and
>> send the last patch to akpm.
>>
>> 1. shrink_slab
>> 2. right after balance_pgdat
>> 3. shrink_zone
>> 4. reclaim priority decreasing routine.
>>
>> Now, I vote 1) and 2).
>>
>
> I've already submitted a pair of patches for option 1. I don't think
> option 2 gains us anything. I think it's more likely we should worry
> about all_unreclaimable being set when shrink_slab is returning 0 and we
> are encountering so many dirty pages that pages_scanned is high enough.

Okay.

Colin reported he had no problem with patch 1 in this series and
mine(ie, just cond_resched right after balance_pgdat call without no
patch of shrink_slab).

If Colin's test is successful, I don't insist on mine.
(I don't want to drag on for days :( )
If KOSAKI agree, let's ask the test to Colin and confirm our last test.

KOSAKI. Could you post a your opinion?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18 22:55                         ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-18 22:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 6:58 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote:
>> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >> It would be better to put cond_resched after balance_pgdat?
>> >>
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index 292582c..61c45d0 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>> >>                 if (!ret) {
>> >>                         trace_mm_vmscan_kswapd_wake(pgdat->node_id,
>> >> order);
>> >>                         order = balance_pgdat(pgdat,
>> >> order,&classzone_idx);
>> >> +                       cond_resched();
>> >>                 }
>> >>         }
>> >>         return 0;
>> >>
>> >>>>> While it appears unlikely, there are bad conditions which can result
>> >>>
>> >>> in cond_resched() being avoided.
>> >
>> > Every reclaim priority decreasing or every shrink_zone() calling makes more
>> > fine grained preemption. I think.
>>
>> It could be.
>> But in direct reclaim case, I have a concern about losing pages
>> reclaimed to other tasks by preemption.
>>
>> Hmm,, anyway, we also needs test.
>> Hmm,, how long should we bother them(Colins and James)?
>> First of all, Let's fix one just between us and ask test to them and
>> send the last patch to akpm.
>>
>> 1. shrink_slab
>> 2. right after balance_pgdat
>> 3. shrink_zone
>> 4. reclaim priority decreasing routine.
>>
>> Now, I vote 1) and 2).
>>
>
> I've already submitted a pair of patches for option 1. I don't think
> option 2 gains us anything. I think it's more likely we should worry
> about all_unreclaimable being set when shrink_slab is returning 0 and we
> are encountering so many dirty pages that pages_scanned is high enough.

Okay.

Colin reported he had no problem with patch 1 in this series and
mine(ie, just cond_resched right after balance_pgdat call without no
patch of shrink_slab).

If Colin's test is successful, I don't insist on mine.
(I don't want to drag on for days :( )
If KOSAKI agree, let's ask the test to Colin and confirm our last test.

KOSAKI. Could you post a your opinion?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18 22:55                         ` Minchan Kim
@ 2011-05-18 23:54                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18 23:54 UTC (permalink / raw)
  To: minchan.kim
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

>> I've already submitted a pair of patches for option 1. I don't think
>> option 2 gains us anything. I think it's more likely we should worry
>> about all_unreclaimable being set when shrink_slab is returning 0 and we
>> are encountering so many dirty pages that pages_scanned is high enough.
>
> Okay.
>
> Colin reported he had no problem with patch 1 in this series and
> mine(ie, just cond_resched right after balance_pgdat call without no
> patch of shrink_slab).
>
> If Colin's test is successful, I don't insist on mine.
> (I don't want to drag on for days :( )
> If KOSAKI agree, let's ask the test to Colin and confirm our last test.
>
> KOSAKI. Could you post a your opinion?

Yeah.
I also don't have any motivation to ignore Colin's test result.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18 23:54                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 119+ messages in thread
From: KOSAKI Motohiro @ 2011-05-18 23:54 UTC (permalink / raw)
  To: minchan.kim
  Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

>> I've already submitted a pair of patches for option 1. I don't think
>> option 2 gains us anything. I think it's more likely we should worry
>> about all_unreclaimable being set when shrink_slab is returning 0 and we
>> are encountering so many dirty pages that pages_scanned is high enough.
>
> Okay.
>
> Colin reported he had no problem with patch 1 in this series and
> mine(ie, just cond_resched right after balance_pgdat call without no
> patch of shrink_slab).
>
> If Colin's test is successful, I don't insist on mine.
> (I don't want to drag on for days :( )
> If KOSAKI agree, let's ask the test to Colin and confirm our last test.
>
> KOSAKI. Could you post a your opinion?

Yeah.
I also don't have any motivation to ignore Colin's test result.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
  2011-05-17 16:15                       ` Mel Gorman
  (?)
@ 2011-05-19  0:03                         ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-19  0:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Colin Ian King, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that it was reclaiming for order-0 to kswapd(). It can then
>        find that a caller has rewoken kswapd for a high-order and
>        re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
>        pages. If there are a large number of direct reclaimers, the
>        shrinker_rwsem can be contended and prevent kswapd calling
>        cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-19  0:03                         ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-19  0:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Colin Ian King, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that it was reclaiming for order-0 to kswapd(). It can then
>        find that a caller has rewoken kswapd for a high-order and
>        re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
>        pages. If there are a large number of direct reclaimers, the
>        shrinker_rwsem can be contended and prevent kswapd calling
>        cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-19  0:03                         ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-19  0:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Colin Ian King, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that it was reclaiming for order-0 to kswapd(). It can then
>        find that a caller has rewoken kswapd for a high-order and
>        re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
>        pages. If there are a large number of direct reclaimers, the
>        shrinker_rwsem can be contended and prevent kswapd calling
>        cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
  2011-05-17 16:15                       ` Mel Gorman
  (?)
@ 2011-05-19  0:09                         ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-19  0:09 UTC (permalink / raw)
  To: Mel Gorman, Colin Ian King
  Cc: akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

Hi Colin.

Sorry for bothering you. :(
I hope this test is last.

We(Mel, KOSAKI and me) finalized opinion.

Could you test below patch with patch[1/4] of Mel's series(ie,
!pgdat_balanced  of sleeping_prematurely)?
If it is successful, we will try to merge this version instead of
various cond_resched sprinkling version.


On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that it was reclaiming for order-0 to kswapd(). It can then
>        find that a caller has rewoken kswapd for a high-order and
>        re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
>        pages. If there are a large number of direct reclaimers, the
>        shrinker_rwsem can be contended and prevent kswapd calling
>        cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..0bed248 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>        if (scanned == 0)
>                scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               /* Assume we'll be able to shrink next time */
> +               ret = 1;
> +               goto out;
> +       }
>
>        list_for_each_entry(shrinker, &shrinker_list, list) {
>                unsigned long long delta;
> @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>                shrinker->nr += total_scan;
>        }
>        up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>        return ret;
>  }
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-19  0:09                         ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-19  0:09 UTC (permalink / raw)
  To: Mel Gorman, Colin Ian King
  Cc: akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

Hi Colin.

Sorry for bothering you. :(
I hope this test is last.

We(Mel, KOSAKI and me) finalized opinion.

Could you test below patch with patch[1/4] of Mel's series(ie,
!pgdat_balanced  of sleeping_prematurely)?
If it is successful, we will try to merge this version instead of
various cond_resched sprinkling version.


On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that it was reclaiming for order-0 to kswapd(). It can then
>        find that a caller has rewoken kswapd for a high-order and
>        re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
>        pages. If there are a large number of direct reclaimers, the
>        shrinker_rwsem can be contended and prevent kswapd calling
>        cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..0bed248 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>        if (scanned == 0)
>                scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               /* Assume we'll be able to shrink next time */
> +               ret = 1;
> +               goto out;
> +       }
>
>        list_for_each_entry(shrinker, &shrinker_list, list) {
>                unsigned long long delta;
> @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>                shrinker->nr += total_scan;
>        }
>        up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>        return ret;
>  }
>
>



-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-19  0:09                         ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-19  0:09 UTC (permalink / raw)
  To: Mel Gorman, Colin Ian King
  Cc: akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack,
	chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

Hi Colin.

Sorry for bothering you. :(
I hope this test is last.

We(Mel, KOSAKI and me) finalized opinion.

Could you test below patch with patch[1/4] of Mel's series(ie,
!pgdat_balanced  of sleeping_prematurely)?
If it is successful, we will try to merge this version instead of
various cond_resched sprinkling version.


On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
>        which in turn may not happen if all_unreclaimable is set in
>        shrink_zones(). If for whatver reason, all_unreclaimable is
>        set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
>        balanced. For a high-order allocation that is balanced, it
>        checks order-0 again. During that window, order-0 might have
>        become unbalanced so it loops again for order-0 and returns
>        that it was reclaiming for order-0 to kswapd(). It can then
>        find that a caller has rewoken kswapd for a high-order and
>        re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
>        pages. If there are a large number of direct reclaimers, the
>        shrinker_rwsem can be contended and prevent kswapd calling
>        cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..0bed248 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>        if (scanned == 0)
>                scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem))
> -               return 1;       /* Assume we'll be able to shrink next time */
> +       if (!down_read_trylock(&shrinker_rwsem)) {
> +               /* Assume we'll be able to shrink next time */
> +               ret = 1;
> +               goto out;
> +       }
>
>        list_for_each_entry(shrinker, &shrinker_list, list) {
>                unsigned long long delta;
> @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>                shrinker->nr += total_scan;
>        }
>        up_read(&shrinker_rwsem);
> +out:
> +       cond_resched();
>        return ret;
>  }
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
  2011-05-19  0:09                         ` Minchan Kim
@ 2011-05-19 11:36                           ` Colin Ian King
  -1 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-19 11:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote:
> Hi Colin.
> 
> Sorry for bothering you. :(

No problem at all, I've very happy to re-test.

> I hope this test is last.
> 
> We(Mel, KOSAKI and me) finalized opinion.
> 
> Could you test below patch with patch[1/4] of Mel's series(ie,
> !pgdat_balanced  of sleeping_prematurely)?
> If it is successful, we will try to merge this version instead of
> various cond_resched sprinkling version.

tested with the patch below + patch[1/4] of Mel's series.  300 cycles,
2.5 hrs of soak testing: works OK.

Colin
> 
> 
> On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> > It has been reported on some laptops that kswapd is consuming large
> > amounts of CPU and not being scheduled when SLUB is enabled during
> > large amounts of file copying. It is expected that this is due to
> > kswapd missing every cond_resched() point because;
> >
> > shrink_page_list() calls cond_resched() if inactive pages were isolated
> >        which in turn may not happen if all_unreclaimable is set in
> >        shrink_zones(). If for whatver reason, all_unreclaimable is
> >        set on all zones, we can miss calling cond_resched().
> >
> > balance_pgdat() only calls cond_resched if the zones are not
> >        balanced. For a high-order allocation that is balanced, it
> >        checks order-0 again. During that window, order-0 might have
> >        become unbalanced so it loops again for order-0 and returns
> >        that it was reclaiming for order-0 to kswapd(). It can then
> >        find that a caller has rewoken kswapd for a high-order and
> >        re-enters balance_pgdat() without ever calling cond_resched().
> >
> > shrink_slab only calls cond_resched() if we are reclaiming slab
> >        pages. If there are a large number of direct reclaimers, the
> >        shrinker_rwsem can be contended and prevent kswapd calling
> >        cond_resched().
> >
> > This patch modifies the shrink_slab() case. If the semaphore is
> > contended, the caller will still check cond_resched(). After each
> > successful call into a shrinker, the check for cond_resched() is
> > still necessary in case one shrinker call is particularly slow.
> >
> > This patch replaces
> > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> > in -mm.
> >
> > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |    9 +++++++--
> >  1 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..0bed248 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> >        if (scanned == 0)
> >                scanned = SWAP_CLUSTER_MAX;
> >
> > -       if (!down_read_trylock(&shrinker_rwsem))
> > -               return 1;       /* Assume we'll be able to shrink next time */
> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> > +               /* Assume we'll be able to shrink next time */
> > +               ret = 1;
> > +               goto out;
> > +       }
> >
> >        list_for_each_entry(shrinker, &shrinker_list, list) {
> >                unsigned long long delta;
> > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> >                shrinker->nr += total_scan;
> >        }
> >        up_read(&shrinker_rwsem);
> > +out:
> > +       cond_resched();
> >        return ret;
> >  }
> >
> >
> 
> 
> 



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-19 11:36                           ` Colin Ian King
  0 siblings, 0 replies; 119+ messages in thread
From: Colin Ian King @ 2011-05-19 11:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote:
> Hi Colin.
> 
> Sorry for bothering you. :(

No problem at all, I've very happy to re-test.

> I hope this test is last.
> 
> We(Mel, KOSAKI and me) finalized opinion.
> 
> Could you test below patch with patch[1/4] of Mel's series(ie,
> !pgdat_balanced  of sleeping_prematurely)?
> If it is successful, we will try to merge this version instead of
> various cond_resched sprinkling version.

tested with the patch below + patch[1/4] of Mel's series.  300 cycles,
2.5 hrs of soak testing: works OK.

Colin
> 
> 
> On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote:
> > It has been reported on some laptops that kswapd is consuming large
> > amounts of CPU and not being scheduled when SLUB is enabled during
> > large amounts of file copying. It is expected that this is due to
> > kswapd missing every cond_resched() point because;
> >
> > shrink_page_list() calls cond_resched() if inactive pages were isolated
> >        which in turn may not happen if all_unreclaimable is set in
> >        shrink_zones(). If for whatver reason, all_unreclaimable is
> >        set on all zones, we can miss calling cond_resched().
> >
> > balance_pgdat() only calls cond_resched if the zones are not
> >        balanced. For a high-order allocation that is balanced, it
> >        checks order-0 again. During that window, order-0 might have
> >        become unbalanced so it loops again for order-0 and returns
> >        that it was reclaiming for order-0 to kswapd(). It can then
> >        find that a caller has rewoken kswapd for a high-order and
> >        re-enters balance_pgdat() without ever calling cond_resched().
> >
> > shrink_slab only calls cond_resched() if we are reclaiming slab
> >        pages. If there are a large number of direct reclaimers, the
> >        shrinker_rwsem can be contended and prevent kswapd calling
> >        cond_resched().
> >
> > This patch modifies the shrink_slab() case. If the semaphore is
> > contended, the caller will still check cond_resched(). After each
> > successful call into a shrinker, the check for cond_resched() is
> > still necessary in case one shrinker call is particularly slow.
> >
> > This patch replaces
> > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> > in -mm.
> >
> > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker]
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |    9 +++++++--
> >  1 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..0bed248 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> >        if (scanned == 0)
> >                scanned = SWAP_CLUSTER_MAX;
> >
> > -       if (!down_read_trylock(&shrinker_rwsem))
> > -               return 1;       /* Assume we'll be able to shrink next time */
> > +       if (!down_read_trylock(&shrinker_rwsem)) {
> > +               /* Assume we'll be able to shrink next time */
> > +               ret = 1;
> > +               goto out;
> > +       }
> >
> >        list_for_each_entry(shrinker, &shrinker_list, list) {
> >                unsigned long long delta;
> > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> >                shrinker->nr += total_scan;
> >        }
> >        up_read(&shrinker_rwsem);
> > +out:
> > +       cond_resched();
> >        return ret;
> >  }
> >
> >
> 
> 
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
  2011-05-19 11:36                           ` Colin Ian King
  (?)
@ 2011-05-20  0:06                             ` Minchan Kim
  -1 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-20  0:06 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, May 19, 2011 at 8:36 PM, Colin Ian King
<colin.king@canonical.com> wrote:
> On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote:
>> Hi Colin.
>>
>> Sorry for bothering you. :(
>
> No problem at all, I've very happy to re-test.
>
>> I hope this test is last.
>>
>> We(Mel, KOSAKI and me) finalized opinion.
>>
>> Could you test below patch with patch[1/4] of Mel's series(ie,
>> !pgdat_balanced  of sleeping_prematurely)?
>> If it is successful, we will try to merge this version instead of
>> various cond_resched sprinkling version.
>
> tested with the patch below + patch[1/4] of Mel's series.  300 cycles,
> 2.5 hrs of soak testing: works OK.
>
> Colin

Thanks, Colin.
We are approaching the conclusion for  your help. :)

Mel, KOSAKI.
I will ask test to Andrew Lutomirski.
If he doesn't have a problem, let's go, then.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-20  0:06                             ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-20  0:06 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, May 19, 2011 at 8:36 PM, Colin Ian King
<colin.king@canonical.com> wrote:
> On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote:
>> Hi Colin.
>>
>> Sorry for bothering you. :(
>
> No problem at all, I've very happy to re-test.
>
>> I hope this test is last.
>>
>> We(Mel, KOSAKI and me) finalized opinion.
>>
>> Could you test below patch with patch[1/4] of Mel's series(ie,
>> !pgdat_balanced  of sleeping_prematurely)?
>> If it is successful, we will try to merge this version instead of
>> various cond_resched sprinkling version.
>
> tested with the patch below + patch[1/4] of Mel's series.  300 cycles,
> 2.5 hrs of soak testing: works OK.
>
> Colin

Thanks, Colin.
We are approaching the conclusion for  your help. :)

Mel, KOSAKI.
I will ask test to Andrew Lutomirski.
If he doesn't have a problem, let's go, then.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab
@ 2011-05-20  0:06                             ` Minchan Kim
  0 siblings, 0 replies; 119+ messages in thread
From: Minchan Kim @ 2011-05-20  0:06 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro,
	raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes,
	linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, May 19, 2011 at 8:36 PM, Colin Ian King
<colin.king@canonical.com> wrote:
> On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote:
>> Hi Colin.
>>
>> Sorry for bothering you. :(
>
> No problem at all, I've very happy to re-test.
>
>> I hope this test is last.
>>
>> We(Mel, KOSAKI and me) finalized opinion.
>>
>> Could you test below patch with patch[1/4] of Mel's series(ie,
>> !pgdat_balanced  of sleeping_prematurely)?
>> If it is successful, we will try to merge this version instead of
>> various cond_resched sprinkling version.
>
> tested with the patch below + patch[1/4] of Mel's series.  300 cycles,
> 2.5 hrs of soak testing: works OK.
>
> Colin

Thanks, Colin.
We are approaching the conclusion for  your help. :)

Mel, KOSAKI.
I will ask test to Andrew Lutomirski.
If he doesn't have a problem, let's go, then.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2011-05-20  0:06 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-13 14:03 [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 Mel Gorman
2011-05-13 14:03 ` Mel Gorman
2011-05-13 14:03 ` [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely Mel Gorman
2011-05-13 14:03   ` Mel Gorman
2011-05-13 14:28   ` Johannes Weiner
2011-05-13 14:28     ` Johannes Weiner
2011-05-14 16:30   ` Minchan Kim
2011-05-14 16:30     ` Minchan Kim
2011-05-16 14:30   ` Rik van Riel
2011-05-16 14:30     ` Rik van Riel
2011-05-13 14:03 ` [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations Mel Gorman
2011-05-13 14:03   ` Mel Gorman
2011-05-16 21:10   ` David Rientjes
2011-05-16 21:10     ` David Rientjes
2011-05-18  6:09     ` Pekka Enberg
2011-05-18  6:09       ` Pekka Enberg
2011-05-18 17:21       ` Christoph Lameter
2011-05-18 17:21         ` Christoph Lameter
2011-05-13 14:03 ` [PATCH 3/4] mm: slub: Do not take expensive steps " Mel Gorman
2011-05-13 14:03   ` Mel Gorman
2011-05-16 21:16   ` David Rientjes
2011-05-16 21:16     ` David Rientjes
2011-05-17  8:42     ` Mel Gorman
2011-05-17  8:42       ` Mel Gorman
2011-05-17 13:51       ` Christoph Lameter
2011-05-17 13:51         ` Christoph Lameter
2011-05-17 16:22         ` Mel Gorman
2011-05-17 16:22           ` Mel Gorman
2011-05-17 17:52           ` Christoph Lameter
2011-05-17 17:52             ` Christoph Lameter
2011-05-17 19:35             ` David Rientjes
2011-05-17 19:35               ` David Rientjes
2011-05-17 19:31       ` David Rientjes
2011-05-17 19:31         ` David Rientjes
2011-05-13 14:03 ` [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep Mel Gorman
2011-05-13 14:03   ` Mel Gorman
2011-05-15 10:27   ` KOSAKI Motohiro
2011-05-15 10:27     ` KOSAKI Motohiro
2011-05-16  4:21     ` James Bottomley
2011-05-16  4:21       ` James Bottomley
2011-05-16  5:04       ` Minchan Kim
2011-05-16  5:04         ` Minchan Kim
2011-05-16  8:45         ` Mel Gorman
2011-05-16  8:45           ` Mel Gorman
2011-05-16  8:45           ` Mel Gorman
2011-05-16  8:58           ` Minchan Kim
2011-05-16  8:58             ` Minchan Kim
2011-05-16  8:58             ` Minchan Kim
2011-05-16 10:27             ` Mel Gorman
2011-05-16 10:27               ` Mel Gorman
2011-05-16 10:27               ` Mel Gorman
2011-05-16 23:50               ` Minchan Kim
2011-05-16 23:50                 ` Minchan Kim
2011-05-17  0:48                 ` Minchan Kim
2011-05-17  0:48                   ` Minchan Kim
2011-05-17  0:48                   ` Minchan Kim
2011-05-17 10:38                 ` Mel Gorman
2011-05-17 10:38                   ` Mel Gorman
2011-05-17 10:38                   ` Mel Gorman
2011-05-17 13:50                   ` Colin Ian King
2011-05-17 13:50                     ` Colin Ian King
2011-05-17 16:15                     ` [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab Mel Gorman
2011-05-17 16:15                       ` Mel Gorman
2011-05-18  0:45                       ` KOSAKI Motohiro
2011-05-18  0:45                         ` KOSAKI Motohiro
2011-05-19  0:03                       ` Minchan Kim
2011-05-19  0:03                         ` Minchan Kim
2011-05-19  0:03                         ` Minchan Kim
2011-05-19  0:09                       ` Minchan Kim
2011-05-19  0:09                         ` Minchan Kim
2011-05-19  0:09                         ` Minchan Kim
2011-05-19 11:36                         ` Colin Ian King
2011-05-19 11:36                           ` Colin Ian King
2011-05-20  0:06                           ` Minchan Kim
2011-05-20  0:06                             ` Minchan Kim
2011-05-20  0:06                             ` Minchan Kim
2011-05-18  4:19                     ` [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep Minchan Kim
2011-05-18  4:19                       ` Minchan Kim
2011-05-18  7:39                       ` Colin Ian King
2011-05-18  7:39                         ` Colin Ian King
2011-05-18  4:09                   ` James Bottomley
2011-05-18  4:09                     ` James Bottomley
2011-05-18  1:05                 ` KOSAKI Motohiro
2011-05-18  1:05                   ` KOSAKI Motohiro
2011-05-18  5:44                   ` Minchan Kim
2011-05-18  5:44                     ` Minchan Kim
2011-05-18  5:44                     ` Minchan Kim
2011-05-18  6:05                     ` KOSAKI Motohiro
2011-05-18  6:05                       ` KOSAKI Motohiro
2011-05-18  9:58                     ` Mel Gorman
2011-05-18  9:58                       ` Mel Gorman
2011-05-18  9:58                       ` Mel Gorman
2011-05-18 22:55                       ` Minchan Kim
2011-05-18 22:55                         ` Minchan Kim
2011-05-18 23:54                         ` KOSAKI Motohiro
2011-05-18 23:54                           ` KOSAKI Motohiro
2011-05-18  0:26               ` KOSAKI Motohiro
2011-05-18  0:26                 ` KOSAKI Motohiro
2011-05-18  9:57                 ` Mel Gorman
2011-05-18  9:57                   ` Mel Gorman
2011-05-16  8:45     ` Mel Gorman
2011-05-16  8:45       ` Mel Gorman
2011-05-16 14:30   ` Rik van Riel
2011-05-16 14:30     ` Rik van Riel
2011-05-13 15:19 ` [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 James Bottomley
2011-05-13 15:19   ` James Bottomley
2011-05-13 15:19   ` James Bottomley
2011-05-13 15:52   ` Mel Gorman
2011-05-13 15:52     ` Mel Gorman
2011-05-13 15:21 ` Christoph Lameter
2011-05-13 15:21   ` Christoph Lameter
2011-05-13 15:43   ` Mel Gorman
2011-05-13 15:43     ` Mel Gorman
2011-05-14  8:34 ` Colin Ian King
2011-05-14  8:34   ` Colin Ian King
2011-05-16  8:37   ` Mel Gorman
2011-05-16  8:37     ` Mel Gorman
2011-05-16 11:24     ` Colin Ian King
2011-05-16 11:24       ` Colin Ian King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.