* [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-13 14:03 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman Changelog since V1 o kswapd should sleep if need_resched o Remove __GFP_REPEAT from GFP flags when speculatively using high orders so direct/compaction exits earlier o Remove __GFP_NORETRY for correctness o Correct logic in sleeping_prematurely o Leave SLUB using the default slub_max_order There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU which appear to be due to recent reclaim changes. SLUB using high orders is the trigger but not the root cause as SLUB has been using high orders for a while. The following four patches aim to fix the problems in reclaim while reducing the cost for SLUB using those high orders. Patch 1 corrects logic introduced by commit [1741c877: mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced] to allow kswapd to go to sleep when balanced for high orders. Patch 2 prevents kswapd waking up in response to SLUBs speculative use of high orders. Patch 3 further reduces the cost by prevent SLUB entering direct compaction or reclaim paths on the grounds that falling back to order-0 should be cheaper. Patch 4 notes that even when kswapd is failing to keep up with allocation requests, it should still go to sleep when its quota has expired to prevent it spinning. My own data on this is not great. I haven't really been able to reproduce the same problem locally. The test case is simple. "download tar" wgets a large tar file and stores it locally. "unpack" is expanding it (15 times physical RAM in this case) and "delete source dirs" is the tarfile being deleted again. I also experimented with having the tar copied numerous times and into deeper directories to increase the size but the results were not particularly interesting so I left it as one tar. In the background, applications are being launched to time to vaguely simulate activity on the desktop and to measure how long it takes applications to start. Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running LARGE COPY AND UNTAR vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 MMTests Statistics: application launch evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 Having SLUB avoid expensive steps in reclaim improves performance by quite a bit with the overall test completing 1.5 minutes faster. Application launch times were not really affected but it's not something my test machine was suffering from in the first place so it's not really conclusive. The kswapd patches also did not appear to help but again, the test machine wasn't suffering that problem. These patches are against 2.6.39-rc7. Again, testing would be appreciated. Documentation/vm/slub.txt | 2 +- mm/page_alloc.c | 3 ++- mm/slub.c | 5 +++-- 3 files changed, 6 insertions(+), 4 deletions(-) mm/page_alloc.c | 3 ++- mm/slub.c | 3 ++- mm/vmscan.c | 6 +++++- 3 files changed, 9 insertions(+), 3 deletions(-) -- 1.7.3.4 ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-13 14:03 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman Changelog since V1 o kswapd should sleep if need_resched o Remove __GFP_REPEAT from GFP flags when speculatively using high orders so direct/compaction exits earlier o Remove __GFP_NORETRY for correctness o Correct logic in sleeping_prematurely o Leave SLUB using the default slub_max_order There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU which appear to be due to recent reclaim changes. SLUB using high orders is the trigger but not the root cause as SLUB has been using high orders for a while. The following four patches aim to fix the problems in reclaim while reducing the cost for SLUB using those high orders. Patch 1 corrects logic introduced by commit [1741c877: mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced] to allow kswapd to go to sleep when balanced for high orders. Patch 2 prevents kswapd waking up in response to SLUBs speculative use of high orders. Patch 3 further reduces the cost by prevent SLUB entering direct compaction or reclaim paths on the grounds that falling back to order-0 should be cheaper. Patch 4 notes that even when kswapd is failing to keep up with allocation requests, it should still go to sleep when its quota has expired to prevent it spinning. My own data on this is not great. I haven't really been able to reproduce the same problem locally. The test case is simple. "download tar" wgets a large tar file and stores it locally. "unpack" is expanding it (15 times physical RAM in this case) and "delete source dirs" is the tarfile being deleted again. I also experimented with having the tar copied numerous times and into deeper directories to increase the size but the results were not particularly interesting so I left it as one tar. In the background, applications are being launched to time to vaguely simulate activity on the desktop and to measure how long it takes applications to start. Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running LARGE COPY AND UNTAR vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 MMTests Statistics: application launch evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 Having SLUB avoid expensive steps in reclaim improves performance by quite a bit with the overall test completing 1.5 minutes faster. Application launch times were not really affected but it's not something my test machine was suffering from in the first place so it's not really conclusive. The kswapd patches also did not appear to help but again, the test machine wasn't suffering that problem. These patches are against 2.6.39-rc7. Again, testing would be appreciated. Documentation/vm/slub.txt | 2 +- mm/page_alloc.c | 3 ++- mm/slub.c | 5 +++-- 3 files changed, 6 insertions(+), 4 deletions(-) mm/page_alloc.c | 3 ++- mm/slub.c | 3 ++- mm/vmscan.c | 6 +++++- 3 files changed, 9 insertions(+), 3 deletions(-) -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely 2011-05-13 14:03 ` Mel Gorman @ 2011-05-13 14:03 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman Johannes Weiner poined out that the logic in commit [1741c877: mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced] is backwards. Instead of allowing kswapd to go to sleep when balancing for high order allocations, it keeps it kswapd running uselessly. From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f6b435c..af24d1e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, * must be balanced */ if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); + return !pgdat_balanced(pgdat, balanced, classzone_idx); else return !all_zones_ok; } -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely @ 2011-05-13 14:03 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman Johannes Weiner poined out that the logic in commit [1741c877: mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced] is backwards. Instead of allowing kswapd to go to sleep when balancing for high order allocations, it keeps it kswapd running uselessly. From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f6b435c..af24d1e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, * must be balanced */ if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); + return !pgdat_balanced(pgdat, balanced, classzone_idx); else return !all_zones_ok; } -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely 2011-05-13 14:03 ` Mel Gorman @ 2011-05-13 14:28 ` Johannes Weiner -1 siblings, 0 replies; 119+ messages in thread From: Johannes Weiner @ 2011-05-13 14:28 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 03:03:21PM +0100, Mel Gorman wrote: > Johannes Weiner poined out that the logic in commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until a percentage > of the node is balanced] is backwards. Instead of allowing kswapd to go > to sleep when balancing for high order allocations, it keeps it kswapd > running uselessly. > > From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Thanks for picking it up, Mel. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de> > --- > mm/vmscan.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f6b435c..af24d1e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > * must be balanced > */ > if (order) > - return pgdat_balanced(pgdat, balanced, classzone_idx); > + return !pgdat_balanced(pgdat, balanced, classzone_idx); > else > return !all_zones_ok; > } > -- > 1.7.3.4 ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely @ 2011-05-13 14:28 ` Johannes Weiner 0 siblings, 0 replies; 119+ messages in thread From: Johannes Weiner @ 2011-05-13 14:28 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 03:03:21PM +0100, Mel Gorman wrote: > Johannes Weiner poined out that the logic in commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until a percentage > of the node is balanced] is backwards. Instead of allowing kswapd to go > to sleep when balancing for high order allocations, it keeps it kswapd > running uselessly. > > From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Thanks for picking it up, Mel. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de> > --- > mm/vmscan.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f6b435c..af24d1e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > * must be balanced > */ > if (order) > - return pgdat_balanced(pgdat, balanced, classzone_idx); > + return !pgdat_balanced(pgdat, balanced, classzone_idx); > else > return !all_zones_ok; > } > -- > 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely 2011-05-13 14:03 ` Mel Gorman @ 2011-05-14 16:30 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-14 16:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 11:03 PM, Mel Gorman <mgorman@suse.de> wrote: > Johannes Weiner poined out that the logic in commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until a percentage > of the node is balanced] is backwards. Instead of allowing kswapd to go > to sleep when balancing for high order allocations, it keeps it kswapd > running uselessly. > > From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Nice catch! Hannes. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely @ 2011-05-14 16:30 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-14 16:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 11:03 PM, Mel Gorman <mgorman@suse.de> wrote: > Johannes Weiner poined out that the logic in commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until a percentage > of the node is balanced] is backwards. Instead of allowing kswapd to go > to sleep when balancing for high order allocations, it keeps it kswapd > running uselessly. > > From-but-was-not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > Will-sign-off-after-Johannes: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Nice catch! Hannes. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely 2011-05-13 14:03 ` Mel Gorman @ 2011-05-16 14:30 ` Rik van Riel -1 siblings, 0 replies; 119+ messages in thread From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On 05/13/2011 10:03 AM, Mel Gorman wrote: > Johannes Weiner poined out that the logic in commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until a percentage > of the node is balanced] is backwards. Instead of allowing kswapd to go > to sleep when balancing for high order allocations, it keeps it kswapd > running uselessly. > > From-but-was-not-signed-off-by: Johannes Weiner<hannes@cmpxchg.org> > Will-sign-off-after-Johannes: Mel Gorman<mgorman@suse.de> Reviewed-by: Rik van Riel<riel@redhat.com> -- All rights reversed ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely @ 2011-05-16 14:30 ` Rik van Riel 0 siblings, 0 replies; 119+ messages in thread From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On 05/13/2011 10:03 AM, Mel Gorman wrote: > Johannes Weiner poined out that the logic in commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until a percentage > of the node is balanced] is backwards. Instead of allowing kswapd to go > to sleep when balancing for high order allocations, it keeps it kswapd > running uselessly. > > From-but-was-not-signed-off-by: Johannes Weiner<hannes@cmpxchg.org> > Will-sign-off-after-Johannes: Mel Gorman<mgorman@suse.de> Reviewed-by: Rik van Riel<riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations 2011-05-13 14:03 ` Mel Gorman @ 2011-05-13 14:03 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations and falls back to lower allocations if they fail. However, by simply trying to allocate, kswapd is woken up to start reclaiming at that order. On a desktop system, two users report that the system is getting locked up with kswapd using large amounts of CPU. Using SLAB instead of SLUB made this problem go away. This patch prevents kswapd being woken up for high-order allocations. Testing indicated that with this patch applied, the system was much harder to hang and even when it did, it eventually recovered. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/slub.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 9d2e5e4..98c358d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1170,7 +1170,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * Let the initial higher-order allocation fail under memory pressure * so we fall-back to the minimum order allocation. */ - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; page = alloc_slab_page(alloc_gfp, node, oo); if (unlikely(!page)) { -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations @ 2011-05-13 14:03 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations and falls back to lower allocations if they fail. However, by simply trying to allocate, kswapd is woken up to start reclaiming at that order. On a desktop system, two users report that the system is getting locked up with kswapd using large amounts of CPU. Using SLAB instead of SLUB made this problem go away. This patch prevents kswapd being woken up for high-order allocations. Testing indicated that with this patch applied, the system was much harder to hang and even when it did, it eventually recovered. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/slub.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 9d2e5e4..98c358d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1170,7 +1170,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * Let the initial higher-order allocation fail under memory pressure * so we fall-back to the minimum order allocation. */ - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; page = alloc_slab_page(alloc_gfp, node, oo); if (unlikely(!page)) { -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations 2011-05-13 14:03 ` Mel Gorman @ 2011-05-16 21:10 ` David Rientjes -1 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-16 21:10 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 13 May 2011, Mel Gorman wrote: > To avoid locking and per-cpu overhead, SLUB optimisically uses > high-order allocations and falls back to lower allocations if they > fail. However, by simply trying to allocate, kswapd is woken up to > start reclaiming at that order. On a desktop system, two users report > that the system is getting locked up with kswapd using large amounts > of CPU. Using SLAB instead of SLUB made this problem go away. > > This patch prevents kswapd being woken up for high-order allocations. > Testing indicated that with this patch applied, the system was much > harder to hang and even when it did, it eventually recovered. > > Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations @ 2011-05-16 21:10 ` David Rientjes 0 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-16 21:10 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 13 May 2011, Mel Gorman wrote: > To avoid locking and per-cpu overhead, SLUB optimisically uses > high-order allocations and falls back to lower allocations if they > fail. However, by simply trying to allocate, kswapd is woken up to > start reclaiming at that order. On a desktop system, two users report > that the system is getting locked up with kswapd using large amounts > of CPU. Using SLAB instead of SLUB made this problem go away. > > This patch prevents kswapd being woken up for high-order allocations. > Testing indicated that with this patch applied, the system was much > harder to hang and even when it did, it eventually recovered. > > Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations 2011-05-16 21:10 ` David Rientjes @ 2011-05-18 6:09 ` Pekka Enberg -1 siblings, 0 replies; 119+ messages in thread From: Pekka Enberg @ 2011-05-18 6:09 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On 5/17/11 12:10 AM, David Rientjes wrote: > On Fri, 13 May 2011, Mel Gorman wrote: > >> To avoid locking and per-cpu overhead, SLUB optimisically uses >> high-order allocations and falls back to lower allocations if they >> fail. However, by simply trying to allocate, kswapd is woken up to >> start reclaiming at that order. On a desktop system, two users report >> that the system is getting locked up with kswapd using large amounts >> of CPU. Using SLAB instead of SLUB made this problem go away. >> >> This patch prevents kswapd being woken up for high-order allocations. >> Testing indicated that with this patch applied, the system was much >> harder to hang and even when it did, it eventually recovered. >> >> Signed-off-by: Mel Gorman<mgorman@suse.de> > Acked-by: David Rientjes<rientjes@google.com> Christoph? I think this patch is sane although the original rationale was to workaround kswapd problems. Pekka ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations @ 2011-05-18 6:09 ` Pekka Enberg 0 siblings, 0 replies; 119+ messages in thread From: Pekka Enberg @ 2011-05-18 6:09 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On 5/17/11 12:10 AM, David Rientjes wrote: > On Fri, 13 May 2011, Mel Gorman wrote: > >> To avoid locking and per-cpu overhead, SLUB optimisically uses >> high-order allocations and falls back to lower allocations if they >> fail. However, by simply trying to allocate, kswapd is woken up to >> start reclaiming at that order. On a desktop system, two users report >> that the system is getting locked up with kswapd using large amounts >> of CPU. Using SLAB instead of SLUB made this problem go away. >> >> This patch prevents kswapd being woken up for high-order allocations. >> Testing indicated that with this patch applied, the system was much >> harder to hang and even when it did, it eventually recovered. >> >> Signed-off-by: Mel Gorman<mgorman@suse.de> > Acked-by: David Rientjes<rientjes@google.com> Christoph? I think this patch is sane although the original rationale was to workaround kswapd problems. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations 2011-05-18 6:09 ` Pekka Enberg @ 2011-05-18 17:21 ` Christoph Lameter -1 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-18 17:21 UTC (permalink / raw) To: Pekka Enberg Cc: David Rientjes, Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 18 May 2011, Pekka Enberg wrote: > On 5/17/11 12:10 AM, David Rientjes wrote: > > On Fri, 13 May 2011, Mel Gorman wrote: > > > > > To avoid locking and per-cpu overhead, SLUB optimisically uses > > > high-order allocations and falls back to lower allocations if they > > > fail. However, by simply trying to allocate, kswapd is woken up to > > > start reclaiming at that order. On a desktop system, two users report > > > that the system is getting locked up with kswapd using large amounts > > > of CPU. Using SLAB instead of SLUB made this problem go away. > > > > > > This patch prevents kswapd being woken up for high-order allocations. > > > Testing indicated that with this patch applied, the system was much > > > harder to hang and even when it did, it eventually recovered. > > > > > > Signed-off-by: Mel Gorman<mgorman@suse.de> > > Acked-by: David Rientjes<rientjes@google.com> > > Christoph? I think this patch is sane although the original rationale was to > workaround kswapd problems. I am mostly fine with it. The concerns that I have is if there is a large series of high order allocs then at some point you would want kswapd to be triggered instead of high order allocs constantly failing. Can we have a "trigger once in a while" functionality? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations @ 2011-05-18 17:21 ` Christoph Lameter 0 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-18 17:21 UTC (permalink / raw) To: Pekka Enberg Cc: David Rientjes, Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 18 May 2011, Pekka Enberg wrote: > On 5/17/11 12:10 AM, David Rientjes wrote: > > On Fri, 13 May 2011, Mel Gorman wrote: > > > > > To avoid locking and per-cpu overhead, SLUB optimisically uses > > > high-order allocations and falls back to lower allocations if they > > > fail. However, by simply trying to allocate, kswapd is woken up to > > > start reclaiming at that order. On a desktop system, two users report > > > that the system is getting locked up with kswapd using large amounts > > > of CPU. Using SLAB instead of SLUB made this problem go away. > > > > > > This patch prevents kswapd being woken up for high-order allocations. > > > Testing indicated that with this patch applied, the system was much > > > harder to hang and even when it did, it eventually recovered. > > > > > > Signed-off-by: Mel Gorman<mgorman@suse.de> > > Acked-by: David Rientjes<rientjes@google.com> > > Christoph? I think this patch is sane although the original rationale was to > workaround kswapd problems. I am mostly fine with it. The concerns that I have is if there is a large series of high order allocs then at some point you would want kswapd to be triggered instead of high order allocs constantly failing. Can we have a "trigger once in a while" functionality? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-13 14:03 ` Mel Gorman @ 2011-05-13 14:03 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations and falls back to lower allocations if they fail. However, by simply trying to allocate, the caller can enter compaction or reclaim - both of which are likely to cost more than the benefit of using high-order pages in SLUB. On a desktop system, two users report that the system is getting stalled with kswapd using large amounts of CPU. This patch prevents SLUB taking any expensive steps when trying to use high-order allocations. Instead, it is expected to fall back to smaller orders more aggressively. Testing was somewhat inconclusive on how much this helped but it makes sense that falling back to order-0 allocations is faster than entering compaction or direct reclaim. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/page_alloc.c | 3 ++- mm/slub.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9f8a97b..057f1e2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) { int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; const gfp_t wait = gfp_mask & __GFP_WAIT; + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) */ alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); - if (!wait) { + if (!wait && can_wake_kswapd) { /* * Not worth trying to allocate harder for * __GFP_NOMEMALLOC even if it can't schedule. diff --git a/mm/slub.c b/mm/slub.c index 98c358d..c5797ab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * Let the initial higher-order allocation fail under memory pressure * so we fall-back to the minimum order allocation. */ - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); page = alloc_slab_page(alloc_gfp, node, oo); if (unlikely(!page)) { -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-13 14:03 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman To avoid locking and per-cpu overhead, SLUB optimisically uses high-order allocations and falls back to lower allocations if they fail. However, by simply trying to allocate, the caller can enter compaction or reclaim - both of which are likely to cost more than the benefit of using high-order pages in SLUB. On a desktop system, two users report that the system is getting stalled with kswapd using large amounts of CPU. This patch prevents SLUB taking any expensive steps when trying to use high-order allocations. Instead, it is expected to fall back to smaller orders more aggressively. Testing was somewhat inconclusive on how much this helped but it makes sense that falling back to order-0 allocations is faster than entering compaction or direct reclaim. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/page_alloc.c | 3 ++- mm/slub.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9f8a97b..057f1e2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) { int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; const gfp_t wait = gfp_mask & __GFP_WAIT; + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) */ alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); - if (!wait) { + if (!wait && can_wake_kswapd) { /* * Not worth trying to allocate harder for * __GFP_NOMEMALLOC even if it can't schedule. diff --git a/mm/slub.c b/mm/slub.c index 98c358d..c5797ab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * Let the initial higher-order allocation fail under memory pressure * so we fall-back to the minimum order allocation. */ - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); page = alloc_slab_page(alloc_gfp, node, oo); if (unlikely(!page)) { -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-13 14:03 ` Mel Gorman @ 2011-05-16 21:16 ` David Rientjes -1 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-16 21:16 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 13 May 2011, Mel Gorman wrote: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9f8a97b..057f1e2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > { > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > const gfp_t wait = gfp_mask & __GFP_WAIT; > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > */ > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > - if (!wait) { > + if (!wait && can_wake_kswapd) { > /* > * Not worth trying to allocate harder for > * __GFP_NOMEMALLOC even if it can't schedule. > diff --git a/mm/slub.c b/mm/slub.c > index 98c358d..c5797ab 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > * Let the initial higher-order allocation fail under memory pressure > * so we fall-back to the minimum order allocation. > */ > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); > > page = alloc_slab_page(alloc_gfp, node, oo); > if (unlikely(!page)) { It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations will immediately fail. alloc_gfp would probably benefit from having a comment about why __GFP_WAIT should be masked off here: that we don't want to do compaction or direct reclaim or retry the allocation more than once (so both __GFP_NORETRY and __GFP_REPEAT are no-ops). ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-16 21:16 ` David Rientjes 0 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-16 21:16 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 13 May 2011, Mel Gorman wrote: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9f8a97b..057f1e2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > { > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > const gfp_t wait = gfp_mask & __GFP_WAIT; > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > */ > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > - if (!wait) { > + if (!wait && can_wake_kswapd) { > /* > * Not worth trying to allocate harder for > * __GFP_NOMEMALLOC even if it can't schedule. > diff --git a/mm/slub.c b/mm/slub.c > index 98c358d..c5797ab 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > * Let the initial higher-order allocation fail under memory pressure > * so we fall-back to the minimum order allocation. > */ > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); > > page = alloc_slab_page(alloc_gfp, node, oo); > if (unlikely(!page)) { It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations will immediately fail. alloc_gfp would probably benefit from having a comment about why __GFP_WAIT should be masked off here: that we don't want to do compaction or direct reclaim or retry the allocation more than once (so both __GFP_NORETRY and __GFP_REPEAT are no-ops). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-16 21:16 ` David Rientjes @ 2011-05-17 8:42 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 8:42 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 02:16:46PM -0700, David Rientjes wrote: > On Fri, 13 May 2011, Mel Gorman wrote: > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 9f8a97b..057f1e2 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > { > > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > > const gfp_t wait = gfp_mask & __GFP_WAIT; > > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > */ > > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > > > - if (!wait) { > > + if (!wait && can_wake_kswapd) { > > /* > > * Not worth trying to allocate harder for > > * __GFP_NOMEMALLOC even if it can't schedule. > > diff --git a/mm/slub.c b/mm/slub.c > > index 98c358d..c5797ab 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > > * Let the initial higher-order allocation fail under memory pressure > > * so we fall-back to the minimum order allocation. > > */ > > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & > > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); > > > > page = alloc_slab_page(alloc_gfp, node, oo); > > if (unlikely(!page)) { > > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations > will immediately fail. > We can enter enter direct compaction or direct reclaim at least once. If compaction is enabled and we enter reclaim/compaction, the presense of __GFP_REPEAT makes a difference in should_continue_reclaim(). With compaction disabled, the presense of the flag is relevant in should_alloc_retry() with it being possible to loop in the allocator instead of failing the SLUB allocation and dropping back. Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes more sense. In that case, we clear both flags because __GFP_REPEAT && !_GFP_WAIT is a senseless combination of flags. If for whatever reason the __GFP_WAIT was re-added, the presense of __GFP_REPEAT could cause problems in reclaim that would be hard to spot again. > alloc_gfp would probably benefit from having a comment about why > __GFP_WAIT should be masked off here: that we don't want to do compaction > or direct reclaim or retry the allocation more than once (so both > __GFP_NORETRY and __GFP_REPEAT are no-ops). That would have been helpful all right. I should have caught that and explained it properly. In the event there is a new version of the patch, I'll add one. For the moment, I'm dropping this patch entirely. Christoph wants to maintain historic behaviour of SLUB to maximise the number of high-order pages it uses and at the end of the day, which option performs better depends entirely on the workload and machine configuration. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-17 8:42 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 8:42 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 02:16:46PM -0700, David Rientjes wrote: > On Fri, 13 May 2011, Mel Gorman wrote: > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 9f8a97b..057f1e2 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > { > > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > > const gfp_t wait = gfp_mask & __GFP_WAIT; > > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > */ > > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > > > - if (!wait) { > > + if (!wait && can_wake_kswapd) { > > /* > > * Not worth trying to allocate harder for > > * __GFP_NOMEMALLOC even if it can't schedule. > > diff --git a/mm/slub.c b/mm/slub.c > > index 98c358d..c5797ab 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > > * Let the initial higher-order allocation fail under memory pressure > > * so we fall-back to the minimum order allocation. > > */ > > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & > > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); > > > > page = alloc_slab_page(alloc_gfp, node, oo); > > if (unlikely(!page)) { > > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations > will immediately fail. > We can enter enter direct compaction or direct reclaim at least once. If compaction is enabled and we enter reclaim/compaction, the presense of __GFP_REPEAT makes a difference in should_continue_reclaim(). With compaction disabled, the presense of the flag is relevant in should_alloc_retry() with it being possible to loop in the allocator instead of failing the SLUB allocation and dropping back. Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes more sense. In that case, we clear both flags because __GFP_REPEAT && !_GFP_WAIT is a senseless combination of flags. If for whatever reason the __GFP_WAIT was re-added, the presense of __GFP_REPEAT could cause problems in reclaim that would be hard to spot again. > alloc_gfp would probably benefit from having a comment about why > __GFP_WAIT should be masked off here: that we don't want to do compaction > or direct reclaim or retry the allocation more than once (so both > __GFP_NORETRY and __GFP_REPEAT are no-ops). That would have been helpful all right. I should have caught that and explained it properly. In the event there is a new version of the patch, I'll add one. For the moment, I'm dropping this patch entirely. Christoph wants to maintain historic behaviour of SLUB to maximise the number of high-order pages it uses and at the end of the day, which option performs better depends entirely on the workload and machine configuration. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-17 8:42 ` Mel Gorman @ 2011-05-17 13:51 ` Christoph Lameter -1 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-17 13:51 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Mel Gorman wrote: > entirely. Christoph wants to maintain historic behaviour of SLUB to > maximise the number of high-order pages it uses and at the end of the > day, which option performs better depends entirely on the workload > and machine configuration. That is not what I meant. I would like more higher order allocations to succeed. That does not mean that slubs allocation methods and flags passed have to stay the same. You can change the slub behavior if it helps. I am just suspicious of compaction. If these mods are needed to reduce the amount of higher order pages then compaction does not have the beneficial effect that it should have. It does not actually increase the available higher order pages. Fix that first. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-17 13:51 ` Christoph Lameter 0 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-17 13:51 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Mel Gorman wrote: > entirely. Christoph wants to maintain historic behaviour of SLUB to > maximise the number of high-order pages it uses and at the end of the > day, which option performs better depends entirely on the workload > and machine configuration. That is not what I meant. I would like more higher order allocations to succeed. That does not mean that slubs allocation methods and flags passed have to stay the same. You can change the slub behavior if it helps. I am just suspicious of compaction. If these mods are needed to reduce the amount of higher order pages then compaction does not have the beneficial effect that it should have. It does not actually increase the available higher order pages. Fix that first. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-17 13:51 ` Christoph Lameter @ 2011-05-17 16:22 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 16:22 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 08:51:47AM -0500, Christoph Lameter wrote: > On Tue, 17 May 2011, Mel Gorman wrote: > > > entirely. Christoph wants to maintain historic behaviour of SLUB to > > maximise the number of high-order pages it uses and at the end of the > > day, which option performs better depends entirely on the workload > > and machine configuration. > > That is not what I meant. I would like more higher order allocations to > succeed. That does not mean that slubs allocation methods and flags passed > have to stay the same. You can change the slub behavior if it helps. > In this particular patch, the success rate for high order allocations would likely decrease in low memory conditions albeit the latency when calling the page allocator will be lower and the disruption to the system will be less (no copying or reclaim of pages). My expectation would be that it's cheaper for SLUB to fall back than compact memory or reclaim pages even if this means a slab page is smaller until more memory is free. However, if the "goodness" criteria is high order allocation success rate, the patch shouldn't be merged. > I am just suspicious of compaction. If these mods are needed to reduce the > amount of higher order pages then compaction does not have the > beneficial effect that it should have. It does not actually > increase the available higher order pages. Fix that first. > The problem being addressed was the machine being hung at worst and in other cases having kswapd pinned at 99-100% CPU. It's now been shown that modifying SLUB is not necessary to fix this because the bug was in page reclaim. The high-order allocation success rate didn't come into it. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-17 16:22 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 16:22 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 08:51:47AM -0500, Christoph Lameter wrote: > On Tue, 17 May 2011, Mel Gorman wrote: > > > entirely. Christoph wants to maintain historic behaviour of SLUB to > > maximise the number of high-order pages it uses and at the end of the > > day, which option performs better depends entirely on the workload > > and machine configuration. > > That is not what I meant. I would like more higher order allocations to > succeed. That does not mean that slubs allocation methods and flags passed > have to stay the same. You can change the slub behavior if it helps. > In this particular patch, the success rate for high order allocations would likely decrease in low memory conditions albeit the latency when calling the page allocator will be lower and the disruption to the system will be less (no copying or reclaim of pages). My expectation would be that it's cheaper for SLUB to fall back than compact memory or reclaim pages even if this means a slab page is smaller until more memory is free. However, if the "goodness" criteria is high order allocation success rate, the patch shouldn't be merged. > I am just suspicious of compaction. If these mods are needed to reduce the > amount of higher order pages then compaction does not have the > beneficial effect that it should have. It does not actually > increase the available higher order pages. Fix that first. > The problem being addressed was the machine being hung at worst and in other cases having kswapd pinned at 99-100% CPU. It's now been shown that modifying SLUB is not necessary to fix this because the bug was in page reclaim. The high-order allocation success rate didn't come into it. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-17 16:22 ` Mel Gorman @ 2011-05-17 17:52 ` Christoph Lameter -1 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-17 17:52 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Mel Gorman wrote: > > That is not what I meant. I would like more higher order allocations to > > succeed. That does not mean that slubs allocation methods and flags passed > > have to stay the same. You can change the slub behavior if it helps. > > > > In this particular patch, the success rate for high order allocations > would likely decrease in low memory conditions albeit the latency when > calling the page allocator will be lower and the disruption to the > system will be less (no copying or reclaim of pages). My expectation > would be that it's cheaper for SLUB to fall back than compact memory > or reclaim pages even if this means a slab page is smaller until more > memory is free. However, if the "goodness" criteria is high order > allocation success rate, the patch shouldn't be merged. The criteria is certainly overall system performance and not a high order allocation rate. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-17 17:52 ` Christoph Lameter 0 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-17 17:52 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Mel Gorman wrote: > > That is not what I meant. I would like more higher order allocations to > > succeed. That does not mean that slubs allocation methods and flags passed > > have to stay the same. You can change the slub behavior if it helps. > > > > In this particular patch, the success rate for high order allocations > would likely decrease in low memory conditions albeit the latency when > calling the page allocator will be lower and the disruption to the > system will be less (no copying or reclaim of pages). My expectation > would be that it's cheaper for SLUB to fall back than compact memory > or reclaim pages even if this means a slab page is smaller until more > memory is free. However, if the "goodness" criteria is high order > allocation success rate, the patch shouldn't be merged. The criteria is certainly overall system performance and not a high order allocation rate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-17 17:52 ` Christoph Lameter @ 2011-05-17 19:35 ` David Rientjes -1 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-17 19:35 UTC (permalink / raw) To: Christoph Lameter Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Christoph Lameter wrote: > > In this particular patch, the success rate for high order allocations > > would likely decrease in low memory conditions albeit the latency when > > calling the page allocator will be lower and the disruption to the > > system will be less (no copying or reclaim of pages). My expectation > > would be that it's cheaper for SLUB to fall back than compact memory > > or reclaim pages even if this means a slab page is smaller until more > > memory is free. However, if the "goodness" criteria is high order > > allocation success rate, the patch shouldn't be merged. > > The criteria is certainly overall system performance and not a high order > allocation rate. > SLUB definitely depends on these higher order allocations being successful for performance, dropping back to the min order is a last resort as opposed to failing the kmalloc(). If it's the last resort, then it makes sense that we'd want to try both compaction and reclaim while we're already in the page allocator as we go down the slub slowpath. Why not try just a little harder (compaction and/or reclaim) to alloc the cache's preferred order? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-17 19:35 ` David Rientjes 0 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-17 19:35 UTC (permalink / raw) To: Christoph Lameter Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Christoph Lameter wrote: > > In this particular patch, the success rate for high order allocations > > would likely decrease in low memory conditions albeit the latency when > > calling the page allocator will be lower and the disruption to the > > system will be less (no copying or reclaim of pages). My expectation > > would be that it's cheaper for SLUB to fall back than compact memory > > or reclaim pages even if this means a slab page is smaller until more > > memory is free. However, if the "goodness" criteria is high order > > allocation success rate, the patch shouldn't be merged. > > The criteria is certainly overall system performance and not a high order > allocation rate. > SLUB definitely depends on these higher order allocations being successful for performance, dropping back to the min order is a last resort as opposed to failing the kmalloc(). If it's the last resort, then it makes sense that we'd want to try both compaction and reclaim while we're already in the page allocator as we go down the slub slowpath. Why not try just a little harder (compaction and/or reclaim) to alloc the cache's preferred order? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations 2011-05-17 8:42 ` Mel Gorman @ 2011-05-17 19:31 ` David Rientjes -1 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-17 19:31 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Mel Gorman wrote: > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 9f8a97b..057f1e2 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > > { > > > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > > > const gfp_t wait = gfp_mask & __GFP_WAIT; > > > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > > > > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > > > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > > */ > > > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > > > > > - if (!wait) { > > > + if (!wait && can_wake_kswapd) { > > > /* > > > * Not worth trying to allocate harder for > > > * __GFP_NOMEMALLOC even if it can't schedule. > > > diff --git a/mm/slub.c b/mm/slub.c > > > index 98c358d..c5797ab 100644 > > > --- a/mm/slub.c > > > +++ b/mm/slub.c > > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > > > * Let the initial higher-order allocation fail under memory pressure > > > * so we fall-back to the minimum order allocation. > > > */ > > > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > > > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & > > > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); > > > > > > page = alloc_slab_page(alloc_gfp, node, oo); > > > if (unlikely(!page)) { > > > > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations > > will immediately fail. > > > > We can enter enter direct compaction or direct reclaim > at least once. If compaction is enabled and we enter > reclaim/compaction, the presense of __GFP_REPEAT makes a difference > in should_continue_reclaim(). With compaction disabled, the presense > of the flag is relevant in should_alloc_retry() with it being possible > to loop in the allocator instead of failing the SLUB allocation and > dropping back. > You've cleared __GFP_WAIT, so it cannot enter direct compaction or direct reclaim, so clearing __GFP_REPEAT here doesn't actually do anything. That's why I suggested adding a comment about why you're clearing __GFP_WAIT: to make it obvious that these allocations will immediately fail if the alloc is unsuccessful and we don't need to add __GFP_NORETRY or remove __GFP_REPEAT. > Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes > more sense. No, I meant !__GFP_NOFAIL since the high priority allocations (if PF_MEMALLOC or TIF_MEMDIE) will not loop forever looking for a page without that bit. That allows this !__GFP_WAIT allocation to immediately fail. __GFP_NORETRY and __GFP_REPEAT are no-ops unless you have __GFP_WAIT. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 3/4] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations @ 2011-05-17 19:31 ` David Rientjes 0 siblings, 0 replies; 119+ messages in thread From: David Rientjes @ 2011-05-17 19:31 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 17 May 2011, Mel Gorman wrote: > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 9f8a97b..057f1e2 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > > { > > > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; > > > const gfp_t wait = gfp_mask & __GFP_WAIT; > > > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD); > > > > > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ > > > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH); > > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > > */ > > > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH); > > > > > > - if (!wait) { > > > + if (!wait && can_wake_kswapd) { > > > /* > > > * Not worth trying to allocate harder for > > > * __GFP_NOMEMALLOC even if it can't schedule. > > > diff --git a/mm/slub.c b/mm/slub.c > > > index 98c358d..c5797ab 100644 > > > --- a/mm/slub.c > > > +++ b/mm/slub.c > > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) > > > * Let the initial higher-order allocation fail under memory pressure > > > * so we fall-back to the minimum order allocation. > > > */ > > > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL; > > > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) & > > > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT); > > > > > > page = alloc_slab_page(alloc_gfp, node, oo); > > > if (unlikely(!page)) { > > > > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations > > will immediately fail. > > > > We can enter enter direct compaction or direct reclaim > at least once. If compaction is enabled and we enter > reclaim/compaction, the presense of __GFP_REPEAT makes a difference > in should_continue_reclaim(). With compaction disabled, the presense > of the flag is relevant in should_alloc_retry() with it being possible > to loop in the allocator instead of failing the SLUB allocation and > dropping back. > You've cleared __GFP_WAIT, so it cannot enter direct compaction or direct reclaim, so clearing __GFP_REPEAT here doesn't actually do anything. That's why I suggested adding a comment about why you're clearing __GFP_WAIT: to make it obvious that these allocations will immediately fail if the alloc is unsuccessful and we don't need to add __GFP_NORETRY or remove __GFP_REPEAT. > Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes > more sense. No, I meant !__GFP_NOFAIL since the high priority allocations (if PF_MEMALLOC or TIF_MEMDIE) will not loop forever looking for a page without that bit. That allows this !__GFP_WAIT allocation to immediately fail. __GFP_NORETRY and __GFP_REPEAT are no-ops unless you have __GFP_WAIT. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-13 14:03 ` Mel Gorman @ 2011-05-13 14:03 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman Under constant allocation pressure, kswapd can be in the situation where sleeping_prematurely() will always return true even if kswapd has been running a long time. Check if kswapd needs to be scheduled. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index af24d1e..4d24828 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, unsigned long balanced = 0; bool all_zones_ok = true; + /* If kswapd has been running too long, just sleep */ + if (need_resched()) + return false; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return true; -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-13 14:03 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 14:03 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4, Mel Gorman Under constant allocation pressure, kswapd can be in the situation where sleeping_prematurely() will always return true even if kswapd has been running a long time. Check if kswapd needs to be scheduled. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index af24d1e..4d24828 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, unsigned long balanced = 0; bool all_zones_ok = true; + /* If kswapd has been running too long, just sleep */ + if (need_resched()) + return false; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ if (remaining) return true; -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-13 14:03 ` Mel Gorman @ 2011-05-15 10:27 ` KOSAKI Motohiro -1 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-15 10:27 UTC (permalink / raw) To: mgorman Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 (2011/05/13 23:03), Mel Gorman wrote: > Under constant allocation pressure, kswapd can be in the situation where > sleeping_prematurely() will always return true even if kswapd has been > running a long time. Check if kswapd needs to be scheduled. > > Signed-off-by: Mel Gorman<mgorman@suse.de> > --- > mm/vmscan.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index af24d1e..4d24828 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > unsigned long balanced = 0; > bool all_zones_ok = true; > > + /* If kswapd has been running too long, just sleep */ > + if (need_resched()) > + return false; > + Hmm... I don't like this patch so much. because this code does - don't sleep if kswapd got context switch at shrink_inactive_list - sleep if kswapd didn't It seems to be semi random behavior. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-15 10:27 ` KOSAKI Motohiro 0 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-15 10:27 UTC (permalink / raw) To: mgorman Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 (2011/05/13 23:03), Mel Gorman wrote: > Under constant allocation pressure, kswapd can be in the situation where > sleeping_prematurely() will always return true even if kswapd has been > running a long time. Check if kswapd needs to be scheduled. > > Signed-off-by: Mel Gorman<mgorman@suse.de> > --- > mm/vmscan.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index af24d1e..4d24828 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > unsigned long balanced = 0; > bool all_zones_ok = true; > > + /* If kswapd has been running too long, just sleep */ > + if (need_resched()) > + return false; > + Hmm... I don't like this patch so much. because this code does - don't sleep if kswapd got context switch at shrink_inactive_list - sleep if kswapd didn't It seems to be semi random behavior. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-15 10:27 ` KOSAKI Motohiro @ 2011-05-16 4:21 ` James Bottomley -1 siblings, 0 replies; 119+ messages in thread From: James Bottomley @ 2011-05-16 4:21 UTC (permalink / raw) To: KOSAKI Motohiro Cc: mgorman, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > (2011/05/13 23:03), Mel Gorman wrote: > > Under constant allocation pressure, kswapd can be in the situation where > > sleeping_prematurely() will always return true even if kswapd has been > > running a long time. Check if kswapd needs to be scheduled. > > > > Signed-off-by: Mel Gorman<mgorman@suse.de> > > --- > > mm/vmscan.c | 4 ++++ > > 1 files changed, 4 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index af24d1e..4d24828 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > > unsigned long balanced = 0; > > bool all_zones_ok = true; > > > > + /* If kswapd has been running too long, just sleep */ > > + if (need_resched()) > > + return false; > > + > > Hmm... I don't like this patch so much. because this code does > > - don't sleep if kswapd got context switch at shrink_inactive_list This isn't entirely true: need_resched() will be false, so we'll follow the normal path for determining whether to sleep or not, in effect leaving the current behaviour unchanged. > - sleep if kswapd didn't This also isn't entirely true: whether need_resched() is true at this point depends on a whole lot more that whether we did a context switch in shrink_inactive. It mostly depends on how long we've been running without giving up the CPU. Generally that will mean we've been round the shrinker loop hundreds to thousands of times without sleeping. > It seems to be semi random behavior. Well, we have to do something. Chris Mason first suspected the hang was a kswapd rescheduling problem a while ago. We tried putting cond_rescheds() in several places in the vmscan code, but to no avail. The need_resched() in sleeping_prematurely() seems to be about the best option. The other option might be just to put a cond_resched() in kswapd_try_to_sleep(), but that will really have about the same effect. James ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 4:21 ` James Bottomley 0 siblings, 0 replies; 119+ messages in thread From: James Bottomley @ 2011-05-16 4:21 UTC (permalink / raw) To: KOSAKI Motohiro Cc: mgorman, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > (2011/05/13 23:03), Mel Gorman wrote: > > Under constant allocation pressure, kswapd can be in the situation where > > sleeping_prematurely() will always return true even if kswapd has been > > running a long time. Check if kswapd needs to be scheduled. > > > > Signed-off-by: Mel Gorman<mgorman@suse.de> > > --- > > mm/vmscan.c | 4 ++++ > > 1 files changed, 4 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index af24d1e..4d24828 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > > unsigned long balanced = 0; > > bool all_zones_ok = true; > > > > + /* If kswapd has been running too long, just sleep */ > > + if (need_resched()) > > + return false; > > + > > Hmm... I don't like this patch so much. because this code does > > - don't sleep if kswapd got context switch at shrink_inactive_list This isn't entirely true: need_resched() will be false, so we'll follow the normal path for determining whether to sleep or not, in effect leaving the current behaviour unchanged. > - sleep if kswapd didn't This also isn't entirely true: whether need_resched() is true at this point depends on a whole lot more that whether we did a context switch in shrink_inactive. It mostly depends on how long we've been running without giving up the CPU. Generally that will mean we've been round the shrinker loop hundreds to thousands of times without sleeping. > It seems to be semi random behavior. Well, we have to do something. Chris Mason first suspected the hang was a kswapd rescheduling problem a while ago. We tried putting cond_rescheds() in several places in the vmscan code, but to no avail. The need_resched() in sleeping_prematurely() seems to be about the best option. The other option might be just to put a cond_resched() in kswapd_try_to_sleep(), but that will really have about the same effect. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 4:21 ` James Bottomley @ 2011-05-16 5:04 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-16 5:04 UTC (permalink / raw) To: James Bottomley Cc: KOSAKI Motohiro, mgorman, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 1:21 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> (2011/05/13 23:03), Mel Gorman wrote: >> > Under constant allocation pressure, kswapd can be in the situation where >> > sleeping_prematurely() will always return true even if kswapd has been >> > running a long time. Check if kswapd needs to be scheduled. >> > >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> > --- >> > mm/vmscan.c | 4 ++++ >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index af24d1e..4d24828 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> > unsigned long balanced = 0; >> > bool all_zones_ok = true; >> > >> > + /* If kswapd has been running too long, just sleep */ >> > + if (need_resched()) >> > + return false; >> > + >> >> Hmm... I don't like this patch so much. because this code does >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > > This isn't entirely true: need_resched() will be false, so we'll follow > the normal path for determining whether to sleep or not, in effect > leaving the current behaviour unchanged. > >> - sleep if kswapd didn't > > This also isn't entirely true: whether need_resched() is true at this > point depends on a whole lot more that whether we did a context switch > in shrink_inactive. It mostly depends on how long we've been running > without giving up the CPU. Generally that will mean we've been round > the shrinker loop hundreds to thousands of times without sleeping. > >> It seems to be semi random behavior. > > Well, we have to do something. Chris Mason first suspected the hang was > a kswapd rescheduling problem a while ago. We tried putting > cond_rescheds() in several places in the vmscan code, but to no avail. Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. Because, although we complete zone balancing, kswapd doesn't sleep as pgdat_balance returns wrong result. And at last VM calls balance_pgdat. In this case, balance_pgdat returns without any work as kswap couldn't find zones which have not enough free pages and goto out. kswapd could repeat this work infinitely. So you don't have a chance to call cond_resched. But if your test was with Hanne's patch, I am very curious how come kswapd consumes CPU a lot. > The need_resched() in sleeping_prematurely() seems to be about the best > option. The other option might be just to put a cond_resched() in > kswapd_try_to_sleep(), but that will really have about the same effect. I don't oppose it but before that, I think we have to know why kswapd consumes CPU a lot although we applied Hannes' patch. > > James > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 5:04 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-16 5:04 UTC (permalink / raw) To: James Bottomley Cc: KOSAKI Motohiro, mgorman, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 1:21 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> (2011/05/13 23:03), Mel Gorman wrote: >> > Under constant allocation pressure, kswapd can be in the situation where >> > sleeping_prematurely() will always return true even if kswapd has been >> > running a long time. Check if kswapd needs to be scheduled. >> > >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> > --- >> > mm/vmscan.c | 4 ++++ >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index af24d1e..4d24828 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> > unsigned long balanced = 0; >> > bool all_zones_ok = true; >> > >> > + /* If kswapd has been running too long, just sleep */ >> > + if (need_resched()) >> > + return false; >> > + >> >> Hmm... I don't like this patch so much. because this code does >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > > This isn't entirely true: need_resched() will be false, so we'll follow > the normal path for determining whether to sleep or not, in effect > leaving the current behaviour unchanged. > >> - sleep if kswapd didn't > > This also isn't entirely true: whether need_resched() is true at this > point depends on a whole lot more that whether we did a context switch > in shrink_inactive. It mostly depends on how long we've been running > without giving up the CPU. Generally that will mean we've been round > the shrinker loop hundreds to thousands of times without sleeping. > >> It seems to be semi random behavior. > > Well, we have to do something. Chris Mason first suspected the hang was > a kswapd rescheduling problem a while ago. We tried putting > cond_rescheds() in several places in the vmscan code, but to no avail. Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. Because, although we complete zone balancing, kswapd doesn't sleep as pgdat_balance returns wrong result. And at last VM calls balance_pgdat. In this case, balance_pgdat returns without any work as kswap couldn't find zones which have not enough free pages and goto out. kswapd could repeat this work infinitely. So you don't have a chance to call cond_resched. But if your test was with Hanne's patch, I am very curious how come kswapd consumes CPU a lot. > The need_resched() in sleeping_prematurely() seems to be about the best > option. The other option might be just to put a cond_resched() in > kswapd_try_to_sleep(), but that will really have about the same effect. I don't oppose it but before that, I think we have to know why kswapd consumes CPU a lot although we applied Hannes' patch. > > James > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 5:04 ` Minchan Kim (?) @ 2011-05-16 8:45 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 8:45 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 1:21 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> (2011/05/13 23:03), Mel Gorman wrote: > >> > Under constant allocation pressure, kswapd can be in the situation where > >> > sleeping_prematurely() will always return true even if kswapd has been > >> > running a long time. Check if kswapd needs to be scheduled. > >> > > >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> > --- > >> > mm/vmscan.c | 4 ++++ > >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> > > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > index af24d1e..4d24828 100644 > >> > --- a/mm/vmscan.c > >> > +++ b/mm/vmscan.c > >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> > unsigned long balanced = 0; > >> > bool all_zones_ok = true; > >> > > >> > + /* If kswapd has been running too long, just sleep */ > >> > + if (need_resched()) > >> > + return false; > >> > + > >> > >> Hmm... I don't like this patch so much. because this code does > >> > >> - don't sleep if kswapd got context switch at shrink_inactive_list > > > > This isn't entirely true: need_resched() will be false, so we'll follow > > the normal path for determining whether to sleep or not, in effect > > leaving the current behaviour unchanged. > > > >> - sleep if kswapd didn't > > > > This also isn't entirely true: whether need_resched() is true at this > > point depends on a whole lot more that whether we did a context switch > > in shrink_inactive. It mostly depends on how long we've been running > > without giving up the CPU. Generally that will mean we've been round > > the shrinker loop hundreds to thousands of times without sleeping. > > > >> It seems to be semi random behavior. > > > > Well, we have to do something. Chris Mason first suspected the hang was > > a kswapd rescheduling problem a while ago. We tried putting > > cond_rescheds() in several places in the vmscan code, but to no avail. > > Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > > If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > Because, although we complete zone balancing, kswapd doesn't sleep as > pgdat_balance returns wrong result. And at last VM calls > balance_pgdat. In this case, balance_pgdat returns without any work as > kswap couldn't find zones which have not enough free pages and goto > out. kswapd could repeat this work infinitely. So you don't have a > chance to call cond_resched. > > But if your test was with Hanne's patch, I am very curious how come > kswapd consumes CPU a lot. > > > The need_resched() in sleeping_prematurely() seems to be about the best > > option. The other option might be just to put a cond_resched() in > > kswapd_try_to_sleep(), but that will really have about the same effect. > > I don't oppose it but before that, I think we have to know why kswapd > consumes CPU a lot although we applied Hannes' patch. > Because it's still possible for processes to allocate pages at the same rate kswapd is freeing them leading to a situation where kswapd does not consider the zone balanced for prolonged periods of time. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 8:45 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 8:45 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 1:21 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> (2011/05/13 23:03), Mel Gorman wrote: > >> > Under constant allocation pressure, kswapd can be in the situation where > >> > sleeping_prematurely() will always return true even if kswapd has been > >> > running a long time. Check if kswapd needs to be scheduled. > >> > > >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> > --- > >> > mm/vmscan.c | 4 ++++ > >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> > > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > index af24d1e..4d24828 100644 > >> > --- a/mm/vmscan.c > >> > +++ b/mm/vmscan.c > >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> > unsigned long balanced = 0; > >> > bool all_zones_ok = true; > >> > > >> > + /* If kswapd has been running too long, just sleep */ > >> > + if (need_resched()) > >> > + return false; > >> > + > >> > >> Hmm... I don't like this patch so much. because this code does > >> > >> - don't sleep if kswapd got context switch at shrink_inactive_list > > > > This isn't entirely true: need_resched() will be false, so we'll follow > > the normal path for determining whether to sleep or not, in effect > > leaving the current behaviour unchanged. > > > >> - sleep if kswapd didn't > > > > This also isn't entirely true: whether need_resched() is true at this > > point depends on a whole lot more that whether we did a context switch > > in shrink_inactive. It mostly depends on how long we've been running > > without giving up the CPU. Generally that will mean we've been round > > the shrinker loop hundreds to thousands of times without sleeping. > > > >> It seems to be semi random behavior. > > > > Well, we have to do something. Chris Mason first suspected the hang was > > a kswapd rescheduling problem a while ago. We tried putting > > cond_rescheds() in several places in the vmscan code, but to no avail. > > Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > > If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > Because, although we complete zone balancing, kswapd doesn't sleep as > pgdat_balance returns wrong result. And at last VM calls > balance_pgdat. In this case, balance_pgdat returns without any work as > kswap couldn't find zones which have not enough free pages and goto > out. kswapd could repeat this work infinitely. So you don't have a > chance to call cond_resched. > > But if your test was with Hanne's patch, I am very curious how come > kswapd consumes CPU a lot. > > > The need_resched() in sleeping_prematurely() seems to be about the best > > option. The other option might be just to put a cond_resched() in > > kswapd_try_to_sleep(), but that will really have about the same effect. > > I don't oppose it but before that, I think we have to know why kswapd > consumes CPU a lot although we applied Hannes' patch. > Because it's still possible for processes to allocate pages at the same rate kswapd is freeing them leading to a situation where kswapd does not consider the zone balanced for prolonged periods of time. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 8:45 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 8:45 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 1:21 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> (2011/05/13 23:03), Mel Gorman wrote: > >> > Under constant allocation pressure, kswapd can be in the situation where > >> > sleeping_prematurely() will always return true even if kswapd has been > >> > running a long time. Check if kswapd needs to be scheduled. > >> > > >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> > --- > >> > mm/vmscan.c | 4 ++++ > >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> > > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > index af24d1e..4d24828 100644 > >> > --- a/mm/vmscan.c > >> > +++ b/mm/vmscan.c > >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> > unsigned long balanced = 0; > >> > bool all_zones_ok = true; > >> > > >> > + /* If kswapd has been running too long, just sleep */ > >> > + if (need_resched()) > >> > + return false; > >> > + > >> > >> Hmm... I don't like this patch so much. because this code does > >> > >> - don't sleep if kswapd got context switch at shrink_inactive_list > > > > This isn't entirely true: need_resched() will be false, so we'll follow > > the normal path for determining whether to sleep or not, in effect > > leaving the current behaviour unchanged. > > > >> - sleep if kswapd didn't > > > > This also isn't entirely true: whether need_resched() is true at this > > point depends on a whole lot more that whether we did a context switch > > in shrink_inactive. It mostly depends on how long we've been running > > without giving up the CPU. Generally that will mean we've been round > > the shrinker loop hundreds to thousands of times without sleeping. > > > >> It seems to be semi random behavior. > > > > Well, we have to do something. Chris Mason first suspected the hang was > > a kswapd rescheduling problem a while ago. We tried putting > > cond_rescheds() in several places in the vmscan code, but to no avail. > > Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > > If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > Because, although we complete zone balancing, kswapd doesn't sleep as > pgdat_balance returns wrong result. And at last VM calls > balance_pgdat. In this case, balance_pgdat returns without any work as > kswap couldn't find zones which have not enough free pages and goto > out. kswapd could repeat this work infinitely. So you don't have a > chance to call cond_resched. > > But if your test was with Hanne's patch, I am very curious how come > kswapd consumes CPU a lot. > > > The need_resched() in sleeping_prematurely() seems to be about the best > > option. The other option might be just to put a cond_resched() in > > kswapd_try_to_sleep(), but that will really have about the same effect. > > I don't oppose it but before that, I think we have to know why kswapd > consumes CPU a lot although we applied Hannes' patch. > Because it's still possible for processes to allocate pages at the same rate kswapd is freeing them leading to a situation where kswapd does not consider the zone balanced for prolonged periods of time. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 8:45 ` Mel Gorman (?) @ 2011-05-16 8:58 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-16 8:58 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >> <James.Bottomley@hansenpartnership.com> wrote: >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> >> (2011/05/13 23:03), Mel Gorman wrote: >> >> > Under constant allocation pressure, kswapd can be in the situation where >> >> > sleeping_prematurely() will always return true even if kswapd has been >> >> > running a long time. Check if kswapd needs to be scheduled. >> >> > >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> >> > --- >> >> > mm/vmscan.c | 4 ++++ >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> >> > >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> >> > index af24d1e..4d24828 100644 >> >> > --- a/mm/vmscan.c >> >> > +++ b/mm/vmscan.c >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> >> > unsigned long balanced = 0; >> >> > bool all_zones_ok = true; >> >> > >> >> > + /* If kswapd has been running too long, just sleep */ >> >> > + if (need_resched()) >> >> > + return false; >> >> > + >> >> >> >> Hmm... I don't like this patch so much. because this code does >> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >> > >> > This isn't entirely true: need_resched() will be false, so we'll follow >> > the normal path for determining whether to sleep or not, in effect >> > leaving the current behaviour unchanged. >> > >> >> - sleep if kswapd didn't >> > >> > This also isn't entirely true: whether need_resched() is true at this >> > point depends on a whole lot more that whether we did a context switch >> > in shrink_inactive. It mostly depends on how long we've been running >> > without giving up the CPU. Generally that will mean we've been round >> > the shrinker loop hundreds to thousands of times without sleeping. >> > >> >> It seems to be semi random behavior. >> > >> > Well, we have to do something. Chris Mason first suspected the hang was >> > a kswapd rescheduling problem a while ago. We tried putting >> > cond_rescheds() in several places in the vmscan code, but to no avail. >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >> Because, although we complete zone balancing, kswapd doesn't sleep as >> pgdat_balance returns wrong result. And at last VM calls >> balance_pgdat. In this case, balance_pgdat returns without any work as >> kswap couldn't find zones which have not enough free pages and goto >> out. kswapd could repeat this work infinitely. So you don't have a >> chance to call cond_resched. >> >> But if your test was with Hanne's patch, I am very curious how come >> kswapd consumes CPU a lot. >> >> > The need_resched() in sleeping_prematurely() seems to be about the best >> > option. The other option might be just to put a cond_resched() in >> > kswapd_try_to_sleep(), but that will really have about the same effect. >> >> I don't oppose it but before that, I think we have to know why kswapd >> consumes CPU a lot although we applied Hannes' patch. >> > > Because it's still possible for processes to allocate pages at the same > rate kswapd is freeing them leading to a situation where kswapd does not > consider the zone balanced for prolonged periods of time. We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. So I think kswapd can be scheduled out although it's scheduled in after a short time as task scheduled also need page reclaim. Although all task in system need reclaim, kswapd cpu 99% consumption is a natural result, I think. Do I miss something? > > -- > Mel Gorman > SUSE Labs > -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 8:58 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-16 8:58 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >> <James.Bottomley@hansenpartnership.com> wrote: >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> >> (2011/05/13 23:03), Mel Gorman wrote: >> >> > Under constant allocation pressure, kswapd can be in the situation where >> >> > sleeping_prematurely() will always return true even if kswapd has been >> >> > running a long time. Check if kswapd needs to be scheduled. >> >> > >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> >> > --- >> >> > mm/vmscan.c | 4 ++++ >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> >> > >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> >> > index af24d1e..4d24828 100644 >> >> > --- a/mm/vmscan.c >> >> > +++ b/mm/vmscan.c >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> >> > unsigned long balanced = 0; >> >> > bool all_zones_ok = true; >> >> > >> >> > + /* If kswapd has been running too long, just sleep */ >> >> > + if (need_resched()) >> >> > + return false; >> >> > + >> >> >> >> Hmm... I don't like this patch so much. because this code does >> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >> > >> > This isn't entirely true: need_resched() will be false, so we'll follow >> > the normal path for determining whether to sleep or not, in effect >> > leaving the current behaviour unchanged. >> > >> >> - sleep if kswapd didn't >> > >> > This also isn't entirely true: whether need_resched() is true at this >> > point depends on a whole lot more that whether we did a context switch >> > in shrink_inactive. It mostly depends on how long we've been running >> > without giving up the CPU. Generally that will mean we've been round >> > the shrinker loop hundreds to thousands of times without sleeping. >> > >> >> It seems to be semi random behavior. >> > >> > Well, we have to do something. Chris Mason first suspected the hang was >> > a kswapd rescheduling problem a while ago. We tried putting >> > cond_rescheds() in several places in the vmscan code, but to no avail. >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >> Because, although we complete zone balancing, kswapd doesn't sleep as >> pgdat_balance returns wrong result. And at last VM calls >> balance_pgdat. In this case, balance_pgdat returns without any work as >> kswap couldn't find zones which have not enough free pages and goto >> out. kswapd could repeat this work infinitely. So you don't have a >> chance to call cond_resched. >> >> But if your test was with Hanne's patch, I am very curious how come >> kswapd consumes CPU a lot. >> >> > The need_resched() in sleeping_prematurely() seems to be about the best >> > option. The other option might be just to put a cond_resched() in >> > kswapd_try_to_sleep(), but that will really have about the same effect. >> >> I don't oppose it but before that, I think we have to know why kswapd >> consumes CPU a lot although we applied Hannes' patch. >> > > Because it's still possible for processes to allocate pages at the same > rate kswapd is freeing them leading to a situation where kswapd does not > consider the zone balanced for prolonged periods of time. We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. So I think kswapd can be scheduled out although it's scheduled in after a short time as task scheduled also need page reclaim. Although all task in system need reclaim, kswapd cpu 99% consumption is a natural result, I think. Do I miss something? > > -- > Mel Gorman > SUSE Labs > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 8:58 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-16 8:58 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >> <James.Bottomley@hansenpartnership.com> wrote: >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> >> (2011/05/13 23:03), Mel Gorman wrote: >> >> > Under constant allocation pressure, kswapd can be in the situation where >> >> > sleeping_prematurely() will always return true even if kswapd has been >> >> > running a long time. Check if kswapd needs to be scheduled. >> >> > >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> >> > --- >> >> > mm/vmscan.c | 4 ++++ >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> >> > >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> >> > index af24d1e..4d24828 100644 >> >> > --- a/mm/vmscan.c >> >> > +++ b/mm/vmscan.c >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> >> > unsigned long balanced = 0; >> >> > bool all_zones_ok = true; >> >> > >> >> > + /* If kswapd has been running too long, just sleep */ >> >> > + if (need_resched()) >> >> > + return false; >> >> > + >> >> >> >> Hmm... I don't like this patch so much. because this code does >> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >> > >> > This isn't entirely true: need_resched() will be false, so we'll follow >> > the normal path for determining whether to sleep or not, in effect >> > leaving the current behaviour unchanged. >> > >> >> - sleep if kswapd didn't >> > >> > This also isn't entirely true: whether need_resched() is true at this >> > point depends on a whole lot more that whether we did a context switch >> > in shrink_inactive. It mostly depends on how long we've been running >> > without giving up the CPU. Generally that will mean we've been round >> > the shrinker loop hundreds to thousands of times without sleeping. >> > >> >> It seems to be semi random behavior. >> > >> > Well, we have to do something. Chris Mason first suspected the hang was >> > a kswapd rescheduling problem a while ago. We tried putting >> > cond_rescheds() in several places in the vmscan code, but to no avail. >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >> Because, although we complete zone balancing, kswapd doesn't sleep as >> pgdat_balance returns wrong result. And at last VM calls >> balance_pgdat. In this case, balance_pgdat returns without any work as >> kswap couldn't find zones which have not enough free pages and goto >> out. kswapd could repeat this work infinitely. So you don't have a >> chance to call cond_resched. >> >> But if your test was with Hanne's patch, I am very curious how come >> kswapd consumes CPU a lot. >> >> > The need_resched() in sleeping_prematurely() seems to be about the best >> > option. The other option might be just to put a cond_resched() in >> > kswapd_try_to_sleep(), but that will really have about the same effect. >> >> I don't oppose it but before that, I think we have to know why kswapd >> consumes CPU a lot although we applied Hannes' patch. >> > > Because it's still possible for processes to allocate pages at the same > rate kswapd is freeing them leading to a situation where kswapd does not > consider the zone balanced for prolonged periods of time. We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. So I think kswapd can be scheduled out although it's scheduled in after a short time as task scheduled also need page reclaim. Although all task in system need reclaim, kswapd cpu 99% consumption is a natural result, I think. Do I miss something? > > -- > Mel Gorman > SUSE Labs > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 8:58 ` Minchan Kim (?) @ 2011-05-16 10:27 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 10:27 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> <James.Bottomley@hansenpartnership.com> wrote: > >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> >> > running a long time. Check if kswapd needs to be scheduled. > >> >> > > >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> >> > --- > >> >> > mm/vmscan.c | 4 ++++ > >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> >> > > >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> >> > index af24d1e..4d24828 100644 > >> >> > --- a/mm/vmscan.c > >> >> > +++ b/mm/vmscan.c > >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> >> > unsigned long balanced = 0; > >> >> > bool all_zones_ok = true; > >> >> > > >> >> > + /* If kswapd has been running too long, just sleep */ > >> >> > + if (need_resched()) > >> >> > + return false; > >> >> > + > >> >> > >> >> Hmm... I don't like this patch so much. because this code does > >> >> > >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> > > >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> > the normal path for determining whether to sleep or not, in effect > >> > leaving the current behaviour unchanged. > >> > > >> >> - sleep if kswapd didn't > >> > > >> > This also isn't entirely true: whether need_resched() is true at this > >> > point depends on a whole lot more that whether we did a context switch > >> > in shrink_inactive. It mostly depends on how long we've been running > >> > without giving up the CPU. Generally that will mean we've been round > >> > the shrinker loop hundreds to thousands of times without sleeping. > >> > > >> >> It seems to be semi random behavior. > >> > > >> > Well, we have to do something. Chris Mason first suspected the hang was > >> > a kswapd rescheduling problem a while ago. We tried putting > >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> > >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> > >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> pgdat_balance returns wrong result. And at last VM calls > >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> kswap couldn't find zones which have not enough free pages and goto > >> out. kswapd could repeat this work infinitely. So you don't have a > >> chance to call cond_resched. > >> > >> But if your test was with Hanne's patch, I am very curious how come > >> kswapd consumes CPU a lot. > >> > >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> > option. The other option might be just to put a cond_resched() in > >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> > >> I don't oppose it but before that, I think we have to know why kswapd > >> consumes CPU a lot although we applied Hannes' patch. > >> > > > > Because it's still possible for processes to allocate pages at the same > > rate kswapd is freeing them leading to a situation where kswapd does not > > consider the zone balanced for prolonged periods of time. > > We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > So I think kswapd can be scheduled out although it's scheduled in > after a short time as task scheduled also need page reclaim. Although > all task in system need reclaim, kswapd cpu 99% consumption is a > natural result, I think. > Do I miss something? > Lets see; shrink_page_list() only applies if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). shrink_slab only applies if we are reclaiming slab pages. If the first shrinker returns -1, we do not call cond_resched(). If that first shrinker is dcache and __GFP_FS is not set, direct reclaimers will not shrink at all. However, if there are enough of them running or if one of the other shrinkers is running for a very long time, kswapd could be starved acquiring the shrinker_rwsem and never reaching the cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever have called cond_resched(). While it appears unlikely, there are bad conditions which can result in cond_resched() being avoided. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 10:27 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 10:27 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> <James.Bottomley@hansenpartnership.com> wrote: > >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> >> > running a long time. Check if kswapd needs to be scheduled. > >> >> > > >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> >> > --- > >> >> > mm/vmscan.c | 4 ++++ > >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> >> > > >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> >> > index af24d1e..4d24828 100644 > >> >> > --- a/mm/vmscan.c > >> >> > +++ b/mm/vmscan.c > >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> >> > unsigned long balanced = 0; > >> >> > bool all_zones_ok = true; > >> >> > > >> >> > + /* If kswapd has been running too long, just sleep */ > >> >> > + if (need_resched()) > >> >> > + return false; > >> >> > + > >> >> > >> >> Hmm... I don't like this patch so much. because this code does > >> >> > >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> > > >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> > the normal path for determining whether to sleep or not, in effect > >> > leaving the current behaviour unchanged. > >> > > >> >> - sleep if kswapd didn't > >> > > >> > This also isn't entirely true: whether need_resched() is true at this > >> > point depends on a whole lot more that whether we did a context switch > >> > in shrink_inactive. It mostly depends on how long we've been running > >> > without giving up the CPU. Generally that will mean we've been round > >> > the shrinker loop hundreds to thousands of times without sleeping. > >> > > >> >> It seems to be semi random behavior. > >> > > >> > Well, we have to do something. Chris Mason first suspected the hang was > >> > a kswapd rescheduling problem a while ago. We tried putting > >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> > >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> > >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> pgdat_balance returns wrong result. And at last VM calls > >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> kswap couldn't find zones which have not enough free pages and goto > >> out. kswapd could repeat this work infinitely. So you don't have a > >> chance to call cond_resched. > >> > >> But if your test was with Hanne's patch, I am very curious how come > >> kswapd consumes CPU a lot. > >> > >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> > option. The other option might be just to put a cond_resched() in > >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> > >> I don't oppose it but before that, I think we have to know why kswapd > >> consumes CPU a lot although we applied Hannes' patch. > >> > > > > Because it's still possible for processes to allocate pages at the same > > rate kswapd is freeing them leading to a situation where kswapd does not > > consider the zone balanced for prolonged periods of time. > > We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > So I think kswapd can be scheduled out although it's scheduled in > after a short time as task scheduled also need page reclaim. Although > all task in system need reclaim, kswapd cpu 99% consumption is a > natural result, I think. > Do I miss something? > Lets see; shrink_page_list() only applies if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). shrink_slab only applies if we are reclaiming slab pages. If the first shrinker returns -1, we do not call cond_resched(). If that first shrinker is dcache and __GFP_FS is not set, direct reclaimers will not shrink at all. However, if there are enough of them running or if one of the other shrinkers is running for a very long time, kswapd could be starved acquiring the shrinker_rwsem and never reaching the cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever have called cond_resched(). While it appears unlikely, there are bad conditions which can result in cond_resched() being avoided. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 10:27 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 10:27 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> <James.Bottomley@hansenpartnership.com> wrote: > >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> >> > running a long time. Check if kswapd needs to be scheduled. > >> >> > > >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> >> > --- > >> >> > mm/vmscan.c | 4 ++++ > >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> >> > > >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> >> > index af24d1e..4d24828 100644 > >> >> > --- a/mm/vmscan.c > >> >> > +++ b/mm/vmscan.c > >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> >> > unsigned long balanced = 0; > >> >> > bool all_zones_ok = true; > >> >> > > >> >> > + /* If kswapd has been running too long, just sleep */ > >> >> > + if (need_resched()) > >> >> > + return false; > >> >> > + > >> >> > >> >> Hmm... I don't like this patch so much. because this code does > >> >> > >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> > > >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> > the normal path for determining whether to sleep or not, in effect > >> > leaving the current behaviour unchanged. > >> > > >> >> - sleep if kswapd didn't > >> > > >> > This also isn't entirely true: whether need_resched() is true at this > >> > point depends on a whole lot more that whether we did a context switch > >> > in shrink_inactive. It mostly depends on how long we've been running > >> > without giving up the CPU. Generally that will mean we've been round > >> > the shrinker loop hundreds to thousands of times without sleeping. > >> > > >> >> It seems to be semi random behavior. > >> > > >> > Well, we have to do something. Chris Mason first suspected the hang was > >> > a kswapd rescheduling problem a while ago. We tried putting > >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> > >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> > >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> pgdat_balance returns wrong result. And at last VM calls > >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> kswap couldn't find zones which have not enough free pages and goto > >> out. kswapd could repeat this work infinitely. So you don't have a > >> chance to call cond_resched. > >> > >> But if your test was with Hanne's patch, I am very curious how come > >> kswapd consumes CPU a lot. > >> > >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> > option. The other option might be just to put a cond_resched() in > >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> > >> I don't oppose it but before that, I think we have to know why kswapd > >> consumes CPU a lot although we applied Hannes' patch. > >> > > > > Because it's still possible for processes to allocate pages at the same > > rate kswapd is freeing them leading to a situation where kswapd does not > > consider the zone balanced for prolonged periods of time. > > We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > So I think kswapd can be scheduled out although it's scheduled in > after a short time as task scheduled also need page reclaim. Although > all task in system need reclaim, kswapd cpu 99% consumption is a > natural result, I think. > Do I miss something? > Lets see; shrink_page_list() only applies if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). shrink_slab only applies if we are reclaiming slab pages. If the first shrinker returns -1, we do not call cond_resched(). If that first shrinker is dcache and __GFP_FS is not set, direct reclaimers will not shrink at all. However, if there are enough of them running or if one of the other shrinkers is running for a very long time, kswapd could be starved acquiring the shrinker_rwsem and never reaching the cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever have called cond_resched(). While it appears unlikely, there are bad conditions which can result in cond_resched() being avoided. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 10:27 ` Mel Gorman @ 2011-05-16 23:50 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-16 23:50 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >> >> <James.Bottomley@hansenpartnership.com> wrote: >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> >> >> (2011/05/13 23:03), Mel Gorman wrote: >> >> >> > Under constant allocation pressure, kswapd can be in the situation where >> >> >> > sleeping_prematurely() will always return true even if kswapd has been >> >> >> > running a long time. Check if kswapd needs to be scheduled. >> >> >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> >> >> > --- >> >> >> > mm/vmscan.c | 4 ++++ >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> >> >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> >> >> > index af24d1e..4d24828 100644 >> >> >> > --- a/mm/vmscan.c >> >> >> > +++ b/mm/vmscan.c >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> >> >> > unsigned long balanced = 0; >> >> >> > bool all_zones_ok = true; >> >> >> > >> >> >> > + /* If kswapd has been running too long, just sleep */ >> >> >> > + if (need_resched()) >> >> >> > + return false; >> >> >> > + >> >> >> >> >> >> Hmm... I don't like this patch so much. because this code does >> >> >> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >> >> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow >> >> > the normal path for determining whether to sleep or not, in effect >> >> > leaving the current behaviour unchanged. >> >> > >> >> >> - sleep if kswapd didn't >> >> > >> >> > This also isn't entirely true: whether need_resched() is true at this >> >> > point depends on a whole lot more that whether we did a context switch >> >> > in shrink_inactive. It mostly depends on how long we've been running >> >> > without giving up the CPU. Generally that will mean we've been round >> >> > the shrinker loop hundreds to thousands of times without sleeping. >> >> > >> >> >> It seems to be semi random behavior. >> >> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was >> >> > a kswapd rescheduling problem a while ago. We tried putting >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. >> >> >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >> >> >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >> >> Because, although we complete zone balancing, kswapd doesn't sleep as >> >> pgdat_balance returns wrong result. And at last VM calls >> >> balance_pgdat. In this case, balance_pgdat returns without any work as >> >> kswap couldn't find zones which have not enough free pages and goto >> >> out. kswapd could repeat this work infinitely. So you don't have a >> >> chance to call cond_resched. >> >> >> >> But if your test was with Hanne's patch, I am very curious how come >> >> kswapd consumes CPU a lot. >> >> >> >> > The need_resched() in sleeping_prematurely() seems to be about the best >> >> > option. The other option might be just to put a cond_resched() in >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. >> >> >> >> I don't oppose it but before that, I think we have to know why kswapd >> >> consumes CPU a lot although we applied Hannes' patch. >> >> >> > >> > Because it's still possible for processes to allocate pages at the same >> > rate kswapd is freeing them leading to a situation where kswapd does not >> > consider the zone balanced for prolonged periods of time. >> >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. >> So I think kswapd can be scheduled out although it's scheduled in >> after a short time as task scheduled also need page reclaim. Although >> all task in system need reclaim, kswapd cpu 99% consumption is a >> natural result, I think. >> Do I miss something? >> > > Lets see; > > shrink_page_list() only applies if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > shrink_slab only applies if we are reclaiming slab pages. If the first > shrinker returns -1, we do not call cond_resched(). If that > first shrinker is dcache and __GFP_FS is not set, direct > reclaimers will not shrink at all. However, if there are > enough of them running or if one of the other shrinkers > is running for a very long time, kswapd could be starved > acquiring the shrinker_rwsem and never reaching the > cond_resched(). Don't we have to move cond_resched? diff --git a/mm/vmscan.c b/mm/vmscan.c index 292582c..633e761 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, if (scanned == 0) scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + if (!down_read_trylock(&shrinker_rwsem)) { + ret = 1; + goto out; /* Assume we'll be able to shrink next time */ + } list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta; @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, count_vm_events(SLABS_SCANNED, this_scan); total_scan -= this_scan; - cond_resched(); } shrinker->nr += total_scan; + cond_resched(); } up_read(&shrinker_rwsem); +out: + cond_resched(); return ret; } > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that was reclaiming for order-0 to kswapd(). It can then find > that a caller has rewoken kswapd for a high-order and re-enters > balance_pgdat() without ever have called cond_resched(). If kswapd reclaims order-o followed by high order, it would have a chance to call cond_resched in shrink_page_list. But if all zones are all_unreclaimable is set, balance_pgdat could return any work. Okay. It does make sense. By your scenario, someone wakes up kswapd with higher order, again. So re-enters balance_pgdat without ever have called cond_resched. But if someone wakes up higher order again, we can't have a chance to call kswapd_try_to_sleep. So your patch effect would be nop, too. It would be better to put cond_resched after balance_pgdat? diff --git a/mm/vmscan.c b/mm/vmscan.c index 292582c..61c45d0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2753,6 +2753,7 @@ static int kswapd(void *p) if (!ret) { trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); order = balance_pgdat(pgdat, order, &classzone_idx); + cond_resched(); } } return 0; > > While it appears unlikely, there are bad conditions which can result > in cond_resched() being avoided. > > -- > Mel Gorman > SUSE Labs > -- Kind regards, Minchan Kim ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 23:50 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-16 23:50 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >> >> <James.Bottomley@hansenpartnership.com> wrote: >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> >> >> (2011/05/13 23:03), Mel Gorman wrote: >> >> >> > Under constant allocation pressure, kswapd can be in the situation where >> >> >> > sleeping_prematurely() will always return true even if kswapd has been >> >> >> > running a long time. Check if kswapd needs to be scheduled. >> >> >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> >> >> > --- >> >> >> > mm/vmscan.c | 4 ++++ >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> >> >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> >> >> > index af24d1e..4d24828 100644 >> >> >> > --- a/mm/vmscan.c >> >> >> > +++ b/mm/vmscan.c >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> >> >> > unsigned long balanced = 0; >> >> >> > bool all_zones_ok = true; >> >> >> > >> >> >> > + /* If kswapd has been running too long, just sleep */ >> >> >> > + if (need_resched()) >> >> >> > + return false; >> >> >> > + >> >> >> >> >> >> Hmm... I don't like this patch so much. because this code does >> >> >> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >> >> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow >> >> > the normal path for determining whether to sleep or not, in effect >> >> > leaving the current behaviour unchanged. >> >> > >> >> >> - sleep if kswapd didn't >> >> > >> >> > This also isn't entirely true: whether need_resched() is true at this >> >> > point depends on a whole lot more that whether we did a context switch >> >> > in shrink_inactive. It mostly depends on how long we've been running >> >> > without giving up the CPU. Generally that will mean we've been round >> >> > the shrinker loop hundreds to thousands of times without sleeping. >> >> > >> >> >> It seems to be semi random behavior. >> >> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was >> >> > a kswapd rescheduling problem a while ago. We tried putting >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. >> >> >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >> >> >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >> >> Because, although we complete zone balancing, kswapd doesn't sleep as >> >> pgdat_balance returns wrong result. And at last VM calls >> >> balance_pgdat. In this case, balance_pgdat returns without any work as >> >> kswap couldn't find zones which have not enough free pages and goto >> >> out. kswapd could repeat this work infinitely. So you don't have a >> >> chance to call cond_resched. >> >> >> >> But if your test was with Hanne's patch, I am very curious how come >> >> kswapd consumes CPU a lot. >> >> >> >> > The need_resched() in sleeping_prematurely() seems to be about the best >> >> > option. The other option might be just to put a cond_resched() in >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. >> >> >> >> I don't oppose it but before that, I think we have to know why kswapd >> >> consumes CPU a lot although we applied Hannes' patch. >> >> >> > >> > Because it's still possible for processes to allocate pages at the same >> > rate kswapd is freeing them leading to a situation where kswapd does not >> > consider the zone balanced for prolonged periods of time. >> >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. >> So I think kswapd can be scheduled out although it's scheduled in >> after a short time as task scheduled also need page reclaim. Although >> all task in system need reclaim, kswapd cpu 99% consumption is a >> natural result, I think. >> Do I miss something? >> > > Lets see; > > shrink_page_list() only applies if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > shrink_slab only applies if we are reclaiming slab pages. If the first > shrinker returns -1, we do not call cond_resched(). If that > first shrinker is dcache and __GFP_FS is not set, direct > reclaimers will not shrink at all. However, if there are > enough of them running or if one of the other shrinkers > is running for a very long time, kswapd could be starved > acquiring the shrinker_rwsem and never reaching the > cond_resched(). Don't we have to move cond_resched? diff --git a/mm/vmscan.c b/mm/vmscan.c index 292582c..633e761 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, if (scanned == 0) scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + if (!down_read_trylock(&shrinker_rwsem)) { + ret = 1; + goto out; /* Assume we'll be able to shrink next time */ + } list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta; @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, count_vm_events(SLABS_SCANNED, this_scan); total_scan -= this_scan; - cond_resched(); } shrinker->nr += total_scan; + cond_resched(); } up_read(&shrinker_rwsem); +out: + cond_resched(); return ret; } > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that was reclaiming for order-0 to kswapd(). It can then find > that a caller has rewoken kswapd for a high-order and re-enters > balance_pgdat() without ever have called cond_resched(). If kswapd reclaims order-o followed by high order, it would have a chance to call cond_resched in shrink_page_list. But if all zones are all_unreclaimable is set, balance_pgdat could return any work. Okay. It does make sense. By your scenario, someone wakes up kswapd with higher order, again. So re-enters balance_pgdat without ever have called cond_resched. But if someone wakes up higher order again, we can't have a chance to call kswapd_try_to_sleep. So your patch effect would be nop, too. It would be better to put cond_resched after balance_pgdat? diff --git a/mm/vmscan.c b/mm/vmscan.c index 292582c..61c45d0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2753,6 +2753,7 @@ static int kswapd(void *p) if (!ret) { trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); order = balance_pgdat(pgdat, order, &classzone_idx); + cond_resched(); } } return 0; > > While it appears unlikely, there are bad conditions which can result > in cond_resched() being avoided. > > -- > Mel Gorman > SUSE Labs > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 23:50 ` Minchan Kim (?) @ 2011-05-17 0:48 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-17 0:48 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 8:50 AM, Minchan Kim <minchan.kim@gmail.com> wrote: > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: >> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: >>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: >>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >>> >> <James.Bottomley@hansenpartnership.com> wrote: >>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >>> >> >> (2011/05/13 23:03), Mel Gorman wrote: >>> >> >> > Under constant allocation pressure, kswapd can be in the situation where >>> >> >> > sleeping_prematurely() will always return true even if kswapd has been >>> >> >> > running a long time. Check if kswapd needs to be scheduled. >>> >> >> > >>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >>> >> >> > --- >>> >> >> > mm/vmscan.c | 4 ++++ >>> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >>> >> >> > >>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >>> >> >> > index af24d1e..4d24828 100644 >>> >> >> > --- a/mm/vmscan.c >>> >> >> > +++ b/mm/vmscan.c >>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >>> >> >> > unsigned long balanced = 0; >>> >> >> > bool all_zones_ok = true; >>> >> >> > >>> >> >> > + /* If kswapd has been running too long, just sleep */ >>> >> >> > + if (need_resched()) >>> >> >> > + return false; >>> >> >> > + >>> >> >> >>> >> >> Hmm... I don't like this patch so much. because this code does >>> >> >> >>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >>> >> > >>> >> > This isn't entirely true: need_resched() will be false, so we'll follow >>> >> > the normal path for determining whether to sleep or not, in effect >>> >> > leaving the current behaviour unchanged. >>> >> > >>> >> >> - sleep if kswapd didn't >>> >> > >>> >> > This also isn't entirely true: whether need_resched() is true at this >>> >> > point depends on a whole lot more that whether we did a context switch >>> >> > in shrink_inactive. It mostly depends on how long we've been running >>> >> > without giving up the CPU. Generally that will mean we've been round >>> >> > the shrinker loop hundreds to thousands of times without sleeping. >>> >> > >>> >> >> It seems to be semi random behavior. >>> >> > >>> >> > Well, we have to do something. Chris Mason first suspected the hang was >>> >> > a kswapd rescheduling problem a while ago. We tried putting >>> >> > cond_rescheds() in several places in the vmscan code, but to no avail. >>> >> >>> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >>> >> >>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >>> >> Because, although we complete zone balancing, kswapd doesn't sleep as >>> >> pgdat_balance returns wrong result. And at last VM calls >>> >> balance_pgdat. In this case, balance_pgdat returns without any work as >>> >> kswap couldn't find zones which have not enough free pages and goto >>> >> out. kswapd could repeat this work infinitely. So you don't have a >>> >> chance to call cond_resched. >>> >> >>> >> But if your test was with Hanne's patch, I am very curious how come >>> >> kswapd consumes CPU a lot. >>> >> >>> >> > The need_resched() in sleeping_prematurely() seems to be about the best >>> >> > option. The other option might be just to put a cond_resched() in >>> >> > kswapd_try_to_sleep(), but that will really have about the same effect. >>> >> >>> >> I don't oppose it but before that, I think we have to know why kswapd >>> >> consumes CPU a lot although we applied Hannes' patch. >>> >> >>> > >>> > Because it's still possible for processes to allocate pages at the same >>> > rate kswapd is freeing them leading to a situation where kswapd does not >>> > consider the zone balanced for prolonged periods of time. >>> >>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. >>> So I think kswapd can be scheduled out although it's scheduled in >>> after a short time as task scheduled also need page reclaim. Although >>> all task in system need reclaim, kswapd cpu 99% consumption is a >>> natural result, I think. >>> Do I miss something? >>> >> >> Lets see; >> >> shrink_page_list() only applies if inactive pages were isolated >> which in turn may not happen if all_unreclaimable is set in >> shrink_zones(). If for whatver reason, all_unreclaimable is >> set on all zones, we can miss calling cond_resched(). >> >> shrink_slab only applies if we are reclaiming slab pages. If the first >> shrinker returns -1, we do not call cond_resched(). If that >> first shrinker is dcache and __GFP_FS is not set, direct >> reclaimers will not shrink at all. However, if there are >> enough of them running or if one of the other shrinkers >> is running for a very long time, kswapd could be starved >> acquiring the shrinker_rwsem and never reaching the >> cond_resched(). > > Don't we have to move cond_resched? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..633e761 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + ret = 1; > + goto out; /* Assume we'll be able to shrink next time */ > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > count_vm_events(SLABS_SCANNED, this_scan); > total_scan -= this_scan; > > - cond_resched(); > } > > shrinker->nr += total_scan; > + cond_resched(); > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > > >> >> balance_pgdat() only calls cond_resched if the zones are not >> balanced. For a high-order allocation that is balanced, it >> checks order-0 again. During that window, order-0 might have >> become unbalanced so it loops again for order-0 and returns >> that was reclaiming for order-0 to kswapd(). It can then find >> that a caller has rewoken kswapd for a high-order and re-enters >> balance_pgdat() without ever have called cond_resched(). > > If kswapd reclaims order-o followed by high order, it would have a > chance to call cond_resched in shrink_page_list. But if all zones are > all_unreclaimable is set, balance_pgdat could return any work. Typo : without any work. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-17 0:48 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-17 0:48 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 8:50 AM, Minchan Kim <minchan.kim@gmail.com> wrote: > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: >> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: >>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: >>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >>> >> <James.Bottomley@hansenpartnership.com> wrote: >>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >>> >> >> (2011/05/13 23:03), Mel Gorman wrote: >>> >> >> > Under constant allocation pressure, kswapd can be in the situation where >>> >> >> > sleeping_prematurely() will always return true even if kswapd has been >>> >> >> > running a long time. Check if kswapd needs to be scheduled. >>> >> >> > >>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >>> >> >> > --- >>> >> >> > mm/vmscan.c | 4 ++++ >>> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >>> >> >> > >>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >>> >> >> > index af24d1e..4d24828 100644 >>> >> >> > --- a/mm/vmscan.c >>> >> >> > +++ b/mm/vmscan.c >>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >>> >> >> > unsigned long balanced = 0; >>> >> >> > bool all_zones_ok = true; >>> >> >> > >>> >> >> > + /* If kswapd has been running too long, just sleep */ >>> >> >> > + if (need_resched()) >>> >> >> > + return false; >>> >> >> > + >>> >> >> >>> >> >> Hmm... I don't like this patch so much. because this code does >>> >> >> >>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >>> >> > >>> >> > This isn't entirely true: need_resched() will be false, so we'll follow >>> >> > the normal path for determining whether to sleep or not, in effect >>> >> > leaving the current behaviour unchanged. >>> >> > >>> >> >> - sleep if kswapd didn't >>> >> > >>> >> > This also isn't entirely true: whether need_resched() is true at this >>> >> > point depends on a whole lot more that whether we did a context switch >>> >> > in shrink_inactive. It mostly depends on how long we've been running >>> >> > without giving up the CPU. Generally that will mean we've been round >>> >> > the shrinker loop hundreds to thousands of times without sleeping. >>> >> > >>> >> >> It seems to be semi random behavior. >>> >> > >>> >> > Well, we have to do something. Chris Mason first suspected the hang was >>> >> > a kswapd rescheduling problem a while ago. We tried putting >>> >> > cond_rescheds() in several places in the vmscan code, but to no avail. >>> >> >>> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >>> >> >>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >>> >> Because, although we complete zone balancing, kswapd doesn't sleep as >>> >> pgdat_balance returns wrong result. And at last VM calls >>> >> balance_pgdat. In this case, balance_pgdat returns without any work as >>> >> kswap couldn't find zones which have not enough free pages and goto >>> >> out. kswapd could repeat this work infinitely. So you don't have a >>> >> chance to call cond_resched. >>> >> >>> >> But if your test was with Hanne's patch, I am very curious how come >>> >> kswapd consumes CPU a lot. >>> >> >>> >> > The need_resched() in sleeping_prematurely() seems to be about the best >>> >> > option. The other option might be just to put a cond_resched() in >>> >> > kswapd_try_to_sleep(), but that will really have about the same effect. >>> >> >>> >> I don't oppose it but before that, I think we have to know why kswapd >>> >> consumes CPU a lot although we applied Hannes' patch. >>> >> >>> > >>> > Because it's still possible for processes to allocate pages at the same >>> > rate kswapd is freeing them leading to a situation where kswapd does not >>> > consider the zone balanced for prolonged periods of time. >>> >>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. >>> So I think kswapd can be scheduled out although it's scheduled in >>> after a short time as task scheduled also need page reclaim. Although >>> all task in system need reclaim, kswapd cpu 99% consumption is a >>> natural result, I think. >>> Do I miss something? >>> >> >> Lets see; >> >> shrink_page_list() only applies if inactive pages were isolated >> which in turn may not happen if all_unreclaimable is set in >> shrink_zones(). If for whatver reason, all_unreclaimable is >> set on all zones, we can miss calling cond_resched(). >> >> shrink_slab only applies if we are reclaiming slab pages. If the first >> shrinker returns -1, we do not call cond_resched(). If that >> first shrinker is dcache and __GFP_FS is not set, direct >> reclaimers will not shrink at all. However, if there are >> enough of them running or if one of the other shrinkers >> is running for a very long time, kswapd could be starved >> acquiring the shrinker_rwsem and never reaching the >> cond_resched(). > > Don't we have to move cond_resched? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..633e761 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + ret = 1; > + goto out; /* Assume we'll be able to shrink next time */ > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > count_vm_events(SLABS_SCANNED, this_scan); > total_scan -= this_scan; > > - cond_resched(); > } > > shrinker->nr += total_scan; > + cond_resched(); > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > > >> >> balance_pgdat() only calls cond_resched if the zones are not >> balanced. For a high-order allocation that is balanced, it >> checks order-0 again. During that window, order-0 might have >> become unbalanced so it loops again for order-0 and returns >> that was reclaiming for order-0 to kswapd(). It can then find >> that a caller has rewoken kswapd for a high-order and re-enters >> balance_pgdat() without ever have called cond_resched(). > > If kswapd reclaims order-o followed by high order, it would have a > chance to call cond_resched in shrink_page_list. But if all zones are > all_unreclaimable is set, balance_pgdat could return any work. Typo : without any work. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-17 0:48 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-17 0:48 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 8:50 AM, Minchan Kim <minchan.kim@gmail.com> wrote: > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: >> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: >>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: >>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >>> >> <James.Bottomley@hansenpartnership.com> wrote: >>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >>> >> >> (2011/05/13 23:03), Mel Gorman wrote: >>> >> >> > Under constant allocation pressure, kswapd can be in the situation where >>> >> >> > sleeping_prematurely() will always return true even if kswapd has been >>> >> >> > running a long time. Check if kswapd needs to be scheduled. >>> >> >> > >>> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >>> >> >> > --- >>> >> >> > mm/vmscan.c | 4 ++++ >>> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >>> >> >> > >>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >>> >> >> > index af24d1e..4d24828 100644 >>> >> >> > --- a/mm/vmscan.c >>> >> >> > +++ b/mm/vmscan.c >>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >>> >> >> > unsigned long balanced = 0; >>> >> >> > bool all_zones_ok = true; >>> >> >> > >>> >> >> > + /* If kswapd has been running too long, just sleep */ >>> >> >> > + if (need_resched()) >>> >> >> > + return false; >>> >> >> > + >>> >> >> >>> >> >> Hmm... I don't like this patch so much. because this code does >>> >> >> >>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >>> >> > >>> >> > This isn't entirely true: need_resched() will be false, so we'll follow >>> >> > the normal path for determining whether to sleep or not, in effect >>> >> > leaving the current behaviour unchanged. >>> >> > >>> >> >> - sleep if kswapd didn't >>> >> > >>> >> > This also isn't entirely true: whether need_resched() is true at this >>> >> > point depends on a whole lot more that whether we did a context switch >>> >> > in shrink_inactive. It mostly depends on how long we've been running >>> >> > without giving up the CPU. Generally that will mean we've been round >>> >> > the shrinker loop hundreds to thousands of times without sleeping. >>> >> > >>> >> >> It seems to be semi random behavior. >>> >> > >>> >> > Well, we have to do something. Chris Mason first suspected the hang was >>> >> > a kswapd rescheduling problem a while ago. We tried putting >>> >> > cond_rescheds() in several places in the vmscan code, but to no avail. >>> >> >>> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >>> >> >>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >>> >> Because, although we complete zone balancing, kswapd doesn't sleep as >>> >> pgdat_balance returns wrong result. And at last VM calls >>> >> balance_pgdat. In this case, balance_pgdat returns without any work as >>> >> kswap couldn't find zones which have not enough free pages and goto >>> >> out. kswapd could repeat this work infinitely. So you don't have a >>> >> chance to call cond_resched. >>> >> >>> >> But if your test was with Hanne's patch, I am very curious how come >>> >> kswapd consumes CPU a lot. >>> >> >>> >> > The need_resched() in sleeping_prematurely() seems to be about the best >>> >> > option. The other option might be just to put a cond_resched() in >>> >> > kswapd_try_to_sleep(), but that will really have about the same effect. >>> >> >>> >> I don't oppose it but before that, I think we have to know why kswapd >>> >> consumes CPU a lot although we applied Hannes' patch. >>> >> >>> > >>> > Because it's still possible for processes to allocate pages at the same >>> > rate kswapd is freeing them leading to a situation where kswapd does not >>> > consider the zone balanced for prolonged periods of time. >>> >>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. >>> So I think kswapd can be scheduled out although it's scheduled in >>> after a short time as task scheduled also need page reclaim. Although >>> all task in system need reclaim, kswapd cpu 99% consumption is a >>> natural result, I think. >>> Do I miss something? >>> >> >> Lets see; >> >> shrink_page_list() only applies if inactive pages were isolated >> which in turn may not happen if all_unreclaimable is set in >> shrink_zones(). If for whatver reason, all_unreclaimable is >> set on all zones, we can miss calling cond_resched(). >> >> shrink_slab only applies if we are reclaiming slab pages. If the first >> shrinker returns -1, we do not call cond_resched(). If that >> first shrinker is dcache and __GFP_FS is not set, direct >> reclaimers will not shrink at all. However, if there are >> enough of them running or if one of the other shrinkers >> is running for a very long time, kswapd could be starved >> acquiring the shrinker_rwsem and never reaching the >> cond_resched(). > > Don't we have to move cond_resched? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..633e761 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + ret = 1; > + goto out; /* Assume we'll be able to shrink next time */ > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > count_vm_events(SLABS_SCANNED, this_scan); > total_scan -= this_scan; > > - cond_resched(); > } > > shrinker->nr += total_scan; > + cond_resched(); > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > > >> >> balance_pgdat() only calls cond_resched if the zones are not >> balanced. For a high-order allocation that is balanced, it >> checks order-0 again. During that window, order-0 might have >> become unbalanced so it loops again for order-0 and returns >> that was reclaiming for order-0 to kswapd(). It can then find >> that a caller has rewoken kswapd for a high-order and re-enters >> balance_pgdat() without ever have called cond_resched(). > > If kswapd reclaims order-o followed by high order, it would have a > chance to call cond_resched in shrink_page_list. But if all zones are > all_unreclaimable is set, balance_pgdat could return any work. Typo : without any work. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 23:50 ` Minchan Kim (?) @ 2011-05-17 10:38 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 10:38 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> >> <James.Bottomley@hansenpartnership.com> wrote: > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> >> >> > running a long time. Check if kswapd needs to be scheduled. > >> >> >> > > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> >> >> > --- > >> >> >> > mm/vmscan.c | 4 ++++ > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> >> >> > > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> >> >> > index af24d1e..4d24828 100644 > >> >> >> > --- a/mm/vmscan.c > >> >> >> > +++ b/mm/vmscan.c > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> >> >> > unsigned long balanced = 0; > >> >> >> > bool all_zones_ok = true; > >> >> >> > > >> >> >> > + /* If kswapd has been running too long, just sleep */ > >> >> >> > + if (need_resched()) > >> >> >> > + return false; > >> >> >> > + > >> >> >> > >> >> >> Hmm... I don't like this patch so much. because this code does > >> >> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> >> > > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> >> > the normal path for determining whether to sleep or not, in effect > >> >> > leaving the current behaviour unchanged. > >> >> > > >> >> >> - sleep if kswapd didn't > >> >> > > >> >> > This also isn't entirely true: whether need_resched() is true at this > >> >> > point depends on a whole lot more that whether we did a context switch > >> >> > in shrink_inactive. It mostly depends on how long we've been running > >> >> > without giving up the CPU. Generally that will mean we've been round > >> >> > the shrinker loop hundreds to thousands of times without sleeping. > >> >> > > >> >> >> It seems to be semi random behavior. > >> >> > > >> >> > Well, we have to do something. Chris Mason first suspected the hang was > >> >> > a kswapd rescheduling problem a while ago. We tried putting > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> >> pgdat_balance returns wrong result. And at last VM calls > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> >> kswap couldn't find zones which have not enough free pages and goto > >> >> out. kswapd could repeat this work infinitely. So you don't have a > >> >> chance to call cond_resched. > >> >> > >> >> But if your test was with Hanne's patch, I am very curious how come > >> >> kswapd consumes CPU a lot. > >> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> >> > option. The other option might be just to put a cond_resched() in > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd > >> >> consumes CPU a lot although we applied Hannes' patch. > >> >> > >> > > >> > Because it's still possible for processes to allocate pages at the same > >> > rate kswapd is freeing them leading to a situation where kswapd does not > >> > consider the zone balanced for prolonged periods of time. > >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > >> So I think kswapd can be scheduled out although it's scheduled in > >> after a short time as task scheduled also need page reclaim. Although > >> all task in system need reclaim, kswapd cpu 99% consumption is a > >> natural result, I think. > >> Do I miss something? > >> > > > > Lets see; > > > > shrink_page_list() only applies if inactive pages were isolated > > which in turn may not happen if all_unreclaimable is set in > > shrink_zones(). If for whatver reason, all_unreclaimable is > > set on all zones, we can miss calling cond_resched(). > > > > shrink_slab only applies if we are reclaiming slab pages. If the first > > shrinker returns -1, we do not call cond_resched(). If that > > first shrinker is dcache and __GFP_FS is not set, direct > > reclaimers will not shrink at all. However, if there are > > enough of them running or if one of the other shrinkers > > is running for a very long time, kswapd could be starved > > acquiring the shrinker_rwsem and never reaching the > > cond_resched(). > > Don't we have to move cond_resched? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..633e761 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + ret = 1; > + goto out; /* Assume we'll be able to shrink next time */ > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > count_vm_events(SLABS_SCANNED, this_scan); > total_scan -= this_scan; > > - cond_resched(); > } > > shrinker->nr += total_scan; > + cond_resched(); > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > This makes some sense for the exit path but if one or more of the shrinkers takes a very long time without sleeping (extremely long list searches for example) then kswapd will not call cond_resched() between shrinkers and still consume a lot of CPU. > > > > balance_pgdat() only calls cond_resched if the zones are not > > balanced. For a high-order allocation that is balanced, it > > checks order-0 again. During that window, order-0 might have > > become unbalanced so it loops again for order-0 and returns > > that was reclaiming for order-0 to kswapd(). It can then find > > that a caller has rewoken kswapd for a high-order and re-enters > > balance_pgdat() without ever have called cond_resched(). > > If kswapd reclaims order-o followed by high order, it would have a > chance to call cond_resched in shrink_page_list. But if all zones are > all_unreclaimable is set, balance_pgdat could return any work. Okay. > It does make sense. > By your scenario, someone wakes up kswapd with higher order, again. > So re-enters balance_pgdat without ever have called cond_resched. > But if someone wakes up higher order again, we can't have a chance to > call kswapd_try_to_sleep. So your patch effect would be nop, too. > > It would be better to put cond_resched after balance_pgdat? > Which will leave kswapd runnable instead of going to sleep but guarantees a scheduling point. Lets see if the problem is that cond_resched is being missed although if this was the case then patch 4 would truly be a no-op but Colin has already reported that patch 1 on its own didn't fix his problem. If the problem is sandybridge-specific where kswapd remains runnable and consuming large amounts of CPU in turbo mode then we know that there are other cond_resched() decisions that will need to be revisited. Colin or James, would you be willing to test with patch 1 from this series and Minchan's patch below? Thanks. > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..61c45d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > if (!ret) { > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > order = balance_pgdat(pgdat, order, &classzone_idx); > + cond_resched(); > } > } > return 0; > > > > > While it appears unlikely, there are bad conditions which can result > > in cond_resched() being avoided. > > > > > -- > > Mel Gorman > > SUSE Labs > > > > > > -- > Kind regards, > Minchan Kim -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-17 10:38 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 10:38 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> >> <James.Bottomley@hansenpartnership.com> wrote: > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> >> >> > running a long time. Check if kswapd needs to be scheduled. > >> >> >> > > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> >> >> > --- > >> >> >> > mm/vmscan.c | 4 ++++ > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> >> >> > > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> >> >> > index af24d1e..4d24828 100644 > >> >> >> > --- a/mm/vmscan.c > >> >> >> > +++ b/mm/vmscan.c > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> >> >> > unsigned long balanced = 0; > >> >> >> > bool all_zones_ok = true; > >> >> >> > > >> >> >> > + /* If kswapd has been running too long, just sleep */ > >> >> >> > + if (need_resched()) > >> >> >> > + return false; > >> >> >> > + > >> >> >> > >> >> >> Hmm... I don't like this patch so much. because this code does > >> >> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> >> > > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> >> > the normal path for determining whether to sleep or not, in effect > >> >> > leaving the current behaviour unchanged. > >> >> > > >> >> >> - sleep if kswapd didn't > >> >> > > >> >> > This also isn't entirely true: whether need_resched() is true at this > >> >> > point depends on a whole lot more that whether we did a context switch > >> >> > in shrink_inactive. It mostly depends on how long we've been running > >> >> > without giving up the CPU. Generally that will mean we've been round > >> >> > the shrinker loop hundreds to thousands of times without sleeping. > >> >> > > >> >> >> It seems to be semi random behavior. > >> >> > > >> >> > Well, we have to do something. Chris Mason first suspected the hang was > >> >> > a kswapd rescheduling problem a while ago. We tried putting > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> >> pgdat_balance returns wrong result. And at last VM calls > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> >> kswap couldn't find zones which have not enough free pages and goto > >> >> out. kswapd could repeat this work infinitely. So you don't have a > >> >> chance to call cond_resched. > >> >> > >> >> But if your test was with Hanne's patch, I am very curious how come > >> >> kswapd consumes CPU a lot. > >> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> >> > option. The other option might be just to put a cond_resched() in > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd > >> >> consumes CPU a lot although we applied Hannes' patch. > >> >> > >> > > >> > Because it's still possible for processes to allocate pages at the same > >> > rate kswapd is freeing them leading to a situation where kswapd does not > >> > consider the zone balanced for prolonged periods of time. > >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > >> So I think kswapd can be scheduled out although it's scheduled in > >> after a short time as task scheduled also need page reclaim. Although > >> all task in system need reclaim, kswapd cpu 99% consumption is a > >> natural result, I think. > >> Do I miss something? > >> > > > > Lets see; > > > > shrink_page_list() only applies if inactive pages were isolated > > which in turn may not happen if all_unreclaimable is set in > > shrink_zones(). If for whatver reason, all_unreclaimable is > > set on all zones, we can miss calling cond_resched(). > > > > shrink_slab only applies if we are reclaiming slab pages. If the first > > shrinker returns -1, we do not call cond_resched(). If that > > first shrinker is dcache and __GFP_FS is not set, direct > > reclaimers will not shrink at all. However, if there are > > enough of them running or if one of the other shrinkers > > is running for a very long time, kswapd could be starved > > acquiring the shrinker_rwsem and never reaching the > > cond_resched(). > > Don't we have to move cond_resched? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..633e761 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + ret = 1; > + goto out; /* Assume we'll be able to shrink next time */ > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > count_vm_events(SLABS_SCANNED, this_scan); > total_scan -= this_scan; > > - cond_resched(); > } > > shrinker->nr += total_scan; > + cond_resched(); > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > This makes some sense for the exit path but if one or more of the shrinkers takes a very long time without sleeping (extremely long list searches for example) then kswapd will not call cond_resched() between shrinkers and still consume a lot of CPU. > > > > balance_pgdat() only calls cond_resched if the zones are not > > balanced. For a high-order allocation that is balanced, it > > checks order-0 again. During that window, order-0 might have > > become unbalanced so it loops again for order-0 and returns > > that was reclaiming for order-0 to kswapd(). It can then find > > that a caller has rewoken kswapd for a high-order and re-enters > > balance_pgdat() without ever have called cond_resched(). > > If kswapd reclaims order-o followed by high order, it would have a > chance to call cond_resched in shrink_page_list. But if all zones are > all_unreclaimable is set, balance_pgdat could return any work. Okay. > It does make sense. > By your scenario, someone wakes up kswapd with higher order, again. > So re-enters balance_pgdat without ever have called cond_resched. > But if someone wakes up higher order again, we can't have a chance to > call kswapd_try_to_sleep. So your patch effect would be nop, too. > > It would be better to put cond_resched after balance_pgdat? > Which will leave kswapd runnable instead of going to sleep but guarantees a scheduling point. Lets see if the problem is that cond_resched is being missed although if this was the case then patch 4 would truly be a no-op but Colin has already reported that patch 1 on its own didn't fix his problem. If the problem is sandybridge-specific where kswapd remains runnable and consuming large amounts of CPU in turbo mode then we know that there are other cond_resched() decisions that will need to be revisited. Colin or James, would you be willing to test with patch 1 from this series and Minchan's patch below? Thanks. > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..61c45d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > if (!ret) { > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > order = balance_pgdat(pgdat, order, &classzone_idx); > + cond_resched(); > } > } > return 0; > > > > > While it appears unlikely, there are bad conditions which can result > > in cond_resched() being avoided. > > > > > -- > > Mel Gorman > > SUSE Labs > > > > > > -- > Kind regards, > Minchan Kim -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-17 10:38 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 10:38 UTC (permalink / raw) To: Minchan Kim Cc: James Bottomley, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> >> <James.Bottomley@hansenpartnership.com> wrote: > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> >> >> > running a long time. Check if kswapd needs to be scheduled. > >> >> >> > > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> >> >> > --- > >> >> >> > mm/vmscan.c | 4 ++++ > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> >> >> > > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> >> >> > index af24d1e..4d24828 100644 > >> >> >> > --- a/mm/vmscan.c > >> >> >> > +++ b/mm/vmscan.c > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> >> >> > unsigned long balanced = 0; > >> >> >> > bool all_zones_ok = true; > >> >> >> > > >> >> >> > + /* If kswapd has been running too long, just sleep */ > >> >> >> > + if (need_resched()) > >> >> >> > + return false; > >> >> >> > + > >> >> >> > >> >> >> Hmm... I don't like this patch so much. because this code does > >> >> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> >> > > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> >> > the normal path for determining whether to sleep or not, in effect > >> >> > leaving the current behaviour unchanged. > >> >> > > >> >> >> - sleep if kswapd didn't > >> >> > > >> >> > This also isn't entirely true: whether need_resched() is true at this > >> >> > point depends on a whole lot more that whether we did a context switch > >> >> > in shrink_inactive. It mostly depends on how long we've been running > >> >> > without giving up the CPU. Generally that will mean we've been round > >> >> > the shrinker loop hundreds to thousands of times without sleeping. > >> >> > > >> >> >> It seems to be semi random behavior. > >> >> > > >> >> > Well, we have to do something. Chris Mason first suspected the hang was > >> >> > a kswapd rescheduling problem a while ago. We tried putting > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> >> pgdat_balance returns wrong result. And at last VM calls > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> >> kswap couldn't find zones which have not enough free pages and goto > >> >> out. kswapd could repeat this work infinitely. So you don't have a > >> >> chance to call cond_resched. > >> >> > >> >> But if your test was with Hanne's patch, I am very curious how come > >> >> kswapd consumes CPU a lot. > >> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> >> > option. The other option might be just to put a cond_resched() in > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd > >> >> consumes CPU a lot although we applied Hannes' patch. > >> >> > >> > > >> > Because it's still possible for processes to allocate pages at the same > >> > rate kswapd is freeing them leading to a situation where kswapd does not > >> > consider the zone balanced for prolonged periods of time. > >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > >> So I think kswapd can be scheduled out although it's scheduled in > >> after a short time as task scheduled also need page reclaim. Although > >> all task in system need reclaim, kswapd cpu 99% consumption is a > >> natural result, I think. > >> Do I miss something? > >> > > > > Lets see; > > > > shrink_page_list() only applies if inactive pages were isolated > > which in turn may not happen if all_unreclaimable is set in > > shrink_zones(). If for whatver reason, all_unreclaimable is > > set on all zones, we can miss calling cond_resched(). > > > > shrink_slab only applies if we are reclaiming slab pages. If the first > > shrinker returns -1, we do not call cond_resched(). If that > > first shrinker is dcache and __GFP_FS is not set, direct > > reclaimers will not shrink at all. However, if there are > > enough of them running or if one of the other shrinkers > > is running for a very long time, kswapd could be starved > > acquiring the shrinker_rwsem and never reaching the > > cond_resched(). > > Don't we have to move cond_resched? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..633e761 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + ret = 1; > + goto out; /* Assume we'll be able to shrink next time */ > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > count_vm_events(SLABS_SCANNED, this_scan); > total_scan -= this_scan; > > - cond_resched(); > } > > shrinker->nr += total_scan; > + cond_resched(); > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > This makes some sense for the exit path but if one or more of the shrinkers takes a very long time without sleeping (extremely long list searches for example) then kswapd will not call cond_resched() between shrinkers and still consume a lot of CPU. > > > > balance_pgdat() only calls cond_resched if the zones are not > > balanced. For a high-order allocation that is balanced, it > > checks order-0 again. During that window, order-0 might have > > become unbalanced so it loops again for order-0 and returns > > that was reclaiming for order-0 to kswapd(). It can then find > > that a caller has rewoken kswapd for a high-order and re-enters > > balance_pgdat() without ever have called cond_resched(). > > If kswapd reclaims order-o followed by high order, it would have a > chance to call cond_resched in shrink_page_list. But if all zones are > all_unreclaimable is set, balance_pgdat could return any work. Okay. > It does make sense. > By your scenario, someone wakes up kswapd with higher order, again. > So re-enters balance_pgdat without ever have called cond_resched. > But if someone wakes up higher order again, we can't have a chance to > call kswapd_try_to_sleep. So your patch effect would be nop, too. > > It would be better to put cond_resched after balance_pgdat? > Which will leave kswapd runnable instead of going to sleep but guarantees a scheduling point. Lets see if the problem is that cond_resched is being missed although if this was the case then patch 4 would truly be a no-op but Colin has already reported that patch 1 on its own didn't fix his problem. If the problem is sandybridge-specific where kswapd remains runnable and consuming large amounts of CPU in turbo mode then we know that there are other cond_resched() decisions that will need to be revisited. Colin or James, would you be willing to test with patch 1 from this series and Minchan's patch below? Thanks. > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..61c45d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > if (!ret) { > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > order = balance_pgdat(pgdat, order, &classzone_idx); > + cond_resched(); > } > } > return 0; > > > > > While it appears unlikely, there are bad conditions which can result > > in cond_resched() being avoided. > > > > > -- > > Mel Gorman > > SUSE Labs > > > > > > -- > Kind regards, > Minchan Kim -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-17 10:38 ` Mel Gorman @ 2011-05-17 13:50 ` Colin Ian King -1 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-17 13:50 UTC (permalink / raw) To: Mel Gorman Cc: Minchan Kim, James Bottomley, KOSAKI Motohiro, akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: > On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > > >> >> <James.Bottomley@hansenpartnership.com> wrote: > > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: > > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where > > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been > > >> >> >> > running a long time. Check if kswapd needs to be scheduled. > > >> >> >> > > > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > > >> >> >> > --- > > >> >> >> > mm/vmscan.c | 4 ++++ > > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > > >> >> >> > > > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > > >> >> >> > index af24d1e..4d24828 100644 > > >> >> >> > --- a/mm/vmscan.c > > >> >> >> > +++ b/mm/vmscan.c > > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > > >> >> >> > unsigned long balanced = 0; > > >> >> >> > bool all_zones_ok = true; > > >> >> >> > > > >> >> >> > + /* If kswapd has been running too long, just sleep */ > > >> >> >> > + if (need_resched()) > > >> >> >> > + return false; > > >> >> >> > + > > >> >> >> > > >> >> >> Hmm... I don't like this patch so much. because this code does > > >> >> >> > > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > > >> >> > > > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow > > >> >> > the normal path for determining whether to sleep or not, in effect > > >> >> > leaving the current behaviour unchanged. > > >> >> > > > >> >> >> - sleep if kswapd didn't > > >> >> > > > >> >> > This also isn't entirely true: whether need_resched() is true at this > > >> >> > point depends on a whole lot more that whether we did a context switch > > >> >> > in shrink_inactive. It mostly depends on how long we've been running > > >> >> > without giving up the CPU. Generally that will mean we've been round > > >> >> > the shrinker loop hundreds to thousands of times without sleeping. > > >> >> > > > >> >> >> It seems to be semi random behavior. > > >> >> > > > >> >> > Well, we have to do something. Chris Mason first suspected the hang was > > >> >> > a kswapd rescheduling problem a while ago. We tried putting > > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. > > >> >> > > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > > >> >> > > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as > > >> >> pgdat_balance returns wrong result. And at last VM calls > > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as > > >> >> kswap couldn't find zones which have not enough free pages and goto > > >> >> out. kswapd could repeat this work infinitely. So you don't have a > > >> >> chance to call cond_resched. > > >> >> > > >> >> But if your test was with Hanne's patch, I am very curious how come > > >> >> kswapd consumes CPU a lot. > > >> >> > > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best > > >> >> > option. The other option might be just to put a cond_resched() in > > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. > > >> >> > > >> >> I don't oppose it but before that, I think we have to know why kswapd > > >> >> consumes CPU a lot although we applied Hannes' patch. > > >> >> > > >> > > > >> > Because it's still possible for processes to allocate pages at the same > > >> > rate kswapd is freeing them leading to a situation where kswapd does not > > >> > consider the zone balanced for prolonged periods of time. > > >> > > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > > >> So I think kswapd can be scheduled out although it's scheduled in > > >> after a short time as task scheduled also need page reclaim. Although > > >> all task in system need reclaim, kswapd cpu 99% consumption is a > > >> natural result, I think. > > >> Do I miss something? > > >> > > > > > > Lets see; > > > > > > shrink_page_list() only applies if inactive pages were isolated > > > which in turn may not happen if all_unreclaimable is set in > > > shrink_zones(). If for whatver reason, all_unreclaimable is > > > set on all zones, we can miss calling cond_resched(). > > > > > > shrink_slab only applies if we are reclaiming slab pages. If the first > > > shrinker returns -1, we do not call cond_resched(). If that > > > first shrinker is dcache and __GFP_FS is not set, direct > > > reclaimers will not shrink at all. However, if there are > > > enough of them running or if one of the other shrinkers > > > is running for a very long time, kswapd could be starved > > > acquiring the shrinker_rwsem and never reaching the > > > cond_resched(). > > > > Don't we have to move cond_resched? > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..633e761 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > if (scanned == 0) > > scanned = SWAP_CLUSTER_MAX; > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - return 1; /* Assume we'll be able to shrink next time */ > > + if (!down_read_trylock(&shrinker_rwsem)) { > > + ret = 1; > > + goto out; /* Assume we'll be able to shrink next time */ > > + } > > > > list_for_each_entry(shrinker, &shrinker_list, list) { > > unsigned long long delta; > > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > count_vm_events(SLABS_SCANNED, this_scan); > > total_scan -= this_scan; > > > > - cond_resched(); > > } > > > > shrinker->nr += total_scan; > > + cond_resched(); > > } > > up_read(&shrinker_rwsem); > > +out: > > + cond_resched(); > > return ret; > > } > > > > This makes some sense for the exit path but if one or more of the > shrinkers takes a very long time without sleeping (extremely long > list searches for example) then kswapd will not call cond_resched() > between shrinkers and still consume a lot of CPU. > > > > > > > balance_pgdat() only calls cond_resched if the zones are not > > > balanced. For a high-order allocation that is balanced, it > > > checks order-0 again. During that window, order-0 might have > > > become unbalanced so it loops again for order-0 and returns > > > that was reclaiming for order-0 to kswapd(). It can then find > > > that a caller has rewoken kswapd for a high-order and re-enters > > > balance_pgdat() without ever have called cond_resched(). > > > > If kswapd reclaims order-o followed by high order, it would have a > > chance to call cond_resched in shrink_page_list. But if all zones are > > all_unreclaimable is set, balance_pgdat could return any work. Okay. > > It does make sense. > > By your scenario, someone wakes up kswapd with higher order, again. > > So re-enters balance_pgdat without ever have called cond_resched. > > But if someone wakes up higher order again, we can't have a chance to > > call kswapd_try_to_sleep. So your patch effect would be nop, too. > > > > It would be better to put cond_resched after balance_pgdat? > > > > Which will leave kswapd runnable instead of going to sleep but > guarantees a scheduling point. Lets see if the problem is that > cond_resched is being missed although if this was the case then patch > 4 would truly be a no-op but Colin has already reported that patch 1 on > its own didn't fix his problem. If the problem is sandybridge-specific > where kswapd remains runnable and consuming large amounts of CPU in > turbo mode then we know that there are other cond_resched() decisions > that will need to be revisited. > > Colin or James, would you be willing to test with patch 1 from this > series and Minchan's patch below? Thanks. This works OK fine. Ran 250 test cycles for about 2 hours. > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..61c45d0 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > > if (!ret) { > > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > > order = balance_pgdat(pgdat, order, &classzone_idx); > > + cond_resched(); > > } > > } > > return 0; > > > > > > > > While it appears unlikely, there are bad conditions which can result > > > in cond_resched() being avoided. > > > > > > > > -- > > > Mel Gorman > > > SUSE Labs > > > > > > > > > > > -- > > Kind regards, > > Minchan Kim > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-17 13:50 ` Colin Ian King 0 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-17 13:50 UTC (permalink / raw) To: Mel Gorman Cc: Minchan Kim, James Bottomley, KOSAKI Motohiro, akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: > On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > > >> >> <James.Bottomley@hansenpartnership.com> wrote: > > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: > > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where > > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been > > >> >> >> > running a long time. Check if kswapd needs to be scheduled. > > >> >> >> > > > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > > >> >> >> > --- > > >> >> >> > mm/vmscan.c | 4 ++++ > > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > > >> >> >> > > > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > > >> >> >> > index af24d1e..4d24828 100644 > > >> >> >> > --- a/mm/vmscan.c > > >> >> >> > +++ b/mm/vmscan.c > > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > > >> >> >> > unsigned long balanced = 0; > > >> >> >> > bool all_zones_ok = true; > > >> >> >> > > > >> >> >> > + /* If kswapd has been running too long, just sleep */ > > >> >> >> > + if (need_resched()) > > >> >> >> > + return false; > > >> >> >> > + > > >> >> >> > > >> >> >> Hmm... I don't like this patch so much. because this code does > > >> >> >> > > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > > >> >> > > > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow > > >> >> > the normal path for determining whether to sleep or not, in effect > > >> >> > leaving the current behaviour unchanged. > > >> >> > > > >> >> >> - sleep if kswapd didn't > > >> >> > > > >> >> > This also isn't entirely true: whether need_resched() is true at this > > >> >> > point depends on a whole lot more that whether we did a context switch > > >> >> > in shrink_inactive. It mostly depends on how long we've been running > > >> >> > without giving up the CPU. Generally that will mean we've been round > > >> >> > the shrinker loop hundreds to thousands of times without sleeping. > > >> >> > > > >> >> >> It seems to be semi random behavior. > > >> >> > > > >> >> > Well, we have to do something. Chris Mason first suspected the hang was > > >> >> > a kswapd rescheduling problem a while ago. We tried putting > > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. > > >> >> > > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > > >> >> > > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as > > >> >> pgdat_balance returns wrong result. And at last VM calls > > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as > > >> >> kswap couldn't find zones which have not enough free pages and goto > > >> >> out. kswapd could repeat this work infinitely. So you don't have a > > >> >> chance to call cond_resched. > > >> >> > > >> >> But if your test was with Hanne's patch, I am very curious how come > > >> >> kswapd consumes CPU a lot. > > >> >> > > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best > > >> >> > option. The other option might be just to put a cond_resched() in > > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. > > >> >> > > >> >> I don't oppose it but before that, I think we have to know why kswapd > > >> >> consumes CPU a lot although we applied Hannes' patch. > > >> >> > > >> > > > >> > Because it's still possible for processes to allocate pages at the same > > >> > rate kswapd is freeing them leading to a situation where kswapd does not > > >> > consider the zone balanced for prolonged periods of time. > > >> > > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > > >> So I think kswapd can be scheduled out although it's scheduled in > > >> after a short time as task scheduled also need page reclaim. Although > > >> all task in system need reclaim, kswapd cpu 99% consumption is a > > >> natural result, I think. > > >> Do I miss something? > > >> > > > > > > Lets see; > > > > > > shrink_page_list() only applies if inactive pages were isolated > > > which in turn may not happen if all_unreclaimable is set in > > > shrink_zones(). If for whatver reason, all_unreclaimable is > > > set on all zones, we can miss calling cond_resched(). > > > > > > shrink_slab only applies if we are reclaiming slab pages. If the first > > > shrinker returns -1, we do not call cond_resched(). If that > > > first shrinker is dcache and __GFP_FS is not set, direct > > > reclaimers will not shrink at all. However, if there are > > > enough of them running or if one of the other shrinkers > > > is running for a very long time, kswapd could be starved > > > acquiring the shrinker_rwsem and never reaching the > > > cond_resched(). > > > > Don't we have to move cond_resched? > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..633e761 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > if (scanned == 0) > > scanned = SWAP_CLUSTER_MAX; > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - return 1; /* Assume we'll be able to shrink next time */ > > + if (!down_read_trylock(&shrinker_rwsem)) { > > + ret = 1; > > + goto out; /* Assume we'll be able to shrink next time */ > > + } > > > > list_for_each_entry(shrinker, &shrinker_list, list) { > > unsigned long long delta; > > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > count_vm_events(SLABS_SCANNED, this_scan); > > total_scan -= this_scan; > > > > - cond_resched(); > > } > > > > shrinker->nr += total_scan; > > + cond_resched(); > > } > > up_read(&shrinker_rwsem); > > +out: > > + cond_resched(); > > return ret; > > } > > > > This makes some sense for the exit path but if one or more of the > shrinkers takes a very long time without sleeping (extremely long > list searches for example) then kswapd will not call cond_resched() > between shrinkers and still consume a lot of CPU. > > > > > > > balance_pgdat() only calls cond_resched if the zones are not > > > balanced. For a high-order allocation that is balanced, it > > > checks order-0 again. During that window, order-0 might have > > > become unbalanced so it loops again for order-0 and returns > > > that was reclaiming for order-0 to kswapd(). It can then find > > > that a caller has rewoken kswapd for a high-order and re-enters > > > balance_pgdat() without ever have called cond_resched(). > > > > If kswapd reclaims order-o followed by high order, it would have a > > chance to call cond_resched in shrink_page_list. But if all zones are > > all_unreclaimable is set, balance_pgdat could return any work. Okay. > > It does make sense. > > By your scenario, someone wakes up kswapd with higher order, again. > > So re-enters balance_pgdat without ever have called cond_resched. > > But if someone wakes up higher order again, we can't have a chance to > > call kswapd_try_to_sleep. So your patch effect would be nop, too. > > > > It would be better to put cond_resched after balance_pgdat? > > > > Which will leave kswapd runnable instead of going to sleep but > guarantees a scheduling point. Lets see if the problem is that > cond_resched is being missed although if this was the case then patch > 4 would truly be a no-op but Colin has already reported that patch 1 on > its own didn't fix his problem. If the problem is sandybridge-specific > where kswapd remains runnable and consuming large amounts of CPU in > turbo mode then we know that there are other cond_resched() decisions > that will need to be revisited. > > Colin or James, would you be willing to test with patch 1 from this > series and Minchan's patch below? Thanks. This works OK fine. Ran 250 test cycles for about 2 hours. > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..61c45d0 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > > if (!ret) { > > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > > order = balance_pgdat(pgdat, order, &classzone_idx); > > + cond_resched(); > > } > > } > > return 0; > > > > > > > > While it appears unlikely, there are bad conditions which can result > > > in cond_resched() being avoided. > > > > > > > > -- > > > Mel Gorman > > > SUSE Labs > > > > > > > > > > > -- > > Kind regards, > > Minchan Kim > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab 2011-05-17 13:50 ` Colin Ian King @ 2011-05-17 16:15 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 16:15 UTC (permalink / raw) To: akpm Cc: Minchan Kim, Colin Ian King, James Bottomley, KOSAKI Motohiro, akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 It has been reported on some laptops that kswapd is consuming large amounts of CPU and not being scheduled when SLUB is enabled during large amounts of file copying. It is expected that this is due to kswapd missing every cond_resched() point because; shrink_page_list() calls cond_resched() if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that it was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever calling cond_resched(). shrink_slab only calls cond_resched() if we are reclaiming slab pages. If there are a large number of direct reclaimers, the shrinker_rwsem can be contended and prevent kswapd calling cond_resched(). This patch modifies the shrink_slab() case. If the semaphore is contended, the caller will still check cond_resched(). After each successful call into a shrinker, the check for cond_resched() is still necessary in case one shrinker call is particularly slow. This patch replaces mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch in -mm. [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] From: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 9 +++++++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index af24d1e..0bed248 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, if (scanned == 0) scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + if (!down_read_trylock(&shrinker_rwsem)) { + /* Assume we'll be able to shrink next time */ + ret = 1; + goto out; + } list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta; @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, shrinker->nr += total_scan; } up_read(&shrinker_rwsem); +out: + cond_resched(); return ret; } ^ permalink raw reply related [flat|nested] 119+ messages in thread
* [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-17 16:15 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-17 16:15 UTC (permalink / raw) To: akpm Cc: Minchan Kim, Colin Ian King, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 It has been reported on some laptops that kswapd is consuming large amounts of CPU and not being scheduled when SLUB is enabled during large amounts of file copying. It is expected that this is due to kswapd missing every cond_resched() point because; shrink_page_list() calls cond_resched() if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that it was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever calling cond_resched(). shrink_slab only calls cond_resched() if we are reclaiming slab pages. If there are a large number of direct reclaimers, the shrinker_rwsem can be contended and prevent kswapd calling cond_resched(). This patch modifies the shrink_slab() case. If the semaphore is contended, the caller will still check cond_resched(). After each successful call into a shrinker, the check for cond_resched() is still necessary in case one shrinker call is particularly slow. This patch replaces mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch in -mm. [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] From: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 9 +++++++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index af24d1e..0bed248 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, if (scanned == 0) scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + if (!down_read_trylock(&shrinker_rwsem)) { + /* Assume we'll be able to shrink next time */ + ret = 1; + goto out; + } list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta; @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, shrinker->nr += total_scan; } up_read(&shrinker_rwsem); +out: + cond_resched(); return ret; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab 2011-05-17 16:15 ` Mel Gorman @ 2011-05-18 0:45 ` KOSAKI Motohiro -1 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 0:45 UTC (permalink / raw) To: mgorman Cc: akpm, minchan.kim, colin.king, James.Bottomley, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 (2011/05/18 1:15), Mel Gorman wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim<minchan.kim@gmail.com> > Signed-off-by: Mel Gorman<mgorman@suse.de> Looks good to me. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-18 0:45 ` KOSAKI Motohiro 0 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 0:45 UTC (permalink / raw) To: mgorman Cc: akpm, minchan.kim, colin.king, James.Bottomley, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 (2011/05/18 1:15), Mel Gorman wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim<minchan.kim@gmail.com> > Signed-off-by: Mel Gorman<mgorman@suse.de> Looks good to me. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab 2011-05-17 16:15 ` Mel Gorman (?) @ 2011-05-19 0:03 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-19 0:03 UTC (permalink / raw) To: Mel Gorman Cc: akpm, Colin Ian King, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Mel Gorman <mgorman@suse.de> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-19 0:03 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-19 0:03 UTC (permalink / raw) To: Mel Gorman Cc: akpm, Colin Ian King, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Mel Gorman <mgorman@suse.de> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-19 0:03 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-19 0:03 UTC (permalink / raw) To: Mel Gorman Cc: akpm, Colin Ian King, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Mel Gorman <mgorman@suse.de> -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab 2011-05-17 16:15 ` Mel Gorman (?) @ 2011-05-19 0:09 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-19 0:09 UTC (permalink / raw) To: Mel Gorman, Colin Ian King Cc: akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hi Colin. Sorry for bothering you. :( I hope this test is last. We(Mel, KOSAKI and me) finalized opinion. Could you test below patch with patch[1/4] of Mel's series(ie, !pgdat_balanced of sleeping_prematurely)? If it is successful, we will try to merge this version instead of various cond_resched sprinkling version. On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > mm/vmscan.c | 9 +++++++-- > 1 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index af24d1e..0bed248 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + /* Assume we'll be able to shrink next time */ > + ret = 1; > + goto out; > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > shrinker->nr += total_scan; > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > > -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-19 0:09 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-19 0:09 UTC (permalink / raw) To: Mel Gorman, Colin Ian King Cc: akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hi Colin. Sorry for bothering you. :( I hope this test is last. We(Mel, KOSAKI and me) finalized opinion. Could you test below patch with patch[1/4] of Mel's series(ie, !pgdat_balanced of sleeping_prematurely)? If it is successful, we will try to merge this version instead of various cond_resched sprinkling version. On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > mm/vmscan.c | 9 +++++++-- > 1 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index af24d1e..0bed248 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + /* Assume we'll be able to shrink next time */ > + ret = 1; > + goto out; > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > shrinker->nr += total_scan; > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-19 0:09 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-19 0:09 UTC (permalink / raw) To: Mel Gorman, Colin Ian King Cc: akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hi Colin. Sorry for bothering you. :( I hope this test is last. We(Mel, KOSAKI and me) finalized opinion. Could you test below patch with patch[1/4] of Mel's series(ie, !pgdat_balanced of sleeping_prematurely)? If it is successful, we will try to merge this version instead of various cond_resched sprinkling version. On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > It has been reported on some laptops that kswapd is consuming large > amounts of CPU and not being scheduled when SLUB is enabled during > large amounts of file copying. It is expected that this is due to > kswapd missing every cond_resched() point because; > > shrink_page_list() calls cond_resched() if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that it was reclaiming for order-0 to kswapd(). It can then > find that a caller has rewoken kswapd for a high-order and > re-enters balance_pgdat() without ever calling cond_resched(). > > shrink_slab only calls cond_resched() if we are reclaiming slab > pages. If there are a large number of direct reclaimers, the > shrinker_rwsem can be contended and prevent kswapd calling > cond_resched(). > > This patch modifies the shrink_slab() case. If the semaphore is > contended, the caller will still check cond_resched(). After each > successful call into a shrinker, the check for cond_resched() is > still necessary in case one shrinker call is particularly slow. > > This patch replaces > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > in -mm. > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > From: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > mm/vmscan.c | 9 +++++++-- > 1 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index af24d1e..0bed248 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > if (scanned == 0) > scanned = SWAP_CLUSTER_MAX; > > - if (!down_read_trylock(&shrinker_rwsem)) > - return 1; /* Assume we'll be able to shrink next time */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + /* Assume we'll be able to shrink next time */ > + ret = 1; > + goto out; > + } > > list_for_each_entry(shrinker, &shrinker_list, list) { > unsigned long long delta; > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > shrinker->nr += total_scan; > } > up_read(&shrinker_rwsem); > +out: > + cond_resched(); > return ret; > } > > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab 2011-05-19 0:09 ` Minchan Kim @ 2011-05-19 11:36 ` Colin Ian King -1 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-19 11:36 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote: > Hi Colin. > > Sorry for bothering you. :( No problem at all, I've very happy to re-test. > I hope this test is last. > > We(Mel, KOSAKI and me) finalized opinion. > > Could you test below patch with patch[1/4] of Mel's series(ie, > !pgdat_balanced of sleeping_prematurely)? > If it is successful, we will try to merge this version instead of > various cond_resched sprinkling version. tested with the patch below + patch[1/4] of Mel's series. 300 cycles, 2.5 hrs of soak testing: works OK. Colin > > > On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > > It has been reported on some laptops that kswapd is consuming large > > amounts of CPU and not being scheduled when SLUB is enabled during > > large amounts of file copying. It is expected that this is due to > > kswapd missing every cond_resched() point because; > > > > shrink_page_list() calls cond_resched() if inactive pages were isolated > > which in turn may not happen if all_unreclaimable is set in > > shrink_zones(). If for whatver reason, all_unreclaimable is > > set on all zones, we can miss calling cond_resched(). > > > > balance_pgdat() only calls cond_resched if the zones are not > > balanced. For a high-order allocation that is balanced, it > > checks order-0 again. During that window, order-0 might have > > become unbalanced so it loops again for order-0 and returns > > that it was reclaiming for order-0 to kswapd(). It can then > > find that a caller has rewoken kswapd for a high-order and > > re-enters balance_pgdat() without ever calling cond_resched(). > > > > shrink_slab only calls cond_resched() if we are reclaiming slab > > pages. If there are a large number of direct reclaimers, the > > shrinker_rwsem can be contended and prevent kswapd calling > > cond_resched(). > > > > This patch modifies the shrink_slab() case. If the semaphore is > > contended, the caller will still check cond_resched(). After each > > successful call into a shrinker, the check for cond_resched() is > > still necessary in case one shrinker call is particularly slow. > > > > This patch replaces > > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > > in -mm. > > > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > > From: Minchan Kim <minchan.kim@gmail.com> > > Signed-off-by: Mel Gorman <mgorman@suse.de> > > --- > > mm/vmscan.c | 9 +++++++-- > > 1 files changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index af24d1e..0bed248 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > > if (scanned == 0) > > scanned = SWAP_CLUSTER_MAX; > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - return 1; /* Assume we'll be able to shrink next time */ > > + if (!down_read_trylock(&shrinker_rwsem)) { > > + /* Assume we'll be able to shrink next time */ > > + ret = 1; > > + goto out; > > + } > > > > list_for_each_entry(shrinker, &shrinker_list, list) { > > unsigned long long delta; > > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > > shrinker->nr += total_scan; > > } > > up_read(&shrinker_rwsem); > > +out: > > + cond_resched(); > > return ret; > > } > > > > > > > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-19 11:36 ` Colin Ian King 0 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-19 11:36 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote: > Hi Colin. > > Sorry for bothering you. :( No problem at all, I've very happy to re-test. > I hope this test is last. > > We(Mel, KOSAKI and me) finalized opinion. > > Could you test below patch with patch[1/4] of Mel's series(ie, > !pgdat_balanced of sleeping_prematurely)? > If it is successful, we will try to merge this version instead of > various cond_resched sprinkling version. tested with the patch below + patch[1/4] of Mel's series. 300 cycles, 2.5 hrs of soak testing: works OK. Colin > > > On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <mgorman@suse.de> wrote: > > It has been reported on some laptops that kswapd is consuming large > > amounts of CPU and not being scheduled when SLUB is enabled during > > large amounts of file copying. It is expected that this is due to > > kswapd missing every cond_resched() point because; > > > > shrink_page_list() calls cond_resched() if inactive pages were isolated > > which in turn may not happen if all_unreclaimable is set in > > shrink_zones(). If for whatver reason, all_unreclaimable is > > set on all zones, we can miss calling cond_resched(). > > > > balance_pgdat() only calls cond_resched if the zones are not > > balanced. For a high-order allocation that is balanced, it > > checks order-0 again. During that window, order-0 might have > > become unbalanced so it loops again for order-0 and returns > > that it was reclaiming for order-0 to kswapd(). It can then > > find that a caller has rewoken kswapd for a high-order and > > re-enters balance_pgdat() without ever calling cond_resched(). > > > > shrink_slab only calls cond_resched() if we are reclaiming slab > > pages. If there are a large number of direct reclaimers, the > > shrinker_rwsem can be contended and prevent kswapd calling > > cond_resched(). > > > > This patch modifies the shrink_slab() case. If the semaphore is > > contended, the caller will still check cond_resched(). After each > > successful call into a shrinker, the check for cond_resched() is > > still necessary in case one shrinker call is particularly slow. > > > > This patch replaces > > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch > > in -mm. > > > > [mgorman@suse.de: Preserve call to cond_resched after each call into shrinker] > > From: Minchan Kim <minchan.kim@gmail.com> > > Signed-off-by: Mel Gorman <mgorman@suse.de> > > --- > > mm/vmscan.c | 9 +++++++-- > > 1 files changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index af24d1e..0bed248 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > > if (scanned == 0) > > scanned = SWAP_CLUSTER_MAX; > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - return 1; /* Assume we'll be able to shrink next time */ > > + if (!down_read_trylock(&shrinker_rwsem)) { > > + /* Assume we'll be able to shrink next time */ > > + ret = 1; > > + goto out; > > + } > > > > list_for_each_entry(shrinker, &shrinker_list, list) { > > unsigned long long delta; > > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, > > shrinker->nr += total_scan; > > } > > up_read(&shrinker_rwsem); > > +out: > > + cond_resched(); > > return ret; > > } > > > > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab 2011-05-19 11:36 ` Colin Ian King (?) @ 2011-05-20 0:06 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-20 0:06 UTC (permalink / raw) To: Colin Ian King Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 19, 2011 at 8:36 PM, Colin Ian King <colin.king@canonical.com> wrote: > On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote: >> Hi Colin. >> >> Sorry for bothering you. :( > > No problem at all, I've very happy to re-test. > >> I hope this test is last. >> >> We(Mel, KOSAKI and me) finalized opinion. >> >> Could you test below patch with patch[1/4] of Mel's series(ie, >> !pgdat_balanced of sleeping_prematurely)? >> If it is successful, we will try to merge this version instead of >> various cond_resched sprinkling version. > > tested with the patch below + patch[1/4] of Mel's series. 300 cycles, > 2.5 hrs of soak testing: works OK. > > Colin Thanks, Colin. We are approaching the conclusion for your help. :) Mel, KOSAKI. I will ask test to Andrew Lutomirski. If he doesn't have a problem, let's go, then. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-20 0:06 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-20 0:06 UTC (permalink / raw) To: Colin Ian King Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 19, 2011 at 8:36 PM, Colin Ian King <colin.king@canonical.com> wrote: > On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote: >> Hi Colin. >> >> Sorry for bothering you. :( > > No problem at all, I've very happy to re-test. > >> I hope this test is last. >> >> We(Mel, KOSAKI and me) finalized opinion. >> >> Could you test below patch with patch[1/4] of Mel's series(ie, >> !pgdat_balanced of sleeping_prematurely)? >> If it is successful, we will try to merge this version instead of >> various cond_resched sprinkling version. > > tested with the patch below + patch[1/4] of Mel's series. 300 cycles, > 2.5 hrs of soak testing: works OK. > > Colin Thanks, Colin. We are approaching the conclusion for your help. :) Mel, KOSAKI. I will ask test to Andrew Lutomirski. If he doesn't have a problem, let's go, then. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab @ 2011-05-20 0:06 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-20 0:06 UTC (permalink / raw) To: Colin Ian King Cc: Mel Gorman, akpm, James Bottomley, KOSAKI Motohiro, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Thu, May 19, 2011 at 8:36 PM, Colin Ian King <colin.king@canonical.com> wrote: > On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote: >> Hi Colin. >> >> Sorry for bothering you. :( > > No problem at all, I've very happy to re-test. > >> I hope this test is last. >> >> We(Mel, KOSAKI and me) finalized opinion. >> >> Could you test below patch with patch[1/4] of Mel's series(ie, >> !pgdat_balanced of sleeping_prematurely)? >> If it is successful, we will try to merge this version instead of >> various cond_resched sprinkling version. > > tested with the patch below + patch[1/4] of Mel's series. 300 cycles, > 2.5 hrs of soak testing: works OK. > > Colin Thanks, Colin. We are approaching the conclusion for your help. :) Mel, KOSAKI. I will ask test to Andrew Lutomirski. If he doesn't have a problem, let's go, then. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-17 13:50 ` Colin Ian King @ 2011-05-18 4:19 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-18 4:19 UTC (permalink / raw) To: Colin Ian King Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hello Colin, On Tue, May 17, 2011 at 10:50 PM, Colin Ian King <colin.king@canonical.com> wrote: > On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: >> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: >> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: >> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: >> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: >> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >> > >> >> <James.Bottomley@hansenpartnership.com> wrote: >> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: >> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where >> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been >> > >> >> >> > running a long time. Check if kswapd needs to be scheduled. >> > >> >> >> > >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> > >> >> >> > --- >> > >> >> >> > mm/vmscan.c | 4 ++++ >> > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> > >> >> >> > >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > >> >> >> > index af24d1e..4d24828 100644 >> > >> >> >> > --- a/mm/vmscan.c >> > >> >> >> > +++ b/mm/vmscan.c >> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> > >> >> >> > unsigned long balanced = 0; >> > >> >> >> > bool all_zones_ok = true; >> > >> >> >> > >> > >> >> >> > + /* If kswapd has been running too long, just sleep */ >> > >> >> >> > + if (need_resched()) >> > >> >> >> > + return false; >> > >> >> >> > + >> > >> >> >> >> > >> >> >> Hmm... I don't like this patch so much. because this code does >> > >> >> >> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >> > >> >> > >> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow >> > >> >> > the normal path for determining whether to sleep or not, in effect >> > >> >> > leaving the current behaviour unchanged. >> > >> >> > >> > >> >> >> - sleep if kswapd didn't >> > >> >> > >> > >> >> > This also isn't entirely true: whether need_resched() is true at this >> > >> >> > point depends on a whole lot more that whether we did a context switch >> > >> >> > in shrink_inactive. It mostly depends on how long we've been running >> > >> >> > without giving up the CPU. Generally that will mean we've been round >> > >> >> > the shrinker loop hundreds to thousands of times without sleeping. >> > >> >> > >> > >> >> >> It seems to be semi random behavior. >> > >> >> > >> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was >> > >> >> > a kswapd rescheduling problem a while ago. We tried putting >> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. >> > >> >> >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >> > >> >> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as >> > >> >> pgdat_balance returns wrong result. And at last VM calls >> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as >> > >> >> kswap couldn't find zones which have not enough free pages and goto >> > >> >> out. kswapd could repeat this work infinitely. So you don't have a >> > >> >> chance to call cond_resched. >> > >> >> >> > >> >> But if your test was with Hanne's patch, I am very curious how come >> > >> >> kswapd consumes CPU a lot. >> > >> >> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best >> > >> >> > option. The other option might be just to put a cond_resched() in >> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. >> > >> >> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd >> > >> >> consumes CPU a lot although we applied Hannes' patch. >> > >> >> >> > >> > >> > >> > Because it's still possible for processes to allocate pages at the same >> > >> > rate kswapd is freeing them leading to a situation where kswapd does not >> > >> > consider the zone balanced for prolonged periods of time. >> > >> >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. >> > >> So I think kswapd can be scheduled out although it's scheduled in >> > >> after a short time as task scheduled also need page reclaim. Although >> > >> all task in system need reclaim, kswapd cpu 99% consumption is a >> > >> natural result, I think. >> > >> Do I miss something? >> > >> >> > > >> > > Lets see; >> > > >> > > shrink_page_list() only applies if inactive pages were isolated >> > > which in turn may not happen if all_unreclaimable is set in >> > > shrink_zones(). If for whatver reason, all_unreclaimable is >> > > set on all zones, we can miss calling cond_resched(). >> > > >> > > shrink_slab only applies if we are reclaiming slab pages. If the first >> > > shrinker returns -1, we do not call cond_resched(). If that >> > > first shrinker is dcache and __GFP_FS is not set, direct >> > > reclaimers will not shrink at all. However, if there are >> > > enough of them running or if one of the other shrinkers >> > > is running for a very long time, kswapd could be starved >> > > acquiring the shrinker_rwsem and never reaching the >> > > cond_resched(). >> > >> > Don't we have to move cond_resched? >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 292582c..633e761 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, >> > if (scanned == 0) >> > scanned = SWAP_CLUSTER_MAX; >> > >> > - if (!down_read_trylock(&shrinker_rwsem)) >> > - return 1; /* Assume we'll be able to shrink next time */ >> > + if (!down_read_trylock(&shrinker_rwsem)) { >> > + ret = 1; >> > + goto out; /* Assume we'll be able to shrink next time */ >> > + } >> > >> > list_for_each_entry(shrinker, &shrinker_list, list) { >> > unsigned long long delta; >> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, >> > count_vm_events(SLABS_SCANNED, this_scan); >> > total_scan -= this_scan; >> > >> > - cond_resched(); >> > } >> > >> > shrinker->nr += total_scan; >> > + cond_resched(); >> > } >> > up_read(&shrinker_rwsem); >> > +out: >> > + cond_resched(); >> > return ret; >> > } >> > >> >> This makes some sense for the exit path but if one or more of the >> shrinkers takes a very long time without sleeping (extremely long >> list searches for example) then kswapd will not call cond_resched() >> between shrinkers and still consume a lot of CPU. >> >> > > >> > > balance_pgdat() only calls cond_resched if the zones are not >> > > balanced. For a high-order allocation that is balanced, it >> > > checks order-0 again. During that window, order-0 might have >> > > become unbalanced so it loops again for order-0 and returns >> > > that was reclaiming for order-0 to kswapd(). It can then find >> > > that a caller has rewoken kswapd for a high-order and re-enters >> > > balance_pgdat() without ever have called cond_resched(). >> > >> > If kswapd reclaims order-o followed by high order, it would have a >> > chance to call cond_resched in shrink_page_list. But if all zones are >> > all_unreclaimable is set, balance_pgdat could return any work. Okay. >> > It does make sense. >> > By your scenario, someone wakes up kswapd with higher order, again. >> > So re-enters balance_pgdat without ever have called cond_resched. >> > But if someone wakes up higher order again, we can't have a chance to >> > call kswapd_try_to_sleep. So your patch effect would be nop, too. >> > >> > It would be better to put cond_resched after balance_pgdat? >> > >> >> Which will leave kswapd runnable instead of going to sleep but >> guarantees a scheduling point. Lets see if the problem is that >> cond_resched is being missed although if this was the case then patch >> 4 would truly be a no-op but Colin has already reported that patch 1 on >> its own didn't fix his problem. If the problem is sandybridge-specific >> where kswapd remains runnable and consuming large amounts of CPU in >> turbo mode then we know that there are other cond_resched() decisions >> that will need to be revisited. >> >> Colin or James, would you be willing to test with patch 1 from this >> series and Minchan's patch below? Thanks. > > This works OK fine. Ran 250 test cycles for about 2 hours. Thanks for the testing!. I would like to know exact patch for you to apply. My modification of inserting cond_resched is two. 1) shrink_slab function 2) kswapd right after balance_pgdat. 1) or 2) ? Or Both? Thanks -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 4:19 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-18 4:19 UTC (permalink / raw) To: Colin Ian King Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 Hello Colin, On Tue, May 17, 2011 at 10:50 PM, Colin Ian King <colin.king@canonical.com> wrote: > On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: >> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: >> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: >> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: >> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: >> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: >> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley >> > >> >> <James.Bottomley@hansenpartnership.com> wrote: >> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: >> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: >> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where >> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been >> > >> >> >> > running a long time. Check if kswapd needs to be scheduled. >> > >> >> >> > >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> >> > >> >> >> > --- >> > >> >> >> > mm/vmscan.c | 4 ++++ >> > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) >> > >> >> >> > >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > >> >> >> > index af24d1e..4d24828 100644 >> > >> >> >> > --- a/mm/vmscan.c >> > >> >> >> > +++ b/mm/vmscan.c >> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, >> > >> >> >> > unsigned long balanced = 0; >> > >> >> >> > bool all_zones_ok = true; >> > >> >> >> > >> > >> >> >> > + /* If kswapd has been running too long, just sleep */ >> > >> >> >> > + if (need_resched()) >> > >> >> >> > + return false; >> > >> >> >> > + >> > >> >> >> >> > >> >> >> Hmm... I don't like this patch so much. because this code does >> > >> >> >> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list >> > >> >> > >> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow >> > >> >> > the normal path for determining whether to sleep or not, in effect >> > >> >> > leaving the current behaviour unchanged. >> > >> >> > >> > >> >> >> - sleep if kswapd didn't >> > >> >> > >> > >> >> > This also isn't entirely true: whether need_resched() is true at this >> > >> >> > point depends on a whole lot more that whether we did a context switch >> > >> >> > in shrink_inactive. It mostly depends on how long we've been running >> > >> >> > without giving up the CPU. Generally that will mean we've been round >> > >> >> > the shrinker loop hundreds to thousands of times without sleeping. >> > >> >> > >> > >> >> >> It seems to be semi random behavior. >> > >> >> > >> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was >> > >> >> > a kswapd rescheduling problem a while ago. We tried putting >> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. >> > >> >> >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? >> > >> >> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. >> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as >> > >> >> pgdat_balance returns wrong result. And at last VM calls >> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as >> > >> >> kswap couldn't find zones which have not enough free pages and goto >> > >> >> out. kswapd could repeat this work infinitely. So you don't have a >> > >> >> chance to call cond_resched. >> > >> >> >> > >> >> But if your test was with Hanne's patch, I am very curious how come >> > >> >> kswapd consumes CPU a lot. >> > >> >> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best >> > >> >> > option. The other option might be just to put a cond_resched() in >> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. >> > >> >> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd >> > >> >> consumes CPU a lot although we applied Hannes' patch. >> > >> >> >> > >> > >> > >> > Because it's still possible for processes to allocate pages at the same >> > >> > rate kswapd is freeing them leading to a situation where kswapd does not >> > >> > consider the zone balanced for prolonged periods of time. >> > >> >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. >> > >> So I think kswapd can be scheduled out although it's scheduled in >> > >> after a short time as task scheduled also need page reclaim. Although >> > >> all task in system need reclaim, kswapd cpu 99% consumption is a >> > >> natural result, I think. >> > >> Do I miss something? >> > >> >> > > >> > > Lets see; >> > > >> > > shrink_page_list() only applies if inactive pages were isolated >> > > which in turn may not happen if all_unreclaimable is set in >> > > shrink_zones(). If for whatver reason, all_unreclaimable is >> > > set on all zones, we can miss calling cond_resched(). >> > > >> > > shrink_slab only applies if we are reclaiming slab pages. If the first >> > > shrinker returns -1, we do not call cond_resched(). If that >> > > first shrinker is dcache and __GFP_FS is not set, direct >> > > reclaimers will not shrink at all. However, if there are >> > > enough of them running or if one of the other shrinkers >> > > is running for a very long time, kswapd could be starved >> > > acquiring the shrinker_rwsem and never reaching the >> > > cond_resched(). >> > >> > Don't we have to move cond_resched? >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 292582c..633e761 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, >> > if (scanned == 0) >> > scanned = SWAP_CLUSTER_MAX; >> > >> > - if (!down_read_trylock(&shrinker_rwsem)) >> > - return 1; /* Assume we'll be able to shrink next time */ >> > + if (!down_read_trylock(&shrinker_rwsem)) { >> > + ret = 1; >> > + goto out; /* Assume we'll be able to shrink next time */ >> > + } >> > >> > list_for_each_entry(shrinker, &shrinker_list, list) { >> > unsigned long long delta; >> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, >> > count_vm_events(SLABS_SCANNED, this_scan); >> > total_scan -= this_scan; >> > >> > - cond_resched(); >> > } >> > >> > shrinker->nr += total_scan; >> > + cond_resched(); >> > } >> > up_read(&shrinker_rwsem); >> > +out: >> > + cond_resched(); >> > return ret; >> > } >> > >> >> This makes some sense for the exit path but if one or more of the >> shrinkers takes a very long time without sleeping (extremely long >> list searches for example) then kswapd will not call cond_resched() >> between shrinkers and still consume a lot of CPU. >> >> > > >> > > balance_pgdat() only calls cond_resched if the zones are not >> > > balanced. For a high-order allocation that is balanced, it >> > > checks order-0 again. During that window, order-0 might have >> > > become unbalanced so it loops again for order-0 and returns >> > > that was reclaiming for order-0 to kswapd(). It can then find >> > > that a caller has rewoken kswapd for a high-order and re-enters >> > > balance_pgdat() without ever have called cond_resched(). >> > >> > If kswapd reclaims order-o followed by high order, it would have a >> > chance to call cond_resched in shrink_page_list. But if all zones are >> > all_unreclaimable is set, balance_pgdat could return any work. Okay. >> > It does make sense. >> > By your scenario, someone wakes up kswapd with higher order, again. >> > So re-enters balance_pgdat without ever have called cond_resched. >> > But if someone wakes up higher order again, we can't have a chance to >> > call kswapd_try_to_sleep. So your patch effect would be nop, too. >> > >> > It would be better to put cond_resched after balance_pgdat? >> > >> >> Which will leave kswapd runnable instead of going to sleep but >> guarantees a scheduling point. Lets see if the problem is that >> cond_resched is being missed although if this was the case then patch >> 4 would truly be a no-op but Colin has already reported that patch 1 on >> its own didn't fix his problem. If the problem is sandybridge-specific >> where kswapd remains runnable and consuming large amounts of CPU in >> turbo mode then we know that there are other cond_resched() decisions >> that will need to be revisited. >> >> Colin or James, would you be willing to test with patch 1 from this >> series and Minchan's patch below? Thanks. > > This works OK fine. Ran 250 test cycles for about 2 hours. Thanks for the testing!. I would like to know exact patch for you to apply. My modification of inserting cond_resched is two. 1) shrink_slab function 2) kswapd right after balance_pgdat. 1) or 2) ? Or Both? Thanks -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-18 4:19 ` Minchan Kim @ 2011-05-18 7:39 ` Colin Ian King -1 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-18 7:39 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 2011-05-18 at 13:19 +0900, Minchan Kim wrote: > Hello Colin, > > On Tue, May 17, 2011 at 10:50 PM, Colin Ian King > <colin.king@canonical.com> wrote: > > On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: > >> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > >> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > >> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > >> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > >> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> > >> >> <James.Bottomley@hansenpartnership.com> wrote: > >> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> > >> >> >> > running a long time. Check if kswapd needs to be scheduled. > >> > >> >> >> > > >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> > >> >> >> > --- > >> > >> >> >> > mm/vmscan.c | 4 ++++ > >> > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> > >> >> >> > > >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > >> >> >> > index af24d1e..4d24828 100644 > >> > >> >> >> > --- a/mm/vmscan.c > >> > >> >> >> > +++ b/mm/vmscan.c > >> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> > >> >> >> > unsigned long balanced = 0; > >> > >> >> >> > bool all_zones_ok = true; > >> > >> >> >> > > >> > >> >> >> > + /* If kswapd has been running too long, just sleep */ > >> > >> >> >> > + if (need_resched()) > >> > >> >> >> > + return false; > >> > >> >> >> > + > >> > >> >> >> > >> > >> >> >> Hmm... I don't like this patch so much. because this code does > >> > >> >> >> > >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> > >> >> > > >> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> > >> >> > the normal path for determining whether to sleep or not, in effect > >> > >> >> > leaving the current behaviour unchanged. > >> > >> >> > > >> > >> >> >> - sleep if kswapd didn't > >> > >> >> > > >> > >> >> > This also isn't entirely true: whether need_resched() is true at this > >> > >> >> > point depends on a whole lot more that whether we did a context switch > >> > >> >> > in shrink_inactive. It mostly depends on how long we've been running > >> > >> >> > without giving up the CPU. Generally that will mean we've been round > >> > >> >> > the shrinker loop hundreds to thousands of times without sleeping. > >> > >> >> > > >> > >> >> >> It seems to be semi random behavior. > >> > >> >> > > >> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was > >> > >> >> > a kswapd rescheduling problem a while ago. We tried putting > >> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> > >> >> > >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> > >> >> > >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> > >> >> pgdat_balance returns wrong result. And at last VM calls > >> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> > >> >> kswap couldn't find zones which have not enough free pages and goto > >> > >> >> out. kswapd could repeat this work infinitely. So you don't have a > >> > >> >> chance to call cond_resched. > >> > >> >> > >> > >> >> But if your test was with Hanne's patch, I am very curious how come > >> > >> >> kswapd consumes CPU a lot. > >> > >> >> > >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> > >> >> > option. The other option might be just to put a cond_resched() in > >> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> > >> >> > >> > >> >> I don't oppose it but before that, I think we have to know why kswapd > >> > >> >> consumes CPU a lot although we applied Hannes' patch. > >> > >> >> > >> > >> > > >> > >> > Because it's still possible for processes to allocate pages at the same > >> > >> > rate kswapd is freeing them leading to a situation where kswapd does not > >> > >> > consider the zone balanced for prolonged periods of time. > >> > >> > >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > >> > >> So I think kswapd can be scheduled out although it's scheduled in > >> > >> after a short time as task scheduled also need page reclaim. Although > >> > >> all task in system need reclaim, kswapd cpu 99% consumption is a > >> > >> natural result, I think. > >> > >> Do I miss something? > >> > >> > >> > > > >> > > Lets see; > >> > > > >> > > shrink_page_list() only applies if inactive pages were isolated > >> > > which in turn may not happen if all_unreclaimable is set in > >> > > shrink_zones(). If for whatver reason, all_unreclaimable is > >> > > set on all zones, we can miss calling cond_resched(). > >> > > > >> > > shrink_slab only applies if we are reclaiming slab pages. If the first > >> > > shrinker returns -1, we do not call cond_resched(). If that > >> > > first shrinker is dcache and __GFP_FS is not set, direct > >> > > reclaimers will not shrink at all. However, if there are > >> > > enough of them running or if one of the other shrinkers > >> > > is running for a very long time, kswapd could be starved > >> > > acquiring the shrinker_rwsem and never reaching the > >> > > cond_resched(). > >> > > >> > Don't we have to move cond_resched? > >> > > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > index 292582c..633e761 100644 > >> > --- a/mm/vmscan.c > >> > +++ b/mm/vmscan.c > >> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > >> > if (scanned == 0) > >> > scanned = SWAP_CLUSTER_MAX; > >> > > >> > - if (!down_read_trylock(&shrinker_rwsem)) > >> > - return 1; /* Assume we'll be able to shrink next time */ > >> > + if (!down_read_trylock(&shrinker_rwsem)) { > >> > + ret = 1; > >> > + goto out; /* Assume we'll be able to shrink next time */ > >> > + } > >> > > >> > list_for_each_entry(shrinker, &shrinker_list, list) { > >> > unsigned long long delta; > >> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > >> > count_vm_events(SLABS_SCANNED, this_scan); > >> > total_scan -= this_scan; > >> > > >> > - cond_resched(); > >> > } > >> > > >> > shrinker->nr += total_scan; > >> > + cond_resched(); > >> > } > >> > up_read(&shrinker_rwsem); > >> > +out: > >> > + cond_resched(); > >> > return ret; > >> > } > >> > > >> > >> This makes some sense for the exit path but if one or more of the > >> shrinkers takes a very long time without sleeping (extremely long > >> list searches for example) then kswapd will not call cond_resched() > >> between shrinkers and still consume a lot of CPU. > >> > >> > > > >> > > balance_pgdat() only calls cond_resched if the zones are not > >> > > balanced. For a high-order allocation that is balanced, it > >> > > checks order-0 again. During that window, order-0 might have > >> > > become unbalanced so it loops again for order-0 and returns > >> > > that was reclaiming for order-0 to kswapd(). It can then find > >> > > that a caller has rewoken kswapd for a high-order and re-enters > >> > > balance_pgdat() without ever have called cond_resched(). > >> > > >> > If kswapd reclaims order-o followed by high order, it would have a > >> > chance to call cond_resched in shrink_page_list. But if all zones are > >> > all_unreclaimable is set, balance_pgdat could return any work. Okay. > >> > It does make sense. > >> > By your scenario, someone wakes up kswapd with higher order, again. > >> > So re-enters balance_pgdat without ever have called cond_resched. > >> > But if someone wakes up higher order again, we can't have a chance to > >> > call kswapd_try_to_sleep. So your patch effect would be nop, too. > >> > > >> > It would be better to put cond_resched after balance_pgdat? > >> > > >> > >> Which will leave kswapd runnable instead of going to sleep but > >> guarantees a scheduling point. Lets see if the problem is that > >> cond_resched is being missed although if this was the case then patch > >> 4 would truly be a no-op but Colin has already reported that patch 1 on > >> its own didn't fix his problem. If the problem is sandybridge-specific > >> where kswapd remains runnable and consuming large amounts of CPU in > >> turbo mode then we know that there are other cond_resched() decisions > >> that will need to be revisited. > >> > >> Colin or James, would you be willing to test with patch 1 from this > >> series and Minchan's patch below? Thanks. > > > > This works OK fine. Ran 250 test cycles for about 2 hours. > > Thanks for the testing!. > I would like to know exact patch for you to apply. > My modification of inserting cond_resched is two. > > 1) shrink_slab function > 2) kswapd right after balance_pgdat. > > 1) or 2) ? > Or > Both? > I just followed Mel's request, so, patch 1 from the series and *just* the following: >Colin or James, would you be willing to test with patch 1 from this >series and Minchan's patch below? Thanks. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..61c45d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > if (!ret) { > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > order = balance_pgdat(pgdat, order, &classzone_idx); > + cond_resched(); > } > } > return 0; Colin > Thanks ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 7:39 ` Colin Ian King 0 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-18 7:39 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, James Bottomley, KOSAKI Motohiro, akpm, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, 2011-05-18 at 13:19 +0900, Minchan Kim wrote: > Hello Colin, > > On Tue, May 17, 2011 at 10:50 PM, Colin Ian King > <colin.king@canonical.com> wrote: > > On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: > >> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > >> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <mgorman@suse.de> wrote: > >> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote: > >> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <mgorman@suse.de> wrote: > >> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote: > >> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley > >> > >> >> <James.Bottomley@hansenpartnership.com> wrote: > >> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote: > >> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote: > >> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where > >> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been > >> > >> >> >> > running a long time. Check if kswapd needs to be scheduled. > >> > >> >> >> > > >> > >> >> >> > Signed-off-by: Mel Gorman<mgorman@suse.de> > >> > >> >> >> > --- > >> > >> >> >> > mm/vmscan.c | 4 ++++ > >> > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-) > >> > >> >> >> > > >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > >> >> >> > index af24d1e..4d24828 100644 > >> > >> >> >> > --- a/mm/vmscan.c > >> > >> >> >> > +++ b/mm/vmscan.c > >> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> > >> >> >> > unsigned long balanced = 0; > >> > >> >> >> > bool all_zones_ok = true; > >> > >> >> >> > > >> > >> >> >> > + /* If kswapd has been running too long, just sleep */ > >> > >> >> >> > + if (need_resched()) > >> > >> >> >> > + return false; > >> > >> >> >> > + > >> > >> >> >> > >> > >> >> >> Hmm... I don't like this patch so much. because this code does > >> > >> >> >> > >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list > >> > >> >> > > >> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow > >> > >> >> > the normal path for determining whether to sleep or not, in effect > >> > >> >> > leaving the current behaviour unchanged. > >> > >> >> > > >> > >> >> >> - sleep if kswapd didn't > >> > >> >> > > >> > >> >> > This also isn't entirely true: whether need_resched() is true at this > >> > >> >> > point depends on a whole lot more that whether we did a context switch > >> > >> >> > in shrink_inactive. It mostly depends on how long we've been running > >> > >> >> > without giving up the CPU. Generally that will mean we've been round > >> > >> >> > the shrinker loop hundreds to thousands of times without sleeping. > >> > >> >> > > >> > >> >> >> It seems to be semi random behavior. > >> > >> >> > > >> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was > >> > >> >> > a kswapd rescheduling problem a while ago. We tried putting > >> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail. > >> > >> >> > >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? > >> > >> >> > >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. > >> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as > >> > >> >> pgdat_balance returns wrong result. And at last VM calls > >> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as > >> > >> >> kswap couldn't find zones which have not enough free pages and goto > >> > >> >> out. kswapd could repeat this work infinitely. So you don't have a > >> > >> >> chance to call cond_resched. > >> > >> >> > >> > >> >> But if your test was with Hanne's patch, I am very curious how come > >> > >> >> kswapd consumes CPU a lot. > >> > >> >> > >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best > >> > >> >> > option. The other option might be just to put a cond_resched() in > >> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect. > >> > >> >> > >> > >> >> I don't oppose it but before that, I think we have to know why kswapd > >> > >> >> consumes CPU a lot although we applied Hannes' patch. > >> > >> >> > >> > >> > > >> > >> > Because it's still possible for processes to allocate pages at the same > >> > >> > rate kswapd is freeing them leading to a situation where kswapd does not > >> > >> > consider the zone balanced for prolonged periods of time. > >> > >> > >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. > >> > >> So I think kswapd can be scheduled out although it's scheduled in > >> > >> after a short time as task scheduled also need page reclaim. Although > >> > >> all task in system need reclaim, kswapd cpu 99% consumption is a > >> > >> natural result, I think. > >> > >> Do I miss something? > >> > >> > >> > > > >> > > Lets see; > >> > > > >> > > shrink_page_list() only applies if inactive pages were isolated > >> > > which in turn may not happen if all_unreclaimable is set in > >> > > shrink_zones(). If for whatver reason, all_unreclaimable is > >> > > set on all zones, we can miss calling cond_resched(). > >> > > > >> > > shrink_slab only applies if we are reclaiming slab pages. If the first > >> > > shrinker returns -1, we do not call cond_resched(). If that > >> > > first shrinker is dcache and __GFP_FS is not set, direct > >> > > reclaimers will not shrink at all. However, if there are > >> > > enough of them running or if one of the other shrinkers > >> > > is running for a very long time, kswapd could be starved > >> > > acquiring the shrinker_rwsem and never reaching the > >> > > cond_resched(). > >> > > >> > Don't we have to move cond_resched? > >> > > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > index 292582c..633e761 100644 > >> > --- a/mm/vmscan.c > >> > +++ b/mm/vmscan.c > >> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > >> > if (scanned == 0) > >> > scanned = SWAP_CLUSTER_MAX; > >> > > >> > - if (!down_read_trylock(&shrinker_rwsem)) > >> > - return 1; /* Assume we'll be able to shrink next time */ > >> > + if (!down_read_trylock(&shrinker_rwsem)) { > >> > + ret = 1; > >> > + goto out; /* Assume we'll be able to shrink next time */ > >> > + } > >> > > >> > list_for_each_entry(shrinker, &shrinker_list, list) { > >> > unsigned long long delta; > >> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > >> > count_vm_events(SLABS_SCANNED, this_scan); > >> > total_scan -= this_scan; > >> > > >> > - cond_resched(); > >> > } > >> > > >> > shrinker->nr += total_scan; > >> > + cond_resched(); > >> > } > >> > up_read(&shrinker_rwsem); > >> > +out: > >> > + cond_resched(); > >> > return ret; > >> > } > >> > > >> > >> This makes some sense for the exit path but if one or more of the > >> shrinkers takes a very long time without sleeping (extremely long > >> list searches for example) then kswapd will not call cond_resched() > >> between shrinkers and still consume a lot of CPU. > >> > >> > > > >> > > balance_pgdat() only calls cond_resched if the zones are not > >> > > balanced. For a high-order allocation that is balanced, it > >> > > checks order-0 again. During that window, order-0 might have > >> > > become unbalanced so it loops again for order-0 and returns > >> > > that was reclaiming for order-0 to kswapd(). It can then find > >> > > that a caller has rewoken kswapd for a high-order and re-enters > >> > > balance_pgdat() without ever have called cond_resched(). > >> > > >> > If kswapd reclaims order-o followed by high order, it would have a > >> > chance to call cond_resched in shrink_page_list. But if all zones are > >> > all_unreclaimable is set, balance_pgdat could return any work. Okay. > >> > It does make sense. > >> > By your scenario, someone wakes up kswapd with higher order, again. > >> > So re-enters balance_pgdat without ever have called cond_resched. > >> > But if someone wakes up higher order again, we can't have a chance to > >> > call kswapd_try_to_sleep. So your patch effect would be nop, too. > >> > > >> > It would be better to put cond_resched after balance_pgdat? > >> > > >> > >> Which will leave kswapd runnable instead of going to sleep but > >> guarantees a scheduling point. Lets see if the problem is that > >> cond_resched is being missed although if this was the case then patch > >> 4 would truly be a no-op but Colin has already reported that patch 1 on > >> its own didn't fix his problem. If the problem is sandybridge-specific > >> where kswapd remains runnable and consuming large amounts of CPU in > >> turbo mode then we know that there are other cond_resched() decisions > >> that will need to be revisited. > >> > >> Colin or James, would you be willing to test with patch 1 from this > >> series and Minchan's patch below? Thanks. > > > > This works OK fine. Ran 250 test cycles for about 2 hours. > > Thanks for the testing!. > I would like to know exact patch for you to apply. > My modification of inserting cond_resched is two. > > 1) shrink_slab function > 2) kswapd right after balance_pgdat. > > 1) or 2) ? > Or > Both? > I just followed Mel's request, so, patch 1 from the series and *just* the following: >Colin or James, would you be willing to test with patch 1 from this >series and Minchan's patch below? Thanks. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..61c45d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > if (!ret) { > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > order = balance_pgdat(pgdat, order, &classzone_idx); > + cond_resched(); > } > } > return 0; Colin > Thanks -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-17 10:38 ` Mel Gorman @ 2011-05-18 4:09 ` James Bottomley -1 siblings, 0 replies; 119+ messages in thread From: James Bottomley @ 2011-05-18 4:09 UTC (permalink / raw) To: Mel Gorman Cc: Minchan Kim, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: > On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > > Don't we have to move cond_resched? > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..633e761 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > if (scanned == 0) > > scanned = SWAP_CLUSTER_MAX; > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - return 1; /* Assume we'll be able to shrink next time */ > > + if (!down_read_trylock(&shrinker_rwsem)) { > > + ret = 1; > > + goto out; /* Assume we'll be able to shrink next time */ > > + } > > > > list_for_each_entry(shrinker, &shrinker_list, list) { > > unsigned long long delta; > > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > count_vm_events(SLABS_SCANNED, this_scan); > > total_scan -= this_scan; > > > > - cond_resched(); > > } > > > > shrinker->nr += total_scan; > > + cond_resched(); > > } > > up_read(&shrinker_rwsem); > > +out: > > + cond_resched(); > > return ret; > > } > > > > This makes some sense for the exit path but if one or more of the > shrinkers takes a very long time without sleeping (extremely long > list searches for example) then kswapd will not call cond_resched() > between shrinkers and still consume a lot of CPU. > > > > > > > balance_pgdat() only calls cond_resched if the zones are not > > > balanced. For a high-order allocation that is balanced, it > > > checks order-0 again. During that window, order-0 might have > > > become unbalanced so it loops again for order-0 and returns > > > that was reclaiming for order-0 to kswapd(). It can then find > > > that a caller has rewoken kswapd for a high-order and re-enters > > > balance_pgdat() without ever have called cond_resched(). > > > > If kswapd reclaims order-o followed by high order, it would have a > > chance to call cond_resched in shrink_page_list. But if all zones are > > all_unreclaimable is set, balance_pgdat could return any work. Okay. > > It does make sense. > > By your scenario, someone wakes up kswapd with higher order, again. > > So re-enters balance_pgdat without ever have called cond_resched. > > But if someone wakes up higher order again, we can't have a chance to > > call kswapd_try_to_sleep. So your patch effect would be nop, too. > > > > It would be better to put cond_resched after balance_pgdat? > > > > Which will leave kswapd runnable instead of going to sleep but > guarantees a scheduling point. Lets see if the problem is that > cond_resched is being missed although if this was the case then patch > 4 would truly be a no-op but Colin has already reported that patch 1 on > its own didn't fix his problem. If the problem is sandybridge-specific > where kswapd remains runnable and consuming large amounts of CPU in > turbo mode then we know that there are other cond_resched() decisions > that will need to be revisited. > > Colin or James, would you be willing to test with patch 1 from this > series and Minchan's patch below? Thanks. Yes, but unfortunately I'm on the road at the moment. I won't get back to the laptop showing the problem until late on Tuesday (24th). If it works for Colin, I'd assume it's OK. James > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..61c45d0 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > > if (!ret) { > > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > > order = balance_pgdat(pgdat, order, &classzone_idx); > > + cond_resched(); > > } > > } > > return 0; > > > > > > > > While it appears unlikely, there are bad conditions which can result > > > in cond_resched() being avoided. > > > > > > > > -- > > > Mel Gorman > > > SUSE Labs > > > > > > > > > > > -- > > Kind regards, > > Minchan Kim > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 4:09 ` James Bottomley 0 siblings, 0 replies; 119+ messages in thread From: James Bottomley @ 2011-05-18 4:09 UTC (permalink / raw) To: Mel Gorman Cc: Minchan Kim, KOSAKI Motohiro, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote: > On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote: > > Don't we have to move cond_resched? > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..633e761 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > if (scanned == 0) > > scanned = SWAP_CLUSTER_MAX; > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - return 1; /* Assume we'll be able to shrink next time */ > > + if (!down_read_trylock(&shrinker_rwsem)) { > > + ret = 1; > > + goto out; /* Assume we'll be able to shrink next time */ > > + } > > > > list_for_each_entry(shrinker, &shrinker_list, list) { > > unsigned long long delta; > > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink, > > count_vm_events(SLABS_SCANNED, this_scan); > > total_scan -= this_scan; > > > > - cond_resched(); > > } > > > > shrinker->nr += total_scan; > > + cond_resched(); > > } > > up_read(&shrinker_rwsem); > > +out: > > + cond_resched(); > > return ret; > > } > > > > This makes some sense for the exit path but if one or more of the > shrinkers takes a very long time without sleeping (extremely long > list searches for example) then kswapd will not call cond_resched() > between shrinkers and still consume a lot of CPU. > > > > > > > balance_pgdat() only calls cond_resched if the zones are not > > > balanced. For a high-order allocation that is balanced, it > > > checks order-0 again. During that window, order-0 might have > > > become unbalanced so it loops again for order-0 and returns > > > that was reclaiming for order-0 to kswapd(). It can then find > > > that a caller has rewoken kswapd for a high-order and re-enters > > > balance_pgdat() without ever have called cond_resched(). > > > > If kswapd reclaims order-o followed by high order, it would have a > > chance to call cond_resched in shrink_page_list. But if all zones are > > all_unreclaimable is set, balance_pgdat could return any work. Okay. > > It does make sense. > > By your scenario, someone wakes up kswapd with higher order, again. > > So re-enters balance_pgdat without ever have called cond_resched. > > But if someone wakes up higher order again, we can't have a chance to > > call kswapd_try_to_sleep. So your patch effect would be nop, too. > > > > It would be better to put cond_resched after balance_pgdat? > > > > Which will leave kswapd runnable instead of going to sleep but > guarantees a scheduling point. Lets see if the problem is that > cond_resched is being missed although if this was the case then patch > 4 would truly be a no-op but Colin has already reported that patch 1 on > its own didn't fix his problem. If the problem is sandybridge-specific > where kswapd remains runnable and consuming large amounts of CPU in > turbo mode then we know that there are other cond_resched() decisions > that will need to be revisited. > > Colin or James, would you be willing to test with patch 1 from this > series and Minchan's patch below? Thanks. Yes, but unfortunately I'm on the road at the moment. I won't get back to the laptop showing the problem until late on Tuesday (24th). If it works for Colin, I'd assume it's OK. James > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 292582c..61c45d0 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > > if (!ret) { > > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > > order = balance_pgdat(pgdat, order, &classzone_idx); > > + cond_resched(); > > } > > } > > return 0; > > > > > > > > While it appears unlikely, there are bad conditions which can result > > > in cond_resched() being avoided. > > > > > > > > -- > > > Mel Gorman > > > SUSE Labs > > > > > > > > > > > -- > > Kind regards, > > Minchan Kim > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 23:50 ` Minchan Kim @ 2011-05-18 1:05 ` KOSAKI Motohiro -1 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 1:05 UTC (permalink / raw) To: minchan.kim Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 > It would be better to put cond_resched after balance_pgdat? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..61c45d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > if (!ret) { > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > order = balance_pgdat(pgdat, order,&classzone_idx); > + cond_resched(); > } > } > return 0; > >>>> While it appears unlikely, there are bad conditions which can result >> in cond_resched() being avoided. Every reclaim priority decreasing or every shrink_zone() calling makes more fine grained preemption. I think. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 1:05 ` KOSAKI Motohiro 0 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 1:05 UTC (permalink / raw) To: minchan.kim Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 > It would be better to put cond_resched after balance_pgdat? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 292582c..61c45d0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > if (!ret) { > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); > order = balance_pgdat(pgdat, order,&classzone_idx); > + cond_resched(); > } > } > return 0; > >>>> While it appears unlikely, there are bad conditions which can result >> in cond_resched() being avoided. Every reclaim priority decreasing or every shrink_zone() calling makes more fine grained preemption. I think. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-18 1:05 ` KOSAKI Motohiro (?) @ 2011-05-18 5:44 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-18 5:44 UTC (permalink / raw) To: KOSAKI Motohiro Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: >> It would be better to put cond_resched after balance_pgdat? >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 292582c..61c45d0 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) >> if (!ret) { >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, >> order); >> order = balance_pgdat(pgdat, >> order,&classzone_idx); >> + cond_resched(); >> } >> } >> return 0; >> >>>>> While it appears unlikely, there are bad conditions which can result >>> >>> in cond_resched() being avoided. > > Every reclaim priority decreasing or every shrink_zone() calling makes more > fine grained preemption. I think. It could be. But in direct reclaim case, I have a concern about losing pages reclaimed to other tasks by preemption. Hmm,, anyway, we also needs test. Hmm,, how long should we bother them(Colins and James)? First of all, Let's fix one just between us and ask test to them and send the last patch to akpm. 1. shrink_slab 2. right after balance_pgdat 3. shrink_zone 4. reclaim priority decreasing routine. Now, I vote 1) and 2). Mel, KOSAKI? -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 5:44 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-18 5:44 UTC (permalink / raw) To: KOSAKI Motohiro Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: >> It would be better to put cond_resched after balance_pgdat? >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 292582c..61c45d0 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) >> if (!ret) { >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, >> order); >> order = balance_pgdat(pgdat, >> order,&classzone_idx); >> + cond_resched(); >> } >> } >> return 0; >> >>>>> While it appears unlikely, there are bad conditions which can result >>> >>> in cond_resched() being avoided. > > Every reclaim priority decreasing or every shrink_zone() calling makes more > fine grained preemption. I think. It could be. But in direct reclaim case, I have a concern about losing pages reclaimed to other tasks by preemption. Hmm,, anyway, we also needs test. Hmm,, how long should we bother them(Colins and James)? First of all, Let's fix one just between us and ask test to them and send the last patch to akpm. 1. shrink_slab 2. right after balance_pgdat 3. shrink_zone 4. reclaim priority decreasing routine. Now, I vote 1) and 2). Mel, KOSAKI? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 5:44 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-18 5:44 UTC (permalink / raw) To: KOSAKI Motohiro Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: >> It would be better to put cond_resched after balance_pgdat? >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 292582c..61c45d0 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) >> if (!ret) { >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, >> order); >> order = balance_pgdat(pgdat, >> order,&classzone_idx); >> + cond_resched(); >> } >> } >> return 0; >> >>>>> While it appears unlikely, there are bad conditions which can result >>> >>> in cond_resched() being avoided. > > Every reclaim priority decreasing or every shrink_zone() calling makes more > fine grained preemption. I think. It could be. But in direct reclaim case, I have a concern about losing pages reclaimed to other tasks by preemption. Hmm,, anyway, we also needs test. Hmm,, how long should we bother them(Colins and James)? First of all, Let's fix one just between us and ask test to them and send the last patch to akpm. 1. shrink_slab 2. right after balance_pgdat 3. shrink_zone 4. reclaim priority decreasing routine. Now, I vote 1) and 2). Mel, KOSAKI? -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-18 5:44 ` Minchan Kim @ 2011-05-18 6:05 ` KOSAKI Motohiro -1 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 6:05 UTC (permalink / raw) To: minchan.kim Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 >>>>>> While it appears unlikely, there are bad conditions which can result >>>> >>>> in cond_resched() being avoided. >> >> Every reclaim priority decreasing or every shrink_zone() calling makes more >> fine grained preemption. I think. > > It could be. > But in direct reclaim case, I have a concern about losing pages > reclaimed to other tasks by preemption. Nope, I proposed to add cond_resched() into balance_pgdat(). > Hmm,, anyway, we also needs test. > Hmm,, how long should we bother them(Colins and James)? > First of all, Let's fix one just between us and ask test to them and > send the last patch to akpm. > > 1. shrink_slab > 2. right after balance_pgdat > 3. shrink_zone > 4. reclaim priority decreasing routine. > > Now, I vote 1) and 2). > > Mel, KOSAKI? I think following patch makes enough preemption. Thanks. From e7d88be1916184ea7c93a6f2746b15c7a32d1973 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Date: Wed, 18 May 2011 15:00:39 +0900 Subject: [PATCH] vmscan: balance_pgdat() call cond_resched() unconditionally Under constant allocation pressure, kswapd can be in the situation where sleeping_prematurely() will always return true even if kswapd has been running a long time. Check if kswapd needs to be scheduled. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Colin King <colin.king@canonical.com> Cc: Minchan Kim <minchan.kim@gmail.com> --- mm/vmscan.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 19e179b..87c88fd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2449,6 +2449,7 @@ loop_again: sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; + cond_resched(); if (zone->all_unreclaimable) continue; if (nr_slab == 0 && @@ -2518,8 +2519,6 @@ out: * for the node to be balanced */ if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) { - cond_resched(); - try_to_freeze(); /* -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 6:05 ` KOSAKI Motohiro 0 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 6:05 UTC (permalink / raw) To: minchan.kim Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 >>>>>> While it appears unlikely, there are bad conditions which can result >>>> >>>> in cond_resched() being avoided. >> >> Every reclaim priority decreasing or every shrink_zone() calling makes more >> fine grained preemption. I think. > > It could be. > But in direct reclaim case, I have a concern about losing pages > reclaimed to other tasks by preemption. Nope, I proposed to add cond_resched() into balance_pgdat(). > Hmm,, anyway, we also needs test. > Hmm,, how long should we bother them(Colins and James)? > First of all, Let's fix one just between us and ask test to them and > send the last patch to akpm. > > 1. shrink_slab > 2. right after balance_pgdat > 3. shrink_zone > 4. reclaim priority decreasing routine. > > Now, I vote 1) and 2). > > Mel, KOSAKI? I think following patch makes enough preemption. Thanks. From e7d88be1916184ea7c93a6f2746b15c7a32d1973 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Date: Wed, 18 May 2011 15:00:39 +0900 Subject: [PATCH] vmscan: balance_pgdat() call cond_resched() unconditionally Under constant allocation pressure, kswapd can be in the situation where sleeping_prematurely() will always return true even if kswapd has been running a long time. Check if kswapd needs to be scheduled. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Colin King <colin.king@canonical.com> Cc: Minchan Kim <minchan.kim@gmail.com> --- mm/vmscan.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 19e179b..87c88fd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2449,6 +2449,7 @@ loop_again: sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; + cond_resched(); if (zone->all_unreclaimable) continue; if (nr_slab == 0 && @@ -2518,8 +2519,6 @@ out: * for the node to be balanced */ if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) { - cond_resched(); - try_to_freeze(); /* -- 1.7.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-18 5:44 ` Minchan Kim (?) @ 2011-05-18 9:58 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-18 9:58 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote: > On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro > <kosaki.motohiro@jp.fujitsu.com> wrote: > >> It would be better to put cond_resched after balance_pgdat? > >> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 292582c..61c45d0 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > >> if (!ret) { > >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, > >> order); > >> order = balance_pgdat(pgdat, > >> order,&classzone_idx); > >> + cond_resched(); > >> } > >> } > >> return 0; > >> > >>>>> While it appears unlikely, there are bad conditions which can result > >>> > >>> in cond_resched() being avoided. > > > > Every reclaim priority decreasing or every shrink_zone() calling makes more > > fine grained preemption. I think. > > It could be. > But in direct reclaim case, I have a concern about losing pages > reclaimed to other tasks by preemption. > > Hmm,, anyway, we also needs test. > Hmm,, how long should we bother them(Colins and James)? > First of all, Let's fix one just between us and ask test to them and > send the last patch to akpm. > > 1. shrink_slab > 2. right after balance_pgdat > 3. shrink_zone > 4. reclaim priority decreasing routine. > > Now, I vote 1) and 2). > I've already submitted a pair of patches for option 1. I don't think option 2 gains us anything. I think it's more likely we should worry about all_unreclaimable being set when shrink_slab is returning 0 and we are encountering so many dirty pages that pages_scanned is high enough. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 9:58 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-18 9:58 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote: > On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro > <kosaki.motohiro@jp.fujitsu.com> wrote: > >> It would be better to put cond_resched after balance_pgdat? > >> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 292582c..61c45d0 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > >> if (!ret) { > >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, > >> order); > >> order = balance_pgdat(pgdat, > >> order,&classzone_idx); > >> + cond_resched(); > >> } > >> } > >> return 0; > >> > >>>>> While it appears unlikely, there are bad conditions which can result > >>> > >>> in cond_resched() being avoided. > > > > Every reclaim priority decreasing or every shrink_zone() calling makes more > > fine grained preemption. I think. > > It could be. > But in direct reclaim case, I have a concern about losing pages > reclaimed to other tasks by preemption. > > Hmm,, anyway, we also needs test. > Hmm,, how long should we bother them(Colins and James)? > First of all, Let's fix one just between us and ask test to them and > send the last patch to akpm. > > 1. shrink_slab > 2. right after balance_pgdat > 3. shrink_zone > 4. reclaim priority decreasing routine. > > Now, I vote 1) and 2). > I've already submitted a pair of patches for option 1. I don't think option 2 gains us anything. I think it's more likely we should worry about all_unreclaimable being set when shrink_slab is returning 0 and we are encountering so many dirty pages that pages_scanned is high enough. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 9:58 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-18 9:58 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote: > On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro > <kosaki.motohiro@jp.fujitsu.com> wrote: > >> It would be better to put cond_resched after balance_pgdat? > >> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 292582c..61c45d0 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) > >> if (!ret) { > >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, > >> order); > >> order = balance_pgdat(pgdat, > >> order,&classzone_idx); > >> + cond_resched(); > >> } > >> } > >> return 0; > >> > >>>>> While it appears unlikely, there are bad conditions which can result > >>> > >>> in cond_resched() being avoided. > > > > Every reclaim priority decreasing or every shrink_zone() calling makes more > > fine grained preemption. I think. > > It could be. > But in direct reclaim case, I have a concern about losing pages > reclaimed to other tasks by preemption. > > Hmm,, anyway, we also needs test. > Hmm,, how long should we bother them(Colins and James)? > First of all, Let's fix one just between us and ask test to them and > send the last patch to akpm. > > 1. shrink_slab > 2. right after balance_pgdat > 3. shrink_zone > 4. reclaim priority decreasing routine. > > Now, I vote 1) and 2). > I've already submitted a pair of patches for option 1. I don't think option 2 gains us anything. I think it's more likely we should worry about all_unreclaimable being set when shrink_slab is returning 0 and we are encountering so many dirty pages that pages_scanned is high enough. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-18 9:58 ` Mel Gorman @ 2011-05-18 22:55 ` Minchan Kim -1 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-18 22:55 UTC (permalink / raw) To: Mel Gorman Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 6:58 PM, Mel Gorman <mgorman@suse.de> wrote: > On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote: >> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro >> <kosaki.motohiro@jp.fujitsu.com> wrote: >> >> It would be better to put cond_resched after balance_pgdat? >> >> >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> >> index 292582c..61c45d0 100644 >> >> --- a/mm/vmscan.c >> >> +++ b/mm/vmscan.c >> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) >> >> if (!ret) { >> >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, >> >> order); >> >> order = balance_pgdat(pgdat, >> >> order,&classzone_idx); >> >> + cond_resched(); >> >> } >> >> } >> >> return 0; >> >> >> >>>>> While it appears unlikely, there are bad conditions which can result >> >>> >> >>> in cond_resched() being avoided. >> > >> > Every reclaim priority decreasing or every shrink_zone() calling makes more >> > fine grained preemption. I think. >> >> It could be. >> But in direct reclaim case, I have a concern about losing pages >> reclaimed to other tasks by preemption. >> >> Hmm,, anyway, we also needs test. >> Hmm,, how long should we bother them(Colins and James)? >> First of all, Let's fix one just between us and ask test to them and >> send the last patch to akpm. >> >> 1. shrink_slab >> 2. right after balance_pgdat >> 3. shrink_zone >> 4. reclaim priority decreasing routine. >> >> Now, I vote 1) and 2). >> > > I've already submitted a pair of patches for option 1. I don't think > option 2 gains us anything. I think it's more likely we should worry > about all_unreclaimable being set when shrink_slab is returning 0 and we > are encountering so many dirty pages that pages_scanned is high enough. Okay. Colin reported he had no problem with patch 1 in this series and mine(ie, just cond_resched right after balance_pgdat call without no patch of shrink_slab). If Colin's test is successful, I don't insist on mine. (I don't want to drag on for days :( ) If KOSAKI agree, let's ask the test to Colin and confirm our last test. KOSAKI. Could you post a your opinion? -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 22:55 ` Minchan Kim 0 siblings, 0 replies; 119+ messages in thread From: Minchan Kim @ 2011-05-18 22:55 UTC (permalink / raw) To: Mel Gorman Cc: KOSAKI Motohiro, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 6:58 PM, Mel Gorman <mgorman@suse.de> wrote: > On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote: >> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro >> <kosaki.motohiro@jp.fujitsu.com> wrote: >> >> It would be better to put cond_resched after balance_pgdat? >> >> >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> >> index 292582c..61c45d0 100644 >> >> --- a/mm/vmscan.c >> >> +++ b/mm/vmscan.c >> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p) >> >> if (!ret) { >> >> trace_mm_vmscan_kswapd_wake(pgdat->node_id, >> >> order); >> >> order = balance_pgdat(pgdat, >> >> order,&classzone_idx); >> >> + cond_resched(); >> >> } >> >> } >> >> return 0; >> >> >> >>>>> While it appears unlikely, there are bad conditions which can result >> >>> >> >>> in cond_resched() being avoided. >> > >> > Every reclaim priority decreasing or every shrink_zone() calling makes more >> > fine grained preemption. I think. >> >> It could be. >> But in direct reclaim case, I have a concern about losing pages >> reclaimed to other tasks by preemption. >> >> Hmm,, anyway, we also needs test. >> Hmm,, how long should we bother them(Colins and James)? >> First of all, Let's fix one just between us and ask test to them and >> send the last patch to akpm. >> >> 1. shrink_slab >> 2. right after balance_pgdat >> 3. shrink_zone >> 4. reclaim priority decreasing routine. >> >> Now, I vote 1) and 2). >> > > I've already submitted a pair of patches for option 1. I don't think > option 2 gains us anything. I think it's more likely we should worry > about all_unreclaimable being set when shrink_slab is returning 0 and we > are encountering so many dirty pages that pages_scanned is high enough. Okay. Colin reported he had no problem with patch 1 in this series and mine(ie, just cond_resched right after balance_pgdat call without no patch of shrink_slab). If Colin's test is successful, I don't insist on mine. (I don't want to drag on for days :( ) If KOSAKI agree, let's ask the test to Colin and confirm our last test. KOSAKI. Could you post a your opinion? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-18 22:55 ` Minchan Kim @ 2011-05-18 23:54 ` KOSAKI Motohiro -1 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 23:54 UTC (permalink / raw) To: minchan.kim Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 >> I've already submitted a pair of patches for option 1. I don't think >> option 2 gains us anything. I think it's more likely we should worry >> about all_unreclaimable being set when shrink_slab is returning 0 and we >> are encountering so many dirty pages that pages_scanned is high enough. > > Okay. > > Colin reported he had no problem with patch 1 in this series and > mine(ie, just cond_resched right after balance_pgdat call without no > patch of shrink_slab). > > If Colin's test is successful, I don't insist on mine. > (I don't want to drag on for days :( ) > If KOSAKI agree, let's ask the test to Colin and confirm our last test. > > KOSAKI. Could you post a your opinion? Yeah. I also don't have any motivation to ignore Colin's test result. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 23:54 ` KOSAKI Motohiro 0 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 23:54 UTC (permalink / raw) To: minchan.kim Cc: mgorman, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 >> I've already submitted a pair of patches for option 1. I don't think >> option 2 gains us anything. I think it's more likely we should worry >> about all_unreclaimable being set when shrink_slab is returning 0 and we >> are encountering so many dirty pages that pages_scanned is high enough. > > Okay. > > Colin reported he had no problem with patch 1 in this series and > mine(ie, just cond_resched right after balance_pgdat call without no > patch of shrink_slab). > > If Colin's test is successful, I don't insist on mine. > (I don't want to drag on for days :( ) > If KOSAKI agree, let's ask the test to Colin and confirm our last test. > > KOSAKI. Could you post a your opinion? Yeah. I also don't have any motivation to ignore Colin's test result. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-16 10:27 ` Mel Gorman @ 2011-05-18 0:26 ` KOSAKI Motohiro -1 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 0:26 UTC (permalink / raw) To: mgorman Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 > Lets see; > > shrink_page_list() only applies if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > shrink_slab only applies if we are reclaiming slab pages. If the first > shrinker returns -1, we do not call cond_resched(). If that > first shrinker is dcache and __GFP_FS is not set, direct > reclaimers will not shrink at all. However, if there are > enough of them running or if one of the other shrinkers > is running for a very long time, kswapd could be starved > acquiring the shrinker_rwsem and never reaching the > cond_resched(). OK. > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that was reclaiming for order-0 to kswapd(). It can then find > that a caller has rewoken kswapd for a high-order and re-enters > balance_pgdat() without ever have called cond_resched(). Then, Shouldn't balance_pgdat() call cond_resched() unconditionally? The problem is NOT 100% cpu consumption. if kswapd will sleep, other processes need to reclaim old pages. The problem is, kswapd doesn't invoke context switch and other tasks hang-up. > While it appears unlikely, there are bad conditions which can result > in cond_resched() being avoided. > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 0:26 ` KOSAKI Motohiro 0 siblings, 0 replies; 119+ messages in thread From: KOSAKI Motohiro @ 2011-05-18 0:26 UTC (permalink / raw) To: mgorman Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 > Lets see; > > shrink_page_list() only applies if inactive pages were isolated > which in turn may not happen if all_unreclaimable is set in > shrink_zones(). If for whatver reason, all_unreclaimable is > set on all zones, we can miss calling cond_resched(). > > shrink_slab only applies if we are reclaiming slab pages. If the first > shrinker returns -1, we do not call cond_resched(). If that > first shrinker is dcache and __GFP_FS is not set, direct > reclaimers will not shrink at all. However, if there are > enough of them running or if one of the other shrinkers > is running for a very long time, kswapd could be starved > acquiring the shrinker_rwsem and never reaching the > cond_resched(). OK. > > balance_pgdat() only calls cond_resched if the zones are not > balanced. For a high-order allocation that is balanced, it > checks order-0 again. During that window, order-0 might have > become unbalanced so it loops again for order-0 and returns > that was reclaiming for order-0 to kswapd(). It can then find > that a caller has rewoken kswapd for a high-order and re-enters > balance_pgdat() without ever have called cond_resched(). Then, Shouldn't balance_pgdat() call cond_resched() unconditionally? The problem is NOT 100% cpu consumption. if kswapd will sleep, other processes need to reclaim old pages. The problem is, kswapd doesn't invoke context switch and other tasks hang-up. > While it appears unlikely, there are bad conditions which can result > in cond_resched() being avoided. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-18 0:26 ` KOSAKI Motohiro @ 2011-05-18 9:57 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-18 9:57 UTC (permalink / raw) To: KOSAKI Motohiro Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 09:26:09AM +0900, KOSAKI Motohiro wrote: > >Lets see; > > > >shrink_page_list() only applies if inactive pages were isolated > > which in turn may not happen if all_unreclaimable is set in > > shrink_zones(). If for whatver reason, all_unreclaimable is > > set on all zones, we can miss calling cond_resched(). > > > >shrink_slab only applies if we are reclaiming slab pages. If the first > > shrinker returns -1, we do not call cond_resched(). If that > > first shrinker is dcache and __GFP_FS is not set, direct > > reclaimers will not shrink at all. However, if there are > > enough of them running or if one of the other shrinkers > > is running for a very long time, kswapd could be starved > > acquiring the shrinker_rwsem and never reaching the > > cond_resched(). > > OK. > > > > > >balance_pgdat() only calls cond_resched if the zones are not > > balanced. For a high-order allocation that is balanced, it > > checks order-0 again. During that window, order-0 might have > > become unbalanced so it loops again for order-0 and returns > > that was reclaiming for order-0 to kswapd(). It can then find > > that a caller has rewoken kswapd for a high-order and re-enters > > balance_pgdat() without ever have called cond_resched(). > > Then, Shouldn't balance_pgdat() call cond_resched() unconditionally? > The problem is NOT 100% cpu consumption. if kswapd will sleep, other > processes need to reclaim old pages. The problem is, kswapd doesn't > invoke context switch and other tasks hang-up. > Which the shrink_slab patch does (either version). What's the gain from sprinkling more cond_resched() around? If you think there is, submit another pair of patches (include patch 1 from this series) but I'm not seeing the advantage myself. > > >While it appears unlikely, there are bad conditions which can result > >in cond_resched() being avoided. > > > > -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-18 9:57 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-18 9:57 UTC (permalink / raw) To: KOSAKI Motohiro Cc: minchan.kim, James.Bottomley, akpm, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Wed, May 18, 2011 at 09:26:09AM +0900, KOSAKI Motohiro wrote: > >Lets see; > > > >shrink_page_list() only applies if inactive pages were isolated > > which in turn may not happen if all_unreclaimable is set in > > shrink_zones(). If for whatver reason, all_unreclaimable is > > set on all zones, we can miss calling cond_resched(). > > > >shrink_slab only applies if we are reclaiming slab pages. If the first > > shrinker returns -1, we do not call cond_resched(). If that > > first shrinker is dcache and __GFP_FS is not set, direct > > reclaimers will not shrink at all. However, if there are > > enough of them running or if one of the other shrinkers > > is running for a very long time, kswapd could be starved > > acquiring the shrinker_rwsem and never reaching the > > cond_resched(). > > OK. > > > > > >balance_pgdat() only calls cond_resched if the zones are not > > balanced. For a high-order allocation that is balanced, it > > checks order-0 again. During that window, order-0 might have > > become unbalanced so it loops again for order-0 and returns > > that was reclaiming for order-0 to kswapd(). It can then find > > that a caller has rewoken kswapd for a high-order and re-enters > > balance_pgdat() without ever have called cond_resched(). > > Then, Shouldn't balance_pgdat() call cond_resched() unconditionally? > The problem is NOT 100% cpu consumption. if kswapd will sleep, other > processes need to reclaim old pages. The problem is, kswapd doesn't > invoke context switch and other tasks hang-up. > Which the shrink_slab patch does (either version). What's the gain from sprinkling more cond_resched() around? If you think there is, submit another pair of patches (include patch 1 from this series) but I'm not seeing the advantage myself. > > >While it appears unlikely, there are bad conditions which can result > >in cond_resched() being avoided. > > > > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-15 10:27 ` KOSAKI Motohiro @ 2011-05-16 8:45 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 8:45 UTC (permalink / raw) To: KOSAKI Motohiro Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Sun, May 15, 2011 at 07:27:12PM +0900, KOSAKI Motohiro wrote: > (2011/05/13 23:03), Mel Gorman wrote: > > Under constant allocation pressure, kswapd can be in the situation where > > sleeping_prematurely() will always return true even if kswapd has been > > running a long time. Check if kswapd needs to be scheduled. > > > > Signed-off-by: Mel Gorman<mgorman@suse.de> > > --- > > mm/vmscan.c | 4 ++++ > > 1 files changed, 4 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index af24d1e..4d24828 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > > unsigned long balanced = 0; > > bool all_zones_ok = true; > > > > + /* If kswapd has been running too long, just sleep */ > > + if (need_resched()) > > + return false; > > + > > Hmm... I don't like this patch so much. because this code does > > - don't sleep if kswapd got context switch at shrink_inactive_list > - sleep if kswapd didn't > > It seems to be semi random behavior. > It's possible to keep kswapd awake simply by allocating fast enough that the watermarks are never balanced making kswapd appear to consume 100% of CPU. This check causes kswapd to sleep in this case. The processes doing the allocations will enter direct reclaim and probably stall while processes that are not allocating will get some CPU time. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 8:45 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 8:45 UTC (permalink / raw) To: KOSAKI Motohiro Cc: akpm, James.Bottomley, colin.king, raghu.prabhu13, jack, chris.mason, cl, penberg, riel, hannes, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Sun, May 15, 2011 at 07:27:12PM +0900, KOSAKI Motohiro wrote: > (2011/05/13 23:03), Mel Gorman wrote: > > Under constant allocation pressure, kswapd can be in the situation where > > sleeping_prematurely() will always return true even if kswapd has been > > running a long time. Check if kswapd needs to be scheduled. > > > > Signed-off-by: Mel Gorman<mgorman@suse.de> > > --- > > mm/vmscan.c | 4 ++++ > > 1 files changed, 4 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index af24d1e..4d24828 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > > unsigned long balanced = 0; > > bool all_zones_ok = true; > > > > + /* If kswapd has been running too long, just sleep */ > > + if (need_resched()) > > + return false; > > + > > Hmm... I don't like this patch so much. because this code does > > - don't sleep if kswapd got context switch at shrink_inactive_list > - sleep if kswapd didn't > > It seems to be semi random behavior. > It's possible to keep kswapd awake simply by allocating fast enough that the watermarks are never balanced making kswapd appear to consume 100% of CPU. This check causes kswapd to sleep in this case. The processes doing the allocations will enter direct reclaim and probably stall while processes that are not allocating will get some CPU time. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep 2011-05-13 14:03 ` Mel Gorman @ 2011-05-16 14:30 ` Rik van Riel -1 siblings, 0 replies; 119+ messages in thread From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On 05/13/2011 10:03 AM, Mel Gorman wrote: > Under constant allocation pressure, kswapd can be in the situation where > sleeping_prematurely() will always return true even if kswapd has been > running a long time. Check if kswapd needs to be scheduled. > > Signed-off-by: Mel Gorman<mgorman@suse.de> Acked-by: Rik van Riel<riel@redhat.com> -- All rights reversed ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep @ 2011-05-16 14:30 ` Rik van Riel 0 siblings, 0 replies; 119+ messages in thread From: Rik van Riel @ 2011-05-16 14:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On 05/13/2011 10:03 AM, Mel Gorman wrote: > Under constant allocation pressure, kswapd can be in the situation where > sleeping_prematurely() will always return true even if kswapd has been > running a long time. Check if kswapd needs to be scheduled. > > Signed-off-by: Mel Gorman<mgorman@suse.de> Acked-by: Rik van Riel<riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 2011-05-13 14:03 ` Mel Gorman (?) @ 2011-05-13 15:19 ` James Bottomley -1 siblings, 0 replies; 119+ messages in thread From: James Bottomley @ 2011-05-13 15:19 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > Changelog since V1 > o kswapd should sleep if need_resched > o Remove __GFP_REPEAT from GFP flags when speculatively using high > orders so direct/compaction exits earlier > o Remove __GFP_NORETRY for correctness > o Correct logic in sleeping_prematurely > o Leave SLUB using the default slub_max_order > > There are a few reports of people experiencing hangs when copying > large amounts of data with kswapd using a large amount of CPU which > appear to be due to recent reclaim changes. > > SLUB using high orders is the trigger but not the root cause as SLUB > has been using high orders for a while. The following four patches > aim to fix the problems in reclaim while reducing the cost for SLUB > using those high orders. > > Patch 1 corrects logic introduced by commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until > a percentage of the node is balanced] to allow kswapd to > go to sleep when balanced for high orders. > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > use of high orders. > > Patch 3 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. > > Patch 4 notes that even when kswapd is failing to keep up with > allocation requests, it should still go to sleep when its > quota has expired to prevent it spinning. This all works fine for me ... three untar runs and no kswapd hangs or pegging the CPU at 99% ... in fact, kswapd rarely gets over 20% This isn't as good as the kswapd sleeping_prematurely() throttling patch. For total CPU time on a three 90GB untar run, it's about 64s of CPU time with your patch rather than 6s, but that's vastly better than the 15 minutes of CPU time kswapd was taking even under PREEMPT. James ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-13 15:19 ` James Bottomley 0 siblings, 0 replies; 119+ messages in thread From: James Bottomley @ 2011-05-13 15:19 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > Changelog since V1 > o kswapd should sleep if need_resched > o Remove __GFP_REPEAT from GFP flags when speculatively using high > orders so direct/compaction exits earlier > o Remove __GFP_NORETRY for correctness > o Correct logic in sleeping_prematurely > o Leave SLUB using the default slub_max_order > > There are a few reports of people experiencing hangs when copying > large amounts of data with kswapd using a large amount of CPU which > appear to be due to recent reclaim changes. > > SLUB using high orders is the trigger but not the root cause as SLUB > has been using high orders for a while. The following four patches > aim to fix the problems in reclaim while reducing the cost for SLUB > using those high orders. > > Patch 1 corrects logic introduced by commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until > a percentage of the node is balanced] to allow kswapd to > go to sleep when balanced for high orders. > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > use of high orders. > > Patch 3 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. > > Patch 4 notes that even when kswapd is failing to keep up with > allocation requests, it should still go to sleep when its > quota has expired to prevent it spinning. This all works fine for me ... three untar runs and no kswapd hangs or pegging the CPU at 99% ... in fact, kswapd rarely gets over 20% This isn't as good as the kswapd sleeping_prematurely() throttling patch. For total CPU time on a three 90GB untar run, it's about 64s of CPU time with your patch rather than 6s, but that's vastly better than the 15 minutes of CPU time kswapd was taking even under PREEMPT. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-13 15:19 ` James Bottomley 0 siblings, 0 replies; 119+ messages in thread From: James Bottomley @ 2011-05-13 15:19 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > Changelog since V1 > o kswapd should sleep if need_resched > o Remove __GFP_REPEAT from GFP flags when speculatively using high > orders so direct/compaction exits earlier > o Remove __GFP_NORETRY for correctness > o Correct logic in sleeping_prematurely > o Leave SLUB using the default slub_max_order > > There are a few reports of people experiencing hangs when copying > large amounts of data with kswapd using a large amount of CPU which > appear to be due to recent reclaim changes. > > SLUB using high orders is the trigger but not the root cause as SLUB > has been using high orders for a while. The following four patches > aim to fix the problems in reclaim while reducing the cost for SLUB > using those high orders. > > Patch 1 corrects logic introduced by commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until > a percentage of the node is balanced] to allow kswapd to > go to sleep when balanced for high orders. > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > use of high orders. > > Patch 3 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. > > Patch 4 notes that even when kswapd is failing to keep up with > allocation requests, it should still go to sleep when its > quota has expired to prevent it spinning. This all works fine for me ... three untar runs and no kswapd hangs or pegging the CPU at 99% ... in fact, kswapd rarely gets over 20% This isn't as good as the kswapd sleeping_prematurely() throttling patch. For total CPU time on a three 90GB untar run, it's about 64s of CPU time with your patch rather than 6s, but that's vastly better than the 15 minutes of CPU time kswapd was taking even under PREEMPT. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 2011-05-13 15:19 ` James Bottomley @ 2011-05-13 15:52 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 15:52 UTC (permalink / raw) To: James Bottomley Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 10:19:44AM -0500, James Bottomley wrote: > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > > Changelog since V1 > > o kswapd should sleep if need_resched > > o Remove __GFP_REPEAT from GFP flags when speculatively using high > > orders so direct/compaction exits earlier > > o Remove __GFP_NORETRY for correctness > > o Correct logic in sleeping_prematurely > > o Leave SLUB using the default slub_max_order > > > > There are a few reports of people experiencing hangs when copying > > large amounts of data with kswapd using a large amount of CPU which > > appear to be due to recent reclaim changes. > > > > SLUB using high orders is the trigger but not the root cause as SLUB > > has been using high orders for a while. The following four patches > > aim to fix the problems in reclaim while reducing the cost for SLUB > > using those high orders. > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > kswapd: keep kswapd awake for high-order allocations until > > a percentage of the node is balanced] to allow kswapd to > > go to sleep when balanced for high orders. > > > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > use of high orders. > > > > Patch 3 further reduces the cost by prevent SLUB entering direct > > compaction or reclaim paths on the grounds that falling > > back to order-0 should be cheaper. > > > > Patch 4 notes that even when kswapd is failing to keep up with > > allocation requests, it should still go to sleep when its > > quota has expired to prevent it spinning. > > This all works fine for me ... three untar runs and no kswapd hangs or > pegging the CPU at 99% ... in fact, kswapd rarely gets over 20% > Good stuff, thanks. > This isn't as good as the kswapd sleeping_prematurely() throttling > patch. For total CPU time on a three 90GB untar run, it's about 64s of > CPU time with your patch rather than 6s, but that's vastly better than > the 15 minutes of CPU time kswapd was taking even under PREEMPT. > The throttling patch is unfortunately a bit hand-wavy based on number of times it's entered and time passed. It'll be even harder to debug problems related to this in the future particularly as it's using global information (a static) for kswapd (per-node which could be worse in the future depending on what memcg do). However, as you are testing against stable, can you also apply this patch? [2876592f: mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT]. It makes a difference as to when reclaimers give up on high-orders and go to sleep. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-13 15:52 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 15:52 UTC (permalink / raw) To: James Bottomley Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 10:19:44AM -0500, James Bottomley wrote: > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > > Changelog since V1 > > o kswapd should sleep if need_resched > > o Remove __GFP_REPEAT from GFP flags when speculatively using high > > orders so direct/compaction exits earlier > > o Remove __GFP_NORETRY for correctness > > o Correct logic in sleeping_prematurely > > o Leave SLUB using the default slub_max_order > > > > There are a few reports of people experiencing hangs when copying > > large amounts of data with kswapd using a large amount of CPU which > > appear to be due to recent reclaim changes. > > > > SLUB using high orders is the trigger but not the root cause as SLUB > > has been using high orders for a while. The following four patches > > aim to fix the problems in reclaim while reducing the cost for SLUB > > using those high orders. > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > kswapd: keep kswapd awake for high-order allocations until > > a percentage of the node is balanced] to allow kswapd to > > go to sleep when balanced for high orders. > > > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > use of high orders. > > > > Patch 3 further reduces the cost by prevent SLUB entering direct > > compaction or reclaim paths on the grounds that falling > > back to order-0 should be cheaper. > > > > Patch 4 notes that even when kswapd is failing to keep up with > > allocation requests, it should still go to sleep when its > > quota has expired to prevent it spinning. > > This all works fine for me ... three untar runs and no kswapd hangs or > pegging the CPU at 99% ... in fact, kswapd rarely gets over 20% > Good stuff, thanks. > This isn't as good as the kswapd sleeping_prematurely() throttling > patch. For total CPU time on a three 90GB untar run, it's about 64s of > CPU time with your patch rather than 6s, but that's vastly better than > the 15 minutes of CPU time kswapd was taking even under PREEMPT. > The throttling patch is unfortunately a bit hand-wavy based on number of times it's entered and time passed. It'll be even harder to debug problems related to this in the future particularly as it's using global information (a static) for kswapd (per-node which could be worse in the future depending on what memcg do). However, as you are testing against stable, can you also apply this patch? [2876592f: mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT]. It makes a difference as to when reclaimers give up on high-orders and go to sleep. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 2011-05-13 14:03 ` Mel Gorman @ 2011-05-13 15:21 ` Christoph Lameter -1 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-13 15:21 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 13 May 2011, Mel Gorman wrote: > SLUB using high orders is the trigger but not the root cause as SLUB > has been using high orders for a while. The following four patches > aim to fix the problems in reclaim while reducing the cost for SLUB > using those high orders. > > Patch 1 corrects logic introduced by commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until > a percentage of the node is balanced] to allow kswapd to > go to sleep when balanced for high orders. The above looks good. > Patch 2 prevents kswapd waking up in response to SLUBs speculative > use of high orders. Not sure if that is necessary since it seems that we triggered kswapd before? Why not continue to do it? Once kswapd has enough higher order pages kswapd should no longer be triggered right? > Patch 3 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. Its cheaper for reclaim path true but more expensive in terms of SLUBs management costs of the data and it also increases the memory wasted. A higher order means denser packing of objects less page management overhead. Fallback is not for free. Reasonable effort should be made to allocate the page order requested. > Patch 4 notes that even when kswapd is failing to keep up with > allocation requests, it should still go to sleep when its > quota has expired to prevent it spinning. Looks good too. Overall, it looks like the compaction logic and the modifications to reclaim introduced recently with the intend to increase the amount of physically contiguous memory is not working as expected. SLUBs chance of getting higher order pages should be *increasing* as a result of these changes. The above looks like the chances are decreasing now. This is a matter of future concern. The metadata management overhead in the kernel is continually increasing since memory sizes keep growing and we typically manage memory in 4k chunks. Through large allocation sizes we can reduce that management overhead but we can only do this if we have an effective way of defragmenting memory to get longer contiguous chunks that can be managed to a single page struct. Please make sure that compaction and related measures really work properly. The patches suggest that the recent modifications are not improving the situation. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-13 15:21 ` Christoph Lameter 0 siblings, 0 replies; 119+ messages in thread From: Christoph Lameter @ 2011-05-13 15:21 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 13 May 2011, Mel Gorman wrote: > SLUB using high orders is the trigger but not the root cause as SLUB > has been using high orders for a while. The following four patches > aim to fix the problems in reclaim while reducing the cost for SLUB > using those high orders. > > Patch 1 corrects logic introduced by commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until > a percentage of the node is balanced] to allow kswapd to > go to sleep when balanced for high orders. The above looks good. > Patch 2 prevents kswapd waking up in response to SLUBs speculative > use of high orders. Not sure if that is necessary since it seems that we triggered kswapd before? Why not continue to do it? Once kswapd has enough higher order pages kswapd should no longer be triggered right? > Patch 3 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. Its cheaper for reclaim path true but more expensive in terms of SLUBs management costs of the data and it also increases the memory wasted. A higher order means denser packing of objects less page management overhead. Fallback is not for free. Reasonable effort should be made to allocate the page order requested. > Patch 4 notes that even when kswapd is failing to keep up with > allocation requests, it should still go to sleep when its > quota has expired to prevent it spinning. Looks good too. Overall, it looks like the compaction logic and the modifications to reclaim introduced recently with the intend to increase the amount of physically contiguous memory is not working as expected. SLUBs chance of getting higher order pages should be *increasing* as a result of these changes. The above looks like the chances are decreasing now. This is a matter of future concern. The metadata management overhead in the kernel is continually increasing since memory sizes keep growing and we typically manage memory in 4k chunks. Through large allocation sizes we can reduce that management overhead but we can only do this if we have an effective way of defragmenting memory to get longer contiguous chunks that can be managed to a single page struct. Please make sure that compaction and related measures really work properly. The patches suggest that the recent modifications are not improving the situation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 2011-05-13 15:21 ` Christoph Lameter @ 2011-05-13 15:43 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 15:43 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 10:21:46AM -0500, Christoph Lameter wrote: > On Fri, 13 May 2011, Mel Gorman wrote: > > > SLUB using high orders is the trigger but not the root cause as SLUB > > has been using high orders for a while. The following four patches > > aim to fix the problems in reclaim while reducing the cost for SLUB > > using those high orders. > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > kswapd: keep kswapd awake for high-order allocations until > > a percentage of the node is balanced] to allow kswapd to > > go to sleep when balanced for high orders. > > The above looks good. > Ok. > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > use of high orders. > > Not sure if that is necessary since it seems that we triggered kswapd > before? Why not continue to do it? Once kswapd has enough higher order > pages kswapd should no longer be triggered right? > Because kswapd waking up isn't cheap and we are reclaiming pages just so SLUB may get high-order pages in the future. As it's for PAGE_ORDER_COSTLY_ORDER, we are not entering lumpy reclaim and just selecting a few random order-0 pages which may or may not help. There is very little control of how many pages are getting freed if kswapd is being woken frequently. > > Patch 3 further reduces the cost by prevent SLUB entering direct > > compaction or reclaim paths on the grounds that falling > > back to order-0 should be cheaper. > > Its cheaper for reclaim path true but more expensive in terms of SLUBs > management costs of the data and it also increases the memory wasted. Surely the reclaim cost exceeds SLUB management cost? > A > higher order means denser packing of objects less page management > overhead. Fallback is not for free. Neither is reclaiming a large bunch of pages. Worse, reclaiming pages so SLUB gets a high-order means it's likely to be stealing MIGRATE_MOVABLE blocks which eventually gives diminishing returns but may not be noticeable for weeks. From a fragmentation perspective, it's better if SLUB uses order-0 allocations when memory is low so that SLUB pages continue to get packed into as few MIGRATE_UNMOVABLE and MIGRATE_UNRECLAIMABLE blocks as possible. > Reasonable effort should be made to > allocate the page order requested. > > > Patch 4 notes that even when kswapd is failing to keep up with > > allocation requests, it should still go to sleep when its > > quota has expired to prevent it spinning. > > Looks good too. > > Overall, it looks like the compaction logic and the modifications to > reclaim introduced recently with the intend to increase the amount of > physically contiguous memory is not working as expected. > The reclaim and kswapd damage was unintended and this is my fault but reclaim/compaction still makes a lot more sense than lumpy reclaim. Testing showed it disrupted the system a lot less and allocated high-order pages faster with fewer pages reclaimed. > SLUBs chance of getting higher order pages should be *increasing* as a > result of these changes. The above looks like the chances are decreasing > now. > Patches 2 and 3 may mean that SLUB gets fewer high order pages when memory is low and it's depending on high-order pages to be naturally freed by SLUB as it recycles slabs of old objects. On the flip-side, fewer pages will be reclaimed. I'd expect the latter option is cheaper overall. > This is a matter of future concern. The metadata management overhead > in the kernel is continually increasing since memory sizes keep growing > and we typically manage memory in 4k chunks. Through large allocation > sizes we can reduce that management overhead but we can only do this if we > have an effective way of defragmenting memory to get longer contiguous > chunks that can be managed to a single page struct. > > Please make sure that compaction and related measures really work properly. > Local testing still shows them to be behaving as expected but then again, I haven't reproduced the simple problem reported by Chris and James despite using a few different laptops and two different low-end servers. > The patches suggest that the recent modifications are not improving the > situation. > -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-13 15:43 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-13 15:43 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, May 13, 2011 at 10:21:46AM -0500, Christoph Lameter wrote: > On Fri, 13 May 2011, Mel Gorman wrote: > > > SLUB using high orders is the trigger but not the root cause as SLUB > > has been using high orders for a while. The following four patches > > aim to fix the problems in reclaim while reducing the cost for SLUB > > using those high orders. > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > kswapd: keep kswapd awake for high-order allocations until > > a percentage of the node is balanced] to allow kswapd to > > go to sleep when balanced for high orders. > > The above looks good. > Ok. > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > use of high orders. > > Not sure if that is necessary since it seems that we triggered kswapd > before? Why not continue to do it? Once kswapd has enough higher order > pages kswapd should no longer be triggered right? > Because kswapd waking up isn't cheap and we are reclaiming pages just so SLUB may get high-order pages in the future. As it's for PAGE_ORDER_COSTLY_ORDER, we are not entering lumpy reclaim and just selecting a few random order-0 pages which may or may not help. There is very little control of how many pages are getting freed if kswapd is being woken frequently. > > Patch 3 further reduces the cost by prevent SLUB entering direct > > compaction or reclaim paths on the grounds that falling > > back to order-0 should be cheaper. > > Its cheaper for reclaim path true but more expensive in terms of SLUBs > management costs of the data and it also increases the memory wasted. Surely the reclaim cost exceeds SLUB management cost? > A > higher order means denser packing of objects less page management > overhead. Fallback is not for free. Neither is reclaiming a large bunch of pages. Worse, reclaiming pages so SLUB gets a high-order means it's likely to be stealing MIGRATE_MOVABLE blocks which eventually gives diminishing returns but may not be noticeable for weeks. From a fragmentation perspective, it's better if SLUB uses order-0 allocations when memory is low so that SLUB pages continue to get packed into as few MIGRATE_UNMOVABLE and MIGRATE_UNRECLAIMABLE blocks as possible. > Reasonable effort should be made to > allocate the page order requested. > > > Patch 4 notes that even when kswapd is failing to keep up with > > allocation requests, it should still go to sleep when its > > quota has expired to prevent it spinning. > > Looks good too. > > Overall, it looks like the compaction logic and the modifications to > reclaim introduced recently with the intend to increase the amount of > physically contiguous memory is not working as expected. > The reclaim and kswapd damage was unintended and this is my fault but reclaim/compaction still makes a lot more sense than lumpy reclaim. Testing showed it disrupted the system a lot less and allocated high-order pages faster with fewer pages reclaimed. > SLUBs chance of getting higher order pages should be *increasing* as a > result of these changes. The above looks like the chances are decreasing > now. > Patches 2 and 3 may mean that SLUB gets fewer high order pages when memory is low and it's depending on high-order pages to be naturally freed by SLUB as it recycles slabs of old objects. On the flip-side, fewer pages will be reclaimed. I'd expect the latter option is cheaper overall. > This is a matter of future concern. The metadata management overhead > in the kernel is continually increasing since memory sizes keep growing > and we typically manage memory in 4k chunks. Through large allocation > sizes we can reduce that management overhead but we can only do this if we > have an effective way of defragmenting memory to get longer contiguous > chunks that can be managed to a single page struct. > > Please make sure that compaction and related measures really work properly. > Local testing still shows them to be behaving as expected but then again, I haven't reproduced the simple problem reported by Chris and James despite using a few different laptops and two different low-end servers. > The patches suggest that the recent modifications are not improving the > situation. > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 2011-05-13 14:03 ` Mel Gorman @ 2011-05-14 8:34 ` Colin Ian King -1 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-14 8:34 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > Changelog since V1 > o kswapd should sleep if need_resched > o Remove __GFP_REPEAT from GFP flags when speculatively using high > orders so direct/compaction exits earlier > o Remove __GFP_NORETRY for correctness > o Correct logic in sleeping_prematurely > o Leave SLUB using the default slub_max_order > > There are a few reports of people experiencing hangs when copying > large amounts of data with kswapd using a large amount of CPU which > appear to be due to recent reclaim changes. > > SLUB using high orders is the trigger but not the root cause as SLUB > has been using high orders for a while. The following four patches > aim to fix the problems in reclaim while reducing the cost for SLUB > using those high orders. > > Patch 1 corrects logic introduced by commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until > a percentage of the node is balanced] to allow kswapd to > go to sleep when balanced for high orders. > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > use of high orders. > > Patch 3 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. > > Patch 4 notes that even when kswapd is failing to keep up with > allocation requests, it should still go to sleep when its > quota has expired to prevent it spinning. > > My own data on this is not great. I haven't really been able to > reproduce the same problem locally. > > The test case is simple. "download tar" wgets a large tar file and > stores it locally. "unpack" is expanding it (15 times physical RAM > in this case) and "delete source dirs" is the tarfile being deleted > again. I also experimented with having the tar copied numerous times > and into deeper directories to increase the size but the results were > not particularly interesting so I left it as one tar. > > In the background, applications are being launched to time to vaguely > simulate activity on the desktop and to measure how long it takes > applications to start. > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running > LARGE COPY AND UNTAR > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) > MMTests Statistics: duration > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 > > MMTests Statistics: application launch > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 > > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 > > Having SLUB avoid expensive steps in reclaim improves performance > by quite a bit with the overall test completing 1.5 minutes > faster. Application launch times were not really affected but it's > not something my test machine was suffering from in the first place > so it's not really conclusive. The kswapd patches also did not appear > to help but again, the test machine wasn't suffering that problem. > > These patches are against 2.6.39-rc7. Again, testing would be > appreciated. These patches solve the problem for me. I've been soak testing the file copy test for 3.5 hours with nearly 400 test cycles and observed no lockups at all - rock solid. From my observations from the output from vmstat the system is behaving sanely. Thanks for finding a solution - much appreciated! > > Documentation/vm/slub.txt | 2 +- > mm/page_alloc.c | 3 ++- > mm/slub.c | 5 +++-- > 3 files changed, 6 insertions(+), 4 deletions(-) > > mm/page_alloc.c | 3 ++- > mm/slub.c | 3 ++- > mm/vmscan.c | 6 +++++- > 3 files changed, 9 insertions(+), 3 deletions(-) > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-14 8:34 ` Colin Ian King 0 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-14 8:34 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > Changelog since V1 > o kswapd should sleep if need_resched > o Remove __GFP_REPEAT from GFP flags when speculatively using high > orders so direct/compaction exits earlier > o Remove __GFP_NORETRY for correctness > o Correct logic in sleeping_prematurely > o Leave SLUB using the default slub_max_order > > There are a few reports of people experiencing hangs when copying > large amounts of data with kswapd using a large amount of CPU which > appear to be due to recent reclaim changes. > > SLUB using high orders is the trigger but not the root cause as SLUB > has been using high orders for a while. The following four patches > aim to fix the problems in reclaim while reducing the cost for SLUB > using those high orders. > > Patch 1 corrects logic introduced by commit [1741c877: mm: > kswapd: keep kswapd awake for high-order allocations until > a percentage of the node is balanced] to allow kswapd to > go to sleep when balanced for high orders. > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > use of high orders. > > Patch 3 further reduces the cost by prevent SLUB entering direct > compaction or reclaim paths on the grounds that falling > back to order-0 should be cheaper. > > Patch 4 notes that even when kswapd is failing to keep up with > allocation requests, it should still go to sleep when its > quota has expired to prevent it spinning. > > My own data on this is not great. I haven't really been able to > reproduce the same problem locally. > > The test case is simple. "download tar" wgets a large tar file and > stores it locally. "unpack" is expanding it (15 times physical RAM > in this case) and "delete source dirs" is the tarfile being deleted > again. I also experimented with having the tar copied numerous times > and into deeper directories to increase the size but the results were > not particularly interesting so I left it as one tar. > > In the background, applications are being launched to time to vaguely > simulate activity on the desktop and to measure how long it takes > applications to start. > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running > LARGE COPY AND UNTAR > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) > MMTests Statistics: duration > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 > > MMTests Statistics: application launch > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 > > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 > > Having SLUB avoid expensive steps in reclaim improves performance > by quite a bit with the overall test completing 1.5 minutes > faster. Application launch times were not really affected but it's > not something my test machine was suffering from in the first place > so it's not really conclusive. The kswapd patches also did not appear > to help but again, the test machine wasn't suffering that problem. > > These patches are against 2.6.39-rc7. Again, testing would be > appreciated. These patches solve the problem for me. I've been soak testing the file copy test for 3.5 hours with nearly 400 test cycles and observed no lockups at all - rock solid. From my observations from the output from vmstat the system is behaving sanely. Thanks for finding a solution - much appreciated! > > Documentation/vm/slub.txt | 2 +- > mm/page_alloc.c | 3 ++- > mm/slub.c | 5 +++-- > 3 files changed, 6 insertions(+), 4 deletions(-) > > mm/page_alloc.c | 3 ++- > mm/slub.c | 3 ++- > mm/vmscan.c | 6 +++++- > 3 files changed, 9 insertions(+), 3 deletions(-) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 2011-05-14 8:34 ` Colin Ian King @ 2011-05-16 8:37 ` Mel Gorman -1 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 8:37 UTC (permalink / raw) To: Colin Ian King Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote: > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > > Changelog since V1 > > o kswapd should sleep if need_resched > > o Remove __GFP_REPEAT from GFP flags when speculatively using high > > orders so direct/compaction exits earlier > > o Remove __GFP_NORETRY for correctness > > o Correct logic in sleeping_prematurely > > o Leave SLUB using the default slub_max_order > > > > There are a few reports of people experiencing hangs when copying > > large amounts of data with kswapd using a large amount of CPU which > > appear to be due to recent reclaim changes. > > > > SLUB using high orders is the trigger but not the root cause as SLUB > > has been using high orders for a while. The following four patches > > aim to fix the problems in reclaim while reducing the cost for SLUB > > using those high orders. > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > kswapd: keep kswapd awake for high-order allocations until > > a percentage of the node is balanced] to allow kswapd to > > go to sleep when balanced for high orders. > > > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > use of high orders. > > > > Patch 3 further reduces the cost by prevent SLUB entering direct > > compaction or reclaim paths on the grounds that falling > > back to order-0 should be cheaper. > > > > Patch 4 notes that even when kswapd is failing to keep up with > > allocation requests, it should still go to sleep when its > > quota has expired to prevent it spinning. > > > > My own data on this is not great. I haven't really been able to > > reproduce the same problem locally. > > > > The test case is simple. "download tar" wgets a large tar file and > > stores it locally. "unpack" is expanding it (15 times physical RAM > > in this case) and "delete source dirs" is the tarfile being deleted > > again. I also experimented with having the tar copied numerous times > > and into deeper directories to increase the size but the results were > > not particularly interesting so I left it as one tar. > > > > In the background, applications are being launched to time to vaguely > > simulate activity on the desktop and to measure how long it takes > > applications to start. > > > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running > > LARGE COPY AND UNTAR > > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep > > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) > > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) > > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) > > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) > > MMTests Statistics: duration > > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 > > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 > > > > MMTests Statistics: application launch > > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 > > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 > > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 > > > > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 > > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 > > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 > > > > Having SLUB avoid expensive steps in reclaim improves performance > > by quite a bit with the overall test completing 1.5 minutes > > faster. Application launch times were not really affected but it's > > not something my test machine was suffering from in the first place > > so it's not really conclusive. The kswapd patches also did not appear > > to help but again, the test machine wasn't suffering that problem. > > > > These patches are against 2.6.39-rc7. Again, testing would be > > appreciated. > > These patches solve the problem for me. I've been soak testing the file > copy test > for 3.5 hours with nearly 400 test cycles and observed no lockups at all > - rock solid. From my observations from the output from vmstat the > system is behaving sanely. > Thanks for finding a solution - much appreciated! > Can you tell me if just patches 1 and 4 fix the problem please? It'd be good to know if this was only a reclaim-related problem. Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-16 8:37 ` Mel Gorman 0 siblings, 0 replies; 119+ messages in thread From: Mel Gorman @ 2011-05-16 8:37 UTC (permalink / raw) To: Colin Ian King Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote: > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > > Changelog since V1 > > o kswapd should sleep if need_resched > > o Remove __GFP_REPEAT from GFP flags when speculatively using high > > orders so direct/compaction exits earlier > > o Remove __GFP_NORETRY for correctness > > o Correct logic in sleeping_prematurely > > o Leave SLUB using the default slub_max_order > > > > There are a few reports of people experiencing hangs when copying > > large amounts of data with kswapd using a large amount of CPU which > > appear to be due to recent reclaim changes. > > > > SLUB using high orders is the trigger but not the root cause as SLUB > > has been using high orders for a while. The following four patches > > aim to fix the problems in reclaim while reducing the cost for SLUB > > using those high orders. > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > kswapd: keep kswapd awake for high-order allocations until > > a percentage of the node is balanced] to allow kswapd to > > go to sleep when balanced for high orders. > > > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > use of high orders. > > > > Patch 3 further reduces the cost by prevent SLUB entering direct > > compaction or reclaim paths on the grounds that falling > > back to order-0 should be cheaper. > > > > Patch 4 notes that even when kswapd is failing to keep up with > > allocation requests, it should still go to sleep when its > > quota has expired to prevent it spinning. > > > > My own data on this is not great. I haven't really been able to > > reproduce the same problem locally. > > > > The test case is simple. "download tar" wgets a large tar file and > > stores it locally. "unpack" is expanding it (15 times physical RAM > > in this case) and "delete source dirs" is the tarfile being deleted > > again. I also experimented with having the tar copied numerous times > > and into deeper directories to increase the size but the results were > > not particularly interesting so I left it as one tar. > > > > In the background, applications are being launched to time to vaguely > > simulate activity on the desktop and to measure how long it takes > > applications to start. > > > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running > > LARGE COPY AND UNTAR > > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep > > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) > > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) > > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) > > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) > > MMTests Statistics: duration > > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 > > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 > > > > MMTests Statistics: application launch > > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 > > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 > > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 > > > > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 > > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 > > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 > > > > Having SLUB avoid expensive steps in reclaim improves performance > > by quite a bit with the overall test completing 1.5 minutes > > faster. Application launch times were not really affected but it's > > not something my test machine was suffering from in the first place > > so it's not really conclusive. The kswapd patches also did not appear > > to help but again, the test machine wasn't suffering that problem. > > > > These patches are against 2.6.39-rc7. Again, testing would be > > appreciated. > > These patches solve the problem for me. I've been soak testing the file > copy test > for 3.5 hours with nearly 400 test cycles and observed no lockups at all > - rock solid. From my observations from the output from vmstat the > system is behaving sanely. > Thanks for finding a solution - much appreciated! > Can you tell me if just patches 1 and 4 fix the problem please? It'd be good to know if this was only a reclaim-related problem. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 2011-05-16 8:37 ` Mel Gorman @ 2011-05-16 11:24 ` Colin Ian King -1 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-16 11:24 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, 2011-05-16 at 09:37 +0100, Mel Gorman wrote: > On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote: > > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > > > Changelog since V1 > > > o kswapd should sleep if need_resched > > > o Remove __GFP_REPEAT from GFP flags when speculatively using high > > > orders so direct/compaction exits earlier > > > o Remove __GFP_NORETRY for correctness > > > o Correct logic in sleeping_prematurely > > > o Leave SLUB using the default slub_max_order > > > > > > There are a few reports of people experiencing hangs when copying > > > large amounts of data with kswapd using a large amount of CPU which > > > appear to be due to recent reclaim changes. > > > > > > SLUB using high orders is the trigger but not the root cause as SLUB > > > has been using high orders for a while. The following four patches > > > aim to fix the problems in reclaim while reducing the cost for SLUB > > > using those high orders. > > > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > > kswapd: keep kswapd awake for high-order allocations until > > > a percentage of the node is balanced] to allow kswapd to > > > go to sleep when balanced for high orders. > > > > > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > > use of high orders. > > > > > > Patch 3 further reduces the cost by prevent SLUB entering direct > > > compaction or reclaim paths on the grounds that falling > > > back to order-0 should be cheaper. > > > > > > Patch 4 notes that even when kswapd is failing to keep up with > > > allocation requests, it should still go to sleep when its > > > quota has expired to prevent it spinning. > > > > > > My own data on this is not great. I haven't really been able to > > > reproduce the same problem locally. > > > > > > The test case is simple. "download tar" wgets a large tar file and > > > stores it locally. "unpack" is expanding it (15 times physical RAM > > > in this case) and "delete source dirs" is the tarfile being deleted > > > again. I also experimented with having the tar copied numerous times > > > and into deeper directories to increase the size but the results were > > > not particularly interesting so I left it as one tar. > > > > > > In the background, applications are being launched to time to vaguely > > > simulate activity on the desktop and to measure how long it takes > > > applications to start. > > > > > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running > > > LARGE COPY AND UNTAR > > > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep > > > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) > > > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) > > > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) > > > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) > > > MMTests Statistics: duration > > > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 > > > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 > > > > > > MMTests Statistics: application launch > > > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 > > > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 > > > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 > > > > > > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 > > > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 > > > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 > > > > > > Having SLUB avoid expensive steps in reclaim improves performance > > > by quite a bit with the overall test completing 1.5 minutes > > > faster. Application launch times were not really affected but it's > > > not something my test machine was suffering from in the first place > > > so it's not really conclusive. The kswapd patches also did not appear > > > to help but again, the test machine wasn't suffering that problem. > > > > > > These patches are against 2.6.39-rc7. Again, testing would be > > > appreciated. > > > > These patches solve the problem for me. I've been soak testing the file > > copy test > > for 3.5 hours with nearly 400 test cycles and observed no lockups at all > > - rock solid. From my observations from the output from vmstat the > > system is behaving sanely. > > Thanks for finding a solution - much appreciated! > > > > Can you tell me if just patches 1 and 4 fix the problem please? It'd be good > to know if this was only a reclaim-related problem. Thanks. Hi Mel, Soak tested just patches 1 + 4 and works fine. Did 250 cycles for ~2 hours, no lockups, and the output from vmstat looked sane. Colin > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 @ 2011-05-16 11:24 ` Colin Ian King 0 siblings, 0 replies; 119+ messages in thread From: Colin Ian King @ 2011-05-16 11:24 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, James Bottomley, Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel, linux-ext4 On Mon, 2011-05-16 at 09:37 +0100, Mel Gorman wrote: > On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote: > > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote: > > > Changelog since V1 > > > o kswapd should sleep if need_resched > > > o Remove __GFP_REPEAT from GFP flags when speculatively using high > > > orders so direct/compaction exits earlier > > > o Remove __GFP_NORETRY for correctness > > > o Correct logic in sleeping_prematurely > > > o Leave SLUB using the default slub_max_order > > > > > > There are a few reports of people experiencing hangs when copying > > > large amounts of data with kswapd using a large amount of CPU which > > > appear to be due to recent reclaim changes. > > > > > > SLUB using high orders is the trigger but not the root cause as SLUB > > > has been using high orders for a while. The following four patches > > > aim to fix the problems in reclaim while reducing the cost for SLUB > > > using those high orders. > > > > > > Patch 1 corrects logic introduced by commit [1741c877: mm: > > > kswapd: keep kswapd awake for high-order allocations until > > > a percentage of the node is balanced] to allow kswapd to > > > go to sleep when balanced for high orders. > > > > > > Patch 2 prevents kswapd waking up in response to SLUBs speculative > > > use of high orders. > > > > > > Patch 3 further reduces the cost by prevent SLUB entering direct > > > compaction or reclaim paths on the grounds that falling > > > back to order-0 should be cheaper. > > > > > > Patch 4 notes that even when kswapd is failing to keep up with > > > allocation requests, it should still go to sleep when its > > > quota has expired to prevent it spinning. > > > > > > My own data on this is not great. I haven't really been able to > > > reproduce the same problem locally. > > > > > > The test case is simple. "download tar" wgets a large tar file and > > > stores it locally. "unpack" is expanding it (15 times physical RAM > > > in this case) and "delete source dirs" is the tarfile being deleted > > > again. I also experimented with having the tar copied numerous times > > > and into deeper directories to increase the size but the results were > > > not particularly interesting so I left it as one tar. > > > > > > In the background, applications are being launched to time to vaguely > > > simulate activity on the desktop and to measure how long it takes > > > applications to start. > > > > > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running > > > LARGE COPY AND UNTAR > > > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep > > > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) > > > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%) > > > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) > > > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%) > > > MMTests Statistics: duration > > > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26 > > > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35 > > > > > > MMTests Statistics: application launch > > > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08 > > > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96 > > > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88 > > > > > > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15 > > > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02 > > > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11 > > > > > > Having SLUB avoid expensive steps in reclaim improves performance > > > by quite a bit with the overall test completing 1.5 minutes > > > faster. Application launch times were not really affected but it's > > > not something my test machine was suffering from in the first place > > > so it's not really conclusive. The kswapd patches also did not appear > > > to help but again, the test machine wasn't suffering that problem. > > > > > > These patches are against 2.6.39-rc7. Again, testing would be > > > appreciated. > > > > These patches solve the problem for me. I've been soak testing the file > > copy test > > for 3.5 hours with nearly 400 test cycles and observed no lockups at all > > - rock solid. From my observations from the output from vmstat the > > system is behaving sanely. > > Thanks for finding a solution - much appreciated! > > > > Can you tell me if just patches 1 and 4 fix the problem please? It'd be good > to know if this was only a reclaim-related problem. Thanks. Hi Mel, Soak tested just patches 1 + 4 and works fine. Did 250 cycles for ~2 hours, no lockups, and the output from vmstat looked sane. Colin > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 119+ messages in thread
end of thread, other threads:[~2011-05-20 0:06 UTC | newest] Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-05-13 14:03 [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 Mel Gorman 2011-05-13 14:03 ` Mel Gorman 2011-05-13 14:03 ` [PATCH 1/4] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely Mel Gorman 2011-05-13 14:03 ` Mel Gorman 2011-05-13 14:28 ` Johannes Weiner 2011-05-13 14:28 ` Johannes Weiner 2011-05-14 16:30 ` Minchan Kim 2011-05-14 16:30 ` Minchan Kim 2011-05-16 14:30 ` Rik van Riel 2011-05-16 14:30 ` Rik van Riel 2011-05-13 14:03 ` [PATCH 2/4] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations Mel Gorman 2011-05-13 14:03 ` Mel Gorman 2011-05-16 21:10 ` David Rientjes 2011-05-16 21:10 ` David Rientjes 2011-05-18 6:09 ` Pekka Enberg 2011-05-18 6:09 ` Pekka Enberg 2011-05-18 17:21 ` Christoph Lameter 2011-05-18 17:21 ` Christoph Lameter 2011-05-13 14:03 ` [PATCH 3/4] mm: slub: Do not take expensive steps " Mel Gorman 2011-05-13 14:03 ` Mel Gorman 2011-05-16 21:16 ` David Rientjes 2011-05-16 21:16 ` David Rientjes 2011-05-17 8:42 ` Mel Gorman 2011-05-17 8:42 ` Mel Gorman 2011-05-17 13:51 ` Christoph Lameter 2011-05-17 13:51 ` Christoph Lameter 2011-05-17 16:22 ` Mel Gorman 2011-05-17 16:22 ` Mel Gorman 2011-05-17 17:52 ` Christoph Lameter 2011-05-17 17:52 ` Christoph Lameter 2011-05-17 19:35 ` David Rientjes 2011-05-17 19:35 ` David Rientjes 2011-05-17 19:31 ` David Rientjes 2011-05-17 19:31 ` David Rientjes 2011-05-13 14:03 ` [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep Mel Gorman 2011-05-13 14:03 ` Mel Gorman 2011-05-15 10:27 ` KOSAKI Motohiro 2011-05-15 10:27 ` KOSAKI Motohiro 2011-05-16 4:21 ` James Bottomley 2011-05-16 4:21 ` James Bottomley 2011-05-16 5:04 ` Minchan Kim 2011-05-16 5:04 ` Minchan Kim 2011-05-16 8:45 ` Mel Gorman 2011-05-16 8:45 ` Mel Gorman 2011-05-16 8:45 ` Mel Gorman 2011-05-16 8:58 ` Minchan Kim 2011-05-16 8:58 ` Minchan Kim 2011-05-16 8:58 ` Minchan Kim 2011-05-16 10:27 ` Mel Gorman 2011-05-16 10:27 ` Mel Gorman 2011-05-16 10:27 ` Mel Gorman 2011-05-16 23:50 ` Minchan Kim 2011-05-16 23:50 ` Minchan Kim 2011-05-17 0:48 ` Minchan Kim 2011-05-17 0:48 ` Minchan Kim 2011-05-17 0:48 ` Minchan Kim 2011-05-17 10:38 ` Mel Gorman 2011-05-17 10:38 ` Mel Gorman 2011-05-17 10:38 ` Mel Gorman 2011-05-17 13:50 ` Colin Ian King 2011-05-17 13:50 ` Colin Ian King 2011-05-17 16:15 ` [PATCH] mm: vmscan: Correctly check if reclaimer should schedule during shrink_slab Mel Gorman 2011-05-17 16:15 ` Mel Gorman 2011-05-18 0:45 ` KOSAKI Motohiro 2011-05-18 0:45 ` KOSAKI Motohiro 2011-05-19 0:03 ` Minchan Kim 2011-05-19 0:03 ` Minchan Kim 2011-05-19 0:03 ` Minchan Kim 2011-05-19 0:09 ` Minchan Kim 2011-05-19 0:09 ` Minchan Kim 2011-05-19 0:09 ` Minchan Kim 2011-05-19 11:36 ` Colin Ian King 2011-05-19 11:36 ` Colin Ian King 2011-05-20 0:06 ` Minchan Kim 2011-05-20 0:06 ` Minchan Kim 2011-05-20 0:06 ` Minchan Kim 2011-05-18 4:19 ` [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep Minchan Kim 2011-05-18 4:19 ` Minchan Kim 2011-05-18 7:39 ` Colin Ian King 2011-05-18 7:39 ` Colin Ian King 2011-05-18 4:09 ` James Bottomley 2011-05-18 4:09 ` James Bottomley 2011-05-18 1:05 ` KOSAKI Motohiro 2011-05-18 1:05 ` KOSAKI Motohiro 2011-05-18 5:44 ` Minchan Kim 2011-05-18 5:44 ` Minchan Kim 2011-05-18 5:44 ` Minchan Kim 2011-05-18 6:05 ` KOSAKI Motohiro 2011-05-18 6:05 ` KOSAKI Motohiro 2011-05-18 9:58 ` Mel Gorman 2011-05-18 9:58 ` Mel Gorman 2011-05-18 9:58 ` Mel Gorman 2011-05-18 22:55 ` Minchan Kim 2011-05-18 22:55 ` Minchan Kim 2011-05-18 23:54 ` KOSAKI Motohiro 2011-05-18 23:54 ` KOSAKI Motohiro 2011-05-18 0:26 ` KOSAKI Motohiro 2011-05-18 0:26 ` KOSAKI Motohiro 2011-05-18 9:57 ` Mel Gorman 2011-05-18 9:57 ` Mel Gorman 2011-05-16 8:45 ` Mel Gorman 2011-05-16 8:45 ` Mel Gorman 2011-05-16 14:30 ` Rik van Riel 2011-05-16 14:30 ` Rik van Riel 2011-05-13 15:19 ` [PATCH 0/4] Reduce impact to overall system of SLUB using high-order allocations V2 James Bottomley 2011-05-13 15:19 ` James Bottomley 2011-05-13 15:19 ` James Bottomley 2011-05-13 15:52 ` Mel Gorman 2011-05-13 15:52 ` Mel Gorman 2011-05-13 15:21 ` Christoph Lameter 2011-05-13 15:21 ` Christoph Lameter 2011-05-13 15:43 ` Mel Gorman 2011-05-13 15:43 ` Mel Gorman 2011-05-14 8:34 ` Colin Ian King 2011-05-14 8:34 ` Colin Ian King 2011-05-16 8:37 ` Mel Gorman 2011-05-16 8:37 ` Mel Gorman 2011-05-16 11:24 ` Colin Ian King 2011-05-16 11:24 ` Colin Ian King
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.