* [RFC PATCH 0/3] Reduce frequency of stalls due to zone_reclaim() on NUMA
@ 2011-07-11 13:01 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on
a distribution bugzilla. In these cases, zone_reclaim was enabled
by default due to large NUMA distances. In general, the complaints
have not been about the workload itself unless it was a file server
(in which case the recommendation was disable zone_reclaim).
The stalls are mostly due to significant amounts of time spent
scanning the preferred zone for pages to free. After a failure, it
might fallback to another node (as zonelists are often node-ordered
rather than zone-ordered) but stall quickly again when the next
allocation attempt occurs. In bad cases, each page allocated results
in a full scan of the preferred zone.
This series aims to reduce some of the impact of zone_reclaim.
Patch 1 stops zone_reclaim using PF_SWAPWRITE. It's a direct reclaimer,
it should not ignore device congestion.
Patch 2 checks the preferred zone for recent allocation failure which
is particularly important if zone_reclaim has failed recently.
This avoids rescanning the zone in the near future and instead
falling back to another node. This may hurt node locality in
some cases but a failure to zone_reclaim is more expensive
than a remote access.
Patch 3 clears the zlc information after direct reclaim. Otherwise,
zone_reclaim can mark zones full, direct reclaim can
reclaim enough pages but the zone is still not considered
for allocation.
This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs
on node-0 to avoid disturbances due to processes being scheduled on
different nodes. The kernels tested are
3.0-rc6-vanilla Vanilla 3.0-rc6
zlcfirst Patches 1+2 applied
zlcreconsider All patches applied
FS-Mark
./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
fsmark-3.0-rc6 3.0-rc6 3.0-rc6
0.2-vanilla zlcfirst zlcreconsider
Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 501.49 493.91 499.93
Total Elapsed Time (seconds) 2451.57 2257.48 2215.92
MMTests Statistics: vmstat
Page Ins 46268 63840 66008
Page Outs 90821596 90671128 88043732
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 13091697 8966863 8971790
Kswapd pages scanned 0 1830011 1831116
Kswapd pages reclaimed 0 1829068 1829930
Direct pages reclaimed 13037777 8956828 8648314
Kswapd efficiency 100% 99% 99%
Kswapd velocity 0.000 810.643 826.346
Direct efficiency 99% 99% 96%
Direct velocity 5340.128 3972.068 4048.788
Percentage direct scans 100% 83% 83%
Page writes by reclaim 0 3 0
Slabs scanned 796672 720640 720256
Direct inode steals 7422667 7160012 7088638
Kswapd inode steals 0 1736840 2021238
Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number
of iterations were much higher than the mean. The number of pages
scanned by zone_reclaim is reduced and kswapd is used for more work.
LARGE DD
3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirst zlcreconsider
download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 125.03 118.98 122.01
Total Elapsed Time (seconds) 624.56 375.02 398.06
MMTests Statistics: vmstat
Page Ins 3594216 439368 407032
Page Outs 23380832 23380488 23377444
Swap Ins 0 0 0
Swap Outs 0 436 287
Direct pages scanned 17482342 69315973 82864918
Kswapd pages scanned 0 519123 575425
Kswapd pages reclaimed 0 466501 522487
Direct pages reclaimed 5858054 2732949 2712547
Kswapd efficiency 100% 89% 90%
Kswapd velocity 0.000 1384.254 1445.574
Direct efficiency 33% 3% 3%
Direct velocity 27991.453 184832.737 208171.929
Percentage direct scans 100% 99% 99%
Page writes by reclaim 0 5082 13917
Slabs scanned 17280 29952 35328
Direct inode steals 115257 1431122 332201
Kswapd inode steals 0 0 979532
This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to
the number of dirty pages encountered. The figures could probably be
improved with more work around how kswapd is used and how dirty pages
are handled but that is separate work and this result is significant
on its own.
Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 124.47 111.67 112.64
Total Elapsed Time (seconds) 2138.14 1816.30 1867.56
MMTests Statistics: vmstat
Page Ins 90760 89124 89516
Page Outs 121028340 120199524 120736696
Swap Ins 0 86 55
Swap Outs 0 0 0
Direct pages scanned 114989363 96461439 96330619
Kswapd pages scanned 56430948 56965763 57075875
Kswapd pages reclaimed 27743219 27752044 27766606
Direct pages reclaimed 49777 46884 36655
Kswapd efficiency 49% 48% 48%
Kswapd velocity 26392.541 31363.631 30561.736
Direct efficiency 0% 0% 0%
Direct velocity 53780.091 53108.759 51581.004
Percentage direct scans 67% 62% 62%
Page writes by reclaim 385 122 1513
Slabs scanned 43008 39040 42112
Direct inode steals 0 10 8
Kswapd inode steals 733 534 477
This test just creates a large file mapping and writes to it
linearly. Time to completion is again reduced.
The gains are mostly down to two things. In many cases, there
is less scanning as zone_reclaim simply gives up faster due to
recent failures. The second reason is that memory is used more
efficiently. Instead of scanning the preferred zone every time, the
allocator falls back to another zone and uses it instead improving
overall memory utilisation.
Comments?
mm/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++++++-------------
mm/vmscan.c | 4 ++--
2 files changed, 43 insertions(+), 15 deletions(-)
--
1.7.3.4
^ permalink raw reply [flat|nested] 42+ messages in thread
* [RFC PATCH 0/3] Reduce frequency of stalls due to zone_reclaim() on NUMA
@ 2011-07-11 13:01 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on
a distribution bugzilla. In these cases, zone_reclaim was enabled
by default due to large NUMA distances. In general, the complaints
have not been about the workload itself unless it was a file server
(in which case the recommendation was disable zone_reclaim).
The stalls are mostly due to significant amounts of time spent
scanning the preferred zone for pages to free. After a failure, it
might fallback to another node (as zonelists are often node-ordered
rather than zone-ordered) but stall quickly again when the next
allocation attempt occurs. In bad cases, each page allocated results
in a full scan of the preferred zone.
This series aims to reduce some of the impact of zone_reclaim.
Patch 1 stops zone_reclaim using PF_SWAPWRITE. It's a direct reclaimer,
it should not ignore device congestion.
Patch 2 checks the preferred zone for recent allocation failure which
is particularly important if zone_reclaim has failed recently.
This avoids rescanning the zone in the near future and instead
falling back to another node. This may hurt node locality in
some cases but a failure to zone_reclaim is more expensive
than a remote access.
Patch 3 clears the zlc information after direct reclaim. Otherwise,
zone_reclaim can mark zones full, direct reclaim can
reclaim enough pages but the zone is still not considered
for allocation.
This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs
on node-0 to avoid disturbances due to processes being scheduled on
different nodes. The kernels tested are
3.0-rc6-vanilla Vanilla 3.0-rc6
zlcfirst Patches 1+2 applied
zlcreconsider All patches applied
FS-Mark
./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
fsmark-3.0-rc6 3.0-rc6 3.0-rc6
0.2-vanilla zlcfirst zlcreconsider
Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 501.49 493.91 499.93
Total Elapsed Time (seconds) 2451.57 2257.48 2215.92
MMTests Statistics: vmstat
Page Ins 46268 63840 66008
Page Outs 90821596 90671128 88043732
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 13091697 8966863 8971790
Kswapd pages scanned 0 1830011 1831116
Kswapd pages reclaimed 0 1829068 1829930
Direct pages reclaimed 13037777 8956828 8648314
Kswapd efficiency 100% 99% 99%
Kswapd velocity 0.000 810.643 826.346
Direct efficiency 99% 99% 96%
Direct velocity 5340.128 3972.068 4048.788
Percentage direct scans 100% 83% 83%
Page writes by reclaim 0 3 0
Slabs scanned 796672 720640 720256
Direct inode steals 7422667 7160012 7088638
Kswapd inode steals 0 1736840 2021238
Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number
of iterations were much higher than the mean. The number of pages
scanned by zone_reclaim is reduced and kswapd is used for more work.
LARGE DD
3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirst zlcreconsider
download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 125.03 118.98 122.01
Total Elapsed Time (seconds) 624.56 375.02 398.06
MMTests Statistics: vmstat
Page Ins 3594216 439368 407032
Page Outs 23380832 23380488 23377444
Swap Ins 0 0 0
Swap Outs 0 436 287
Direct pages scanned 17482342 69315973 82864918
Kswapd pages scanned 0 519123 575425
Kswapd pages reclaimed 0 466501 522487
Direct pages reclaimed 5858054 2732949 2712547
Kswapd efficiency 100% 89% 90%
Kswapd velocity 0.000 1384.254 1445.574
Direct efficiency 33% 3% 3%
Direct velocity 27991.453 184832.737 208171.929
Percentage direct scans 100% 99% 99%
Page writes by reclaim 0 5082 13917
Slabs scanned 17280 29952 35328
Direct inode steals 115257 1431122 332201
Kswapd inode steals 0 0 979532
This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to
the number of dirty pages encountered. The figures could probably be
improved with more work around how kswapd is used and how dirty pages
are handled but that is separate work and this result is significant
on its own.
Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 124.47 111.67 112.64
Total Elapsed Time (seconds) 2138.14 1816.30 1867.56
MMTests Statistics: vmstat
Page Ins 90760 89124 89516
Page Outs 121028340 120199524 120736696
Swap Ins 0 86 55
Swap Outs 0 0 0
Direct pages scanned 114989363 96461439 96330619
Kswapd pages scanned 56430948 56965763 57075875
Kswapd pages reclaimed 27743219 27752044 27766606
Direct pages reclaimed 49777 46884 36655
Kswapd efficiency 49% 48% 48%
Kswapd velocity 26392.541 31363.631 30561.736
Direct efficiency 0% 0% 0%
Direct velocity 53780.091 53108.759 51581.004
Percentage direct scans 67% 62% 62%
Page writes by reclaim 385 122 1513
Slabs scanned 43008 39040 42112
Direct inode steals 0 10 8
Kswapd inode steals 733 534 477
This test just creates a large file mapping and writes to it
linearly. Time to completion is again reduced.
The gains are mostly down to two things. In many cases, there
is less scanning as zone_reclaim simply gives up faster due to
recent failures. The second reason is that memory is used more
efficiently. Instead of scanning the preferred zone every time, the
allocator falls back to another zone and uses it instead improving
overall memory utilisation.
Comments?
mm/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++++++-------------
mm/vmscan.c | 4 ++--
2 files changed, 43 insertions(+), 15 deletions(-)
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-11 13:01 ` Mel Gorman
@ 2011-07-11 13:01 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
Zone reclaim is similar to direct reclaim in a number of respects.
PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
but it's set also set for zone_reclaim which is inappropriate.
Setting it potentially allows zone_reclaim users to cause large IO
stalls which is worse than remote memory accesses.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f49535..ebef213 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3063,7 +3063,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP.
*/
- p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+ p->flags |= PF_MEMALLOC;
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
@@ -3116,7 +3116,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
}
p->reclaim_state = NULL;
- current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+ current->flags &= ~PF_MEMALLOC;
lockdep_clear_current_reclaim_state();
return sc.nr_reclaimed >= nr_pages;
}
--
1.7.3.4
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-11 13:01 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
Zone reclaim is similar to direct reclaim in a number of respects.
PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
but it's set also set for zone_reclaim which is inappropriate.
Setting it potentially allows zone_reclaim users to cause large IO
stalls which is worse than remote memory accesses.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f49535..ebef213 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3063,7 +3063,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP.
*/
- p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+ p->flags |= PF_MEMALLOC;
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
@@ -3116,7 +3116,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
}
p->reclaim_state = NULL;
- current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+ current->flags &= ~PF_MEMALLOC;
lockdep_clear_current_reclaim_state();
return sc.nr_reclaimed >= nr_pages;
}
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-11 13:01 ` Mel Gorman
@ 2011-07-11 13:01 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention
is to avoid a high cost scanning extremely long zonelists or scanning
within the zone uselessly.
Currently the zonelist cache is setup only after the first zone has
been considered and zone_reclaim() has been called. The objective was
to avoid a costly setup but zone_reclaim is itself quite expensive. If
it is failing regularly such as the first eligible zone having mostly
mapped pages, the cost in scanning and allocation stalls is far higher
than the ZLC initialisation step.
This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone
failed zone_reclaim recently. If it has, the zone is skipped. As the
first zone is now being checked, additional care has to be taken about
zones marked full. A zone can be marked "full" because it should not
have enough unmapped pages for zone_reclaim but this is excessive as
direct reclaim or kswapd may succeed where zone_reclaim fails. Only
mark zones "full" after zone_reclaim fails if it failed to reclaim
enough pages after scanning.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 35 ++++++++++++++++++++++-------------
1 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..6913854 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1664,7 +1664,7 @@ zonelist_scan:
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
- goto try_next_zone;
+ continue;
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
@@ -1676,17 +1676,36 @@ zonelist_scan:
classzone_idx, alloc_flags))
goto try_this_zone;
+ if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
+ /*
+ * we do zlc_setup if there are multiple nodes
+ * and before considering the first zone allowed
+ * by the cpuset.
+ */
+ allowednodes = zlc_setup(zonelist, alloc_flags);
+ zlc_active = 1;
+ did_zlc_setup = 1;
+ }
+
if (zone_reclaim_mode == 0)
goto this_zone_full;
+ /*
+ * As we may have just activated ZLC, check if the first
+ * eligible zone has failed zone_reclaim recently.
+ */
+ if (NUMA_BUILD && zlc_active &&
+ !zlc_zone_worth_trying(zonelist, z, allowednodes))
+ continue;
+
ret = zone_reclaim(zone, gfp_mask, order);
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
/* did not scan */
- goto try_next_zone;
+ continue;
case ZONE_RECLAIM_FULL:
/* scanned but unreclaimable */
- goto this_zone_full;
+ continue;
default:
/* did we reclaim enough */
if (!zone_watermark_ok(zone, order, mark,
@@ -1703,16 +1722,6 @@ try_this_zone:
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
-try_next_zone:
- if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
- /*
- * we do zlc_setup after the first zone is tried but only
- * if there are multiple nodes make it worthwhile
- */
- allowednodes = zlc_setup(zonelist, alloc_flags);
- zlc_active = 1;
- did_zlc_setup = 1;
- }
}
if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
--
1.7.3.4
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-11 13:01 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention
is to avoid a high cost scanning extremely long zonelists or scanning
within the zone uselessly.
Currently the zonelist cache is setup only after the first zone has
been considered and zone_reclaim() has been called. The objective was
to avoid a costly setup but zone_reclaim is itself quite expensive. If
it is failing regularly such as the first eligible zone having mostly
mapped pages, the cost in scanning and allocation stalls is far higher
than the ZLC initialisation step.
This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone
failed zone_reclaim recently. If it has, the zone is skipped. As the
first zone is now being checked, additional care has to be taken about
zones marked full. A zone can be marked "full" because it should not
have enough unmapped pages for zone_reclaim but this is excessive as
direct reclaim or kswapd may succeed where zone_reclaim fails. Only
mark zones "full" after zone_reclaim fails if it failed to reclaim
enough pages after scanning.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 35 ++++++++++++++++++++++-------------
1 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..6913854 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1664,7 +1664,7 @@ zonelist_scan:
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
- goto try_next_zone;
+ continue;
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
@@ -1676,17 +1676,36 @@ zonelist_scan:
classzone_idx, alloc_flags))
goto try_this_zone;
+ if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
+ /*
+ * we do zlc_setup if there are multiple nodes
+ * and before considering the first zone allowed
+ * by the cpuset.
+ */
+ allowednodes = zlc_setup(zonelist, alloc_flags);
+ zlc_active = 1;
+ did_zlc_setup = 1;
+ }
+
if (zone_reclaim_mode == 0)
goto this_zone_full;
+ /*
+ * As we may have just activated ZLC, check if the first
+ * eligible zone has failed zone_reclaim recently.
+ */
+ if (NUMA_BUILD && zlc_active &&
+ !zlc_zone_worth_trying(zonelist, z, allowednodes))
+ continue;
+
ret = zone_reclaim(zone, gfp_mask, order);
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
/* did not scan */
- goto try_next_zone;
+ continue;
case ZONE_RECLAIM_FULL:
/* scanned but unreclaimable */
- goto this_zone_full;
+ continue;
default:
/* did we reclaim enough */
if (!zone_watermark_ok(zone, order, mark,
@@ -1703,16 +1722,6 @@ try_this_zone:
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
-try_next_zone:
- if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
- /*
- * we do zlc_setup after the first zone is tried but only
- * if there are multiple nodes make it worthwhile
- */
- allowednodes = zlc_setup(zonelist, alloc_flags);
- zlc_active = 1;
- did_zlc_setup = 1;
- }
}
if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-11 13:01 ` Mel Gorman
@ 2011-07-11 13:01 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be
full even after reclaiming pages. Reconsider all zones for allocation
if direct reclaim returns successfully.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6913854..149409c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
set_bit(i, zlc->fullzones);
}
+/*
+ * clear all zones full, called after direct reclaim makes progress so that
+ * a zone that was recently full is not skipped over for up to a second
+ */
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+ struct zonelist_cache *zlc; /* cached zonelist speedup info */
+
+ zlc = zonelist->zlcache_ptr;
+ if (!zlc)
+ return;
+
+ bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+}
+
#else /* CONFIG_NUMA */
static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
@@ -1963,6 +1978,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (unlikely(!(*did_some_progress)))
return NULL;
+ /* After successful reclaim, reconsider all zones for allocation */
+ if (NUMA_BUILD)
+ zlc_clear_zones_full(zonelist);
+
retry:
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
--
1.7.3.4
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-11 13:01 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-11 13:01 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, Mel Gorman
With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be
full even after reclaiming pages. Reconsider all zones for allocation
if direct reclaim returns successfully.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6913854..149409c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
set_bit(i, zlc->fullzones);
}
+/*
+ * clear all zones full, called after direct reclaim makes progress so that
+ * a zone that was recently full is not skipped over for up to a second
+ */
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+ struct zonelist_cache *zlc; /* cached zonelist speedup info */
+
+ zlc = zonelist->zlcache_ptr;
+ if (!zlc)
+ return;
+
+ bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+}
+
#else /* CONFIG_NUMA */
static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
@@ -1963,6 +1978,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (unlikely(!(*did_some_progress)))
return NULL;
+ /* After successful reclaim, reconsider all zones for allocation */
+ if (NUMA_BUILD)
+ zlc_clear_zones_full(zonelist);
+
retry:
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-11 13:01 ` Mel Gorman
@ 2011-07-12 9:27 ` Minchan Kim
-1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2011-07-12 9:27 UTC (permalink / raw)
To: Mel Gorman; +Cc: linux-mm, linux-kernel, Christoph Lameter, KOSAKI Motohiro
Hi Mel,
On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
> Zone reclaim is similar to direct reclaim in a number of respects.
> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
> but it's set also set for zone_reclaim which is inappropriate.
> Setting it potentially allows zone_reclaim users to cause large IO
> stalls which is worse than remote memory accesses.
As I read zone_reclaim_mode in vm.txt, I think it's intentional.
It has meaning of throttle the process which are writing large amounts
of data. The point is to prevent use of remote node's free memory.
And we has still the comment. If you're right, you should remove comment.
" * and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP."
And at least, we should Cc Christoph and KOSAKI.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> mm/vmscan.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f49535..ebef213 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3063,7 +3063,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> * and we also need to be able to write out pages for RECLAIM_WRITE
> * and RECLAIM_SWAP.
> */
> - p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
> + p->flags |= PF_MEMALLOC;
> lockdep_set_current_reclaim_state(gfp_mask);
> reclaim_state.reclaimed_slab = 0;
> p->reclaim_state = &reclaim_state;
> @@ -3116,7 +3116,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> }
>
> p->reclaim_state = NULL;
> - current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
> + current->flags &= ~PF_MEMALLOC;
> lockdep_clear_current_reclaim_state();
> return sc.nr_reclaimed >= nr_pages;
> }
> --
> 1.7.3.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-12 9:27 ` Minchan Kim
0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2011-07-12 9:27 UTC (permalink / raw)
To: Mel Gorman; +Cc: linux-mm, linux-kernel, Christoph Lameter, KOSAKI Motohiro
Hi Mel,
On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
> Zone reclaim is similar to direct reclaim in a number of respects.
> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
> but it's set also set for zone_reclaim which is inappropriate.
> Setting it potentially allows zone_reclaim users to cause large IO
> stalls which is worse than remote memory accesses.
As I read zone_reclaim_mode in vm.txt, I think it's intentional.
It has meaning of throttle the process which are writing large amounts
of data. The point is to prevent use of remote node's free memory.
And we has still the comment. If you're right, you should remove comment.
" * and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP."
And at least, we should Cc Christoph and KOSAKI.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> mm/vmscan.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f49535..ebef213 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3063,7 +3063,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> * and we also need to be able to write out pages for RECLAIM_WRITE
> * and RECLAIM_SWAP.
> */
> - p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
> + p->flags |= PF_MEMALLOC;
> lockdep_set_current_reclaim_state(gfp_mask);
> reclaim_state.reclaimed_slab = 0;
> p->reclaim_state = &reclaim_state;
> @@ -3116,7 +3116,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> }
>
> p->reclaim_state = NULL;
> - current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
> + current->flags &= ~PF_MEMALLOC;
> lockdep_clear_current_reclaim_state();
> return sc.nr_reclaimed >= nr_pages;
> }
> --
> 1.7.3.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-12 9:27 ` Minchan Kim
@ 2011-07-12 9:40 ` KOSAKI Motohiro
-1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-12 9:40 UTC (permalink / raw)
To: minchan.kim; +Cc: mgorman, linux-mm, linux-kernel, cl
(2011/07/12 18:27), Minchan Kim wrote:
> Hi Mel,
>
> On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
>> Zone reclaim is similar to direct reclaim in a number of respects.
>> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
>> but it's set also set for zone_reclaim which is inappropriate.
>> Setting it potentially allows zone_reclaim users to cause large IO
>> stalls which is worse than remote memory accesses.
>
> As I read zone_reclaim_mode in vm.txt, I think it's intentional.
> It has meaning of throttle the process which are writing large amounts
> of data. The point is to prevent use of remote node's free memory.
>
> And we has still the comment. If you're right, you should remove comment.
> " * and we also need to be able to write out pages for RECLAIM_WRITE
> * and RECLAIM_SWAP."
>
>
> And at least, we should Cc Christoph and KOSAKI.
Of course, I'll take full ack this. Do you remember I posted the same patch
about one year ago. At that time, Mel disagreed me and I'm glad to see he changed
the mind. :)
>
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>> mm/vmscan.c | 4 ++--
>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4f49535..ebef213 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3063,7 +3063,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>> * and we also need to be able to write out pages for RECLAIM_WRITE
>> * and RECLAIM_SWAP.
>> */
>> - p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
>> + p->flags |= PF_MEMALLOC;
>> lockdep_set_current_reclaim_state(gfp_mask);
>> reclaim_state.reclaimed_slab = 0;
>> p->reclaim_state = &reclaim_state;
>> @@ -3116,7 +3116,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>> }
>>
>> p->reclaim_state = NULL;
>> - current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
>> + current->flags &= ~PF_MEMALLOC;
>> lockdep_clear_current_reclaim_state();
>> return sc.nr_reclaimed >= nr_pages;
>> }
>> --
>> 1.7.3.4
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-12 9:40 ` KOSAKI Motohiro
0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-12 9:40 UTC (permalink / raw)
To: minchan.kim; +Cc: mgorman, linux-mm, linux-kernel, cl
(2011/07/12 18:27), Minchan Kim wrote:
> Hi Mel,
>
> On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
>> Zone reclaim is similar to direct reclaim in a number of respects.
>> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
>> but it's set also set for zone_reclaim which is inappropriate.
>> Setting it potentially allows zone_reclaim users to cause large IO
>> stalls which is worse than remote memory accesses.
>
> As I read zone_reclaim_mode in vm.txt, I think it's intentional.
> It has meaning of throttle the process which are writing large amounts
> of data. The point is to prevent use of remote node's free memory.
>
> And we has still the comment. If you're right, you should remove comment.
> " * and we also need to be able to write out pages for RECLAIM_WRITE
> * and RECLAIM_SWAP."
>
>
> And at least, we should Cc Christoph and KOSAKI.
Of course, I'll take full ack this. Do you remember I posted the same patch
about one year ago. At that time, Mel disagreed me and I'm glad to see he changed
the mind. :)
>
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>> mm/vmscan.c | 4 ++--
>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4f49535..ebef213 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3063,7 +3063,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>> * and we also need to be able to write out pages for RECLAIM_WRITE
>> * and RECLAIM_SWAP.
>> */
>> - p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
>> + p->flags |= PF_MEMALLOC;
>> lockdep_set_current_reclaim_state(gfp_mask);
>> reclaim_state.reclaimed_slab = 0;
>> p->reclaim_state = &reclaim_state;
>> @@ -3116,7 +3116,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>> }
>>
>> p->reclaim_state = NULL;
>> - current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
>> + current->flags &= ~PF_MEMALLOC;
>> lockdep_clear_current_reclaim_state();
>> return sc.nr_reclaimed >= nr_pages;
>> }
>> --
>> 1.7.3.4
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-12 9:40 ` KOSAKI Motohiro
@ 2011-07-12 9:55 ` Minchan Kim
-1 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2011-07-12 9:55 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: mgorman, linux-mm, linux-kernel, cl
Hi KOSAKi,
On Tue, Jul 12, 2011 at 6:40 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> (2011/07/12 18:27), Minchan Kim wrote:
>> Hi Mel,
>>
>> On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
>>> Zone reclaim is similar to direct reclaim in a number of respects.
>>> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
>>> but it's set also set for zone_reclaim which is inappropriate.
>>> Setting it potentially allows zone_reclaim users to cause large IO
>>> stalls which is worse than remote memory accesses.
>>
>> As I read zone_reclaim_mode in vm.txt, I think it's intentional.
>> It has meaning of throttle the process which are writing large amounts
>> of data. The point is to prevent use of remote node's free memory.
>>
>> And we has still the comment. If you're right, you should remove comment.
>> " * and we also need to be able to write out pages for RECLAIM_WRITE
>> * and RECLAIM_SWAP."
>>
>>
>> And at least, we should Cc Christoph and KOSAKI.
>
> Of course, I'll take full ack this. Do you remember I posted the same patch
> about one year ago. At that time, Mel disagreed me and I'm glad to see he changed
> the mind. :)
I remember that but I don't know why Mel didn't ack at that time.
http://lkml.org/lkml/2010/8/5/44
Anyway, Hannes's bd2f6199cf is to introduce lumpy reclaim of
zone_reclaim so it's natural to increase latency for getting big order
pages(ie, it's a trade-off).
And as I read about zone_reclaim_mode in Documentation/sysctl/vm.txt,
I think big latency(ie, throttling of the process) is intentional to
prevent stealing pages for other nodes.
If I am not against this patch, at least, we need agreement of
Christoph and others and if we agree this change, we changes vm.txt,
too.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-12 9:55 ` Minchan Kim
0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2011-07-12 9:55 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: mgorman, linux-mm, linux-kernel, cl
Hi KOSAKi,
On Tue, Jul 12, 2011 at 6:40 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> (2011/07/12 18:27), Minchan Kim wrote:
>> Hi Mel,
>>
>> On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
>>> Zone reclaim is similar to direct reclaim in a number of respects.
>>> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
>>> but it's set also set for zone_reclaim which is inappropriate.
>>> Setting it potentially allows zone_reclaim users to cause large IO
>>> stalls which is worse than remote memory accesses.
>>
>> As I read zone_reclaim_mode in vm.txt, I think it's intentional.
>> It has meaning of throttle the process which are writing large amounts
>> of data. The point is to prevent use of remote node's free memory.
>>
>> And we has still the comment. If you're right, you should remove comment.
>> " * and we also need to be able to write out pages for RECLAIM_WRITE
>> * and RECLAIM_SWAP."
>>
>>
>> And at least, we should Cc Christoph and KOSAKI.
>
> Of course, I'll take full ack this. Do you remember I posted the same patch
> about one year ago. At that time, Mel disagreed me and I'm glad to see he changed
> the mind. :)
I remember that but I don't know why Mel didn't ack at that time.
http://lkml.org/lkml/2010/8/5/44
Anyway, Hannes's bd2f6199cf is to introduce lumpy reclaim of
zone_reclaim so it's natural to increase latency for getting big order
pages(ie, it's a trade-off).
And as I read about zone_reclaim_mode in Documentation/sysctl/vm.txt,
I think big latency(ie, throttling of the process) is intentional to
prevent stealing pages for other nodes.
If I am not against this patch, at least, we need agreement of
Christoph and others and if we agree this change, we changes vm.txt,
too.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-12 9:40 ` KOSAKI Motohiro
@ 2011-07-12 10:14 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-12 10:14 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: minchan.kim, linux-mm, linux-kernel, cl
On Tue, Jul 12, 2011 at 06:40:20PM +0900, KOSAKI Motohiro wrote:
> (2011/07/12 18:27), Minchan Kim wrote:
> > Hi Mel,
> >
> > On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> Zone reclaim is similar to direct reclaim in a number of respects.
> >> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
> >> but it's set also set for zone_reclaim which is inappropriate.
> >> Setting it potentially allows zone_reclaim users to cause large IO
> >> stalls which is worse than remote memory accesses.
> >
> > As I read zone_reclaim_mode in vm.txt, I think it's intentional.
> > It has meaning of throttle the process which are writing large amounts
> > of data. The point is to prevent use of remote node's free memory.
> >
> > And we has still the comment. If you're right, you should remove comment.
> > " * and we also need to be able to write out pages for RECLAIM_WRITE
> > * and RECLAIM_SWAP."
> >
> >
> > And at least, we should Cc Christoph and KOSAKI.
>
> Of course, I'll take full ack this. Do you remember I posted the same patch
> about one year ago.
Nope, I didn't remember it at all :) . I'll revive your signed-off
and sorry about that.
> At that time, Mel disagreed me and I'm glad to see he changed
> the mind. :)
>
Did I disagree because of this?
Simply that I believe the intention of PF_SWAPWRITE here was
to allow zone_reclaim() to aggressively reclaim memory if the
reclaim_mode allowed it as it was a statement that off-node
accesses are really not desired.
Or was some other problem brought up that I'm not thinking of now?
I'm no longer think the level of aggression is appropriate after seeing
how seeing how zone_reclaim can stall when just copying large amounts
of data on recent x86-64 NUMA machines. In the same mail, I said
Ok. I am not fully convinced but I'll not block it either if
believe it's necessary. My current understanding is that this
patch only makes a difference if the server is IO congested in
which case the system is struggling anyway and an off-node
access is going to be relatively small penalty overall.
Conceivably, having PF_SWAPWRITE set makes things worse in
that situation and the patch makes some sense.
While I still think this situation is hard to trigger, zone_reclaim
can cause significant stalls *without* IO and there is little point
making the situation even worse.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-12 10:14 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-12 10:14 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: minchan.kim, linux-mm, linux-kernel, cl
On Tue, Jul 12, 2011 at 06:40:20PM +0900, KOSAKI Motohiro wrote:
> (2011/07/12 18:27), Minchan Kim wrote:
> > Hi Mel,
> >
> > On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> Zone reclaim is similar to direct reclaim in a number of respects.
> >> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
> >> but it's set also set for zone_reclaim which is inappropriate.
> >> Setting it potentially allows zone_reclaim users to cause large IO
> >> stalls which is worse than remote memory accesses.
> >
> > As I read zone_reclaim_mode in vm.txt, I think it's intentional.
> > It has meaning of throttle the process which are writing large amounts
> > of data. The point is to prevent use of remote node's free memory.
> >
> > And we has still the comment. If you're right, you should remove comment.
> > " * and we also need to be able to write out pages for RECLAIM_WRITE
> > * and RECLAIM_SWAP."
> >
> >
> > And at least, we should Cc Christoph and KOSAKI.
>
> Of course, I'll take full ack this. Do you remember I posted the same patch
> about one year ago.
Nope, I didn't remember it at all :) . I'll revive your signed-off
and sorry about that.
> At that time, Mel disagreed me and I'm glad to see he changed
> the mind. :)
>
Did I disagree because of this?
Simply that I believe the intention of PF_SWAPWRITE here was
to allow zone_reclaim() to aggressively reclaim memory if the
reclaim_mode allowed it as it was a statement that off-node
accesses are really not desired.
Or was some other problem brought up that I'm not thinking of now?
I'm no longer think the level of aggression is appropriate after seeing
how seeing how zone_reclaim can stall when just copying large amounts
of data on recent x86-64 NUMA machines. In the same mail, I said
Ok. I am not fully convinced but I'll not block it either if
believe it's necessary. My current understanding is that this
patch only makes a difference if the server is IO congested in
which case the system is struggling anyway and an off-node
access is going to be relatively small penalty overall.
Conceivably, having PF_SWAPWRITE set makes things worse in
that situation and the patch makes some sense.
While I still think this situation is hard to trigger, zone_reclaim
can cause significant stalls *without* IO and there is little point
making the situation even worse.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-12 9:55 ` Minchan Kim
@ 2011-07-12 15:43 ` Christoph Lameter
-1 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2011-07-12 15:43 UTC (permalink / raw)
To: Minchan Kim; +Cc: KOSAKI Motohiro, mgorman, linux-mm, linux-kernel
On Tue, 12 Jul 2011, Minchan Kim wrote:
> If I am not against this patch, at least, we need agreement of
> Christoph and others and if we agree this change, we changes vm.txt,
> too.
I think PF_SWAPWRITE should only be set if may_write was set earlier in
__zone_reclaim. If zone reclaim is not configured to do writeback then it
makes no sense to set the bit.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-12 15:43 ` Christoph Lameter
0 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2011-07-12 15:43 UTC (permalink / raw)
To: Minchan Kim; +Cc: KOSAKI Motohiro, mgorman, linux-mm, linux-kernel
On Tue, 12 Jul 2011, Minchan Kim wrote:
> If I am not against this patch, at least, we need agreement of
> Christoph and others and if we agree this change, we changes vm.txt,
> too.
I think PF_SWAPWRITE should only be set if may_write was set earlier in
__zone_reclaim. If zone reclaim is not configured to do writeback then it
makes no sense to set the bit.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-12 10:14 ` Mel Gorman
@ 2011-07-13 0:34 ` KOSAKI Motohiro
-1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-13 0:34 UTC (permalink / raw)
To: mgorman; +Cc: minchan.kim, linux-mm, linux-kernel, cl
(2011/07/12 19:14), Mel Gorman wrote:
> On Tue, Jul 12, 2011 at 06:40:20PM +0900, KOSAKI Motohiro wrote:
>> (2011/07/12 18:27), Minchan Kim wrote:
>>> Hi Mel,
>>>
>>> On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
>>>> Zone reclaim is similar to direct reclaim in a number of respects.
>>>> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
>>>> but it's set also set for zone_reclaim which is inappropriate.
>>>> Setting it potentially allows zone_reclaim users to cause large IO
>>>> stalls which is worse than remote memory accesses.
>>>
>>> As I read zone_reclaim_mode in vm.txt, I think it's intentional.
>>> It has meaning of throttle the process which are writing large amounts
>>> of data. The point is to prevent use of remote node's free memory.
>>>
>>> And we has still the comment. If you're right, you should remove comment.
>>> " * and we also need to be able to write out pages for RECLAIM_WRITE
>>> * and RECLAIM_SWAP."
>>>
>>>
>>> And at least, we should Cc Christoph and KOSAKI.
>>
>> Of course, I'll take full ack this. Do you remember I posted the same patch
>> about one year ago.
>
> Nope, I didn't remember it at all :) . I'll revive your signed-off
> and sorry about that.
No. Not sorry.I think my explanation was not enough. And I couldn't show
the performance result. At that time, I didn't access large NUMA machine.
Thank you for paying attention the latency issue. I'm really glad.
>
>> At that time, Mel disagreed me and I'm glad to see he changed
>> the mind. :)
>>
>
> Did I disagree because of this?
>
> Simply that I believe the intention of PF_SWAPWRITE here was
> to allow zone_reclaim() to aggressively reclaim memory if the
> reclaim_mode allowed it as it was a statement that off-node
> accesses are really not desired.
>
> Or was some other problem brought up that I'm not thinking of now?
To be honest, My brain is volatile memory and my remember is unclear.
As far as remember is, yes, it is only problem.
> I'm no longer think the level of aggression is appropriate after seeing
> how seeing how zone_reclaim can stall when just copying large amounts
> of data on recent x86-64 NUMA machines. In the same mail, I said
>
> Ok. I am not fully convinced but I'll not block it either if
> believe it's necessary. My current understanding is that this
> patch only makes a difference if the server is IO congested in
> which case the system is struggling anyway and an off-node
> access is going to be relatively small penalty overall.
> Conceivably, having PF_SWAPWRITE set makes things worse in
> that situation and the patch makes some sense.
>
> While I still think this situation is hard to trigger, zone_reclaim
> can cause significant stalls *without* IO and there is little point
> making the situation even worse.
And, again, I'm fully agree your [0/3] description.
Thanks.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-13 0:34 ` KOSAKI Motohiro
0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-13 0:34 UTC (permalink / raw)
To: mgorman; +Cc: minchan.kim, linux-mm, linux-kernel, cl
(2011/07/12 19:14), Mel Gorman wrote:
> On Tue, Jul 12, 2011 at 06:40:20PM +0900, KOSAKI Motohiro wrote:
>> (2011/07/12 18:27), Minchan Kim wrote:
>>> Hi Mel,
>>>
>>> On Mon, Jul 11, 2011 at 10:01 PM, Mel Gorman <mgorman@suse.de> wrote:
>>>> Zone reclaim is similar to direct reclaim in a number of respects.
>>>> PF_SWAPWRITE is used by kswapd to avoid a write-congestion check
>>>> but it's set also set for zone_reclaim which is inappropriate.
>>>> Setting it potentially allows zone_reclaim users to cause large IO
>>>> stalls which is worse than remote memory accesses.
>>>
>>> As I read zone_reclaim_mode in vm.txt, I think it's intentional.
>>> It has meaning of throttle the process which are writing large amounts
>>> of data. The point is to prevent use of remote node's free memory.
>>>
>>> And we has still the comment. If you're right, you should remove comment.
>>> " * and we also need to be able to write out pages for RECLAIM_WRITE
>>> * and RECLAIM_SWAP."
>>>
>>>
>>> And at least, we should Cc Christoph and KOSAKI.
>>
>> Of course, I'll take full ack this. Do you remember I posted the same patch
>> about one year ago.
>
> Nope, I didn't remember it at all :) . I'll revive your signed-off
> and sorry about that.
No. Not sorry.I think my explanation was not enough. And I couldn't show
the performance result. At that time, I didn't access large NUMA machine.
Thank you for paying attention the latency issue. I'm really glad.
>
>> At that time, Mel disagreed me and I'm glad to see he changed
>> the mind. :)
>>
>
> Did I disagree because of this?
>
> Simply that I believe the intention of PF_SWAPWRITE here was
> to allow zone_reclaim() to aggressively reclaim memory if the
> reclaim_mode allowed it as it was a statement that off-node
> accesses are really not desired.
>
> Or was some other problem brought up that I'm not thinking of now?
To be honest, My brain is volatile memory and my remember is unclear.
As far as remember is, yes, it is only problem.
> I'm no longer think the level of aggression is appropriate after seeing
> how seeing how zone_reclaim can stall when just copying large amounts
> of data on recent x86-64 NUMA machines. In the same mail, I said
>
> Ok. I am not fully convinced but I'll not block it either if
> believe it's necessary. My current understanding is that this
> patch only makes a difference if the server is IO congested in
> which case the system is struggling anyway and an off-node
> access is going to be relatively small penalty overall.
> Conceivably, having PF_SWAPWRITE set makes things worse in
> that situation and the patch makes some sense.
>
> While I still think this situation is hard to trigger, zone_reclaim
> can cause significant stalls *without* IO and there is little point
> making the situation even worse.
And, again, I'm fully agree your [0/3] description.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-11 13:01 ` Mel Gorman
@ 2011-07-13 0:42 ` KOSAKI Motohiro
-1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-13 0:42 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/11 22:01), Mel Gorman wrote:
> With zone_reclaim_mode enabled, it's possible for zones to be considered
> full in the zonelist_cache so they are skipped in the future. If the
> process enters direct reclaim, the ZLC may still consider zones to be
> full even after reclaiming pages. Reconsider all zones for allocation
> if direct reclaim returns successfully.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Hmmm...
I like the concept, but I'm worry about a corner case a bit.
If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
Then, zlc_clear_zones_full() seems too aggressive operation.
Instead, couldn't we turn zlc->fullzones off from kswapd?
> ---
> mm/page_alloc.c | 19 +++++++++++++++++++
> 1 files changed, 19 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6913854..149409c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
> set_bit(i, zlc->fullzones);
> }
>
> +/*
> + * clear all zones full, called after direct reclaim makes progress so that
> + * a zone that was recently full is not skipped over for up to a second
> + */
> +static void zlc_clear_zones_full(struct zonelist *zonelist)
> +{
> + struct zonelist_cache *zlc; /* cached zonelist speedup info */
> +
> + zlc = zonelist->zlcache_ptr;
> + if (!zlc)
> + return;
> +
> + bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
> +}
> +
> #else /* CONFIG_NUMA */
>
> static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
> @@ -1963,6 +1978,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> if (unlikely(!(*did_some_progress)))
> return NULL;
>
> + /* After successful reclaim, reconsider all zones for allocation */
> + if (NUMA_BUILD)
> + zlc_clear_zones_full(zonelist);
> +
> retry:
> page = get_page_from_freelist(gfp_mask, nodemask, order,
> zonelist, high_zoneidx,
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-13 0:42 ` KOSAKI Motohiro
0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-13 0:42 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/11 22:01), Mel Gorman wrote:
> With zone_reclaim_mode enabled, it's possible for zones to be considered
> full in the zonelist_cache so they are skipped in the future. If the
> process enters direct reclaim, the ZLC may still consider zones to be
> full even after reclaiming pages. Reconsider all zones for allocation
> if direct reclaim returns successfully.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Hmmm...
I like the concept, but I'm worry about a corner case a bit.
If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
Then, zlc_clear_zones_full() seems too aggressive operation.
Instead, couldn't we turn zlc->fullzones off from kswapd?
> ---
> mm/page_alloc.c | 19 +++++++++++++++++++
> 1 files changed, 19 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6913854..149409c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
> set_bit(i, zlc->fullzones);
> }
>
> +/*
> + * clear all zones full, called after direct reclaim makes progress so that
> + * a zone that was recently full is not skipped over for up to a second
> + */
> +static void zlc_clear_zones_full(struct zonelist *zonelist)
> +{
> + struct zonelist_cache *zlc; /* cached zonelist speedup info */
> +
> + zlc = zonelist->zlcache_ptr;
> + if (!zlc)
> + return;
> +
> + bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
> +}
> +
> #else /* CONFIG_NUMA */
>
> static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
> @@ -1963,6 +1978,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> if (unlikely(!(*did_some_progress)))
> return NULL;
>
> + /* After successful reclaim, reconsider all zones for allocation */
> + if (NUMA_BUILD)
> + zlc_clear_zones_full(zonelist);
> +
> retry:
> page = get_page_from_freelist(gfp_mask, nodemask, order,
> zonelist, high_zoneidx,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-11 13:01 ` Mel Gorman
@ 2011-07-13 1:15 ` KOSAKI Motohiro
-1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-13 1:15 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/11 22:01), Mel Gorman wrote:
> The zonelist cache (ZLC) is used among other things to record if
> zone_reclaim() failed for a particular zone recently. The intention
> is to avoid a high cost scanning extremely long zonelists or scanning
> within the zone uselessly.
>
> Currently the zonelist cache is setup only after the first zone has
> been considered and zone_reclaim() has been called. The objective was
> to avoid a costly setup but zone_reclaim is itself quite expensive. If
> it is failing regularly such as the first eligible zone having mostly
> mapped pages, the cost in scanning and allocation stalls is far higher
> than the ZLC initialisation step.
>
> This patch initialises ZLC before the first eligible zone calls
> zone_reclaim(). Once initialised, it is checked whether the zone
> failed zone_reclaim recently. If it has, the zone is skipped. As the
> first zone is now being checked, additional care has to be taken about
> zones marked full. A zone can be marked "full" because it should not
> have enough unmapped pages for zone_reclaim but this is excessive as
> direct reclaim or kswapd may succeed where zone_reclaim fails. Only
> mark zones "full" after zone_reclaim fails if it failed to reclaim
> enough pages after scanning.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
If I understand correctly this patch's procs/cons is,
pros.
1) faster when zone reclaim doesn't work effectively
cons.
2) slower when zone reclaim is off
3) slower when zone recliam works effectively
(2) and (3) are frequently happen than (1), correct?
At least, I think we need to keep zero impact when zone reclaim mode is off.
Thanks.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-13 1:15 ` KOSAKI Motohiro
0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-13 1:15 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/11 22:01), Mel Gorman wrote:
> The zonelist cache (ZLC) is used among other things to record if
> zone_reclaim() failed for a particular zone recently. The intention
> is to avoid a high cost scanning extremely long zonelists or scanning
> within the zone uselessly.
>
> Currently the zonelist cache is setup only after the first zone has
> been considered and zone_reclaim() has been called. The objective was
> to avoid a costly setup but zone_reclaim is itself quite expensive. If
> it is failing regularly such as the first eligible zone having mostly
> mapped pages, the cost in scanning and allocation stalls is far higher
> than the ZLC initialisation step.
>
> This patch initialises ZLC before the first eligible zone calls
> zone_reclaim(). Once initialised, it is checked whether the zone
> failed zone_reclaim recently. If it has, the zone is skipped. As the
> first zone is now being checked, additional care has to be taken about
> zones marked full. A zone can be marked "full" because it should not
> have enough unmapped pages for zone_reclaim but this is excessive as
> direct reclaim or kswapd may succeed where zone_reclaim fails. Only
> mark zones "full" after zone_reclaim fails if it failed to reclaim
> enough pages after scanning.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
If I understand correctly this patch's procs/cons is,
pros.
1) faster when zone reclaim doesn't work effectively
cons.
2) slower when zone reclaim is off
3) slower when zone recliam works effectively
(2) and (3) are frequently happen than (1), correct?
At least, I think we need to keep zero impact when zone reclaim mode is off.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
2011-07-12 15:43 ` Christoph Lameter
@ 2011-07-13 10:40 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-13 10:40 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Tue, Jul 12, 2011 at 10:43:47AM -0500, Christoph Lameter wrote:
> On Tue, 12 Jul 2011, Minchan Kim wrote:
>
> > If I am not against this patch, at least, we need agreement of
> > Christoph and others and if we agree this change, we changes vm.txt,
> > too.
>
> I think PF_SWAPWRITE should only be set if may_write was set earlier in
> __zone_reclaim. If zone reclaim is not configured to do writeback then it
> makes no sense to set the bit.
>
That would effectively make the patch a no-op as the check for
PF_SWAPWRITE only happens if may_write is set. The point of the patch is
that zone reclaim differs from direct reclaim in that zone reclaim
obeys congestion where as zone reclaim does not. If you're saying that
this is the way it's meant to be, then fine, I'll drop the patch. While
I think it's a bad idea, I also didn't specifically test for problems
related to it and I think the other two patches in the series are more
important.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim
@ 2011-07-13 10:40 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-13 10:40 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel
On Tue, Jul 12, 2011 at 10:43:47AM -0500, Christoph Lameter wrote:
> On Tue, 12 Jul 2011, Minchan Kim wrote:
>
> > If I am not against this patch, at least, we need agreement of
> > Christoph and others and if we agree this change, we changes vm.txt,
> > too.
>
> I think PF_SWAPWRITE should only be set if may_write was set earlier in
> __zone_reclaim. If zone reclaim is not configured to do writeback then it
> makes no sense to set the bit.
>
That would effectively make the patch a no-op as the check for
PF_SWAPWRITE only happens if may_write is set. The point of the patch is
that zone reclaim differs from direct reclaim in that zone reclaim
obeys congestion where as zone reclaim does not. If you're saying that
this is the way it's meant to be, then fine, I'll drop the patch. While
I think it's a bad idea, I also didn't specifically test for problems
related to it and I think the other two patches in the series are more
important.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-13 1:15 ` KOSAKI Motohiro
@ 2011-07-13 11:02 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-13 11:02 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Wed, Jul 13, 2011 at 10:15:15AM +0900, KOSAKI Motohiro wrote:
> (2011/07/11 22:01), Mel Gorman wrote:
> > The zonelist cache (ZLC) is used among other things to record if
> > zone_reclaim() failed for a particular zone recently. The intention
> > is to avoid a high cost scanning extremely long zonelists or scanning
> > within the zone uselessly.
> >
> > Currently the zonelist cache is setup only after the first zone has
> > been considered and zone_reclaim() has been called. The objective was
> > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > it is failing regularly such as the first eligible zone having mostly
> > mapped pages, the cost in scanning and allocation stalls is far higher
> > than the ZLC initialisation step.
> >
> > This patch initialises ZLC before the first eligible zone calls
> > zone_reclaim(). Once initialised, it is checked whether the zone
> > failed zone_reclaim recently. If it has, the zone is skipped. As the
> > first zone is now being checked, additional care has to be taken about
> > zones marked full. A zone can be marked "full" because it should not
> > have enough unmapped pages for zone_reclaim but this is excessive as
> > direct reclaim or kswapd may succeed where zone_reclaim fails. Only
> > mark zones "full" after zone_reclaim fails if it failed to reclaim
> > enough pages after scanning.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> If I understand correctly this patch's procs/cons is,
>
> pros.
> 1) faster when zone reclaim doesn't work effectively
>
Yes.
> cons.
> 2) slower when zone reclaim is off
How is it slower with zone_reclaim off?
Before
if (zone_reclaim_mode == 0)
goto this_zone_full;
...
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
...
}
After
if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
...
}
if (zone_reclaim_mode == 0)
goto this_zone_full;
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
Bear in mind that if the watermarks are met on the first zone, the zlc
setup does not occur.
> 3) slower when zone recliam works effectively
>
Marginally slower. It's now calling zlc setup so once a second it's
zeroing a bitmap and calling zlc_zone_worth_trying() on the first
zone testing a bit on a cache-hot structure.
As the ineffective case can be triggered by a simple cp, I think the
cost is justified. Can you think of a better way of doing this?
> (2) and (3) are frequently happen than (1), correct?
Yes. I'd still expect zone_reclaim to be off on the majority of
machines and even when enabled, I think it's relatively rare we hit the
case where the workload is regularly falling over to the other node
except in the case where it's a file server. Still, a cp is not to
uncommon that the kernel should slow to a crawl as a result.
> At least, I think we need to keep zero impact when zone reclaim mode is off.
>
I agree with this but I'm missing where we are taking the big hit with
zone_reclaim==0.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-13 11:02 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-13 11:02 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Wed, Jul 13, 2011 at 10:15:15AM +0900, KOSAKI Motohiro wrote:
> (2011/07/11 22:01), Mel Gorman wrote:
> > The zonelist cache (ZLC) is used among other things to record if
> > zone_reclaim() failed for a particular zone recently. The intention
> > is to avoid a high cost scanning extremely long zonelists or scanning
> > within the zone uselessly.
> >
> > Currently the zonelist cache is setup only after the first zone has
> > been considered and zone_reclaim() has been called. The objective was
> > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > it is failing regularly such as the first eligible zone having mostly
> > mapped pages, the cost in scanning and allocation stalls is far higher
> > than the ZLC initialisation step.
> >
> > This patch initialises ZLC before the first eligible zone calls
> > zone_reclaim(). Once initialised, it is checked whether the zone
> > failed zone_reclaim recently. If it has, the zone is skipped. As the
> > first zone is now being checked, additional care has to be taken about
> > zones marked full. A zone can be marked "full" because it should not
> > have enough unmapped pages for zone_reclaim but this is excessive as
> > direct reclaim or kswapd may succeed where zone_reclaim fails. Only
> > mark zones "full" after zone_reclaim fails if it failed to reclaim
> > enough pages after scanning.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> If I understand correctly this patch's procs/cons is,
>
> pros.
> 1) faster when zone reclaim doesn't work effectively
>
Yes.
> cons.
> 2) slower when zone reclaim is off
How is it slower with zone_reclaim off?
Before
if (zone_reclaim_mode == 0)
goto this_zone_full;
...
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
...
}
After
if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
...
}
if (zone_reclaim_mode == 0)
goto this_zone_full;
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
Bear in mind that if the watermarks are met on the first zone, the zlc
setup does not occur.
> 3) slower when zone recliam works effectively
>
Marginally slower. It's now calling zlc setup so once a second it's
zeroing a bitmap and calling zlc_zone_worth_trying() on the first
zone testing a bit on a cache-hot structure.
As the ineffective case can be triggered by a simple cp, I think the
cost is justified. Can you think of a better way of doing this?
> (2) and (3) are frequently happen than (1), correct?
Yes. I'd still expect zone_reclaim to be off on the majority of
machines and even when enabled, I think it's relatively rare we hit the
case where the workload is regularly falling over to the other node
except in the case where it's a file server. Still, a cp is not to
uncommon that the kernel should slow to a crawl as a result.
> At least, I think we need to keep zero impact when zone reclaim mode is off.
>
I agree with this but I'm missing where we are taking the big hit with
zone_reclaim==0.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-13 0:42 ` KOSAKI Motohiro
@ 2011-07-13 11:10 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-13 11:10 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Wed, Jul 13, 2011 at 09:42:39AM +0900, KOSAKI Motohiro wrote:
> (2011/07/11 22:01), Mel Gorman wrote:
> > With zone_reclaim_mode enabled, it's possible for zones to be considered
> > full in the zonelist_cache so they are skipped in the future. If the
> > process enters direct reclaim, the ZLC may still consider zones to be
> > full even after reclaiming pages. Reconsider all zones for allocation
> > if direct reclaim returns successfully.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Hmmm...
>
> I like the concept, but I'm worry about a corner case a bit.
>
> If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
> Then, zlc_clear_zones_full() seems too aggressive operation.
As the system is likely to be running slow if it is in direct reclaim
that the complexity of being careful about which zone was cleared was
not worth it.
> Instead, couldn't we turn zlc->fullzones off from kswapd?
>
Which zonelist should it clear (there are two) and when should it
happen? If it clears it on each cycle around balance_pgdat(), there
is no guarantee that it'll be cleared between when direct reclaim
finishes and an attempt is made to allocate.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-13 11:10 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-13 11:10 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Wed, Jul 13, 2011 at 09:42:39AM +0900, KOSAKI Motohiro wrote:
> (2011/07/11 22:01), Mel Gorman wrote:
> > With zone_reclaim_mode enabled, it's possible for zones to be considered
> > full in the zonelist_cache so they are skipped in the future. If the
> > process enters direct reclaim, the ZLC may still consider zones to be
> > full even after reclaiming pages. Reconsider all zones for allocation
> > if direct reclaim returns successfully.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Hmmm...
>
> I like the concept, but I'm worry about a corner case a bit.
>
> If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
> Then, zlc_clear_zones_full() seems too aggressive operation.
As the system is likely to be running slow if it is in direct reclaim
that the complexity of being careful about which zone was cleared was
not worth it.
> Instead, couldn't we turn zlc->fullzones off from kswapd?
>
Which zonelist should it clear (there are two) and when should it
happen? If it clears it on each cycle around balance_pgdat(), there
is no guarantee that it'll be cleared between when direct reclaim
finishes and an attempt is made to allocate.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-13 11:02 ` Mel Gorman
@ 2011-07-14 1:20 ` KOSAKI Motohiro
-1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-14 1:20 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/13 20:02), Mel Gorman wrote:
> On Wed, Jul 13, 2011 at 10:15:15AM +0900, KOSAKI Motohiro wrote:
>> (2011/07/11 22:01), Mel Gorman wrote:
>>> The zonelist cache (ZLC) is used among other things to record if
>>> zone_reclaim() failed for a particular zone recently. The intention
>>> is to avoid a high cost scanning extremely long zonelists or scanning
>>> within the zone uselessly.
>>>
>>> Currently the zonelist cache is setup only after the first zone has
>>> been considered and zone_reclaim() has been called. The objective was
>>> to avoid a costly setup but zone_reclaim is itself quite expensive. If
>>> it is failing regularly such as the first eligible zone having mostly
>>> mapped pages, the cost in scanning and allocation stalls is far higher
>>> than the ZLC initialisation step.
>>>
>>> This patch initialises ZLC before the first eligible zone calls
>>> zone_reclaim(). Once initialised, it is checked whether the zone
>>> failed zone_reclaim recently. If it has, the zone is skipped. As the
>>> first zone is now being checked, additional care has to be taken about
>>> zones marked full. A zone can be marked "full" because it should not
>>> have enough unmapped pages for zone_reclaim but this is excessive as
>>> direct reclaim or kswapd may succeed where zone_reclaim fails. Only
>>> mark zones "full" after zone_reclaim fails if it failed to reclaim
>>> enough pages after scanning.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>
>> If I understand correctly this patch's procs/cons is,
>>
>> pros.
>> 1) faster when zone reclaim doesn't work effectively
>>
>
> Yes.
>
>> cons.
>> 2) slower when zone reclaim is off
>
> How is it slower with zone_reclaim off?
>
> Before
>
> if (zone_reclaim_mode == 0)
> goto this_zone_full;
> ...
> this_zone_full:
> if (NUMA_BUILD)
> zlc_mark_zone_full(zonelist, z);
> if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> ...
> }
>
> After
> if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> ...
> }
> if (zone_reclaim_mode == 0)
> goto this_zone_full;
> this_zone_full:
> if (NUMA_BUILD)
> zlc_mark_zone_full(zonelist, z);
>
> Bear in mind that if the watermarks are met on the first zone, the zlc
> setup does not occur.
Right you are. thank you correct me.
>> 3) slower when zone recliam works effectively
>>
>
> Marginally slower. It's now calling zlc setup so once a second it's
> zeroing a bitmap and calling zlc_zone_worth_trying() on the first
> zone testing a bit on a cache-hot structure.
>
> As the ineffective case can be triggered by a simple cp, I think the
> cost is justified. Can you think of a better way of doing this?
So, now I'm revisit your number in [0/3]. and I've conclude your patch
improve simple cp case too. then please forget my last mail. this patch
looks nicer.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
>> (2) and (3) are frequently happen than (1), correct?
>
> Yes. I'd still expect zone_reclaim to be off on the majority of
> machines and even when enabled, I think it's relatively rare we hit the
> case where the workload is regularly falling over to the other node
> except in the case where it's a file server. Still, a cp is not to
> uncommon that the kernel should slow to a crawl as a result.
>
>> At least, I think we need to keep zero impact when zone reclaim mode is off.
>>
>
> I agree with this but I'm missing where we are taking the big hit with
> zone_reclaim==0.
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-14 1:20 ` KOSAKI Motohiro
0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-14 1:20 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/13 20:02), Mel Gorman wrote:
> On Wed, Jul 13, 2011 at 10:15:15AM +0900, KOSAKI Motohiro wrote:
>> (2011/07/11 22:01), Mel Gorman wrote:
>>> The zonelist cache (ZLC) is used among other things to record if
>>> zone_reclaim() failed for a particular zone recently. The intention
>>> is to avoid a high cost scanning extremely long zonelists or scanning
>>> within the zone uselessly.
>>>
>>> Currently the zonelist cache is setup only after the first zone has
>>> been considered and zone_reclaim() has been called. The objective was
>>> to avoid a costly setup but zone_reclaim is itself quite expensive. If
>>> it is failing regularly such as the first eligible zone having mostly
>>> mapped pages, the cost in scanning and allocation stalls is far higher
>>> than the ZLC initialisation step.
>>>
>>> This patch initialises ZLC before the first eligible zone calls
>>> zone_reclaim(). Once initialised, it is checked whether the zone
>>> failed zone_reclaim recently. If it has, the zone is skipped. As the
>>> first zone is now being checked, additional care has to be taken about
>>> zones marked full. A zone can be marked "full" because it should not
>>> have enough unmapped pages for zone_reclaim but this is excessive as
>>> direct reclaim or kswapd may succeed where zone_reclaim fails. Only
>>> mark zones "full" after zone_reclaim fails if it failed to reclaim
>>> enough pages after scanning.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>
>> If I understand correctly this patch's procs/cons is,
>>
>> pros.
>> 1) faster when zone reclaim doesn't work effectively
>>
>
> Yes.
>
>> cons.
>> 2) slower when zone reclaim is off
>
> How is it slower with zone_reclaim off?
>
> Before
>
> if (zone_reclaim_mode == 0)
> goto this_zone_full;
> ...
> this_zone_full:
> if (NUMA_BUILD)
> zlc_mark_zone_full(zonelist, z);
> if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> ...
> }
>
> After
> if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> ...
> }
> if (zone_reclaim_mode == 0)
> goto this_zone_full;
> this_zone_full:
> if (NUMA_BUILD)
> zlc_mark_zone_full(zonelist, z);
>
> Bear in mind that if the watermarks are met on the first zone, the zlc
> setup does not occur.
Right you are. thank you correct me.
>> 3) slower when zone recliam works effectively
>>
>
> Marginally slower. It's now calling zlc setup so once a second it's
> zeroing a bitmap and calling zlc_zone_worth_trying() on the first
> zone testing a bit on a cache-hot structure.
>
> As the ineffective case can be triggered by a simple cp, I think the
> cost is justified. Can you think of a better way of doing this?
So, now I'm revisit your number in [0/3]. and I've conclude your patch
improve simple cp case too. then please forget my last mail. this patch
looks nicer.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
>> (2) and (3) are frequently happen than (1), correct?
>
> Yes. I'd still expect zone_reclaim to be off on the majority of
> machines and even when enabled, I think it's relatively rare we hit the
> case where the workload is regularly falling over to the other node
> except in the case where it's a file server. Still, a cp is not to
> uncommon that the kernel should slow to a crawl as a result.
>
>> At least, I think we need to keep zero impact when zone reclaim mode is off.
>>
>
> I agree with this but I'm missing where we are taking the big hit with
> zone_reclaim==0.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-13 11:10 ` Mel Gorman
@ 2011-07-14 3:20 ` KOSAKI Motohiro
-1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-14 3:20 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/13 20:10), Mel Gorman wrote:
> On Wed, Jul 13, 2011 at 09:42:39AM +0900, KOSAKI Motohiro wrote:
>> (2011/07/11 22:01), Mel Gorman wrote:
>>> With zone_reclaim_mode enabled, it's possible for zones to be considered
>>> full in the zonelist_cache so they are skipped in the future. If the
>>> process enters direct reclaim, the ZLC may still consider zones to be
>>> full even after reclaiming pages. Reconsider all zones for allocation
>>> if direct reclaim returns successfully.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>
>> Hmmm...
>>
>> I like the concept, but I'm worry about a corner case a bit.
>>
>> If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
>> Then, zlc_clear_zones_full() seems too aggressive operation.
>
> As the system is likely to be running slow if it is in direct reclaim
> that the complexity of being careful about which zone was cleared was
> not worth it.
>
>> Instead, couldn't we turn zlc->fullzones off from kswapd?
>>
>
> Which zonelist should it clear (there are two) and when should it
> happen? If it clears it on each cycle around balance_pgdat(), there
> is no guarantee that it'll be cleared between when direct reclaim
> finishes and an attempt is made to allocate.
Hmm..
Probably I'm now missing the point of this patch. Why do we need
to guarantee tightly coupled zlc cache and direct reclaim? IIUC,
zlc cache mean "to avoid free list touch if they have no free mem".
So, any free page increasing point is acceptable good, I thought.
In the other hand, direct reclaim finishing has no guarantee to
zones of zonelist have enough free memory because it has bailing out logic.
So, I think we don't need to care zonelist, just kswapd turn off
their own node.
And, just curious, If we will have a proper zlc clear point, why
do we need to keep HZ timeout?
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-14 3:20 ` KOSAKI Motohiro
0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-14 3:20 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
(2011/07/13 20:10), Mel Gorman wrote:
> On Wed, Jul 13, 2011 at 09:42:39AM +0900, KOSAKI Motohiro wrote:
>> (2011/07/11 22:01), Mel Gorman wrote:
>>> With zone_reclaim_mode enabled, it's possible for zones to be considered
>>> full in the zonelist_cache so they are skipped in the future. If the
>>> process enters direct reclaim, the ZLC may still consider zones to be
>>> full even after reclaiming pages. Reconsider all zones for allocation
>>> if direct reclaim returns successfully.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>
>> Hmmm...
>>
>> I like the concept, but I'm worry about a corner case a bit.
>>
>> If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
>> Then, zlc_clear_zones_full() seems too aggressive operation.
>
> As the system is likely to be running slow if it is in direct reclaim
> that the complexity of being careful about which zone was cleared was
> not worth it.
>
>> Instead, couldn't we turn zlc->fullzones off from kswapd?
>>
>
> Which zonelist should it clear (there are two) and when should it
> happen? If it clears it on each cycle around balance_pgdat(), there
> is no guarantee that it'll be cleared between when direct reclaim
> finishes and an attempt is made to allocate.
Hmm..
Probably I'm now missing the point of this patch. Why do we need
to guarantee tightly coupled zlc cache and direct reclaim? IIUC,
zlc cache mean "to avoid free list touch if they have no free mem".
So, any free page increasing point is acceptable good, I thought.
In the other hand, direct reclaim finishing has no guarantee to
zones of zonelist have enough free memory because it has bailing out logic.
So, I think we don't need to care zonelist, just kswapd turn off
their own node.
And, just curious, If we will have a proper zlc clear point, why
do we need to keep HZ timeout?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-14 3:20 ` KOSAKI Motohiro
@ 2011-07-14 6:10 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-14 6:10 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Thu, Jul 14, 2011 at 12:20:38PM +0900, KOSAKI Motohiro wrote:
> (2011/07/13 20:10), Mel Gorman wrote:
> > On Wed, Jul 13, 2011 at 09:42:39AM +0900, KOSAKI Motohiro wrote:
> >> (2011/07/11 22:01), Mel Gorman wrote:
> >>> With zone_reclaim_mode enabled, it's possible for zones to be considered
> >>> full in the zonelist_cache so they are skipped in the future. If the
> >>> process enters direct reclaim, the ZLC may still consider zones to be
> >>> full even after reclaiming pages. Reconsider all zones for allocation
> >>> if direct reclaim returns successfully.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >>
> >> Hmmm...
> >>
> >> I like the concept, but I'm worry about a corner case a bit.
> >>
> >> If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
> >> Then, zlc_clear_zones_full() seems too aggressive operation.
> >
> > As the system is likely to be running slow if it is in direct reclaim
> > that the complexity of being careful about which zone was cleared was
> > not worth it.
> >
> >> Instead, couldn't we turn zlc->fullzones off from kswapd?
> >>
> >
> > Which zonelist should it clear (there are two) and when should it
> > happen? If it clears it on each cycle around balance_pgdat(), there
> > is no guarantee that it'll be cleared between when direct reclaim
> > finishes and an attempt is made to allocate.
>
> Hmm..
>
> Probably I'm now missing the point of this patch. Why do we need
> to guarantee tightly coupled zlc cache and direct reclaim?
Because direct reclaim may free enough memory such that the zlc cache
stating the zone is full is wrong.
> IIUC,
> zlc cache mean "to avoid free list touch if they have no free mem".
> So, any free page increasing point is acceptable good, I thought.
> In the other hand, direct reclaim finishing has no guarantee to
> zones of zonelist have enough free memory because it has bailing out logic.
>
It has no guarantee but there is a reasonable expectation that direct
reclaim will free some memory that means we should reconsider the
zone for allocation.
> So, I think we don't need to care zonelist, just kswapd turn off
> their own node.
>
I don't understand what you mean by this.
> And, just curious, If we will have a proper zlc clear point, why
> do we need to keep HZ timeout?
>
Yes because we are not guaranteed to call direct reclaim either. Memory
could be freed by a process exiting and I'd rather not add cost to
the free path to find and clear all zonelists referencing the zone the
page being freed belongs to.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-14 6:10 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-14 6:10 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Thu, Jul 14, 2011 at 12:20:38PM +0900, KOSAKI Motohiro wrote:
> (2011/07/13 20:10), Mel Gorman wrote:
> > On Wed, Jul 13, 2011 at 09:42:39AM +0900, KOSAKI Motohiro wrote:
> >> (2011/07/11 22:01), Mel Gorman wrote:
> >>> With zone_reclaim_mode enabled, it's possible for zones to be considered
> >>> full in the zonelist_cache so they are skipped in the future. If the
> >>> process enters direct reclaim, the ZLC may still consider zones to be
> >>> full even after reclaiming pages. Reconsider all zones for allocation
> >>> if direct reclaim returns successfully.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >>
> >> Hmmm...
> >>
> >> I like the concept, but I'm worry about a corner case a bit.
> >>
> >> If users are using cpusets/mempolicy, direct reclaim don't scan all zones.
> >> Then, zlc_clear_zones_full() seems too aggressive operation.
> >
> > As the system is likely to be running slow if it is in direct reclaim
> > that the complexity of being careful about which zone was cleared was
> > not worth it.
> >
> >> Instead, couldn't we turn zlc->fullzones off from kswapd?
> >>
> >
> > Which zonelist should it clear (there are two) and when should it
> > happen? If it clears it on each cycle around balance_pgdat(), there
> > is no guarantee that it'll be cleared between when direct reclaim
> > finishes and an attempt is made to allocate.
>
> Hmm..
>
> Probably I'm now missing the point of this patch. Why do we need
> to guarantee tightly coupled zlc cache and direct reclaim?
Because direct reclaim may free enough memory such that the zlc cache
stating the zone is full is wrong.
> IIUC,
> zlc cache mean "to avoid free list touch if they have no free mem".
> So, any free page increasing point is acceptable good, I thought.
> In the other hand, direct reclaim finishing has no guarantee to
> zones of zonelist have enough free memory because it has bailing out logic.
>
It has no guarantee but there is a reasonable expectation that direct
reclaim will free some memory that means we should reconsider the
zone for allocation.
> So, I think we don't need to care zonelist, just kswapd turn off
> their own node.
>
I don't understand what you mean by this.
> And, just curious, If we will have a proper zlc clear point, why
> do we need to keep HZ timeout?
>
Yes because we are not guaranteed to call direct reclaim either. Memory
could be freed by a process exiting and I'd rather not add cost to
the free path to find and clear all zonelists referencing the zone the
page being freed belongs to.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
2011-07-14 1:20 ` KOSAKI Motohiro
@ 2011-07-14 6:11 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-14 6:11 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Thu, Jul 14, 2011 at 10:20:12AM +0900, KOSAKI Motohiro wrote:
> (2011/07/13 20:02), Mel Gorman wrote:
> > On Wed, Jul 13, 2011 at 10:15:15AM +0900, KOSAKI Motohiro wrote:
> >> (2011/07/11 22:01), Mel Gorman wrote:
> >>> The zonelist cache (ZLC) is used among other things to record if
> >>> zone_reclaim() failed for a particular zone recently. The intention
> >>> is to avoid a high cost scanning extremely long zonelists or scanning
> >>> within the zone uselessly.
> >>>
> >>> Currently the zonelist cache is setup only after the first zone has
> >>> been considered and zone_reclaim() has been called. The objective was
> >>> to avoid a costly setup but zone_reclaim is itself quite expensive. If
> >>> it is failing regularly such as the first eligible zone having mostly
> >>> mapped pages, the cost in scanning and allocation stalls is far higher
> >>> than the ZLC initialisation step.
> >>>
> >>> This patch initialises ZLC before the first eligible zone calls
> >>> zone_reclaim(). Once initialised, it is checked whether the zone
> >>> failed zone_reclaim recently. If it has, the zone is skipped. As the
> >>> first zone is now being checked, additional care has to be taken about
> >>> zones marked full. A zone can be marked "full" because it should not
> >>> have enough unmapped pages for zone_reclaim but this is excessive as
> >>> direct reclaim or kswapd may succeed where zone_reclaim fails. Only
> >>> mark zones "full" after zone_reclaim fails if it failed to reclaim
> >>> enough pages after scanning.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >>
> >> If I understand correctly this patch's procs/cons is,
> >>
> >> pros.
> >> 1) faster when zone reclaim doesn't work effectively
> >>
> >
> > Yes.
> >
> >> cons.
> >> 2) slower when zone reclaim is off
> >
> > How is it slower with zone_reclaim off?
> >
> > Before
> >
> > if (zone_reclaim_mode == 0)
> > goto this_zone_full;
> > ...
> > this_zone_full:
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
> > if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> > ...
> > }
> >
> > After
> > if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> > ...
> > }
> > if (zone_reclaim_mode == 0)
> > goto this_zone_full;
> > this_zone_full:
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
> >
> > Bear in mind that if the watermarks are met on the first zone, the zlc
> > setup does not occur.
>
> Right you are. thank you correct me.
>
>
> >> 3) slower when zone recliam works effectively
> >>
> >
> > Marginally slower. It's now calling zlc setup so once a second it's
> > zeroing a bitmap and calling zlc_zone_worth_trying() on the first
> > zone testing a bit on a cache-hot structure.
> >
> > As the ineffective case can be triggered by a simple cp, I think the
> > cost is justified. Can you think of a better way of doing this?
>
> So, now I'm revisit your number in [0/3]. and I've conclude your patch
> improve simple cp case too. then please forget my last mail. this patch
> looks nicer.
>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-14 6:11 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-14 6:11 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Thu, Jul 14, 2011 at 10:20:12AM +0900, KOSAKI Motohiro wrote:
> (2011/07/13 20:02), Mel Gorman wrote:
> > On Wed, Jul 13, 2011 at 10:15:15AM +0900, KOSAKI Motohiro wrote:
> >> (2011/07/11 22:01), Mel Gorman wrote:
> >>> The zonelist cache (ZLC) is used among other things to record if
> >>> zone_reclaim() failed for a particular zone recently. The intention
> >>> is to avoid a high cost scanning extremely long zonelists or scanning
> >>> within the zone uselessly.
> >>>
> >>> Currently the zonelist cache is setup only after the first zone has
> >>> been considered and zone_reclaim() has been called. The objective was
> >>> to avoid a costly setup but zone_reclaim is itself quite expensive. If
> >>> it is failing regularly such as the first eligible zone having mostly
> >>> mapped pages, the cost in scanning and allocation stalls is far higher
> >>> than the ZLC initialisation step.
> >>>
> >>> This patch initialises ZLC before the first eligible zone calls
> >>> zone_reclaim(). Once initialised, it is checked whether the zone
> >>> failed zone_reclaim recently. If it has, the zone is skipped. As the
> >>> first zone is now being checked, additional care has to be taken about
> >>> zones marked full. A zone can be marked "full" because it should not
> >>> have enough unmapped pages for zone_reclaim but this is excessive as
> >>> direct reclaim or kswapd may succeed where zone_reclaim fails. Only
> >>> mark zones "full" after zone_reclaim fails if it failed to reclaim
> >>> enough pages after scanning.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >>
> >> If I understand correctly this patch's procs/cons is,
> >>
> >> pros.
> >> 1) faster when zone reclaim doesn't work effectively
> >>
> >
> > Yes.
> >
> >> cons.
> >> 2) slower when zone reclaim is off
> >
> > How is it slower with zone_reclaim off?
> >
> > Before
> >
> > if (zone_reclaim_mode == 0)
> > goto this_zone_full;
> > ...
> > this_zone_full:
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
> > if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> > ...
> > }
> >
> > After
> > if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
> > ...
> > }
> > if (zone_reclaim_mode == 0)
> > goto this_zone_full;
> > this_zone_full:
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
> >
> > Bear in mind that if the watermarks are met on the first zone, the zlc
> > setup does not occur.
>
> Right you are. thank you correct me.
>
>
> >> 3) slower when zone recliam works effectively
> >>
> >
> > Marginally slower. It's now calling zlc setup so once a second it's
> > zeroing a bitmap and calling zlc_zone_worth_trying() on the first
> > zone testing a bit on a cache-hot structure.
> >
> > As the ineffective case can be triggered by a simple cp, I think the
> > cost is justified. Can you think of a better way of doing this?
>
> So, now I'm revisit your number in [0/3]. and I've conclude your patch
> improve simple cp case too. then please forget my last mail. this patch
> looks nicer.
>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
Thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-14 6:10 ` Mel Gorman
@ 2011-07-21 9:35 ` KOSAKI Motohiro
-1 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-21 9:35 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
Hi
>> So, I think we don't need to care zonelist, just kswapd turn off
>> their own node.
>
> I don't understand what you mean by this.
This was the answer of following your comments.
> Instead, couldn't we turn zlc->fullzones off from kswapd?
> >
> > Which zonelist should it clear (there are two)
I mean, buddy list is belong to zone, not zonelist. therefore, kswapd
don't need to look up zonelist.
So, I'd suggest either following way,
- use direct reclaim path, but only clear a zlc bit of zones in reclaimed zonelist, not all. or
- use kswapd and only clear a zlc bit at kswap exiting balance_pgdat
I'm prefer to add a branch to slowpath (ie reclaim path) rather than fast path.
>> And, just curious, If we will have a proper zlc clear point, why
>> do we need to keep HZ timeout?
>
> Yes because we are not guaranteed to call direct reclaim either. Memory
> could be freed by a process exiting and I'd rather not add cost to
> the free path to find and clear all zonelists referencing the zone the
> page being freed belongs to.
Ok, it's good trade-off. I agree we need to keep HZ timeout.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-21 9:35 ` KOSAKI Motohiro
0 siblings, 0 replies; 42+ messages in thread
From: KOSAKI Motohiro @ 2011-07-21 9:35 UTC (permalink / raw)
To: mgorman; +Cc: linux-mm, linux-kernel
Hi
>> So, I think we don't need to care zonelist, just kswapd turn off
>> their own node.
>
> I don't understand what you mean by this.
This was the answer of following your comments.
> Instead, couldn't we turn zlc->fullzones off from kswapd?
> >
> > Which zonelist should it clear (there are two)
I mean, buddy list is belong to zone, not zonelist. therefore, kswapd
don't need to look up zonelist.
So, I'd suggest either following way,
- use direct reclaim path, but only clear a zlc bit of zones in reclaimed zonelist, not all. or
- use kswapd and only clear a zlc bit at kswap exiting balance_pgdat
I'm prefer to add a branch to slowpath (ie reclaim path) rather than fast path.
>> And, just curious, If we will have a proper zlc clear point, why
>> do we need to keep HZ timeout?
>
> Yes because we are not guaranteed to call direct reclaim either. Memory
> could be freed by a process exiting and I'd rather not add cost to
> the free path to find and clear all zonelists referencing the zone the
> page being freed belongs to.
Ok, it's good trade-off. I agree we need to keep HZ timeout.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
2011-07-21 9:35 ` KOSAKI Motohiro
@ 2011-07-21 10:31 ` Mel Gorman
-1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-21 10:31 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Thu, Jul 21, 2011 at 06:35:40PM +0900, KOSAKI Motohiro wrote:
> Hi
>
>
>
>
>
> >> So, I think we don't need to care zonelist, just kswapd turn off
> >> their own node.
> >
> > I don't understand what you mean by this.
>
> This was the answer of following your comments.
>
> > Instead, couldn't we turn zlc->fullzones off from kswapd?
> > >
> > > Which zonelist should it clear (there are two)
>
> I mean, buddy list is belong to zone, not zonelist. therefore, kswapd
> don't need to look up zonelist.
>
> So, I'd suggest either following way,
> - use direct reclaim path, but only clear a zlc bit of zones in reclaimed zonelist, not all. or
We certainly could narrow the number of zones the bits are
cleared on by exporting knowledge of the ZLC to vmscan for use in
shrink_zones(). I think in practice the end result will be the same
though as shrink_zones() examples all zones in the zonelist. How much
of a gain do you expect the additional complexity to give us?
> - use kswapd and only clear a zlc bit at kswap exiting balance_pgdat
>
That is potentially a long time if there are streaming readers keeping a
zone under the high watermark for a long time.
> I'm prefer to add a branch to slowpath (ie reclaim path) rather than fast path.
>
The clearing of the zonelist is already happening after direct reclaim
which is the slow path. What fast path are you concerned with here?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-21 10:31 ` Mel Gorman
0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2011-07-21 10:31 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel
On Thu, Jul 21, 2011 at 06:35:40PM +0900, KOSAKI Motohiro wrote:
> Hi
>
>
>
>
>
> >> So, I think we don't need to care zonelist, just kswapd turn off
> >> their own node.
> >
> > I don't understand what you mean by this.
>
> This was the answer of following your comments.
>
> > Instead, couldn't we turn zlc->fullzones off from kswapd?
> > >
> > > Which zonelist should it clear (there are two)
>
> I mean, buddy list is belong to zone, not zonelist. therefore, kswapd
> don't need to look up zonelist.
>
> So, I'd suggest either following way,
> - use direct reclaim path, but only clear a zlc bit of zones in reclaimed zonelist, not all. or
We certainly could narrow the number of zones the bits are
cleared on by exporting knowledge of the ZLC to vmscan for use in
shrink_zones(). I think in practice the end result will be the same
though as shrink_zones() examples all zones in the zonelist. How much
of a gain do you expect the additional complexity to give us?
> - use kswapd and only clear a zlc bit at kswap exiting balance_pgdat
>
That is potentially a long time if there are streaming readers keeping a
zone under the high watermark for a long time.
> I'm prefer to add a branch to slowpath (ie reclaim path) rather than fast path.
>
The clearing of the zonelist is already happening after direct reclaim
which is the slow path. What fast path are you concerned with here?
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2011-07-21 10:31 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-11 13:01 [RFC PATCH 0/3] Reduce frequency of stalls due to zone_reclaim() on NUMA Mel Gorman
2011-07-11 13:01 ` Mel Gorman
2011-07-11 13:01 ` [PATCH 1/3] mm: vmscan: Do use use PF_SWAPWRITE from zone_reclaim Mel Gorman
2011-07-11 13:01 ` Mel Gorman
2011-07-12 9:27 ` Minchan Kim
2011-07-12 9:27 ` Minchan Kim
2011-07-12 9:40 ` KOSAKI Motohiro
2011-07-12 9:40 ` KOSAKI Motohiro
2011-07-12 9:55 ` Minchan Kim
2011-07-12 9:55 ` Minchan Kim
2011-07-12 15:43 ` Christoph Lameter
2011-07-12 15:43 ` Christoph Lameter
2011-07-13 10:40 ` Mel Gorman
2011-07-13 10:40 ` Mel Gorman
2011-07-12 10:14 ` Mel Gorman
2011-07-12 10:14 ` Mel Gorman
2011-07-13 0:34 ` KOSAKI Motohiro
2011-07-13 0:34 ` KOSAKI Motohiro
2011-07-11 13:01 ` [PATCH 2/3] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim Mel Gorman
2011-07-11 13:01 ` Mel Gorman
2011-07-13 1:15 ` KOSAKI Motohiro
2011-07-13 1:15 ` KOSAKI Motohiro
2011-07-13 11:02 ` Mel Gorman
2011-07-13 11:02 ` Mel Gorman
2011-07-14 1:20 ` KOSAKI Motohiro
2011-07-14 1:20 ` KOSAKI Motohiro
2011-07-14 6:11 ` Mel Gorman
2011-07-14 6:11 ` Mel Gorman
2011-07-11 13:01 ` [PATCH 3/3] mm: page allocator: Reconsider zones for allocation after direct reclaim Mel Gorman
2011-07-11 13:01 ` Mel Gorman
2011-07-13 0:42 ` KOSAKI Motohiro
2011-07-13 0:42 ` KOSAKI Motohiro
2011-07-13 11:10 ` Mel Gorman
2011-07-13 11:10 ` Mel Gorman
2011-07-14 3:20 ` KOSAKI Motohiro
2011-07-14 3:20 ` KOSAKI Motohiro
2011-07-14 6:10 ` Mel Gorman
2011-07-14 6:10 ` Mel Gorman
2011-07-21 9:35 ` KOSAKI Motohiro
2011-07-21 9:35 ` KOSAKI Motohiro
2011-07-21 10:31 ` Mel Gorman
2011-07-21 10:31 ` Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.