All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Reduce frequency of stalls due to zone_reclaim() on NUMA v2r1
@ 2011-07-15 15:08 ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-15 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
	linux-mm, linux-kernel

Sorry for the resend. I screwed up the patch numbers in the first
sending.

Changelog since v1
  o Dropped PF_SWAPWRITE change as discussions related to it stalled and
    it's not important for fixing the underlying problem.

There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on
a distribution bugzilla. In these cases, zone_reclaim was enabled
by default due to large NUMA distances. In general, the complaints
have not been about the workload itself unless it was a file server
(in which case the recommendation was disable zone_reclaim).

The stalls are mostly due to significant amounts of time spent
scanning the preferred zone for pages to free. After a failure, it
might fallback to another node (as zonelists are often node-ordered
rather than zone-ordered) but stall quickly again when the next
allocation attempt occurs. In bad cases, each page allocated results
in a full scan of the preferred zone.

Patch 1 checks the preferred zone for recent allocation failure which
	is particularly important if zone_reclaim has failed recently.
	This avoids rescanning the zone in the near future and instead
	falling back to another node. This may hurt node locality in
	some cases but a failure to zone_reclaim is more expensive
	than a remote access.

Patch 2 clears the zlc information after direct reclaim. Otherwise,
	zone_reclaim can mark zones full, direct reclaim can
	reclaim enough pages but the zone is still not considered
	for allocation.

This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs
on node-0 to avoid disturbances due to processes being scheduled on
different nodes. The kernels tested are

3.0-rc6-vanilla		Vanilla 3.0-rc6
zlcfirst		Patch 1 applied
zlcreconsider		Patches 1+2 applied

FS-Mark
./fs_mark  -d  /tmp/fsmark-10813  -D  100  -N  5000  -n  208  -L  35  -t  24  -S0  -s  524288
                fsmark-3.0-rc6       3.0-rc6       		3.0-rc6
                   vanilla			 zlcfirs 	zlcreconsider
Files/s  min          54.90 ( 0.00%)       49.80 (-10.24%)       49.10 (-11.81%)
Files/s  mean        100.11 ( 0.00%)      135.17 (25.94%)      146.93 (31.87%)
Files/s  stddev       57.51 ( 0.00%)      138.97 (58.62%)      158.69 (63.76%)
Files/s  max         361.10 ( 0.00%)      834.40 (56.72%)      802.40 (55.00%)
Overhead min       76704.00 ( 0.00%)    76501.00 ( 0.27%)    77784.00 (-1.39%)
Overhead mean    1485356.51 ( 0.00%)  1035797.83 (43.40%)  1594680.26 (-6.86%)
Overhead stddev  1848122.53 ( 0.00%)   881489.88 (109.66%)  1772354.90 ( 4.27%)
Overhead max     7989060.00 ( 0.00%)  3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        501.49    493.91    499.93
Total Elapsed Time (seconds)               2451.57   2257.48   2215.92

MMTests Statistics: vmstat
Page Ins                                       46268       63840       66008
Page Outs                                   90821596    90671128    88043732
Swap Ins                                           0           0           0
Swap Outs                                          0           0           0
Direct pages scanned                        13091697     8966863     8971790
Kswapd pages scanned                               0     1830011     1831116
Kswapd pages reclaimed                             0     1829068     1829930
Direct pages reclaimed                      13037777     8956828     8648314
Kswapd efficiency                               100%         99%         99%
Kswapd velocity                                0.000     810.643     826.346
Direct efficiency                                99%         99%         96%
Direct velocity                             5340.128    3972.068    4048.788
Percentage direct scans                         100%         83%         83%
Page writes by reclaim                             0           3           0
Slabs scanned                                 796672      720640      720256
Direct inode steals                          7422667     7160012     7088638
Kswapd inode steals                                0     1736840     2021238

Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number
of iterations were much higher than the mean. The number of pages
scanned by zone_reclaim is reduced and kswapd is used for more work.

LARGE DD
               		3.0-rc6       3.0-rc6       3.0-rc6
                   	vanilla     zlcfirst     zlcreconsider
download tar           59 ( 0.00%)   59 ( 0.00%)   55 ( 7.27%)
dd source files       527 ( 0.00%)  296 (78.04%)  320 (64.69%)
delete source          36 ( 0.00%)   19 (89.47%)   20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        125.03    118.98    122.01
Total Elapsed Time (seconds)                624.56    375.02    398.06

MMTests Statistics: vmstat
Page Ins                                     3594216      439368      407032
Page Outs                                   23380832    23380488    23377444
Swap Ins                                           0           0           0
Swap Outs                                          0         436         287
Direct pages scanned                        17482342    69315973    82864918
Kswapd pages scanned                               0      519123      575425
Kswapd pages reclaimed                             0      466501      522487
Direct pages reclaimed                       5858054     2732949     2712547
Kswapd efficiency                               100%         89%         90%
Kswapd velocity                                0.000    1384.254    1445.574
Direct efficiency                                33%          3%          3%
Direct velocity                            27991.453  184832.737  208171.929
Percentage direct scans                         100%         99%         99%
Page writes by reclaim                             0        5082       13917
Slabs scanned                                  17280       29952       35328
Direct inode steals                           115257     1431122      332201
Kswapd inode steals                                0           0      979532

This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to
the number of dirty pages encountered. The figures could probably be
improved with more work around how kswapd is used and how dirty pages
are handled but that is separate work and this result is significant
on its own.

Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        124.47    111.67    112.64
Total Elapsed Time (seconds)               2138.14   1816.30   1867.56

MMTests Statistics: vmstat
Page Ins                                       90760       89124       89516
Page Outs                                  121028340   120199524   120736696
Swap Ins                                           0          86          55
Swap Outs                                          0           0           0
Direct pages scanned                       114989363    96461439    96330619
Kswapd pages scanned                        56430948    56965763    57075875
Kswapd pages reclaimed                      27743219    27752044    27766606
Direct pages reclaimed                         49777       46884       36655
Kswapd efficiency                                49%         48%         48%
Kswapd velocity                            26392.541   31363.631   30561.736
Direct efficiency                                 0%          0%          0%
Direct velocity                            53780.091   53108.759   51581.004
Percentage direct scans                          67%         62%         62%
Page writes by reclaim                           385         122        1513
Slabs scanned                                  43008       39040       42112
Direct inode steals                                0          10           8
Kswapd inode steals                              733         534         477

This test just creates a large file mapping and writes to it
linearly. Time to completion is again reduced.

The gains are mostly down to two things. In many cases, there
is less scanning as zone_reclaim simply gives up faster due to
recent failures. The second reason is that memory is used more
efficiently. Instead of scanning the preferred zone every time, the
allocator falls back to another zone and uses it instead improving
overall memory utilisation.

 mm/page_alloc.c |   54 +++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 41 insertions(+), 13 deletions(-)

-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/2] Reduce frequency of stalls due to zone_reclaim() on NUMA v2r1
@ 2011-07-15 15:08 ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-15 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
	linux-mm, linux-kernel

Sorry for the resend. I screwed up the patch numbers in the first
sending.

Changelog since v1
  o Dropped PF_SWAPWRITE change as discussions related to it stalled and
    it's not important for fixing the underlying problem.

There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on
a distribution bugzilla. In these cases, zone_reclaim was enabled
by default due to large NUMA distances. In general, the complaints
have not been about the workload itself unless it was a file server
(in which case the recommendation was disable zone_reclaim).

The stalls are mostly due to significant amounts of time spent
scanning the preferred zone for pages to free. After a failure, it
might fallback to another node (as zonelists are often node-ordered
rather than zone-ordered) but stall quickly again when the next
allocation attempt occurs. In bad cases, each page allocated results
in a full scan of the preferred zone.

Patch 1 checks the preferred zone for recent allocation failure which
	is particularly important if zone_reclaim has failed recently.
	This avoids rescanning the zone in the near future and instead
	falling back to another node. This may hurt node locality in
	some cases but a failure to zone_reclaim is more expensive
	than a remote access.

Patch 2 clears the zlc information after direct reclaim. Otherwise,
	zone_reclaim can mark zones full, direct reclaim can
	reclaim enough pages but the zone is still not considered
	for allocation.

This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs
on node-0 to avoid disturbances due to processes being scheduled on
different nodes. The kernels tested are

3.0-rc6-vanilla		Vanilla 3.0-rc6
zlcfirst		Patch 1 applied
zlcreconsider		Patches 1+2 applied

FS-Mark
./fs_mark  -d  /tmp/fsmark-10813  -D  100  -N  5000  -n  208  -L  35  -t  24  -S0  -s  524288
                fsmark-3.0-rc6       3.0-rc6       		3.0-rc6
                   vanilla			 zlcfirs 	zlcreconsider
Files/s  min          54.90 ( 0.00%)       49.80 (-10.24%)       49.10 (-11.81%)
Files/s  mean        100.11 ( 0.00%)      135.17 (25.94%)      146.93 (31.87%)
Files/s  stddev       57.51 ( 0.00%)      138.97 (58.62%)      158.69 (63.76%)
Files/s  max         361.10 ( 0.00%)      834.40 (56.72%)      802.40 (55.00%)
Overhead min       76704.00 ( 0.00%)    76501.00 ( 0.27%)    77784.00 (-1.39%)
Overhead mean    1485356.51 ( 0.00%)  1035797.83 (43.40%)  1594680.26 (-6.86%)
Overhead stddev  1848122.53 ( 0.00%)   881489.88 (109.66%)  1772354.90 ( 4.27%)
Overhead max     7989060.00 ( 0.00%)  3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        501.49    493.91    499.93
Total Elapsed Time (seconds)               2451.57   2257.48   2215.92

MMTests Statistics: vmstat
Page Ins                                       46268       63840       66008
Page Outs                                   90821596    90671128    88043732
Swap Ins                                           0           0           0
Swap Outs                                          0           0           0
Direct pages scanned                        13091697     8966863     8971790
Kswapd pages scanned                               0     1830011     1831116
Kswapd pages reclaimed                             0     1829068     1829930
Direct pages reclaimed                      13037777     8956828     8648314
Kswapd efficiency                               100%         99%         99%
Kswapd velocity                                0.000     810.643     826.346
Direct efficiency                                99%         99%         96%
Direct velocity                             5340.128    3972.068    4048.788
Percentage direct scans                         100%         83%         83%
Page writes by reclaim                             0           3           0
Slabs scanned                                 796672      720640      720256
Direct inode steals                          7422667     7160012     7088638
Kswapd inode steals                                0     1736840     2021238

Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number
of iterations were much higher than the mean. The number of pages
scanned by zone_reclaim is reduced and kswapd is used for more work.

LARGE DD
               		3.0-rc6       3.0-rc6       3.0-rc6
                   	vanilla     zlcfirst     zlcreconsider
download tar           59 ( 0.00%)   59 ( 0.00%)   55 ( 7.27%)
dd source files       527 ( 0.00%)  296 (78.04%)  320 (64.69%)
delete source          36 ( 0.00%)   19 (89.47%)   20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        125.03    118.98    122.01
Total Elapsed Time (seconds)                624.56    375.02    398.06

MMTests Statistics: vmstat
Page Ins                                     3594216      439368      407032
Page Outs                                   23380832    23380488    23377444
Swap Ins                                           0           0           0
Swap Outs                                          0         436         287
Direct pages scanned                        17482342    69315973    82864918
Kswapd pages scanned                               0      519123      575425
Kswapd pages reclaimed                             0      466501      522487
Direct pages reclaimed                       5858054     2732949     2712547
Kswapd efficiency                               100%         89%         90%
Kswapd velocity                                0.000    1384.254    1445.574
Direct efficiency                                33%          3%          3%
Direct velocity                            27991.453  184832.737  208171.929
Percentage direct scans                         100%         99%         99%
Page writes by reclaim                             0        5082       13917
Slabs scanned                                  17280       29952       35328
Direct inode steals                           115257     1431122      332201
Kswapd inode steals                                0           0      979532

This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to
the number of dirty pages encountered. The figures could probably be
improved with more work around how kswapd is used and how dirty pages
are handled but that is separate work and this result is significant
on its own.

Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        124.47    111.67    112.64
Total Elapsed Time (seconds)               2138.14   1816.30   1867.56

MMTests Statistics: vmstat
Page Ins                                       90760       89124       89516
Page Outs                                  121028340   120199524   120736696
Swap Ins                                           0          86          55
Swap Outs                                          0           0           0
Direct pages scanned                       114989363    96461439    96330619
Kswapd pages scanned                        56430948    56965763    57075875
Kswapd pages reclaimed                      27743219    27752044    27766606
Direct pages reclaimed                         49777       46884       36655
Kswapd efficiency                                49%         48%         48%
Kswapd velocity                            26392.541   31363.631   30561.736
Direct efficiency                                 0%          0%          0%
Direct velocity                            53780.091   53108.759   51581.004
Percentage direct scans                          67%         62%         62%
Page writes by reclaim                           385         122        1513
Slabs scanned                                  43008       39040       42112
Direct inode steals                                0          10           8
Kswapd inode steals                              733         534         477

This test just creates a large file mapping and writes to it
linearly. Time to completion is again reduced.

The gains are mostly down to two things. In many cases, there
is less scanning as zone_reclaim simply gives up faster due to
recent failures. The second reason is that memory is used more
efficiently. Instead of scanning the preferred zone every time, the
allocator falls back to another zone and uses it instead improving
overall memory utilisation.

 mm/page_alloc.c |   54 +++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 41 insertions(+), 13 deletions(-)

-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-15 15:08 ` Mel Gorman
@ 2011-07-15 15:08   ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-15 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
	linux-mm, linux-kernel

The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention
is to avoid a high cost scanning extremely long zonelists or scanning
within the zone uselessly.

Currently the zonelist cache is setup only after the first zone has
been considered and zone_reclaim() has been called. The objective was
to avoid a costly setup but zone_reclaim is itself quite expensive. If
it is failing regularly such as the first eligible zone having mostly
mapped pages, the cost in scanning and allocation stalls is far higher
than the ZLC initialisation step.

This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone
failed zone_reclaim recently. If it has, the zone is skipped. As the
first zone is now being checked, additional care has to be taken about
zones marked full. A zone can be marked "full" because it should not
have enough unmapped pages for zone_reclaim but this is excessive as
direct reclaim or kswapd may succeed where zone_reclaim fails. Only
mark zones "full" after zone_reclaim fails if it failed to reclaim
enough pages after scanning.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   35 ++++++++++++++++++++++-------------
 1 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..6913854 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1664,7 +1664,7 @@ zonelist_scan:
 				continue;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
-				goto try_next_zone;
+				continue;
 
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
@@ -1676,17 +1676,36 @@ zonelist_scan:
 				    classzone_idx, alloc_flags))
 				goto try_this_zone;
 
+			if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
+				/*
+				 * we do zlc_setup if there are multiple nodes
+				 * and before considering the first zone allowed
+				 * by the cpuset.
+				 */
+				allowednodes = zlc_setup(zonelist, alloc_flags);
+				zlc_active = 1;
+				did_zlc_setup = 1;
+			}
+
 			if (zone_reclaim_mode == 0)
 				goto this_zone_full;
 
+			/*
+			 * As we may have just activated ZLC, check if the first
+			 * eligible zone has failed zone_reclaim recently.
+			 */
+			if (NUMA_BUILD && zlc_active &&
+				!zlc_zone_worth_trying(zonelist, z, allowednodes))
+				continue;
+
 			ret = zone_reclaim(zone, gfp_mask, order);
 			switch (ret) {
 			case ZONE_RECLAIM_NOSCAN:
 				/* did not scan */
-				goto try_next_zone;
+				continue;
 			case ZONE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
-				goto this_zone_full;
+				continue;
 			default:
 				/* did we reclaim enough */
 				if (!zone_watermark_ok(zone, order, mark,
@@ -1703,16 +1722,6 @@ try_this_zone:
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
-try_next_zone:
-		if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
-			/*
-			 * we do zlc_setup after the first zone is tried but only
-			 * if there are multiple nodes make it worthwhile
-			 */
-			allowednodes = zlc_setup(zonelist, alloc_flags);
-			zlc_active = 1;
-			did_zlc_setup = 1;
-		}
 	}
 
 	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-15 15:08   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-15 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
	linux-mm, linux-kernel

The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention
is to avoid a high cost scanning extremely long zonelists or scanning
within the zone uselessly.

Currently the zonelist cache is setup only after the first zone has
been considered and zone_reclaim() has been called. The objective was
to avoid a costly setup but zone_reclaim is itself quite expensive. If
it is failing regularly such as the first eligible zone having mostly
mapped pages, the cost in scanning and allocation stalls is far higher
than the ZLC initialisation step.

This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone
failed zone_reclaim recently. If it has, the zone is skipped. As the
first zone is now being checked, additional care has to be taken about
zones marked full. A zone can be marked "full" because it should not
have enough unmapped pages for zone_reclaim but this is excessive as
direct reclaim or kswapd may succeed where zone_reclaim fails. Only
mark zones "full" after zone_reclaim fails if it failed to reclaim
enough pages after scanning.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   35 ++++++++++++++++++++++-------------
 1 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..6913854 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1664,7 +1664,7 @@ zonelist_scan:
 				continue;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
-				goto try_next_zone;
+				continue;
 
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
@@ -1676,17 +1676,36 @@ zonelist_scan:
 				    classzone_idx, alloc_flags))
 				goto try_this_zone;
 
+			if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
+				/*
+				 * we do zlc_setup if there are multiple nodes
+				 * and before considering the first zone allowed
+				 * by the cpuset.
+				 */
+				allowednodes = zlc_setup(zonelist, alloc_flags);
+				zlc_active = 1;
+				did_zlc_setup = 1;
+			}
+
 			if (zone_reclaim_mode == 0)
 				goto this_zone_full;
 
+			/*
+			 * As we may have just activated ZLC, check if the first
+			 * eligible zone has failed zone_reclaim recently.
+			 */
+			if (NUMA_BUILD && zlc_active &&
+				!zlc_zone_worth_trying(zonelist, z, allowednodes))
+				continue;
+
 			ret = zone_reclaim(zone, gfp_mask, order);
 			switch (ret) {
 			case ZONE_RECLAIM_NOSCAN:
 				/* did not scan */
-				goto try_next_zone;
+				continue;
 			case ZONE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
-				goto this_zone_full;
+				continue;
 			default:
 				/* did we reclaim enough */
 				if (!zone_watermark_ok(zone, order, mark,
@@ -1703,16 +1722,6 @@ try_this_zone:
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
-try_next_zone:
-		if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
-			/*
-			 * we do zlc_setup after the first zone is tried but only
-			 * if there are multiple nodes make it worthwhile
-			 */
-			allowednodes = zlc_setup(zonelist, alloc_flags);
-			zlc_active = 1;
-			did_zlc_setup = 1;
-		}
 	}
 
 	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim
  2011-07-15 15:08 ` Mel Gorman
@ 2011-07-15 15:09   ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-15 15:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
	linux-mm, linux-kernel

With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be
full even after reclaiming pages. Reconsider all zones for allocation
if direct reclaim returns successfully.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6913854..149409c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 	set_bit(i, zlc->fullzones);
 }
 
+/*
+ * clear all zones full, called after direct reclaim makes progress so that
+ * a zone that was recently full is not skipped over for up to a second
+ */
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
+
+	zlc = zonelist->zlcache_ptr;
+	if (!zlc)
+		return;
+
+	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+}
+
 #else	/* CONFIG_NUMA */
 
 static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
@@ -1963,6 +1978,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!(*did_some_progress)))
 		return NULL;
 
+	/* After successful reclaim, reconsider all zones for allocation */
+	if (NUMA_BUILD)
+		zlc_clear_zones_full(zonelist);
+
 retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim
@ 2011-07-15 15:09   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-15 15:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, Mel Gorman,
	linux-mm, linux-kernel

With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be
full even after reclaiming pages. Reconsider all zones for allocation
if direct reclaim returns successfully.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6913854..149409c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1616,6 +1616,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 	set_bit(i, zlc->fullzones);
 }
 
+/*
+ * clear all zones full, called after direct reclaim makes progress so that
+ * a zone that was recently full is not skipped over for up to a second
+ */
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
+
+	zlc = zonelist->zlcache_ptr;
+	if (!zlc)
+		return;
+
+	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+}
+
 #else	/* CONFIG_NUMA */
 
 static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
@@ -1963,6 +1978,10 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!(*did_some_progress)))
 		return NULL;
 
+	/* After successful reclaim, reconsider all zones for allocation */
+	if (NUMA_BUILD)
+		zlc_clear_zones_full(zonelist);
+
 retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-15 15:08   ` Mel Gorman
@ 2011-07-18 14:56     ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-18 14:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Fri, 15 Jul 2011, Mel Gorman wrote:

> Currently the zonelist cache is setup only after the first zone has
> been considered and zone_reclaim() has been called. The objective was
> to avoid a costly setup but zone_reclaim is itself quite expensive. If
> it is failing regularly such as the first eligible zone having mostly
> mapped pages, the cost in scanning and allocation stalls is far higher
> than the ZLC initialisation step.

Would it not be easier to set zlc_active and allowednodes based on the
zone having an active ZLC at the start of get_pages()?

Buffered_rmqueue is handling the situation of a zone with an ZLC in a
weird way right now since it ignores the (potentially existing) ZLC
for the first pass. zlc_setup() does a lot of things. So that is because
there is a performance benefit?



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-18 14:56     ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-18 14:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Fri, 15 Jul 2011, Mel Gorman wrote:

> Currently the zonelist cache is setup only after the first zone has
> been considered and zone_reclaim() has been called. The objective was
> to avoid a costly setup but zone_reclaim is itself quite expensive. If
> it is failing regularly such as the first eligible zone having mostly
> mapped pages, the cost in scanning and allocation stalls is far higher
> than the ZLC initialisation step.

Would it not be easier to set zlc_active and allowednodes based on the
zone having an active ZLC at the start of get_pages()?

Buffered_rmqueue is handling the situation of a zone with an ZLC in a
weird way right now since it ignores the (potentially existing) ZLC
for the first pass. zlc_setup() does a lot of things. So that is because
there is a performance benefit?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-18 14:56     ` Christoph Lameter
@ 2011-07-18 16:05       ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-18 16:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> On Fri, 15 Jul 2011, Mel Gorman wrote:
> 
> > Currently the zonelist cache is setup only after the first zone has
> > been considered and zone_reclaim() has been called. The objective was
> > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > it is failing regularly such as the first eligible zone having mostly
> > mapped pages, the cost in scanning and allocation stalls is far higher
> > than the ZLC initialisation step.
> 
> Would it not be easier to set zlc_active and allowednodes based on the
> zone having an active ZLC at the start of get_pages()?
> 

What do you mean by a zones active ZLC? zonelists are on a per-node,
not a per-zone basis (see node_zonelist) so a zone doesn't have an
active ZLC as such. If the zlc_active is set at the beginning of
get_page_from_freelist(), it implies that we are calling zlc_setup()
even when the watermarks are met which is unnecessary.

> Buffered_rmqueue is handling the situation of a zone with an ZLC in a
> weird way right now since it ignores the (potentially existing) ZLC
> for the first pass.

Where does buffered_rmqueue() refer to a zonelist_cache?

> zlc_setup() does a lot of things. So that is because
> there is a performance benefit?
> 

I do not understand this question. Are you asking if zonelist_cache
has a performance benefit? The answer is "yes" because you can see
how the performance when zone_reclaim degrades when it is not used
for the first zone.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-18 16:05       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-18 16:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> On Fri, 15 Jul 2011, Mel Gorman wrote:
> 
> > Currently the zonelist cache is setup only after the first zone has
> > been considered and zone_reclaim() has been called. The objective was
> > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > it is failing regularly such as the first eligible zone having mostly
> > mapped pages, the cost in scanning and allocation stalls is far higher
> > than the ZLC initialisation step.
> 
> Would it not be easier to set zlc_active and allowednodes based on the
> zone having an active ZLC at the start of get_pages()?
> 

What do you mean by a zones active ZLC? zonelists are on a per-node,
not a per-zone basis (see node_zonelist) so a zone doesn't have an
active ZLC as such. If the zlc_active is set at the beginning of
get_page_from_freelist(), it implies that we are calling zlc_setup()
even when the watermarks are met which is unnecessary.

> Buffered_rmqueue is handling the situation of a zone with an ZLC in a
> weird way right now since it ignores the (potentially existing) ZLC
> for the first pass.

Where does buffered_rmqueue() refer to a zonelist_cache?

> zlc_setup() does a lot of things. So that is because
> there is a performance benefit?
> 

I do not understand this question. Are you asking if zonelist_cache
has a performance benefit? The answer is "yes" because you can see
how the performance when zone_reclaim degrades when it is not used
for the first zone.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-18 16:05       ` Mel Gorman
@ 2011-07-18 17:20         ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-18 17:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel



On Mon, 18 Jul 2011, Mel Gorman wrote:

> On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> > On Fri, 15 Jul 2011, Mel Gorman wrote:
> >
> > > Currently the zonelist cache is setup only after the first zone has
> > > been considered and zone_reclaim() has been called. The objective was
> > > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > > it is failing regularly such as the first eligible zone having mostly
> > > mapped pages, the cost in scanning and allocation stalls is far higher
> > > than the ZLC initialisation step.
> >
> > Would it not be easier to set zlc_active and allowednodes based on the
> > zone having an active ZLC at the start of get_pages()?
> >
>
> What do you mean by a zones active ZLC? zonelists are on a per-node,
> not a per-zone basis (see node_zonelist) so a zone doesn't have an
> active ZLC as such. If the zlc_active is set at the beginning of

Look at get_page_from_freelist(): It sets
zlc_active = 0 even through the zonelist under consideration may have a
ZLC. zlc_active = 0 can also mean that the function has not bothered to
look for the zlc information of the current zonelist.

> get_page_from_freelist(), it implies that we are calling zlc_setup()
> even when the watermarks are met which is unnecessary.

Ok then that decision to not call zlc_setup() for performance reasons is
what created the problem that you are trying to solve. In case that the
first zones watermarks are okay we can avoid calling zlc_setup().

What we do now have is checking for zlc_active in the loop just so that
the first time around we do not call zlc_setup().


We may be able to simplify the function by:

1.  Checking for the special case that the first zone is ok and that we do
not want to call zlc_setup before we get to the loop.

2. Do the zlc_setup() before the loop.

3. Remove the zlc_setup() code as you did from the loop as well as the
checks for zlc_active. zlc_active becomes not necessary since a zlc
is always available when we go through the loop.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-18 17:20         ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-18 17:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel



On Mon, 18 Jul 2011, Mel Gorman wrote:

> On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> > On Fri, 15 Jul 2011, Mel Gorman wrote:
> >
> > > Currently the zonelist cache is setup only after the first zone has
> > > been considered and zone_reclaim() has been called. The objective was
> > > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > > it is failing regularly such as the first eligible zone having mostly
> > > mapped pages, the cost in scanning and allocation stalls is far higher
> > > than the ZLC initialisation step.
> >
> > Would it not be easier to set zlc_active and allowednodes based on the
> > zone having an active ZLC at the start of get_pages()?
> >
>
> What do you mean by a zones active ZLC? zonelists are on a per-node,
> not a per-zone basis (see node_zonelist) so a zone doesn't have an
> active ZLC as such. If the zlc_active is set at the beginning of

Look at get_page_from_freelist(): It sets
zlc_active = 0 even through the zonelist under consideration may have a
ZLC. zlc_active = 0 can also mean that the function has not bothered to
look for the zlc information of the current zonelist.

> get_page_from_freelist(), it implies that we are calling zlc_setup()
> even when the watermarks are met which is unnecessary.

Ok then that decision to not call zlc_setup() for performance reasons is
what created the problem that you are trying to solve. In case that the
first zones watermarks are okay we can avoid calling zlc_setup().

What we do now have is checking for zlc_active in the loop just so that
the first time around we do not call zlc_setup().


We may be able to simplify the function by:

1.  Checking for the special case that the first zone is ok and that we do
not want to call zlc_setup before we get to the loop.

2. Do the zlc_setup() before the loop.

3. Remove the zlc_setup() code as you did from the loop as well as the
checks for zlc_active. zlc_active becomes not necessary since a zlc
is always available when we go through the loop.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-18 17:20         ` Christoph Lameter
@ 2011-07-18 21:13           ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-18 21:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Mon, Jul 18, 2011 at 12:20:11PM -0500, Christoph Lameter wrote:
> 
> 
> On Mon, 18 Jul 2011, Mel Gorman wrote:
> 
> > On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> > > On Fri, 15 Jul 2011, Mel Gorman wrote:
> > >
> > > > Currently the zonelist cache is setup only after the first zone has
> > > > been considered and zone_reclaim() has been called. The objective was
> > > > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > > > it is failing regularly such as the first eligible zone having mostly
> > > > mapped pages, the cost in scanning and allocation stalls is far higher
> > > > than the ZLC initialisation step.
> > >
> > > Would it not be easier to set zlc_active and allowednodes based on the
> > > zone having an active ZLC at the start of get_pages()?
> > >
> >
> > What do you mean by a zones active ZLC? zonelists are on a per-node,
> > not a per-zone basis (see node_zonelist) so a zone doesn't have an
> > active ZLC as such. If the zlc_active is set at the beginning of
> 
> Look at get_page_from_freelist(): It sets
> zlc_active = 0 even through the zonelist under consideration may have a
> ZLC. zlc_active = 0 can also mean that the function has not bothered to
> look for the zlc information of the current zonelist.
> 

Yes. So? It's only necessary if the watermarks are not met.

> > get_page_from_freelist(), it implies that we are calling zlc_setup()
> > even when the watermarks are met which is unnecessary.
> 
> Ok then that decision to not call zlc_setup() for performance reasons is
> what created the problem that you are trying to solve. In case that the
> first zones watermarks are okay we can avoid calling zlc_setup().
> 

The original implementation did not check the ZLC in the first loop
at all. It wasn't just about avoiding the cost of setup. I suspect
this problem has been there a long time and it's taking this long
for bug reports to show up because NUMA machines are being used for
generic numa-unaware workloads.

> What we do now have is checking for zlc_active in the loop just so that
> the first time around we do not call zlc_setup().
> 

Yes, why incur the cost for the common case?

> We may be able to simplify the function by:
> 
> 1.  Checking for the special case that the first zone is ok and that we do
> not want to call zlc_setup before we get to the loop.
> 
> 2. Do the zlc_setup() before the loop.
> 
> 3. Remove the zlc_setup() code as you did from the loop as well as the
> checks for zlc_active. zlc_active becomes not necessary since a zlc
> is always available when we go through the loop.
> 

That initial test will involve duplication of things like the cpuset and
no watermarks check just to place the zlc_setup() in a different place.
I might be missing your point but it seems like the gain would be
marginal. Fancy posting a patch?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-18 21:13           ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-18 21:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Mon, Jul 18, 2011 at 12:20:11PM -0500, Christoph Lameter wrote:
> 
> 
> On Mon, 18 Jul 2011, Mel Gorman wrote:
> 
> > On Mon, Jul 18, 2011 at 09:56:31AM -0500, Christoph Lameter wrote:
> > > On Fri, 15 Jul 2011, Mel Gorman wrote:
> > >
> > > > Currently the zonelist cache is setup only after the first zone has
> > > > been considered and zone_reclaim() has been called. The objective was
> > > > to avoid a costly setup but zone_reclaim is itself quite expensive. If
> > > > it is failing regularly such as the first eligible zone having mostly
> > > > mapped pages, the cost in scanning and allocation stalls is far higher
> > > > than the ZLC initialisation step.
> > >
> > > Would it not be easier to set zlc_active and allowednodes based on the
> > > zone having an active ZLC at the start of get_pages()?
> > >
> >
> > What do you mean by a zones active ZLC? zonelists are on a per-node,
> > not a per-zone basis (see node_zonelist) so a zone doesn't have an
> > active ZLC as such. If the zlc_active is set at the beginning of
> 
> Look at get_page_from_freelist(): It sets
> zlc_active = 0 even through the zonelist under consideration may have a
> ZLC. zlc_active = 0 can also mean that the function has not bothered to
> look for the zlc information of the current zonelist.
> 

Yes. So? It's only necessary if the watermarks are not met.

> > get_page_from_freelist(), it implies that we are calling zlc_setup()
> > even when the watermarks are met which is unnecessary.
> 
> Ok then that decision to not call zlc_setup() for performance reasons is
> what created the problem that you are trying to solve. In case that the
> first zones watermarks are okay we can avoid calling zlc_setup().
> 

The original implementation did not check the ZLC in the first loop
at all. It wasn't just about avoiding the cost of setup. I suspect
this problem has been there a long time and it's taking this long
for bug reports to show up because NUMA machines are being used for
generic numa-unaware workloads.

> What we do now have is checking for zlc_active in the loop just so that
> the first time around we do not call zlc_setup().
> 

Yes, why incur the cost for the common case?

> We may be able to simplify the function by:
> 
> 1.  Checking for the special case that the first zone is ok and that we do
> not want to call zlc_setup before we get to the loop.
> 
> 2. Do the zlc_setup() before the loop.
> 
> 3. Remove the zlc_setup() code as you did from the loop as well as the
> checks for zlc_active. zlc_active becomes not necessary since a zlc
> is always available when we go through the loop.
> 

That initial test will involve duplication of things like the cpuset and
no watermarks check just to place the zlc_setup() in a different place.
I might be missing your point but it seems like the gain would be
marginal. Fancy posting a patch?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-18 21:13           ` Mel Gorman
@ 2011-07-18 21:54             ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-18 21:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Mon, 18 Jul 2011, Mel Gorman wrote:

> > We may be able to simplify the function by:
> >
> > 1.  Checking for the special case that the first zone is ok and that we do
> > not want to call zlc_setup before we get to the loop.
> >
> > 2. Do the zlc_setup() before the loop.
> >
> > 3. Remove the zlc_setup() code as you did from the loop as well as the
> > checks for zlc_active. zlc_active becomes not necessary since a zlc
> > is always available when we go through the loop.
> >
>
> That initial test will involve duplication of things like the cpuset and
> no watermarks check just to place the zlc_setup() in a different place.
> I might be missing your point but it seems like the gain would be
> marginal. Fancy posting a patch?

Looked at it for some time. Would have to create a new function for the
watermark checks, the call to buffer_rmqueue and the marking of a zone as
full. After that the goto mess could be unraveled. But I am out of time
for today.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-18 21:54             ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-18 21:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Mon, 18 Jul 2011, Mel Gorman wrote:

> > We may be able to simplify the function by:
> >
> > 1.  Checking for the special case that the first zone is ok and that we do
> > not want to call zlc_setup before we get to the loop.
> >
> > 2. Do the zlc_setup() before the loop.
> >
> > 3. Remove the zlc_setup() code as you did from the loop as well as the
> > checks for zlc_active. zlc_active becomes not necessary since a zlc
> > is always available when we go through the loop.
> >
>
> That initial test will involve duplication of things like the cpuset and
> no watermarks check just to place the zlc_setup() in a different place.
> I might be missing your point but it seems like the gain would be
> marginal. Fancy posting a patch?

Looked at it for some time. Would have to create a new function for the
watermark checks, the call to buffer_rmqueue and the marking of a zone as
full. After that the goto mess could be unraveled. But I am out of time
for today.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] mm: page allocator: Reconsider zones for allocation after direct reclaim fix
  2011-07-15 15:09   ` Mel Gorman
@ 2011-07-19 11:46     ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-19 11:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, linux-mm, linux-kernel


mm/page_alloc.c: In function ‘__alloc_pages_direct_reclaim’:
mm/page_alloc.c:1983:3: error: implicit declaration of function ‘zlc_clear_zones_full’

This patch is a build fix for !CONFIG_NUMA that should be merged with
mm-page-allocator-reconsider-zones-for-allocation-after-direct-reclaim.patch .

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 149409c..0f50cdb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1647,6 +1647,10 @@ static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
 static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 {
 }
+
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+}
 #endif	/* CONFIG_NUMA */
 
 /*

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH] mm: page allocator: Reconsider zones for allocation after direct reclaim fix
@ 2011-07-19 11:46     ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-19 11:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, KOSAKI Motohiro, Christoph Lameter, linux-mm, linux-kernel


mm/page_alloc.c: In function a??__alloc_pages_direct_reclaima??:
mm/page_alloc.c:1983:3: error: implicit declaration of function a??zlc_clear_zones_fulla??

This patch is a build fix for !CONFIG_NUMA that should be merged with
mm-page-allocator-reconsider-zones-for-allocation-after-direct-reclaim.patch .

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 149409c..0f50cdb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1647,6 +1647,10 @@ static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
 static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 {
 }
+
+static void zlc_clear_zones_full(struct zonelist *zonelist)
+{
+}
 #endif	/* CONFIG_NUMA */
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-18 21:54             ` Christoph Lameter
@ 2011-07-19 14:01               ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-19 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

Well we can unwind that complexity later I guess.

Reviewed-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-19 14:01               ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-19 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

Well we can unwind that complexity later I guess.

Reviewed-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-19 14:01               ` Christoph Lameter
@ 2011-07-20 18:08                 ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 18:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

Hmmm... Looking at get_page_from_freelist and considering speeding that up
in general: Could we move the whole watermark logic into the slow path?
Only check when we refill the per cpu queues?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-20 18:08                 ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 18:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

Hmmm... Looking at get_page_from_freelist and considering speeding that up
in general: Could we move the whole watermark logic into the slow path?
Only check when we refill the per cpu queues?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-20 18:08                 ` Christoph Lameter
@ 2011-07-20 19:18                   ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-20 19:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, Jul 20, 2011 at 01:08:46PM -0500, Christoph Lameter wrote:
> Hmmm... Looking at get_page_from_freelist and considering speeding that up
> in general: Could we move the whole watermark logic into the slow path?
> Only check when we refill the per cpu queues?

Each CPU list can hold 186 pages (on my currently running
kernel at least) which is 744K. As I'm running with THP enabled,
the min watermark is 25852K so with 34 of more CPUs, there is a
risk that a zone would be fully depleted due to lack of watermark
checking. Bit unlikely that 34 CPUs would be on one node but the risk
is there. Without THP, the min watermark would have been something like
32K where it would be much easier to accidentally consume all memory.

Yes, moving the watermark checks to the slow path would be faster
but under some conditions, the system will lock up.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-20 19:18                   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-20 19:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, Jul 20, 2011 at 01:08:46PM -0500, Christoph Lameter wrote:
> Hmmm... Looking at get_page_from_freelist and considering speeding that up
> in general: Could we move the whole watermark logic into the slow path?
> Only check when we refill the per cpu queues?

Each CPU list can hold 186 pages (on my currently running
kernel at least) which is 744K. As I'm running with THP enabled,
the min watermark is 25852K so with 34 of more CPUs, there is a
risk that a zone would be fully depleted due to lack of watermark
checking. Bit unlikely that 34 CPUs would be on one node but the risk
is there. Without THP, the min watermark would have been something like
32K where it would be much easier to accidentally consume all memory.

Yes, moving the watermark checks to the slow path would be faster
but under some conditions, the system will lock up.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-20 19:18                   ` Mel Gorman
@ 2011-07-20 19:28                     ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 19:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, 20 Jul 2011, Mel Gorman wrote:

> On Wed, Jul 20, 2011 at 01:08:46PM -0500, Christoph Lameter wrote:
> > Hmmm... Looking at get_page_from_freelist and considering speeding that up
> > in general: Could we move the whole watermark logic into the slow path?
> > Only check when we refill the per cpu queues?
>
> Each CPU list can hold 186 pages (on my currently running
> kernel at least) which is 744K. As I'm running with THP enabled,
> the min watermark is 25852K so with 34 of more CPUs, there is a
> risk that a zone would be fully depleted due to lack of watermark
> checking. Bit unlikely that 34 CPUs would be on one node but the risk
> is there. Without THP, the min watermark would have been something like
> 32K where it would be much easier to accidentally consume all memory.
>
> Yes, moving the watermark checks to the slow path would be faster
> but under some conditions, the system will lock up.

Well the fastpath would simply grab a page if its on the list. If the list
is empty then we would be checking the watermarks and extract pages from
the buddylists. The pages in the per cpu lists would not be accounted for
for reclaim. Counters would reflect the buddy allocator pages available.
Reclaim  flushes the per cpu pages so the buddy allocator pages would be
replenished.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-20 19:28                     ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 19:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, 20 Jul 2011, Mel Gorman wrote:

> On Wed, Jul 20, 2011 at 01:08:46PM -0500, Christoph Lameter wrote:
> > Hmmm... Looking at get_page_from_freelist and considering speeding that up
> > in general: Could we move the whole watermark logic into the slow path?
> > Only check when we refill the per cpu queues?
>
> Each CPU list can hold 186 pages (on my currently running
> kernel at least) which is 744K. As I'm running with THP enabled,
> the min watermark is 25852K so with 34 of more CPUs, there is a
> risk that a zone would be fully depleted due to lack of watermark
> checking. Bit unlikely that 34 CPUs would be on one node but the risk
> is there. Without THP, the min watermark would have been something like
> 32K where it would be much easier to accidentally consume all memory.
>
> Yes, moving the watermark checks to the slow path would be faster
> but under some conditions, the system will lock up.

Well the fastpath would simply grab a page if its on the list. If the list
is empty then we would be checking the watermarks and extract pages from
the buddylists. The pages in the per cpu lists would not be accounted for
for reclaim. Counters would reflect the buddy allocator pages available.
Reclaim  flushes the per cpu pages so the buddy allocator pages would be
replenished.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-20 19:28                     ` Christoph Lameter
@ 2011-07-20 19:52                       ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 19:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

The existing way of deciding if watermarks have been met looks broken to
me.

There are two pools of pages: One is the pages available from the buddy
lists and another the pages in the per cpu lists.

zone_watermark_ok() only checks those in the buddy lists
(NR_FREE_PAGES) is not updated when we get a page from the per cpu lists).

And we do check zone_watermark_ok() before even attempting to allocate
pages that may be available from the per cpu lists?

So the allocator may pass on a zone and/or go into reclaim despite of the
availability of pages on per cpu lists. The more pages one puts into the
per cpu lists the higher the chance of an OOM. ... Ok that is not true
since we flush the per cpu pages and get them back into the buddy lists
before that happens.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-20 19:52                       ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 19:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

The existing way of deciding if watermarks have been met looks broken to
me.

There are two pools of pages: One is the pages available from the buddy
lists and another the pages in the per cpu lists.

zone_watermark_ok() only checks those in the buddy lists
(NR_FREE_PAGES) is not updated when we get a page from the per cpu lists).

And we do check zone_watermark_ok() before even attempting to allocate
pages that may be available from the per cpu lists?

So the allocator may pass on a zone and/or go into reclaim despite of the
availability of pages on per cpu lists. The more pages one puts into the
per cpu lists the higher the chance of an OOM. ... Ok that is not true
since we flush the per cpu pages and get them back into the buddy lists
before that happens.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-20 19:52                       ` Christoph Lameter
@ 2011-07-20 21:17                         ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 21:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

Hmmm... Maybe we can bypass the checks?

Subject: [page allocator] Do not check watermarks if there is a page available on the per cpu freelists

One should be able to grab a page from the per cpu freelists if available.
The pages on the per cpu freelists are not accounted for in VM statistics
so getting a page from there has no impact on reclaim.

Check for this condition in get_page_from_freelist and short circuit
to the call to buffered_rmqueue if so.

Note that there is a race here. We may deplete the reserve pools by
one page if either the process is rescheduled on a different processor
or if another process grabs the last page from the per cpu freelist.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/page_alloc.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2011-07-20 15:27:20.544825852 -0500
+++ linux-2.6/mm/page_alloc.c	2011-07-20 15:30:05.314824797 -0500
@@ -1666,6 +1666,16 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;

+		/*
+		 * Short circuit allocation if we have a usable object on
+		 * the percpu freelist. Note that this can only be an
+		 * optimization since there is no guarantee that we will
+		 * be executing on the same cpu. Another process could also
+		 * be scheduled and take the available page from us.
+		 */
+		if (order == 0 && this_cpu_read(zone->pageset->pcp.count))
+			goto try_this_zone;
+
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-20 21:17                         ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-20 21:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

Hmmm... Maybe we can bypass the checks?

Subject: [page allocator] Do not check watermarks if there is a page available on the per cpu freelists

One should be able to grab a page from the per cpu freelists if available.
The pages on the per cpu freelists are not accounted for in VM statistics
so getting a page from there has no impact on reclaim.

Check for this condition in get_page_from_freelist and short circuit
to the call to buffered_rmqueue if so.

Note that there is a race here. We may deplete the reserve pools by
one page if either the process is rescheduled on a different processor
or if another process grabs the last page from the per cpu freelist.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/page_alloc.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2011-07-20 15:27:20.544825852 -0500
+++ linux-2.6/mm/page_alloc.c	2011-07-20 15:30:05.314824797 -0500
@@ -1666,6 +1666,16 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;

+		/*
+		 * Short circuit allocation if we have a usable object on
+		 * the percpu freelist. Note that this can only be an
+		 * optimization since there is no guarantee that we will
+		 * be executing on the same cpu. Another process could also
+		 * be scheduled and take the available page from us.
+		 */
+		if (order == 0 && this_cpu_read(zone->pageset->pcp.count))
+			goto try_this_zone;
+
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-20 21:17                         ` Christoph Lameter
@ 2011-07-20 22:48                           ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-20 22:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, Jul 20, 2011 at 04:17:41PM -0500, Christoph Lameter wrote:
> Hmmm... Maybe we can bypass the checks?
> 

Maybe we should not.

Watermarks should not just be ignored. They prevent the system
deadlocking due to an inability allocate a page needed to free more
memory. This patch allows allocations that are not high priority
or atomic to succeed when the buddy lists are at the min watermark
and would normally be throttled. Minimally, this patch increasing
the risk of the locking up due to memory expiration. For example,
a GFP_ATOMIC allocation can refill the per-cpu list with the pages
then consumed by GFP_KERNEL allocations, next GFP_ATOMIC allocation
refills again, gets consumed etc. It's even worse if it's PF_MEMALLOC
allocations that are refilling the lists as they ignore watermarks.
If this is happening on enough CPUs, it will cause trouble.

At the very least, the performance benefit of such a change should
be illustrated. Even if it's faster (and I'd expect it to be,
watermark checks particularly at low memory are expensive), it may
just mean the system occasionally runs very fast into a wall. Hence,
the patch should be accompanied with tests showing that even under
very high stress for a long period of time that it does not lock up
and the changelog should include a *very* convincing description
on why PF_MEMALLOC refilling the per-cpu lists to be consumed by
low-priority users is not a problem.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-20 22:48                           ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-07-20 22:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, Jul 20, 2011 at 04:17:41PM -0500, Christoph Lameter wrote:
> Hmmm... Maybe we can bypass the checks?
> 

Maybe we should not.

Watermarks should not just be ignored. They prevent the system
deadlocking due to an inability allocate a page needed to free more
memory. This patch allows allocations that are not high priority
or atomic to succeed when the buddy lists are at the min watermark
and would normally be throttled. Minimally, this patch increasing
the risk of the locking up due to memory expiration. For example,
a GFP_ATOMIC allocation can refill the per-cpu list with the pages
then consumed by GFP_KERNEL allocations, next GFP_ATOMIC allocation
refills again, gets consumed etc. It's even worse if it's PF_MEMALLOC
allocations that are refilling the lists as they ignore watermarks.
If this is happening on enough CPUs, it will cause trouble.

At the very least, the performance benefit of such a change should
be illustrated. Even if it's faster (and I'd expect it to be,
watermark checks particularly at low memory are expensive), it may
just mean the system occasionally runs very fast into a wall. Hence,
the patch should be accompanied with tests showing that even under
very high stress for a long period of time that it does not lock up
and the changelog should include a *very* convincing description
on why PF_MEMALLOC refilling the per-cpu lists to be consumed by
low-priority users is not a problem.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
  2011-07-20 22:48                           ` Mel Gorman
@ 2011-07-21 15:24                             ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-21 15:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, 20 Jul 2011, Mel Gorman wrote:

> On Wed, Jul 20, 2011 at 04:17:41PM -0500, Christoph Lameter wrote:
> > Hmmm... Maybe we can bypass the checks?
> >
>
> Maybe we should not.
>
> Watermarks should not just be ignored. They prevent the system
> deadlocking due to an inability allocate a page needed to free more
> memory. This patch allows allocations that are not high priority
> or atomic to succeed when the buddy lists are at the min watermark
> and would normally be throttled. Minimally, this patch increasing
> the risk of the locking up due to memory expiration. For example,
> a GFP_ATOMIC allocation can refill the per-cpu list with the pages
> then consumed by GFP_KERNEL allocations, next GFP_ATOMIC allocation
> refills again, gets consumed etc. It's even worse if it's PF_MEMALLOC
> allocations that are refilling the lists as they ignore watermarks.
> If this is happening on enough CPUs, it will cause trouble.

Hmmm... True. This allocation complexity prevents effective use of caches.

> At the very least, the performance benefit of such a change should
> be illustrated. Even if it's faster (and I'd expect it to be,
> watermark checks particularly at low memory are expensive), it may
> just mean the system occasionally runs very fast into a wall. Hence,
> the patch should be accompanied with tests showing that even under
> very high stress for a long period of time that it does not lock up
> and the changelog should include a *very* convincing description
> on why PF_MEMALLOC refilling the per-cpu lists to be consumed by
> low-priority users is not a problem.

The performance of the page allocator is extremely bad at this point and
it is so because of all these checks in the critical paths. There have
been numerous ways that subsystems worked around this in the past and I
would think that there is no question that removing expensive checks from
the fastpath improves performance.

Maybe the only solution is to build a consistent second layer of
caching around the page allocator that is usable by various subsystems?

SLAB has in the past provided such a caching layer. The problem is that
people are trying to build similar complexity into the fast path of those
allocators as well now (f.e. the NFS swap patch with its ways of reserving
objects to fix the issue of objects being taken for the wrong reasons that
you mentioned above). We need some solution that allows the implementation of
fast object allocation and that means reducing the complexity of what is
going on during page alloc and free.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim
@ 2011-07-21 15:24                             ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2011-07-21 15:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, KOSAKI Motohiro, linux-mm, linux-kernel

On Wed, 20 Jul 2011, Mel Gorman wrote:

> On Wed, Jul 20, 2011 at 04:17:41PM -0500, Christoph Lameter wrote:
> > Hmmm... Maybe we can bypass the checks?
> >
>
> Maybe we should not.
>
> Watermarks should not just be ignored. They prevent the system
> deadlocking due to an inability allocate a page needed to free more
> memory. This patch allows allocations that are not high priority
> or atomic to succeed when the buddy lists are at the min watermark
> and would normally be throttled. Minimally, this patch increasing
> the risk of the locking up due to memory expiration. For example,
> a GFP_ATOMIC allocation can refill the per-cpu list with the pages
> then consumed by GFP_KERNEL allocations, next GFP_ATOMIC allocation
> refills again, gets consumed etc. It's even worse if it's PF_MEMALLOC
> allocations that are refilling the lists as they ignore watermarks.
> If this is happening on enough CPUs, it will cause trouble.

Hmmm... True. This allocation complexity prevents effective use of caches.

> At the very least, the performance benefit of such a change should
> be illustrated. Even if it's faster (and I'd expect it to be,
> watermark checks particularly at low memory are expensive), it may
> just mean the system occasionally runs very fast into a wall. Hence,
> the patch should be accompanied with tests showing that even under
> very high stress for a long period of time that it does not lock up
> and the changelog should include a *very* convincing description
> on why PF_MEMALLOC refilling the per-cpu lists to be consumed by
> low-priority users is not a problem.

The performance of the page allocator is extremely bad at this point and
it is so because of all these checks in the critical paths. There have
been numerous ways that subsystems worked around this in the past and I
would think that there is no question that removing expensive checks from
the fastpath improves performance.

Maybe the only solution is to build a consistent second layer of
caching around the page allocator that is usable by various subsystems?

SLAB has in the past provided such a caching layer. The problem is that
people are trying to build similar complexity into the fast path of those
allocators as well now (f.e. the NFS swap patch with its ways of reserving
objects to fix the issue of objects being taken for the wrong reasons that
you mentioned above). We need some solution that allows the implementation of
fast object allocation and that means reducing the complexity of what is
going on during page alloc and free.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-07-21 15:24 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-15 15:08 [PATCH 0/2] Reduce frequency of stalls due to zone_reclaim() on NUMA v2r1 Mel Gorman
2011-07-15 15:08 ` Mel Gorman
2011-07-15 15:08 ` [PATCH 1/2] mm: page allocator: Initialise ZLC for first zone eligible for zone_reclaim Mel Gorman
2011-07-15 15:08   ` Mel Gorman
2011-07-18 14:56   ` Christoph Lameter
2011-07-18 14:56     ` Christoph Lameter
2011-07-18 16:05     ` Mel Gorman
2011-07-18 16:05       ` Mel Gorman
2011-07-18 17:20       ` Christoph Lameter
2011-07-18 17:20         ` Christoph Lameter
2011-07-18 21:13         ` Mel Gorman
2011-07-18 21:13           ` Mel Gorman
2011-07-18 21:54           ` Christoph Lameter
2011-07-18 21:54             ` Christoph Lameter
2011-07-19 14:01             ` Christoph Lameter
2011-07-19 14:01               ` Christoph Lameter
2011-07-20 18:08               ` Christoph Lameter
2011-07-20 18:08                 ` Christoph Lameter
2011-07-20 19:18                 ` Mel Gorman
2011-07-20 19:18                   ` Mel Gorman
2011-07-20 19:28                   ` Christoph Lameter
2011-07-20 19:28                     ` Christoph Lameter
2011-07-20 19:52                     ` Christoph Lameter
2011-07-20 19:52                       ` Christoph Lameter
2011-07-20 21:17                       ` Christoph Lameter
2011-07-20 21:17                         ` Christoph Lameter
2011-07-20 22:48                         ` Mel Gorman
2011-07-20 22:48                           ` Mel Gorman
2011-07-21 15:24                           ` Christoph Lameter
2011-07-21 15:24                             ` Christoph Lameter
2011-07-15 15:09 ` [PATCH 2/2] mm: page allocator: Reconsider zones for allocation after direct reclaim Mel Gorman
2011-07-15 15:09   ` Mel Gorman
2011-07-19 11:46   ` [PATCH] mm: page allocator: Reconsider zones for allocation after direct reclaim fix Mel Gorman
2011-07-19 11:46     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.