All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-09 11:06 ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Posting V2 of this series got delayed due to trying to pin down an unrelated
regression in 3.9-rc where interactive performance is shot to hell. That
problem still has not been identified as it's resisting attempts to be
reproducible by a script for the purposes of bisection.

For those that looked at V1, the most important difference in this version
is how patch 2 preserves the proportional scanning of anon/file LRUs.

The series is against 3.9-rc6.

Changelog since V1
o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
o Reformat comment in shrink_page_list				(andi)
o Clarify some comments						(dhillf)
o Rework how the proportional scanning is preserved
o Add PageReclaim check before kswapd starts writeback
o Reset sc.nr_reclaimed on every full zone scan

Kswapd and page reclaim behaviour has been screwy in one way or the other
for a long time. Very broadly speaking it worked in the far past because
machines were limited in memory so it did not have that many pages to scan
and it stalled congestion_wait() frequently to prevent it going completely
nuts. In recent times it has behaved very unsatisfactorily with some of
the problems compounded by the removal of stall logic and the introduction
of transparent hugepage support with high-order reclaims.

There are many variations of bugs that are rooted in this area. One example
is reports of a large copy operations or backup causing the machine to
grind to a halt or applications pushed to swap. Sometimes in low memory
situations a large percentage of memory suddenly gets reclaimed. In other
cases an application starts and kswapd hits 100% CPU usage for prolonged
periods of time and so on. There is now talk of introducing features like
an extra free kbytes tunable to work around aspects of the problem instead
of trying to deal with it. It's compounded by the problem that it can be
very workload and machine specific.

This series aims at addressing some of the worst of these problems without
attempting to fundmentally alter how page reclaim works.

Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.

Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.

Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.

Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.

Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.

Patch 8 shrinks slab just once per priority scanned or if a zone is otherwise
	unreclaimable to avoid hammering slab when kswapd has to skip a
	large number of pages.

Patches 9-10 are cosmetic but balance_pgdat() might be easier to follow.

This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service and it is run multiple times. It starts with no background IO and
then re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO should
have little or no impact on memcachetest which is running entirely in memory.

                                         3.9.0-rc6                   3.9.0-rc6
                                           vanilla           lessdisrupt-v2r11
Ops memcachetest-0M             11106.00 (  0.00%)          10997.00 ( -0.98%)
Ops memcachetest-749M           10960.00 (  0.00%)          11032.00 (  0.66%)
Ops memcachetest-2498M           2588.00 (  0.00%)          10948.00 (323.03%)
Ops memcachetest-4246M           2401.00 (  0.00%)          10960.00 (356.48%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-749M               15.00 (  0.00%)              8.00 ( 46.67%)
Ops io-duration-2498M             112.00 (  0.00%)             25.00 ( 77.68%)
Ops io-duration-4246M             170.00 (  0.00%)             45.00 ( 73.53%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-749M             161678.00 (  0.00%)             16.00 ( 99.99%)
Ops swaptotal-2498M            471903.00 (  0.00%)            192.00 ( 99.96%)
Ops swaptotal-4246M            444010.00 (  0.00%)           1323.00 ( 99.70%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-749M                   789.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2498M               196496.00 (  0.00%)            192.00 ( 99.90%)
Ops swapin-4246M               168269.00 (  0.00%)            154.00 ( 99.91%)
Ops minorfaults-0M            1596126.00 (  0.00%)        1521332.00 (  4.69%)
Ops minorfaults-749M          1766556.00 (  0.00%)        1596350.00 (  9.63%)
Ops minorfaults-2498M         1661445.00 (  0.00%)        1598762.00 (  3.77%)
Ops minorfaults-4246M         1628375.00 (  0.00%)        1597624.00 (  1.89%)
Ops majorfaults-0M                  9.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-749M              154.00 (  0.00%)            101.00 ( 34.42%)
Ops majorfaults-2498M           27214.00 (  0.00%)            165.00 ( 99.39%)
Ops majorfaults-4246M           23229.00 (  0.00%)            114.00 ( 99.51%)

Note how the vanilla kernels performance collapses when there is enough IO
taking place in the background. This drop in performance is part of users
complain of when they start backups. Note how the swapin and major fault
figures indicate that processes were being pushed to swap prematurely. With
the series applied, there is no noticable performance drop and while there
is still some swap activity, it's tiny.

                             3.9.0-rc6   3.9.0-rc6
                               vanilla lessdisrupt-v2r11
Page Ins                       9094288      346092
Page Outs                     62897388    47599884
Swap Ins                       2243749       19389
Swap Outs                      3953966      142258
Direct pages scanned                 0     2262897
Kswapd pages scanned          55530838    75725437
Kswapd pages reclaimed         6682620     1814689
Direct pages reclaimed               0     2187167
Kswapd efficiency                  12%          2%
Kswapd velocity              10537.501   14377.501
Direct efficiency                 100%         96%
Direct velocity                  0.000     429.642
Percentage direct scans             0%          2%
Page writes by reclaim        10835163    72419297
Page writes file               6881197    72277039
Page writes anon               3953966      142258
Page reclaim immediate           11463        8199
Page rescued immediate               0           0
Slabs scanned                    38144       30592
Direct inode steals                  0           0
Kswapd inode steals              11383         791
Kswapd skipped wait                  0           0
THP fault alloc                     10         111
THP collapse alloc                2782        1779
THP splits                          10          27
THP fault fallback                   0           5
THP collapse fail                    0          21
Compaction stalls                    0          89
Compaction success                   0          53
Compaction failures                  0          36
Page migrate success                 0       37062
Page migrate failure                 0           0
Compaction pages isolated            0       83481
Compaction migrate scanned           0       80830
Compaction free scanned              0     2660824
Compaction cost                      0          40
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0

Note that while there is no noticeable performance drop and swap activity is
massively reduced there are processes that direct reclaim as a consequence
of the series due to kswapd not reclaiming the world. ftrace was not enabled
for this particular test to avoid disruption but on a similar test with
ftrace I found that the vast bulk of the direct reclaims were in the dd
processes. The top direct reclaimers were;

     11 ps-13204
     12 top-13198
     15 memcachetest-11712
     20 gzip-3126
     67 tclsh-3124
     80 memcachetest-12924
    191 flush-8:0-292
    338 tee-3125
   2184 dd-12135
  10751 dd-13124

While processes did stall, it was mostly the "correct" processes that
stalled.

There is also still a risk that kswapd not reclaiming the world may mean
that it stays awake balancing zones, does not stall on the appropriate
events and continually scans pages it cannot reclaim consuming CPU. This
will be visible as continued high CPU usage but in my own tests I only
saw a single spike lasting less than a second and I did not observe any
problems related to reclaim while running the series on my desktop.

 include/linux/mmzone.h |  17 ++
 mm/vmscan.c            | 449 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 293 insertions(+), 173 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-09 11:06 ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Posting V2 of this series got delayed due to trying to pin down an unrelated
regression in 3.9-rc where interactive performance is shot to hell. That
problem still has not been identified as it's resisting attempts to be
reproducible by a script for the purposes of bisection.

For those that looked at V1, the most important difference in this version
is how patch 2 preserves the proportional scanning of anon/file LRUs.

The series is against 3.9-rc6.

Changelog since V1
o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
o Reformat comment in shrink_page_list				(andi)
o Clarify some comments						(dhillf)
o Rework how the proportional scanning is preserved
o Add PageReclaim check before kswapd starts writeback
o Reset sc.nr_reclaimed on every full zone scan

Kswapd and page reclaim behaviour has been screwy in one way or the other
for a long time. Very broadly speaking it worked in the far past because
machines were limited in memory so it did not have that many pages to scan
and it stalled congestion_wait() frequently to prevent it going completely
nuts. In recent times it has behaved very unsatisfactorily with some of
the problems compounded by the removal of stall logic and the introduction
of transparent hugepage support with high-order reclaims.

There are many variations of bugs that are rooted in this area. One example
is reports of a large copy operations or backup causing the machine to
grind to a halt or applications pushed to swap. Sometimes in low memory
situations a large percentage of memory suddenly gets reclaimed. In other
cases an application starts and kswapd hits 100% CPU usage for prolonged
periods of time and so on. There is now talk of introducing features like
an extra free kbytes tunable to work around aspects of the problem instead
of trying to deal with it. It's compounded by the problem that it can be
very workload and machine specific.

This series aims at addressing some of the worst of these problems without
attempting to fundmentally alter how page reclaim works.

Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.

Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.

Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.

Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.

Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.

Patch 8 shrinks slab just once per priority scanned or if a zone is otherwise
	unreclaimable to avoid hammering slab when kswapd has to skip a
	large number of pages.

Patches 9-10 are cosmetic but balance_pgdat() might be easier to follow.

This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service and it is run multiple times. It starts with no background IO and
then re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO should
have little or no impact on memcachetest which is running entirely in memory.

                                         3.9.0-rc6                   3.9.0-rc6
                                           vanilla           lessdisrupt-v2r11
Ops memcachetest-0M             11106.00 (  0.00%)          10997.00 ( -0.98%)
Ops memcachetest-749M           10960.00 (  0.00%)          11032.00 (  0.66%)
Ops memcachetest-2498M           2588.00 (  0.00%)          10948.00 (323.03%)
Ops memcachetest-4246M           2401.00 (  0.00%)          10960.00 (356.48%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-749M               15.00 (  0.00%)              8.00 ( 46.67%)
Ops io-duration-2498M             112.00 (  0.00%)             25.00 ( 77.68%)
Ops io-duration-4246M             170.00 (  0.00%)             45.00 ( 73.53%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-749M             161678.00 (  0.00%)             16.00 ( 99.99%)
Ops swaptotal-2498M            471903.00 (  0.00%)            192.00 ( 99.96%)
Ops swaptotal-4246M            444010.00 (  0.00%)           1323.00 ( 99.70%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-749M                   789.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2498M               196496.00 (  0.00%)            192.00 ( 99.90%)
Ops swapin-4246M               168269.00 (  0.00%)            154.00 ( 99.91%)
Ops minorfaults-0M            1596126.00 (  0.00%)        1521332.00 (  4.69%)
Ops minorfaults-749M          1766556.00 (  0.00%)        1596350.00 (  9.63%)
Ops minorfaults-2498M         1661445.00 (  0.00%)        1598762.00 (  3.77%)
Ops minorfaults-4246M         1628375.00 (  0.00%)        1597624.00 (  1.89%)
Ops majorfaults-0M                  9.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-749M              154.00 (  0.00%)            101.00 ( 34.42%)
Ops majorfaults-2498M           27214.00 (  0.00%)            165.00 ( 99.39%)
Ops majorfaults-4246M           23229.00 (  0.00%)            114.00 ( 99.51%)

Note how the vanilla kernels performance collapses when there is enough IO
taking place in the background. This drop in performance is part of users
complain of when they start backups. Note how the swapin and major fault
figures indicate that processes were being pushed to swap prematurely. With
the series applied, there is no noticable performance drop and while there
is still some swap activity, it's tiny.

                             3.9.0-rc6   3.9.0-rc6
                               vanilla lessdisrupt-v2r11
Page Ins                       9094288      346092
Page Outs                     62897388    47599884
Swap Ins                       2243749       19389
Swap Outs                      3953966      142258
Direct pages scanned                 0     2262897
Kswapd pages scanned          55530838    75725437
Kswapd pages reclaimed         6682620     1814689
Direct pages reclaimed               0     2187167
Kswapd efficiency                  12%          2%
Kswapd velocity              10537.501   14377.501
Direct efficiency                 100%         96%
Direct velocity                  0.000     429.642
Percentage direct scans             0%          2%
Page writes by reclaim        10835163    72419297
Page writes file               6881197    72277039
Page writes anon               3953966      142258
Page reclaim immediate           11463        8199
Page rescued immediate               0           0
Slabs scanned                    38144       30592
Direct inode steals                  0           0
Kswapd inode steals              11383         791
Kswapd skipped wait                  0           0
THP fault alloc                     10         111
THP collapse alloc                2782        1779
THP splits                          10          27
THP fault fallback                   0           5
THP collapse fail                    0          21
Compaction stalls                    0          89
Compaction success                   0          53
Compaction failures                  0          36
Page migrate success                 0       37062
Page migrate failure                 0           0
Compaction pages isolated            0       83481
Compaction migrate scanned           0       80830
Compaction free scanned              0     2660824
Compaction cost                      0          40
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0

Note that while there is no noticeable performance drop and swap activity is
massively reduced there are processes that direct reclaim as a consequence
of the series due to kswapd not reclaiming the world. ftrace was not enabled
for this particular test to avoid disruption but on a similar test with
ftrace I found that the vast bulk of the direct reclaims were in the dd
processes. The top direct reclaimers were;

     11 ps-13204
     12 top-13198
     15 memcachetest-11712
     20 gzip-3126
     67 tclsh-3124
     80 memcachetest-12924
    191 flush-8:0-292
    338 tee-3125
   2184 dd-12135
  10751 dd-13124

While processes did stall, it was mostly the "correct" processes that
stalled.

There is also still a risk that kswapd not reclaiming the world may mean
that it stays awake balancing zones, does not stall on the appropriate
events and continually scans pages it cannot reclaim consuming CPU. This
will be visible as continued high CPU usage but in my own tests I only
saw a single spike lasting less than a second and I did not observe any
problems related to reclaim while running the series on my desktop.

 include/linux/mmzone.h |  17 ++
 mm/vmscan.c            | 449 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 293 insertions(+), 173 deletions(-)

-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:06   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority. In
many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
reclaimed pages but in the event kswapd scans a large number of pages it
cannot reclaim, it will raise the priority and potentially discard a large
percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
effect is a reclaim "spike" where a large percentage of memory is suddenly
freed. It would be bad enough if this was just unused memory but because
of how anon/file pages are balanced it is possible that applications get
pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will still overshoot due to it not being a hard limit as
shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd will
reclaim excessively if it is to balance zones for high-order allocations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 53 +++++++++++++++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..4835a7a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 }
 
 /*
+ * kswapd shrinks the zone by the number of pages required to reach
+ * the high watermark.
+ */
+static void kswapd_shrink_zone(struct zone *zone,
+			       struct scan_control *sc,
+			       unsigned long lru_pages)
+{
+	unsigned long nr_slab;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = sc->gfp_mask,
+	};
+
+	/* Reclaim above the high watermark. */
+	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+	shrink_zone(zone, sc);
+
+	reclaim_state->reclaimed_slab = 0;
+	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+
+	if (nr_slab == 0 && !zone_reclaimable(zone))
+		zone->all_unreclaimable = 1;
+}
+
+/*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
  *
@@ -2619,27 +2645,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	bool pgdat_is_balanced = false;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
-	unsigned long total_scanned;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
 		.may_swap = 1,
-		/*
-		 * kswapd doesn't want to be bailed out while reclaim. because
-		 * we want to put equal scanning pressure on each zone.
-		 */
-		.nr_to_reclaim = ULONG_MAX,
 		.order = order,
 		.target_mem_cgroup = NULL,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 loop_again:
-	total_scanned = 0;
 	sc.priority = DEF_PRIORITY;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
@@ -2710,7 +2725,7 @@ loop_again:
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab, testorder;
+			int testorder;
 			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
@@ -2730,7 +2745,6 @@ loop_again:
 							order, sc.gfp_mask,
 							&nr_soft_scanned);
 			sc.nr_reclaimed += nr_soft_reclaimed;
-			total_scanned += nr_soft_scanned;
 
 			/*
 			 * We put equal pressure on every zone, unless
@@ -2759,17 +2773,8 @@ loop_again:
 
 			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
 			    !zone_balanced(zone, testorder,
-					   balance_gap, end_zone)) {
-				shrink_zone(zone, &sc);
-
-				reclaim_state->reclaimed_slab = 0;
-				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
-				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-				total_scanned += sc.nr_scanned;
-
-				if (nr_slab == 0 && !zone_reclaimable(zone))
-					zone->all_unreclaimable = 1;
-			}
+					   balance_gap, end_zone))
+				kswapd_shrink_zone(zone, &sc, lru_pages);
 
 			/*
 			 * If we're getting trouble reclaiming, start doing
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
@ 2013-04-09 11:06   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority. In
many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
reclaimed pages but in the event kswapd scans a large number of pages it
cannot reclaim, it will raise the priority and potentially discard a large
percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
effect is a reclaim "spike" where a large percentage of memory is suddenly
freed. It would be bad enough if this was just unused memory but because
of how anon/file pages are balanced it is possible that applications get
pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will still overshoot due to it not being a hard limit as
shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd will
reclaim excessively if it is to balance zones for high-order allocations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 53 +++++++++++++++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..4835a7a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 }
 
 /*
+ * kswapd shrinks the zone by the number of pages required to reach
+ * the high watermark.
+ */
+static void kswapd_shrink_zone(struct zone *zone,
+			       struct scan_control *sc,
+			       unsigned long lru_pages)
+{
+	unsigned long nr_slab;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = sc->gfp_mask,
+	};
+
+	/* Reclaim above the high watermark. */
+	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+	shrink_zone(zone, sc);
+
+	reclaim_state->reclaimed_slab = 0;
+	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+
+	if (nr_slab == 0 && !zone_reclaimable(zone))
+		zone->all_unreclaimable = 1;
+}
+
+/*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
  *
@@ -2619,27 +2645,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	bool pgdat_is_balanced = false;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
-	unsigned long total_scanned;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
 		.may_swap = 1,
-		/*
-		 * kswapd doesn't want to be bailed out while reclaim. because
-		 * we want to put equal scanning pressure on each zone.
-		 */
-		.nr_to_reclaim = ULONG_MAX,
 		.order = order,
 		.target_mem_cgroup = NULL,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 loop_again:
-	total_scanned = 0;
 	sc.priority = DEF_PRIORITY;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
@@ -2710,7 +2725,7 @@ loop_again:
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab, testorder;
+			int testorder;
 			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
@@ -2730,7 +2745,6 @@ loop_again:
 							order, sc.gfp_mask,
 							&nr_soft_scanned);
 			sc.nr_reclaimed += nr_soft_reclaimed;
-			total_scanned += nr_soft_scanned;
 
 			/*
 			 * We put equal pressure on every zone, unless
@@ -2759,17 +2773,8 @@ loop_again:
 
 			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
 			    !zone_balanced(zone, testorder,
-					   balance_gap, end_zone)) {
-				shrink_zone(zone, &sc);
-
-				reclaim_state->reclaimed_slab = 0;
-				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
-				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-				total_scanned += sc.nr_scanned;
-
-				if (nr_slab == 0 && !zone_reclaimable(zone))
-					zone->all_unreclaimable = 1;
-			}
+					   balance_gap, end_zone))
+				kswapd_shrink_zone(zone, &sc, lru_pages);
 
 			/*
 			 * If we're getting trouble reclaiming, start doing
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:06   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count().  The patch "mm: vmscan: Limit
the number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.

This patch preserves the proportional scanning and reclaim. It does mean
that kswapd will reclaim more than requested but the number of pages will
be related to the high watermark.

[mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 46 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4835a7a..0742c45 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	unsigned long nr_anon_scantarget, nr_file_scantarget;
 	struct blk_plug plug;
+	bool scan_adjusted = false;
 
 	get_scan_count(lruvec, sc, nr);
 
+	/* Record the original scan target for proportional adjustments later */
+	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
+	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
+		unsigned long nr_anon, nr_file, percentage;
+
 		for_each_evictable_lru(lru) {
 			if (nr[lru]) {
 				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
@@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 							    lruvec, sc);
 			}
 		}
+
+		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
+			continue;
+
 		/*
-		 * On large memory systems, scan >> priority can become
-		 * really large. This is fine for the starting priority;
-		 * we want to put equal scanning pressure on each zone.
-		 * However, if the VM has a harder time of freeing pages,
-		 * with multiple processes reclaiming pages, the total
-		 * freeing target can get unreasonably large.
+		 * For global direct reclaim, reclaim only the number of pages
+		 * requested. Less care is taken to scan proportionally as it
+		 * is more important to minimise direct reclaim stall latency
+		 * than it is to properly age the LRU lists.
 		 */
-		if (nr_reclaimed >= nr_to_reclaim &&
-		    sc->priority < DEF_PRIORITY)
+		if (global_reclaim(sc) && !current_is_kswapd())
 			break;
+
+		/*
+		 * For kswapd and memcg, reclaim at least the number of pages
+		 * requested. Ensure that the anon and file LRUs shrink
+		 * proportionally what was requested by get_scan_count(). We
+		 * stop reclaiming one LRU and reduce the amount scanning
+		 * proportional to the original scan target.
+		 */
+		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
+		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
+
+		if (nr_file > nr_anon) {
+			lru = LRU_BASE;
+			percentage = nr_anon * 100 / nr_anon_scantarget;
+		} else {
+			lru = LRU_FILE;
+			percentage = nr_file * 100 / nr_file_scantarget;
+		}
+
+		/* Stop scanning the smaller of the LRU */
+		nr[lru] = 0;
+		nr[lru + LRU_ACTIVE] = 0;
+
+		/* Reduce scanning of the other LRU proportionally */
+		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
+		nr[lru] = nr[lru] * percentage / 100;;
+		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
+
+		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
 	sc->nr_reclaimed += nr_reclaimed;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
@ 2013-04-09 11:06   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count().  The patch "mm: vmscan: Limit
the number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.

This patch preserves the proportional scanning and reclaim. It does mean
that kswapd will reclaim more than requested but the number of pages will
be related to the high watermark.

[mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 46 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4835a7a..0742c45 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	unsigned long nr_anon_scantarget, nr_file_scantarget;
 	struct blk_plug plug;
+	bool scan_adjusted = false;
 
 	get_scan_count(lruvec, sc, nr);
 
+	/* Record the original scan target for proportional adjustments later */
+	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
+	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
+		unsigned long nr_anon, nr_file, percentage;
+
 		for_each_evictable_lru(lru) {
 			if (nr[lru]) {
 				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
@@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 							    lruvec, sc);
 			}
 		}
+
+		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
+			continue;
+
 		/*
-		 * On large memory systems, scan >> priority can become
-		 * really large. This is fine for the starting priority;
-		 * we want to put equal scanning pressure on each zone.
-		 * However, if the VM has a harder time of freeing pages,
-		 * with multiple processes reclaiming pages, the total
-		 * freeing target can get unreasonably large.
+		 * For global direct reclaim, reclaim only the number of pages
+		 * requested. Less care is taken to scan proportionally as it
+		 * is more important to minimise direct reclaim stall latency
+		 * than it is to properly age the LRU lists.
 		 */
-		if (nr_reclaimed >= nr_to_reclaim &&
-		    sc->priority < DEF_PRIORITY)
+		if (global_reclaim(sc) && !current_is_kswapd())
 			break;
+
+		/*
+		 * For kswapd and memcg, reclaim at least the number of pages
+		 * requested. Ensure that the anon and file LRUs shrink
+		 * proportionally what was requested by get_scan_count(). We
+		 * stop reclaiming one LRU and reduce the amount scanning
+		 * proportional to the original scan target.
+		 */
+		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
+		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
+
+		if (nr_file > nr_anon) {
+			lru = LRU_BASE;
+			percentage = nr_anon * 100 / nr_anon_scantarget;
+		} else {
+			lru = LRU_FILE;
+			percentage = nr_file * 100 / nr_file_scantarget;
+		}
+
+		/* Stop scanning the smaller of the LRU */
+		nr[lru] = 0;
+		nr[lru + LRU_ACTIVE] = 0;
+
+		/* Reduce scanning of the other LRU proportionally */
+		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
+		nr[lru] = nr[lru] * percentage / 100;;
+		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
+
+		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
 	sc->nr_reclaimed += nr_reclaimed;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:06   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
pages have been reclaimed or the pgdat is considered balanced. It then
rechecks if it needs to restart at DEF_PRIORITY and whether high-order
reclaim needs to be reset. This is not wrong per-se but it is confusing
to follow and forcing kswapd to stay at DEF_PRIORITY may require several
restarts before it has scanned enough pages to meet the high watermark even
at 100% efficiency. This patch irons out the logic a bit by controlling
when priority is raised and removing the "goto loop_again".

This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency. It will not raise
the scanning prioirty higher unless it is failing to reclaim any pages.

To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 85 +++++++++++++++++++++++++++++--------------------------------
 1 file changed, 40 insertions(+), 45 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0742c45..78268ca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2633,8 +2633,12 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 /*
  * kswapd shrinks the zone by the number of pages required to reach
  * the high watermark.
+ *
+ * Returns true if kswapd scanned at least the requested number of pages to
+ * reclaim. This is used to determine if the scanning priority needs to be
+ * raised.
  */
-static void kswapd_shrink_zone(struct zone *zone,
+static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
 			       unsigned long lru_pages)
 {
@@ -2654,6 +2658,8 @@ static void kswapd_shrink_zone(struct zone *zone,
 
 	if (nr_slab == 0 && !zone_reclaimable(zone))
 		zone->all_unreclaimable = 1;
+
+	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
 /*
@@ -2680,26 +2686,25 @@ static void kswapd_shrink_zone(struct zone *zone,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
 {
-	bool pgdat_is_balanced = false;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
+		.priority = DEF_PRIORITY,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.may_writepage = !laptop_mode,
 		.order = order,
 		.target_mem_cgroup = NULL,
 	};
-loop_again:
-	sc.priority = DEF_PRIORITY;
-	sc.nr_reclaimed = 0;
-	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
 	do {
 		unsigned long lru_pages = 0;
+		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
+		bool raise_priority = true;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2741,10 +2746,8 @@ loop_again:
 			}
 		}
 
-		if (i < 0) {
-			pgdat_is_balanced = true;
+		if (i < 0)
 			goto out;
-		}
 
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -2811,8 +2814,16 @@ loop_again:
 
 			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
 			    !zone_balanced(zone, testorder,
-					   balance_gap, end_zone))
-				kswapd_shrink_zone(zone, &sc, lru_pages);
+					   balance_gap, end_zone)) {
+				/*
+				 * There should be no need to raise the
+				 * scanning priority if enough pages are
+				 * already being scanned that high
+				 * watermark would be met at 100% efficiency.
+				 */
+				if (kswapd_shrink_zone(zone, &sc, lru_pages))
+					raise_priority = false;
+			}
 
 			/*
 			 * If we're getting trouble reclaiming, start doing
@@ -2847,46 +2858,29 @@ loop_again:
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
-		if (pgdat_balanced(pgdat, order, *classzone_idx)) {
-			pgdat_is_balanced = true;
-			break;		/* kswapd: all done */
-		}
-
 		/*
-		 * We do this so kswapd doesn't build up large priorities for
-		 * example when it is freeing in parallel with allocators. It
-		 * matches the direct reclaim path behaviour in terms of impact
-		 * on zone->*_priority.
+		 * Fragmentation may mean that the system cannot be rebalanced
+		 * for high-order allocations in all zones. If twice the
+		 * allocation size has been reclaimed and the zones are still
+		 * not balanced then recheck the watermarks at order-0 to
+		 * prevent kswapd reclaiming excessively. Assume that a
+		 * process requested a high-order can direct reclaim/compact.
 		 */
-		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
-			break;
-	} while (--sc.priority >= 0);
-
-out:
-	if (!pgdat_is_balanced) {
-		cond_resched();
+		if (order && sc.nr_reclaimed >= 2UL << order)
+			order = sc.order = 0;
 
-		try_to_freeze();
+		/* Check if kswapd should be suspending */
+		if (try_to_freeze() || kthread_should_stop())
+			break;
 
 		/*
-		 * Fragmentation may mean that the system cannot be
-		 * rebalanced for high-order allocations in all zones.
-		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
-		 * it means the zones have been fully scanned and are still
-		 * not balanced. For high-order allocations, there is
-		 * little point trying all over again as kswapd may
-		 * infinite loop.
-		 *
-		 * Instead, recheck all watermarks at order-0 as they
-		 * are the most important. If watermarks are ok, kswapd will go
-		 * back to sleep. High-order users can still perform direct
-		 * reclaim if they wish.
+		 * Raise priority if scanning rate is too low or there was no
+		 * progress in reclaiming pages
 		 */
-		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
-			order = sc.order = 0;
-
-		goto loop_again;
-	}
+		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
+			sc.priority--;
+	} while (sc.priority >= 0 &&
+		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
 	/*
 	 * If kswapd was reclaiming at a higher order, it has the option of
@@ -2915,6 +2909,7 @@ out:
 			compact_pgdat(pgdat, order);
 	}
 
+out:
 	/*
 	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
 	 * makes a decision on the order we were last reclaiming at. However,
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
@ 2013-04-09 11:06   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
pages have been reclaimed or the pgdat is considered balanced. It then
rechecks if it needs to restart at DEF_PRIORITY and whether high-order
reclaim needs to be reset. This is not wrong per-se but it is confusing
to follow and forcing kswapd to stay at DEF_PRIORITY may require several
restarts before it has scanned enough pages to meet the high watermark even
at 100% efficiency. This patch irons out the logic a bit by controlling
when priority is raised and removing the "goto loop_again".

This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency. It will not raise
the scanning prioirty higher unless it is failing to reclaim any pages.

To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 85 +++++++++++++++++++++++++++++--------------------------------
 1 file changed, 40 insertions(+), 45 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0742c45..78268ca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2633,8 +2633,12 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 /*
  * kswapd shrinks the zone by the number of pages required to reach
  * the high watermark.
+ *
+ * Returns true if kswapd scanned at least the requested number of pages to
+ * reclaim. This is used to determine if the scanning priority needs to be
+ * raised.
  */
-static void kswapd_shrink_zone(struct zone *zone,
+static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
 			       unsigned long lru_pages)
 {
@@ -2654,6 +2658,8 @@ static void kswapd_shrink_zone(struct zone *zone,
 
 	if (nr_slab == 0 && !zone_reclaimable(zone))
 		zone->all_unreclaimable = 1;
+
+	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
 /*
@@ -2680,26 +2686,25 @@ static void kswapd_shrink_zone(struct zone *zone,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
 {
-	bool pgdat_is_balanced = false;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
+		.priority = DEF_PRIORITY,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.may_writepage = !laptop_mode,
 		.order = order,
 		.target_mem_cgroup = NULL,
 	};
-loop_again:
-	sc.priority = DEF_PRIORITY;
-	sc.nr_reclaimed = 0;
-	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
 	do {
 		unsigned long lru_pages = 0;
+		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
+		bool raise_priority = true;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2741,10 +2746,8 @@ loop_again:
 			}
 		}
 
-		if (i < 0) {
-			pgdat_is_balanced = true;
+		if (i < 0)
 			goto out;
-		}
 
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -2811,8 +2814,16 @@ loop_again:
 
 			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
 			    !zone_balanced(zone, testorder,
-					   balance_gap, end_zone))
-				kswapd_shrink_zone(zone, &sc, lru_pages);
+					   balance_gap, end_zone)) {
+				/*
+				 * There should be no need to raise the
+				 * scanning priority if enough pages are
+				 * already being scanned that high
+				 * watermark would be met at 100% efficiency.
+				 */
+				if (kswapd_shrink_zone(zone, &sc, lru_pages))
+					raise_priority = false;
+			}
 
 			/*
 			 * If we're getting trouble reclaiming, start doing
@@ -2847,46 +2858,29 @@ loop_again:
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
-		if (pgdat_balanced(pgdat, order, *classzone_idx)) {
-			pgdat_is_balanced = true;
-			break;		/* kswapd: all done */
-		}
-
 		/*
-		 * We do this so kswapd doesn't build up large priorities for
-		 * example when it is freeing in parallel with allocators. It
-		 * matches the direct reclaim path behaviour in terms of impact
-		 * on zone->*_priority.
+		 * Fragmentation may mean that the system cannot be rebalanced
+		 * for high-order allocations in all zones. If twice the
+		 * allocation size has been reclaimed and the zones are still
+		 * not balanced then recheck the watermarks at order-0 to
+		 * prevent kswapd reclaiming excessively. Assume that a
+		 * process requested a high-order can direct reclaim/compact.
 		 */
-		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
-			break;
-	} while (--sc.priority >= 0);
-
-out:
-	if (!pgdat_is_balanced) {
-		cond_resched();
+		if (order && sc.nr_reclaimed >= 2UL << order)
+			order = sc.order = 0;
 
-		try_to_freeze();
+		/* Check if kswapd should be suspending */
+		if (try_to_freeze() || kthread_should_stop())
+			break;
 
 		/*
-		 * Fragmentation may mean that the system cannot be
-		 * rebalanced for high-order allocations in all zones.
-		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
-		 * it means the zones have been fully scanned and are still
-		 * not balanced. For high-order allocations, there is
-		 * little point trying all over again as kswapd may
-		 * infinite loop.
-		 *
-		 * Instead, recheck all watermarks at order-0 as they
-		 * are the most important. If watermarks are ok, kswapd will go
-		 * back to sleep. High-order users can still perform direct
-		 * reclaim if they wish.
+		 * Raise priority if scanning rate is too low or there was no
+		 * progress in reclaiming pages
 		 */
-		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
-			order = sc.order = 0;
-
-		goto loop_again;
-	}
+		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
+			sc.priority--;
+	} while (sc.priority >= 0 &&
+		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
 	/*
 	 * If kswapd was reclaiming at a higher order, it has the option of
@@ -2915,6 +2909,7 @@ out:
 			compact_pgdat(pgdat, order);
 	}
 
+out:
 	/*
 	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
 	 * makes a decision on the order we were last reclaiming at. However,
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:06   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

In the past, kswapd makes a decision on whether to compact memory after the
pgdat was considered balanced. This more or less worked but it is late to
make such a decision and does not fit well now that kswapd makes a decision
whether to exit the zone scanning loop depending on reclaim progress.

This patch will compact a pgdat if at least the requested number of pages
were reclaimed from unbalanced zones for a given priority. If any zone is
currently balanced, kswapd will not call compaction as it is expected the
necessary pages are already available.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 60 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 78268ca..a9e68b4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2640,7 +2640,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  */
 static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
-			       unsigned long lru_pages)
+			       unsigned long lru_pages,
+			       unsigned long *nr_attempted)
 {
 	unsigned long nr_slab;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2656,6 +2657,9 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 
+	/* Account for the number of pages attempted to reclaim */
+	*nr_attempted += sc->nr_to_reclaim;
+
 	if (nr_slab == 0 && !zone_reclaimable(zone))
 		zone->all_unreclaimable = 1;
 
@@ -2703,8 +2707,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 	do {
 		unsigned long lru_pages = 0;
+		unsigned long nr_attempted = 0;
 		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
+		unsigned long this_reclaimed;
 		bool raise_priority = true;
+		bool pgdat_needs_compaction = (order > 0);
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2752,7 +2759,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 
+			if (!populated_zone(zone))
+				continue;
+
 			lru_pages += zone_reclaimable_pages(zone);
+
+			/*
+			 * If any zone is currently balanced then kswapd will
+			 * not call compaction as it is expected that the
+			 * necessary pages are already available.
+			 */
+			if (pgdat_needs_compaction &&
+					zone_watermark_ok(zone, order,
+						low_wmark_pages(zone),
+						*classzone_idx, 0))
+				pgdat_needs_compaction = false;
 		}
 
 		/*
@@ -2821,7 +2842,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				 * already being scanned that high
 				 * watermark would be met at 100% efficiency.
 				 */
-				if (kswapd_shrink_zone(zone, &sc, lru_pages))
+				if (kswapd_shrink_zone(zone, &sc, lru_pages,
+						       &nr_attempted))
 					raise_priority = false;
 			}
 
@@ -2873,42 +2895,20 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		if (try_to_freeze() || kthread_should_stop())
 			break;
 
+		/* Compact if necessary and kswapd is reclaiming efficiently */
+		this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		if (pgdat_needs_compaction && this_reclaimed > nr_attempted)
+			compact_pgdat(pgdat, order);
+
 		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
+		if (raise_priority || !this_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 0 &&
 		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
-	/*
-	 * If kswapd was reclaiming at a higher order, it has the option of
-	 * sleeping without all zones being balanced. Before it does, it must
-	 * ensure that the watermarks for order-0 on *all* zones are met and
-	 * that the congestion flags are cleared. The congestion flag must
-	 * be cleared as kswapd is the only mechanism that clears the flag
-	 * and it is potentially going to sleep here.
-	 */
-	if (order) {
-		int zones_need_compaction = 1;
-
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			if (!populated_zone(zone))
-				continue;
-
-			/* Check if the memory needs to be defragmented. */
-			if (zone_watermark_ok(zone, order,
-				    low_wmark_pages(zone), *classzone_idx, 0))
-				zones_need_compaction = 0;
-		}
-
-		if (zones_need_compaction)
-			compact_pgdat(pgdat, order);
-	}
-
 out:
 	/*
 	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
@ 2013-04-09 11:06   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

In the past, kswapd makes a decision on whether to compact memory after the
pgdat was considered balanced. This more or less worked but it is late to
make such a decision and does not fit well now that kswapd makes a decision
whether to exit the zone scanning loop depending on reclaim progress.

This patch will compact a pgdat if at least the requested number of pages
were reclaimed from unbalanced zones for a given priority. If any zone is
currently balanced, kswapd will not call compaction as it is expected the
necessary pages are already available.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 60 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 78268ca..a9e68b4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2640,7 +2640,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  */
 static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
-			       unsigned long lru_pages)
+			       unsigned long lru_pages,
+			       unsigned long *nr_attempted)
 {
 	unsigned long nr_slab;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2656,6 +2657,9 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 
+	/* Account for the number of pages attempted to reclaim */
+	*nr_attempted += sc->nr_to_reclaim;
+
 	if (nr_slab == 0 && !zone_reclaimable(zone))
 		zone->all_unreclaimable = 1;
 
@@ -2703,8 +2707,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 
 	do {
 		unsigned long lru_pages = 0;
+		unsigned long nr_attempted = 0;
 		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
+		unsigned long this_reclaimed;
 		bool raise_priority = true;
+		bool pgdat_needs_compaction = (order > 0);
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2752,7 +2759,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 
+			if (!populated_zone(zone))
+				continue;
+
 			lru_pages += zone_reclaimable_pages(zone);
+
+			/*
+			 * If any zone is currently balanced then kswapd will
+			 * not call compaction as it is expected that the
+			 * necessary pages are already available.
+			 */
+			if (pgdat_needs_compaction &&
+					zone_watermark_ok(zone, order,
+						low_wmark_pages(zone),
+						*classzone_idx, 0))
+				pgdat_needs_compaction = false;
 		}
 
 		/*
@@ -2821,7 +2842,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				 * already being scanned that high
 				 * watermark would be met at 100% efficiency.
 				 */
-				if (kswapd_shrink_zone(zone, &sc, lru_pages))
+				if (kswapd_shrink_zone(zone, &sc, lru_pages,
+						       &nr_attempted))
 					raise_priority = false;
 			}
 
@@ -2873,42 +2895,20 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		if (try_to_freeze() || kthread_should_stop())
 			break;
 
+		/* Compact if necessary and kswapd is reclaiming efficiently */
+		this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		if (pgdat_needs_compaction && this_reclaimed > nr_attempted)
+			compact_pgdat(pgdat, order);
+
 		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
+		if (raise_priority || !this_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 0 &&
 		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
-	/*
-	 * If kswapd was reclaiming at a higher order, it has the option of
-	 * sleeping without all zones being balanced. Before it does, it must
-	 * ensure that the watermarks for order-0 on *all* zones are met and
-	 * that the congestion flags are cleared. The congestion flag must
-	 * be cleared as kswapd is the only mechanism that clears the flag
-	 * and it is potentially going to sleep here.
-	 */
-	if (order) {
-		int zones_need_compaction = 1;
-
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			if (!populated_zone(zone))
-				continue;
-
-			/* Check if the memory needs to be defragmented. */
-			if (zone_watermark_ok(zone, order,
-				    low_wmark_pages(zone), *classzone_idx, 0))
-				zones_need_compaction = 0;
-		}
-
-		if (zones_need_compaction)
-			compact_pgdat(pgdat, order);
-	}
-
 out:
 	/*
 	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:07   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition. Kswapd can reach priority 0 quite
easily if it is encountering a large number of pages it cannot reclaim
such as pages under writeback. When this happens, kswapd reclaims very
aggressively even though there may be no real risk of allocation failure
or OOM.

This patch prevents kswapd reaching priority 0 and trying to reclaim
the world. Direct reclaimers will still reach priority 0 in the event
of an OOM situation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9e68b4..3d8b80a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2906,7 +2906,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		if (raise_priority || !this_reclaimed)
 			sc.priority--;
-	} while (sc.priority >= 0 &&
+	} while (sc.priority >= 1 &&
 		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
 out:
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority
@ 2013-04-09 11:07   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition. Kswapd can reach priority 0 quite
easily if it is encountering a large number of pages it cannot reclaim
such as pages under writeback. When this happens, kswapd reclaims very
aggressively even though there may be no real risk of allocation failure
or OOM.

This patch prevents kswapd reaching priority 0 and trying to reclaim
the world. Direct reclaimers will still reach priority 0 in the event
of an OOM situation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9e68b4..3d8b80a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2906,7 +2906,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		if (raise_priority || !this_reclaimed)
 			sc.priority--;
-	} while (sc.priority >= 0 &&
+	} while (sc.priority >= 1 &&
 		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
 out:
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:07   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Currently kswapd queues dirty pages for writeback if scanning at an elevated
priority but the priority kswapd scans at is not related to the number
of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten kswapd
priority loop", the priority is related to the size of the LRU and the
zone watermark which is no indication as to whether kswapd should write
pages or not.

This patch tracks if an excessive number of unqueued dirty pages are being
encountered at the end of the LRU.  If so, it indicates that dirty pages
are being recycled before flusher threads can clean them and flags the
zone so that kswapd will start writing pages until the zone is balanced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  9 +++++++++
 mm/vmscan.c            | 31 +++++++++++++++++++++++++------
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c74092e..ecf0c7d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -495,6 +495,10 @@ typedef enum {
 	ZONE_CONGESTED,			/* zone has many dirty pages backed by
 					 * a congested BDI
 					 */
+	ZONE_TAIL_LRU_DIRTY,		/* reclaim scanning has recently found
+					 * many dirty file pages at the tail
+					 * of the LRU.
+					 */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -517,6 +521,11 @@ static inline int zone_is_reclaim_congested(const struct zone *zone)
 	return test_bit(ZONE_CONGESTED, &zone->flags);
 }
 
+static inline int zone_is_reclaim_dirty(const struct zone *zone)
+{
+	return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
+}
+
 static inline int zone_is_reclaim_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3d8b80a..53d5006 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -675,13 +675,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
-				      unsigned long *ret_nr_dirty,
+				      unsigned long *ret_nr_unqueued_dirty,
 				      unsigned long *ret_nr_writeback,
 				      bool force_reclaim)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_dirty = 0;
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
@@ -807,14 +808,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			if (!PageWriteback(page))
+				nr_unqueued_dirty++;
+
 			/*
 			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * avoid risk of stack overflow but only writeback
+			 * if many dirty pages have been encountered.
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() ||
-					 sc->priority >= DEF_PRIORITY - 2)) {
+					 !zone_is_reclaim_dirty(zone))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -959,7 +963,7 @@ keep:
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	mem_cgroup_uncharge_end();
-	*ret_nr_dirty += nr_dirty;
+	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
 	*ret_nr_writeback += nr_writeback;
 	return nr_reclaimed;
 }
@@ -1372,6 +1376,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			(nr_taken >> (DEF_PRIORITY - sc->priority)))
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
+	/*
+	 * Similarly, if many dirty pages are encountered that are not
+	 * currently being written then flag that kswapd should start
+	 * writing back pages.
+	 */
+	if (global_reclaim(sc) && nr_dirty &&
+			nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
+		zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
+
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
 		nr_scanned, nr_reclaimed,
@@ -2748,8 +2761,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				end_zone = i;
 				break;
 			} else {
-				/* If balanced, clear the congested flag */
+				/*
+				 * If balanced, clear the dirty and congested
+				 * flags
+				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
 			}
 		}
 
@@ -2867,8 +2884,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				 * possible there are dirty pages backed by
 				 * congested BDIs but as pressure is relieved,
 				 * speculatively avoid congestion waits
+				 * or writing pages from kswapd context.
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
 		}
 
 		/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority
@ 2013-04-09 11:07   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Currently kswapd queues dirty pages for writeback if scanning at an elevated
priority but the priority kswapd scans at is not related to the number
of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten kswapd
priority loop", the priority is related to the size of the LRU and the
zone watermark which is no indication as to whether kswapd should write
pages or not.

This patch tracks if an excessive number of unqueued dirty pages are being
encountered at the end of the LRU.  If so, it indicates that dirty pages
are being recycled before flusher threads can clean them and flags the
zone so that kswapd will start writing pages until the zone is balanced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  9 +++++++++
 mm/vmscan.c            | 31 +++++++++++++++++++++++++------
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c74092e..ecf0c7d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -495,6 +495,10 @@ typedef enum {
 	ZONE_CONGESTED,			/* zone has many dirty pages backed by
 					 * a congested BDI
 					 */
+	ZONE_TAIL_LRU_DIRTY,		/* reclaim scanning has recently found
+					 * many dirty file pages at the tail
+					 * of the LRU.
+					 */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -517,6 +521,11 @@ static inline int zone_is_reclaim_congested(const struct zone *zone)
 	return test_bit(ZONE_CONGESTED, &zone->flags);
 }
 
+static inline int zone_is_reclaim_dirty(const struct zone *zone)
+{
+	return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
+}
+
 static inline int zone_is_reclaim_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3d8b80a..53d5006 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -675,13 +675,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
-				      unsigned long *ret_nr_dirty,
+				      unsigned long *ret_nr_unqueued_dirty,
 				      unsigned long *ret_nr_writeback,
 				      bool force_reclaim)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_dirty = 0;
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
@@ -807,14 +808,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			if (!PageWriteback(page))
+				nr_unqueued_dirty++;
+
 			/*
 			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * avoid risk of stack overflow but only writeback
+			 * if many dirty pages have been encountered.
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() ||
-					 sc->priority >= DEF_PRIORITY - 2)) {
+					 !zone_is_reclaim_dirty(zone))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -959,7 +963,7 @@ keep:
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	mem_cgroup_uncharge_end();
-	*ret_nr_dirty += nr_dirty;
+	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
 	*ret_nr_writeback += nr_writeback;
 	return nr_reclaimed;
 }
@@ -1372,6 +1376,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			(nr_taken >> (DEF_PRIORITY - sc->priority)))
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
+	/*
+	 * Similarly, if many dirty pages are encountered that are not
+	 * currently being written then flag that kswapd should start
+	 * writing back pages.
+	 */
+	if (global_reclaim(sc) && nr_dirty &&
+			nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
+		zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
+
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
 		nr_scanned, nr_reclaimed,
@@ -2748,8 +2761,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				end_zone = i;
 				break;
 			} else {
-				/* If balanced, clear the congested flag */
+				/*
+				 * If balanced, clear the dirty and congested
+				 * flags
+				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
 			}
 		}
 
@@ -2867,8 +2884,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				 * possible there are dirty pages backed by
 				 * congested BDIs but as pressure is relieved,
 				 * speculatively avoid congestion waits
+				 * or writing pages from kswapd context.
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
 		}
 
 		/*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:07   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Historically, kswapd used to congestion_wait() at higher priorities if it
was not making forward progress. This made no sense as the failure to make
progress could be completely independent of IO. It was later replaced by
wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
wait on congested zones in balance_pgdat()) as it was duplicating logic
in shrink_inactive_list().

This is problematic. If kswapd encounters many pages under writeback and
it continues to scan until it reaches the high watermark then it will
quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer was
unable to write to the underlying BDI. kswapd bypasses the BDI congestion
as it sets PF_SWAPWRITE but even if this was taken into account then it
would cause direct reclaimers to stall on writeback which is not desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/mmzone.h |  8 ++++++
 mm/vmscan.c            | 78 ++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ecf0c7d..264e203 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -499,6 +499,9 @@ typedef enum {
 					 * many dirty file pages at the tail
 					 * of the LRU.
 					 */
+	ZONE_WRITEBACK,			/* reclaim scanning has recently found
+					 * many pages under writeback
+					 */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -526,6 +529,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
 	return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
 }
 
+static inline int zone_is_reclaim_writeback(const struct zone *zone)
+{
+	return test_bit(ZONE_WRITEBACK, &zone->flags);
+}
+
 static inline int zone_is_reclaim_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 53d5006..9fa72f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -723,25 +723,51 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
+		/*
+		 * If a page at the tail of the LRU is under writeback, there
+		 * are three cases to consider.
+		 *
+		 * 1) If reclaim is encountering an excessive number of pages
+		 *    under writeback and this page is both under writeback and
+		 *    PageReclaim then it indicates that pages are being queued
+		 *    for IO but are being recycled through the LRU before the
+		 *    IO can complete. In this case, wait on the IO to complete
+		 *    and then clear the ZONE_WRITEBACK flag to recheck if the
+		 *    condition exists.
+		 *
+		 * 2) Global reclaim encounters a page, memcg encounters a
+		 *    page that is not marked for immediate reclaim or
+		 *    the caller does not have __GFP_IO. In this case mark
+		 *    the page for immediate reclaim and continue scanning.
+		 *
+		 *    __GFP_IO is checked  because a loop driver thread might
+		 *    enter reclaim, and deadlock if it waits on a page for
+		 *    which it is needed to do the write (loop masks off
+		 *    __GFP_IO|__GFP_FS for this reason); but more thought
+		 *    would probably show more reasons.
+		 *
+		 *    Don't require __GFP_FS, since we're not going into the
+		 *    FS, just waiting on its writeback completion. Worryingly,
+		 *    ext4 gfs2 and xfs allocate pages with
+		 *    grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
+		 *    may_enter_fs here is liable to OOM on them.
+		 *
+		 * 3) memcg encounters a page that is not already marked
+		 *    PageReclaim. memcg does not have any dirty pages
+		 *    throttling so we could easily OOM just because too many
+		 *    pages are in writeback and there is nothing else to
+		 *    reclaim. Wait for the writeback to complete.
+		 */
 		if (PageWriteback(page)) {
-			/*
-			 * memcg doesn't have any dirty pages throttling so we
-			 * could easily OOM just because too many pages are in
-			 * writeback and there is nothing else to reclaim.
-			 *
-			 * Check __GFP_IO, certainly because a loop driver
-			 * thread might enter reclaim, and deadlock if it waits
-			 * on a page for which it is needed to do the write
-			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
-			 * but more thought would probably show more reasons.
-			 *
-			 * Don't require __GFP_FS, since we're not going into
-			 * the FS, just waiting on its writeback completion.
-			 * Worryingly, ext4 gfs2 and xfs allocate pages with
-			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
-			 * testing may_enter_fs here is liable to OOM on them.
-			 */
-			if (global_reclaim(sc) ||
+			/* Case 1 above */
+			if (current_is_kswapd() &&
+			    PageReclaim(page) &&
+			    zone_is_reclaim_writeback(zone)) {
+				wait_on_page_writeback(page);
+				zone_clear_flag(zone, ZONE_WRITEBACK);
+
+			/* Case 2 above */
+			} else if (global_reclaim(sc) ||
 			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
 				/*
 				 * This is slightly racy - end_page_writeback()
@@ -756,9 +782,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 */
 				SetPageReclaim(page);
 				nr_writeback++;
+
 				goto keep_locked;
+
+			/* Case 3 above */
+			} else {
+				wait_on_page_writeback(page);
 			}
-			wait_on_page_writeback(page);
 		}
 
 		if (!force_reclaim)
@@ -1373,8 +1403,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 *                     isolated page is PageWriteback
 	 */
 	if (nr_writeback && nr_writeback >=
-			(nr_taken >> (DEF_PRIORITY - sc->priority)))
+			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		zone_set_flag(zone, ZONE_WRITEBACK);
+	}
 
 	/*
 	 * Similarly, if many dirty pages are encountered that are not
@@ -2648,8 +2680,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  * the high watermark.
  *
  * Returns true if kswapd scanned at least the requested number of pages to
- * reclaim. This is used to determine if the scanning priority needs to be
- * raised.
+ * reclaim or if the lack of progress was due to pages under writeback.
+ * This is used to determine if the scanning priority needs to be raised.
  */
 static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
@@ -2676,6 +2708,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	if (nr_slab == 0 && !zone_reclaimable(zone))
 		zone->all_unreclaimable = 1;
 
+	zone_clear_flag(zone, ZONE_WRITEBACK);
+
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback
@ 2013-04-09 11:07   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Historically, kswapd used to congestion_wait() at higher priorities if it
was not making forward progress. This made no sense as the failure to make
progress could be completely independent of IO. It was later replaced by
wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
wait on congested zones in balance_pgdat()) as it was duplicating logic
in shrink_inactive_list().

This is problematic. If kswapd encounters many pages under writeback and
it continues to scan until it reaches the high watermark then it will
quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer was
unable to write to the underlying BDI. kswapd bypasses the BDI congestion
as it sets PF_SWAPWRITE but even if this was taken into account then it
would cause direct reclaimers to stall on writeback which is not desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/mmzone.h |  8 ++++++
 mm/vmscan.c            | 78 ++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ecf0c7d..264e203 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -499,6 +499,9 @@ typedef enum {
 					 * many dirty file pages at the tail
 					 * of the LRU.
 					 */
+	ZONE_WRITEBACK,			/* reclaim scanning has recently found
+					 * many pages under writeback
+					 */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -526,6 +529,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
 	return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
 }
 
+static inline int zone_is_reclaim_writeback(const struct zone *zone)
+{
+	return test_bit(ZONE_WRITEBACK, &zone->flags);
+}
+
 static inline int zone_is_reclaim_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 53d5006..9fa72f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -723,25 +723,51 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
+		/*
+		 * If a page at the tail of the LRU is under writeback, there
+		 * are three cases to consider.
+		 *
+		 * 1) If reclaim is encountering an excessive number of pages
+		 *    under writeback and this page is both under writeback and
+		 *    PageReclaim then it indicates that pages are being queued
+		 *    for IO but are being recycled through the LRU before the
+		 *    IO can complete. In this case, wait on the IO to complete
+		 *    and then clear the ZONE_WRITEBACK flag to recheck if the
+		 *    condition exists.
+		 *
+		 * 2) Global reclaim encounters a page, memcg encounters a
+		 *    page that is not marked for immediate reclaim or
+		 *    the caller does not have __GFP_IO. In this case mark
+		 *    the page for immediate reclaim and continue scanning.
+		 *
+		 *    __GFP_IO is checked  because a loop driver thread might
+		 *    enter reclaim, and deadlock if it waits on a page for
+		 *    which it is needed to do the write (loop masks off
+		 *    __GFP_IO|__GFP_FS for this reason); but more thought
+		 *    would probably show more reasons.
+		 *
+		 *    Don't require __GFP_FS, since we're not going into the
+		 *    FS, just waiting on its writeback completion. Worryingly,
+		 *    ext4 gfs2 and xfs allocate pages with
+		 *    grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
+		 *    may_enter_fs here is liable to OOM on them.
+		 *
+		 * 3) memcg encounters a page that is not already marked
+		 *    PageReclaim. memcg does not have any dirty pages
+		 *    throttling so we could easily OOM just because too many
+		 *    pages are in writeback and there is nothing else to
+		 *    reclaim. Wait for the writeback to complete.
+		 */
 		if (PageWriteback(page)) {
-			/*
-			 * memcg doesn't have any dirty pages throttling so we
-			 * could easily OOM just because too many pages are in
-			 * writeback and there is nothing else to reclaim.
-			 *
-			 * Check __GFP_IO, certainly because a loop driver
-			 * thread might enter reclaim, and deadlock if it waits
-			 * on a page for which it is needed to do the write
-			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
-			 * but more thought would probably show more reasons.
-			 *
-			 * Don't require __GFP_FS, since we're not going into
-			 * the FS, just waiting on its writeback completion.
-			 * Worryingly, ext4 gfs2 and xfs allocate pages with
-			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
-			 * testing may_enter_fs here is liable to OOM on them.
-			 */
-			if (global_reclaim(sc) ||
+			/* Case 1 above */
+			if (current_is_kswapd() &&
+			    PageReclaim(page) &&
+			    zone_is_reclaim_writeback(zone)) {
+				wait_on_page_writeback(page);
+				zone_clear_flag(zone, ZONE_WRITEBACK);
+
+			/* Case 2 above */
+			} else if (global_reclaim(sc) ||
 			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
 				/*
 				 * This is slightly racy - end_page_writeback()
@@ -756,9 +782,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 */
 				SetPageReclaim(page);
 				nr_writeback++;
+
 				goto keep_locked;
+
+			/* Case 3 above */
+			} else {
+				wait_on_page_writeback(page);
 			}
-			wait_on_page_writeback(page);
 		}
 
 		if (!force_reclaim)
@@ -1373,8 +1403,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 *                     isolated page is PageWriteback
 	 */
 	if (nr_writeback && nr_writeback >=
-			(nr_taken >> (DEF_PRIORITY - sc->priority)))
+			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		zone_set_flag(zone, ZONE_WRITEBACK);
+	}
 
 	/*
 	 * Similarly, if many dirty pages are encountered that are not
@@ -2648,8 +2680,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  * the high watermark.
  *
  * Returns true if kswapd scanned at least the requested number of pages to
- * reclaim. This is used to determine if the scanning priority needs to be
- * raised.
+ * reclaim or if the lack of progress was due to pages under writeback.
+ * This is used to determine if the scanning priority needs to be raised.
  */
 static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
@@ -2676,6 +2708,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	if (nr_slab == 0 && !zone_reclaimable(zone))
 		zone->all_unreclaimable = 1;
 
+	zone_clear_flag(zone, ZONE_WRITEBACK);
+
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:07   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

If kswaps fails to make progress but continues to shrink slab then it'll
either discard all of slab or consume CPU uselessly scanning shrinkers.
This patch causes kswapd to only call the shrinkers once per priority.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9fa72f7..c929d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2686,9 +2686,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
 			       unsigned long lru_pages,
+			       bool shrinking_slab,
 			       unsigned long *nr_attempted)
 {
-	unsigned long nr_slab;
+	unsigned long nr_slab = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
@@ -2698,9 +2699,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
 	shrink_zone(zone, sc);
 
-	reclaim_state->reclaimed_slab = 0;
-	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
-	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+	/*
+	 * Slabs are shrunk for each zone once per priority or if the zone
+	 * being balanced is otherwise unreclaimable
+	 */
+	if (shrinking_slab || !zone_reclaimable(zone)) {
+		reclaim_state->reclaimed_slab = 0;
+		nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+	}
 
 	/* Account for the number of pages attempted to reclaim */
 	*nr_attempted += sc->nr_to_reclaim;
@@ -2741,6 +2748,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	bool shrinking_slab = true;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.priority = DEF_PRIORITY,
@@ -2893,8 +2901,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				 * already being scanned that high
 				 * watermark would be met at 100% efficiency.
 				 */
-				if (kswapd_shrink_zone(zone, &sc, lru_pages,
-						       &nr_attempted))
+				if (kswapd_shrink_zone(zone, &sc,
+						lru_pages, shrinking_slab,
+						&nr_attempted))
 					raise_priority = false;
 			}
 
@@ -2933,6 +2942,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
+		/* Only shrink slab once per priority */
+		shrinking_slab = false;
+
 		/*
 		 * Fragmentation may mean that the system cannot be rebalanced
 		 * for high-order allocations in all zones. If twice the
@@ -2957,8 +2969,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-		if (raise_priority || !this_reclaimed)
+		if (raise_priority || !this_reclaimed) {
 			sc.priority--;
+			shrinking_slab = true;
+		}
 	} while (sc.priority >= 1 &&
 		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
@ 2013-04-09 11:07   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

If kswaps fails to make progress but continues to shrink slab then it'll
either discard all of slab or consume CPU uselessly scanning shrinkers.
This patch causes kswapd to only call the shrinkers once per priority.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9fa72f7..c929d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2686,9 +2686,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 static bool kswapd_shrink_zone(struct zone *zone,
 			       struct scan_control *sc,
 			       unsigned long lru_pages,
+			       bool shrinking_slab,
 			       unsigned long *nr_attempted)
 {
-	unsigned long nr_slab;
+	unsigned long nr_slab = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
@@ -2698,9 +2699,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
 	shrink_zone(zone, sc);
 
-	reclaim_state->reclaimed_slab = 0;
-	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
-	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+	/*
+	 * Slabs are shrunk for each zone once per priority or if the zone
+	 * being balanced is otherwise unreclaimable
+	 */
+	if (shrinking_slab || !zone_reclaimable(zone)) {
+		reclaim_state->reclaimed_slab = 0;
+		nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+	}
 
 	/* Account for the number of pages attempted to reclaim */
 	*nr_attempted += sc->nr_to_reclaim;
@@ -2741,6 +2748,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	bool shrinking_slab = true;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.priority = DEF_PRIORITY,
@@ -2893,8 +2901,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				 * already being scanned that high
 				 * watermark would be met at 100% efficiency.
 				 */
-				if (kswapd_shrink_zone(zone, &sc, lru_pages,
-						       &nr_attempted))
+				if (kswapd_shrink_zone(zone, &sc,
+						lru_pages, shrinking_slab,
+						&nr_attempted))
 					raise_priority = false;
 			}
 
@@ -2933,6 +2942,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up(&pgdat->pfmemalloc_wait);
 
+		/* Only shrink slab once per priority */
+		shrinking_slab = false;
+
 		/*
 		 * Fragmentation may mean that the system cannot be rebalanced
 		 * for high-order allocations in all zones. If twice the
@@ -2957,8 +2969,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-		if (raise_priority || !this_reclaimed)
+		if (raise_priority || !this_reclaimed) {
 			sc.priority--;
+			shrinking_slab = true;
+		}
 	} while (sc.priority >= 1 &&
 		 !pgdat_balanced(pgdat, order, *classzone_idx));
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/10] mm: vmscan: Check if kswapd should writepage once per pgdat scan
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:07   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Currently kswapd checks if it should start writepage as it shrinks
each zone without taking into consideration if the zone is balanced or
not. This is not wrong as such but it does not make much sense either.
This patch checks once per pgdat scan if kswapd should be writing pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c929d1e..6cd6435 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2836,6 +2836,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		}
 
 		/*
+		 * If we're getting trouble reclaiming, start doing writepage
+		 * even in laptop mode.
+		 */
+		if (sc.priority < DEF_PRIORITY - 2)
+			sc.may_writepage = 1;
+
+		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
 		 * at the last zone which needs scanning.
 		 *
@@ -2907,13 +2914,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 					raise_priority = false;
 			}
 
-			/*
-			 * If we're getting trouble reclaiming, start doing
-			 * writepage even in laptop mode.
-			 */
-			if (sc.priority < DEF_PRIORITY - 2)
-				sc.may_writepage = 1;
-
 			if (zone->all_unreclaimable) {
 				if (end_zone && end_zone == i)
 					end_zone--;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/10] mm: vmscan: Check if kswapd should writepage once per pgdat scan
@ 2013-04-09 11:07   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

Currently kswapd checks if it should start writepage as it shrinks
each zone without taking into consideration if the zone is balanced or
not. This is not wrong as such but it does not make much sense either.
This patch checks once per pgdat scan if kswapd should be writing pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c929d1e..6cd6435 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2836,6 +2836,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		}
 
 		/*
+		 * If we're getting trouble reclaiming, start doing writepage
+		 * even in laptop mode.
+		 */
+		if (sc.priority < DEF_PRIORITY - 2)
+			sc.may_writepage = 1;
+
+		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
 		 * at the last zone which needs scanning.
 		 *
@@ -2907,13 +2914,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 					raise_priority = false;
 			}
 
-			/*
-			 * If we're getting trouble reclaiming, start doing
-			 * writepage even in laptop mode.
-			 */
-			if (sc.priority < DEF_PRIORITY - 2)
-				sc.may_writepage = 1;
-
 			if (zone->all_unreclaimable) {
 				if (end_zone && end_zone == i)
 					end_zone--;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 11:07   ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

balance_pgdat() is very long and some of the logic can and should
be internal to kswapd_shrink_zone(). Move it so the flow of
balance_pgdat() is marginally easier to follow.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 112 +++++++++++++++++++++++++++++-------------------------------
 1 file changed, 55 insertions(+), 57 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6cd6435..00024d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2684,19 +2684,54 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  * This is used to determine if the scanning priority needs to be raised.
  */
 static bool kswapd_shrink_zone(struct zone *zone,
+			       int classzone_idx,
 			       struct scan_control *sc,
 			       unsigned long lru_pages,
 			       bool shrinking_slab,
 			       unsigned long *nr_attempted)
 {
+	int testorder = sc->order;
 	unsigned long nr_slab = 0;
+	unsigned long balance_gap;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
 	};
+	bool lowmem_pressure;
 
 	/* Reclaim above the high watermark. */
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+
+	/*
+	 * Kswapd reclaims only single pages with compaction enabled. Trying
+	 * too hard to reclaim until contiguous free pages have become
+	 * available can hurt performance by evicting too much useful data
+	 * from memory. Do not reclaim more than needed for compaction.
+	 */
+	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+			compaction_suitable(zone, sc->order) !=
+				COMPACT_SKIPPED)
+		testorder = 0;
+
+	/*
+	 * We put equal pressure on every zone, unless one zone has way too
+	 * many pages free already. The "too many pages" is defined as the
+	 * high wmark plus a "gap" where the gap is either the low
+	 * watermark or 1% of the zone, whichever is smaller.
+	 */
+	balance_gap = min(low_wmark_pages(zone),
+		(zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+		KSWAPD_ZONE_BALANCE_GAP_RATIO);
+
+	/*
+	 * If there is no low memory pressure or the zone is balanced then no
+	 * reclaim is necessary
+	 */
+	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
+	if (!lowmem_pressure && zone_balanced(zone, testorder,
+						balance_gap, classzone_idx))
+		return true;
+
 	shrink_zone(zone, sc);
 
 	/*
@@ -2717,6 +2752,18 @@ static bool kswapd_shrink_zone(struct zone *zone,
 
 	zone_clear_flag(zone, ZONE_WRITEBACK);
 
+	/*
+	 * If a zone reaches its high watermark, consider it to be no longer
+	 * congested. It's possible there are dirty pages backed by congested
+	 * BDIs but as pressure is relieved, speculatively avoid congestion
+	 * waits.
+	 */
+	if (!zone->all_unreclaimable &&
+	    zone_balanced(zone, testorder, 0, classzone_idx)) {
+		zone_clear_flag(zone, ZONE_CONGESTED);
+		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
+	}
+
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
@@ -2853,8 +2900,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int testorder;
-			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2875,62 +2920,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			sc.nr_reclaimed += nr_soft_reclaimed;
 
 			/*
-			 * We put equal pressure on every zone, unless
-			 * one zone has way too many pages free
-			 * already. The "too many pages" is defined
-			 * as the high wmark plus a "gap" where the
-			 * gap is either the low watermark or 1%
-			 * of the zone, whichever is smaller.
-			 */
-			balance_gap = min(low_wmark_pages(zone),
-				(zone->managed_pages +
-					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
-				KSWAPD_ZONE_BALANCE_GAP_RATIO);
-			/*
-			 * Kswapd reclaims only single pages with compaction
-			 * enabled. Trying too hard to reclaim until contiguous
-			 * free pages have become available can hurt performance
-			 * by evicting too much useful data from memory.
-			 * Do not reclaim more than needed for compaction.
+			 * There should be no need to raise the scanning
+			 * priority if enough pages are already being scanned
+			 * that that high watermark would be met at 100%
+			 * efficiency.
 			 */
-			testorder = order;
-			if (IS_ENABLED(CONFIG_COMPACTION) && order &&
-					compaction_suitable(zone, order) !=
-						COMPACT_SKIPPED)
-				testorder = 0;
-
-			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
-			    !zone_balanced(zone, testorder,
-					   balance_gap, end_zone)) {
-				/*
-				 * There should be no need to raise the
-				 * scanning priority if enough pages are
-				 * already being scanned that high
-				 * watermark would be met at 100% efficiency.
-				 */
-				if (kswapd_shrink_zone(zone, &sc,
-						lru_pages, shrinking_slab,
-						&nr_attempted))
-					raise_priority = false;
-			}
-
-			if (zone->all_unreclaimable) {
-				if (end_zone && end_zone == i)
-					end_zone--;
-				continue;
-			}
-
-			if (zone_balanced(zone, testorder, 0, end_zone))
-				/*
-				 * If a zone reaches its high watermark,
-				 * consider it to be no longer congested. It's
-				 * possible there are dirty pages backed by
-				 * congested BDIs but as pressure is relieved,
-				 * speculatively avoid congestion waits
-				 * or writing pages from kswapd context.
-				 */
-				zone_clear_flag(zone, ZONE_CONGESTED);
-				zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
+			if (kswapd_shrink_zone(zone, end_zone, &sc,
+					lru_pages, shrinking_slab,
+					&nr_attempted))
+				raise_priority = false;
 		}
 
 		/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()
@ 2013-04-09 11:07   ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

balance_pgdat() is very long and some of the logic can and should
be internal to kswapd_shrink_zone(). Move it so the flow of
balance_pgdat() is marginally easier to follow.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 112 +++++++++++++++++++++++++++++-------------------------------
 1 file changed, 55 insertions(+), 57 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6cd6435..00024d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2684,19 +2684,54 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  * This is used to determine if the scanning priority needs to be raised.
  */
 static bool kswapd_shrink_zone(struct zone *zone,
+			       int classzone_idx,
 			       struct scan_control *sc,
 			       unsigned long lru_pages,
 			       bool shrinking_slab,
 			       unsigned long *nr_attempted)
 {
+	int testorder = sc->order;
 	unsigned long nr_slab = 0;
+	unsigned long balance_gap;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
 	};
+	bool lowmem_pressure;
 
 	/* Reclaim above the high watermark. */
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+
+	/*
+	 * Kswapd reclaims only single pages with compaction enabled. Trying
+	 * too hard to reclaim until contiguous free pages have become
+	 * available can hurt performance by evicting too much useful data
+	 * from memory. Do not reclaim more than needed for compaction.
+	 */
+	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+			compaction_suitable(zone, sc->order) !=
+				COMPACT_SKIPPED)
+		testorder = 0;
+
+	/*
+	 * We put equal pressure on every zone, unless one zone has way too
+	 * many pages free already. The "too many pages" is defined as the
+	 * high wmark plus a "gap" where the gap is either the low
+	 * watermark or 1% of the zone, whichever is smaller.
+	 */
+	balance_gap = min(low_wmark_pages(zone),
+		(zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+		KSWAPD_ZONE_BALANCE_GAP_RATIO);
+
+	/*
+	 * If there is no low memory pressure or the zone is balanced then no
+	 * reclaim is necessary
+	 */
+	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
+	if (!lowmem_pressure && zone_balanced(zone, testorder,
+						balance_gap, classzone_idx))
+		return true;
+
 	shrink_zone(zone, sc);
 
 	/*
@@ -2717,6 +2752,18 @@ static bool kswapd_shrink_zone(struct zone *zone,
 
 	zone_clear_flag(zone, ZONE_WRITEBACK);
 
+	/*
+	 * If a zone reaches its high watermark, consider it to be no longer
+	 * congested. It's possible there are dirty pages backed by congested
+	 * BDIs but as pressure is relieved, speculatively avoid congestion
+	 * waits.
+	 */
+	if (!zone->all_unreclaimable &&
+	    zone_balanced(zone, testorder, 0, classzone_idx)) {
+		zone_clear_flag(zone, ZONE_CONGESTED);
+		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
+	}
+
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
@@ -2853,8 +2900,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int testorder;
-			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2875,62 +2920,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			sc.nr_reclaimed += nr_soft_reclaimed;
 
 			/*
-			 * We put equal pressure on every zone, unless
-			 * one zone has way too many pages free
-			 * already. The "too many pages" is defined
-			 * as the high wmark plus a "gap" where the
-			 * gap is either the low watermark or 1%
-			 * of the zone, whichever is smaller.
-			 */
-			balance_gap = min(low_wmark_pages(zone),
-				(zone->managed_pages +
-					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
-				KSWAPD_ZONE_BALANCE_GAP_RATIO);
-			/*
-			 * Kswapd reclaims only single pages with compaction
-			 * enabled. Trying too hard to reclaim until contiguous
-			 * free pages have become available can hurt performance
-			 * by evicting too much useful data from memory.
-			 * Do not reclaim more than needed for compaction.
+			 * There should be no need to raise the scanning
+			 * priority if enough pages are already being scanned
+			 * that that high watermark would be met at 100%
+			 * efficiency.
 			 */
-			testorder = order;
-			if (IS_ENABLED(CONFIG_COMPACTION) && order &&
-					compaction_suitable(zone, order) !=
-						COMPACT_SKIPPED)
-				testorder = 0;
-
-			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
-			    !zone_balanced(zone, testorder,
-					   balance_gap, end_zone)) {
-				/*
-				 * There should be no need to raise the
-				 * scanning priority if enough pages are
-				 * already being scanned that high
-				 * watermark would be met at 100% efficiency.
-				 */
-				if (kswapd_shrink_zone(zone, &sc,
-						lru_pages, shrinking_slab,
-						&nr_attempted))
-					raise_priority = false;
-			}
-
-			if (zone->all_unreclaimable) {
-				if (end_zone && end_zone == i)
-					end_zone--;
-				continue;
-			}
-
-			if (zone_balanced(zone, testorder, 0, end_zone))
-				/*
-				 * If a zone reaches its high watermark,
-				 * consider it to be no longer congested. It's
-				 * possible there are dirty pages backed by
-				 * congested BDIs but as pressure is relieved,
-				 * speculatively avoid congestion waits
-				 * or writing pages from kswapd context.
-				 */
-				zone_clear_flag(zone, ZONE_CONGESTED);
-				zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
+			if (kswapd_shrink_zone(zone, end_zone, &sc,
+					lru_pages, shrinking_slab,
+					&nr_attempted))
+				raise_priority = false;
 		}
 
 		/*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
  2013-04-09 11:06   ` Mel Gorman
@ 2013-04-09 13:27     ` Michal Hocko
  -1 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2013-04-09 13:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Linux-MM, LKML

On Tue 09-04-13 12:06:56, Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because
> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
> 
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will still overshoot due to it not being a hard limit as
> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c | 53 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 29 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..4835a7a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>  }
>  
>  /*
> + * kswapd shrinks the zone by the number of pages required to reach
> + * the high watermark.
> + */
> +static void kswapd_shrink_zone(struct zone *zone,
> +			       struct scan_control *sc,
> +			       unsigned long lru_pages)
> +{
> +	unsigned long nr_slab;
> +	struct reclaim_state *reclaim_state = current->reclaim_state;
> +	struct shrink_control shrink = {
> +		.gfp_mask = sc->gfp_mask,
> +	};
> +
> +	/* Reclaim above the high watermark. */
> +	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> +	shrink_zone(zone, sc);
> +
> +	reclaim_state->reclaimed_slab = 0;
> +	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> +	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +
> +	if (nr_slab == 0 && !zone_reclaimable(zone))
> +		zone->all_unreclaimable = 1;
> +}
> +
> +/*
>   * For kswapd, balance_pgdat() will work across all this node's zones until
>   * they are all at high_wmark_pages(zone).
>   *
> @@ -2619,27 +2645,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  	bool pgdat_is_balanced = false;
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> -	unsigned long total_scanned;
> -	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long nr_soft_reclaimed;
>  	unsigned long nr_soft_scanned;
>  	struct scan_control sc = {
>  		.gfp_mask = GFP_KERNEL,
>  		.may_unmap = 1,
>  		.may_swap = 1,
> -		/*
> -		 * kswapd doesn't want to be bailed out while reclaim. because
> -		 * we want to put equal scanning pressure on each zone.
> -		 */
> -		.nr_to_reclaim = ULONG_MAX,
>  		.order = order,
>  		.target_mem_cgroup = NULL,
>  	};
> -	struct shrink_control shrink = {
> -		.gfp_mask = sc.gfp_mask,
> -	};
>  loop_again:
> -	total_scanned = 0;
>  	sc.priority = DEF_PRIORITY;
>  	sc.nr_reclaimed = 0;
>  	sc.may_writepage = !laptop_mode;
> @@ -2710,7 +2725,7 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab, testorder;
> +			int testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
> @@ -2730,7 +2745,6 @@ loop_again:
>  							order, sc.gfp_mask,
>  							&nr_soft_scanned);
>  			sc.nr_reclaimed += nr_soft_reclaimed;
> -			total_scanned += nr_soft_scanned;
>  
>  			/*
>  			 * We put equal pressure on every zone, unless
> @@ -2759,17 +2773,8 @@ loop_again:
>  
>  			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
>  			    !zone_balanced(zone, testorder,
> -					   balance_gap, end_zone)) {
> -				shrink_zone(zone, &sc);
> -
> -				reclaim_state->reclaimed_slab = 0;
> -				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> -				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> -				total_scanned += sc.nr_scanned;
> -
> -				if (nr_slab == 0 && !zone_reclaimable(zone))
> -					zone->all_unreclaimable = 1;
> -			}
> +					   balance_gap, end_zone))
> +				kswapd_shrink_zone(zone, &sc, lru_pages);
>  
>  			/*
>  			 * If we're getting trouble reclaiming, start doing
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
@ 2013-04-09 13:27     ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2013-04-09 13:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Linux-MM, LKML

On Tue 09-04-13 12:06:56, Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because
> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
> 
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will still overshoot due to it not being a hard limit as
> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c | 53 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 29 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..4835a7a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>  }
>  
>  /*
> + * kswapd shrinks the zone by the number of pages required to reach
> + * the high watermark.
> + */
> +static void kswapd_shrink_zone(struct zone *zone,
> +			       struct scan_control *sc,
> +			       unsigned long lru_pages)
> +{
> +	unsigned long nr_slab;
> +	struct reclaim_state *reclaim_state = current->reclaim_state;
> +	struct shrink_control shrink = {
> +		.gfp_mask = sc->gfp_mask,
> +	};
> +
> +	/* Reclaim above the high watermark. */
> +	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> +	shrink_zone(zone, sc);
> +
> +	reclaim_state->reclaimed_slab = 0;
> +	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> +	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +
> +	if (nr_slab == 0 && !zone_reclaimable(zone))
> +		zone->all_unreclaimable = 1;
> +}
> +
> +/*
>   * For kswapd, balance_pgdat() will work across all this node's zones until
>   * they are all at high_wmark_pages(zone).
>   *
> @@ -2619,27 +2645,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  	bool pgdat_is_balanced = false;
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> -	unsigned long total_scanned;
> -	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long nr_soft_reclaimed;
>  	unsigned long nr_soft_scanned;
>  	struct scan_control sc = {
>  		.gfp_mask = GFP_KERNEL,
>  		.may_unmap = 1,
>  		.may_swap = 1,
> -		/*
> -		 * kswapd doesn't want to be bailed out while reclaim. because
> -		 * we want to put equal scanning pressure on each zone.
> -		 */
> -		.nr_to_reclaim = ULONG_MAX,
>  		.order = order,
>  		.target_mem_cgroup = NULL,
>  	};
> -	struct shrink_control shrink = {
> -		.gfp_mask = sc.gfp_mask,
> -	};
>  loop_again:
> -	total_scanned = 0;
>  	sc.priority = DEF_PRIORITY;
>  	sc.nr_reclaimed = 0;
>  	sc.may_writepage = !laptop_mode;
> @@ -2710,7 +2725,7 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab, testorder;
> +			int testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
> @@ -2730,7 +2745,6 @@ loop_again:
>  							order, sc.gfp_mask,
>  							&nr_soft_scanned);
>  			sc.nr_reclaimed += nr_soft_reclaimed;
> -			total_scanned += nr_soft_scanned;
>  
>  			/*
>  			 * We put equal pressure on every zone, unless
> @@ -2759,17 +2773,8 @@ loop_again:
>  
>  			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
>  			    !zone_balanced(zone, testorder,
> -					   balance_gap, end_zone)) {
> -				shrink_zone(zone, &sc);
> -
> -				reclaim_state->reclaimed_slab = 0;
> -				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> -				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> -				total_scanned += sc.nr_scanned;
> -
> -				if (nr_slab == 0 && !zone_reclaimable(zone))
> -					zone->all_unreclaimable = 1;
> -			}
> +					   balance_gap, end_zone))
> +				kswapd_shrink_zone(zone, &sc, lru_pages);
>  
>  			/*
>  			 * If we're getting trouble reclaiming, start doing
> -- 
> 1.8.1.4
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-09 17:27   ` Christoph Lameter
  -1 siblings, 0 replies; 83+ messages in thread
From: Christoph Lameter @ 2013-04-09 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

One additional measure that may be useful is to make kswapd prefer one
specific processor on a socket. Two benefits arise from that:

1. Better use of cpu caches and therefore higher speed, less
serialization.

2. Reduction of the disturbances to one processor.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-09 17:27   ` Christoph Lameter
  0 siblings, 0 replies; 83+ messages in thread
From: Christoph Lameter @ 2013-04-09 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

One additional measure that may be useful is to make kswapd prefer one
specific processor on a socket. Two benefits arise from that:

1. Better use of cpu caches and therefore higher speed, less
serialization.

2. Reduction of the disturbances to one processor.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
  2013-04-09 11:06   ` Mel Gorman
@ 2013-04-10  6:47     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  6:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because
> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
> 
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will still overshoot due to it not being a hard limit as
> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority
@ 2013-04-10  6:47     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  6:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because
> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
> 
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will still overshoot due to it not being a hard limit as
> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
  2013-04-09 11:06   ` Mel Gorman
@ 2013-04-10  7:16     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  7:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> Simplistically, the anon and file LRU lists are scanned proportionally
> depending on the value of vm.swappiness although there are other factors
> taken into account by get_scan_count().  The patch "mm: vmscan: Limit
> the number of pages kswapd reclaims" limits the number of pages kswapd
> reclaims but it breaks this proportional scanning and may evenly shrink
> anon/file LRUs regardless of vm.swappiness.
> 
> This patch preserves the proportional scanning and reclaim. It does mean
> that kswapd will reclaim more than requested but the number of pages will
> be related to the high watermark.
> 
> [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>   mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 46 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4835a7a..0742c45 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   	enum lru_list lru;
>   	unsigned long nr_reclaimed = 0;
>   	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	unsigned long nr_anon_scantarget, nr_file_scantarget;
>   	struct blk_plug plug;
> +	bool scan_adjusted = false;
>   
>   	get_scan_count(lruvec, sc, nr);
>   
> +	/* Record the original scan target for proportional adjustments later */
> +	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
> +	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
> +

I'm sorry I couldn't understand the calc...

Assume here
        nr_file_scantarget = 100
        nr_anon_file_target = 100.


>   	blk_start_plug(&plug);
>   	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>   					nr[LRU_INACTIVE_FILE]) {
> +		unsigned long nr_anon, nr_file, percentage;
> +
>   		for_each_evictable_lru(lru) {
>   			if (nr[lru]) {
>   				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
> @@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   							    lruvec, sc);
>   			}
>   		}
> +
> +		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> +			continue;
> +
>   		/*
> -		 * On large memory systems, scan >> priority can become
> -		 * really large. This is fine for the starting priority;
> -		 * we want to put equal scanning pressure on each zone.
> -		 * However, if the VM has a harder time of freeing pages,
> -		 * with multiple processes reclaiming pages, the total
> -		 * freeing target can get unreasonably large.
> +		 * For global direct reclaim, reclaim only the number of pages
> +		 * requested. Less care is taken to scan proportionally as it
> +		 * is more important to minimise direct reclaim stall latency
> +		 * than it is to properly age the LRU lists.
>   		 */
> -		if (nr_reclaimed >= nr_to_reclaim &&
> -		    sc->priority < DEF_PRIORITY)
> +		if (global_reclaim(sc) && !current_is_kswapd())
>   			break;
> +
> +		/*
> +		 * For kswapd and memcg, reclaim at least the number of pages
> +		 * requested. Ensure that the anon and file LRUs shrink
> +		 * proportionally what was requested by get_scan_count(). We
> +		 * stop reclaiming one LRU and reduce the amount scanning
> +		 * proportional to the original scan target.
> +		 */
> +		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
> +		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
> +
Then, nr_file = 80, nr_anon=70.


> +		if (nr_file > nr_anon) {
> +			lru = LRU_BASE;
> +			percentage = nr_anon * 100 / nr_anon_scantarget;
> +		} else {
> +			lru = LRU_FILE;
> +			percentage = nr_file * 100 / nr_file_scantarget;
> +		}

the percentage will be 70.

> +
> +		/* Stop scanning the smaller of the LRU */
> +		nr[lru] = 0;
> +		nr[lru + LRU_ACTIVE] = 0;
> +
this will stop anon scan.

> +		/* Reduce scanning of the other LRU proportionally */
> +		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
> +		nr[lru] = nr[lru] * percentage / 100;;
> +		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
> +

finally, in the next iteration,

              nr[file] = 80 * 0.7 = 56.
             
After loop, anon-scan is 30 pages , file-scan is 76(20+56) pages..

I think the calc here should be

   nr[lru] = nr_lru_scantarget * percentage / 100 - nr[lru]

   Here, 80-70=10 more pages to scan..should be proportional.

Am I misunderstanding ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
@ 2013-04-10  7:16     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  7:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> Simplistically, the anon and file LRU lists are scanned proportionally
> depending on the value of vm.swappiness although there are other factors
> taken into account by get_scan_count().  The patch "mm: vmscan: Limit
> the number of pages kswapd reclaims" limits the number of pages kswapd
> reclaims but it breaks this proportional scanning and may evenly shrink
> anon/file LRUs regardless of vm.swappiness.
> 
> This patch preserves the proportional scanning and reclaim. It does mean
> that kswapd will reclaim more than requested but the number of pages will
> be related to the high watermark.
> 
> [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>   mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 46 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4835a7a..0742c45 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   	enum lru_list lru;
>   	unsigned long nr_reclaimed = 0;
>   	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	unsigned long nr_anon_scantarget, nr_file_scantarget;
>   	struct blk_plug plug;
> +	bool scan_adjusted = false;
>   
>   	get_scan_count(lruvec, sc, nr);
>   
> +	/* Record the original scan target for proportional adjustments later */
> +	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
> +	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
> +

I'm sorry I couldn't understand the calc...

Assume here
        nr_file_scantarget = 100
        nr_anon_file_target = 100.


>   	blk_start_plug(&plug);
>   	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>   					nr[LRU_INACTIVE_FILE]) {
> +		unsigned long nr_anon, nr_file, percentage;
> +
>   		for_each_evictable_lru(lru) {
>   			if (nr[lru]) {
>   				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
> @@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   							    lruvec, sc);
>   			}
>   		}
> +
> +		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> +			continue;
> +
>   		/*
> -		 * On large memory systems, scan >> priority can become
> -		 * really large. This is fine for the starting priority;
> -		 * we want to put equal scanning pressure on each zone.
> -		 * However, if the VM has a harder time of freeing pages,
> -		 * with multiple processes reclaiming pages, the total
> -		 * freeing target can get unreasonably large.
> +		 * For global direct reclaim, reclaim only the number of pages
> +		 * requested. Less care is taken to scan proportionally as it
> +		 * is more important to minimise direct reclaim stall latency
> +		 * than it is to properly age the LRU lists.
>   		 */
> -		if (nr_reclaimed >= nr_to_reclaim &&
> -		    sc->priority < DEF_PRIORITY)
> +		if (global_reclaim(sc) && !current_is_kswapd())
>   			break;
> +
> +		/*
> +		 * For kswapd and memcg, reclaim at least the number of pages
> +		 * requested. Ensure that the anon and file LRUs shrink
> +		 * proportionally what was requested by get_scan_count(). We
> +		 * stop reclaiming one LRU and reduce the amount scanning
> +		 * proportional to the original scan target.
> +		 */
> +		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
> +		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
> +
Then, nr_file = 80, nr_anon=70.


> +		if (nr_file > nr_anon) {
> +			lru = LRU_BASE;
> +			percentage = nr_anon * 100 / nr_anon_scantarget;
> +		} else {
> +			lru = LRU_FILE;
> +			percentage = nr_file * 100 / nr_file_scantarget;
> +		}

the percentage will be 70.

> +
> +		/* Stop scanning the smaller of the LRU */
> +		nr[lru] = 0;
> +		nr[lru + LRU_ACTIVE] = 0;
> +
this will stop anon scan.

> +		/* Reduce scanning of the other LRU proportionally */
> +		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
> +		nr[lru] = nr[lru] * percentage / 100;;
> +		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
> +

finally, in the next iteration,

              nr[file] = 80 * 0.7 = 56.
             
After loop, anon-scan is 30 pages , file-scan is 76(20+56) pages..

I think the calc here should be

   nr[lru] = nr_lru_scantarget * percentage / 100 - nr[lru]

   Here, 80-70=10 more pages to scan..should be proportional.

Am I misunderstanding ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
  2013-04-09 11:06   ` Mel Gorman
@ 2013-04-10  7:47     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  7:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing
> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency. This patch irons out the logic a bit by controlling
> when priority is raised and removing the "goto loop_again".
> 
> This patch has kswapd raise the scanning priority until it is scanning
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> the scanning prioirty higher unless it is failing to reclaim any pages.
> 
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>   mm/vmscan.c | 85 +++++++++++++++++++++++++++++--------------------------------
>   1 file changed, 40 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0742c45..78268ca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2633,8 +2633,12 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>   /*
>    * kswapd shrinks the zone by the number of pages required to reach
>    * the high watermark.
> + *
> + * Returns true if kswapd scanned at least the requested number of pages to
> + * reclaim. This is used to determine if the scanning priority needs to be
> + * raised.
>    */
> -static void kswapd_shrink_zone(struct zone *zone,
> +static bool kswapd_shrink_zone(struct zone *zone,
>   			       struct scan_control *sc,
>   			       unsigned long lru_pages)
>   {
> @@ -2654,6 +2658,8 @@ static void kswapd_shrink_zone(struct zone *zone,
>   
>   	if (nr_slab == 0 && !zone_reclaimable(zone))
>   		zone->all_unreclaimable = 1;
> +
> +	return sc->nr_scanned >= sc->nr_to_reclaim;
>   }
>   
>   /*
> @@ -2680,26 +2686,25 @@ static void kswapd_shrink_zone(struct zone *zone,
>   static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   							int *classzone_idx)
>   {
> -	bool pgdat_is_balanced = false;
>   	int i;
>   	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
>   	unsigned long nr_soft_reclaimed;
>   	unsigned long nr_soft_scanned;
>   	struct scan_control sc = {
>   		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
>   		.may_unmap = 1,
>   		.may_swap = 1,
> +		.may_writepage = !laptop_mode,
>   		.order = order,
>   		.target_mem_cgroup = NULL,
>   	};
> -loop_again:
> -	sc.priority = DEF_PRIORITY;
> -	sc.nr_reclaimed = 0;
> -	sc.may_writepage = !laptop_mode;
>   	count_vm_event(PAGEOUTRUN);
>   
>   	do {
>   		unsigned long lru_pages = 0;
> +		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
> +		bool raise_priority = true;
>   
>   		/*
>   		 * Scan in the highmem->dma direction for the highest
> @@ -2741,10 +2746,8 @@ loop_again:
>   			}
>   		}
>   
> -		if (i < 0) {
> -			pgdat_is_balanced = true;
> +		if (i < 0)
>   			goto out;
> -		}
>   
>   		for (i = 0; i <= end_zone; i++) {
>   			struct zone *zone = pgdat->node_zones + i;
> @@ -2811,8 +2814,16 @@ loop_again:
>   
>   			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
>   			    !zone_balanced(zone, testorder,
> -					   balance_gap, end_zone))
> -				kswapd_shrink_zone(zone, &sc, lru_pages);
> +					   balance_gap, end_zone)) {
> +				/*
> +				 * There should be no need to raise the
> +				 * scanning priority if enough pages are
> +				 * already being scanned that high
> +				 * watermark would be met at 100% efficiency.
> +				 */
> +				if (kswapd_shrink_zone(zone, &sc, lru_pages))
> +					raise_priority = false;

priority will be raised up enough to scan the amount of "high" watermark
and will not get larger than that if some pages are reclaimed ?

Thanks,
-Kame


> +			}
>   
>   			/*
>   			 * If we're getting trouble reclaiming, start doing
> @@ -2847,46 +2858,29 @@ loop_again:
>   				pfmemalloc_watermark_ok(pgdat))
>   			wake_up(&pgdat->pfmemalloc_wait);
>   
> -		if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> -			pgdat_is_balanced = true;
> -			break;		/* kswapd: all done */
> -		}
> -
>   		/*
> -		 * We do this so kswapd doesn't build up large priorities for
> -		 * example when it is freeing in parallel with allocators. It
> -		 * matches the direct reclaim path behaviour in terms of impact
> -		 * on zone->*_priority.
> +		 * Fragmentation may mean that the system cannot be rebalanced
> +		 * for high-order allocations in all zones. If twice the
> +		 * allocation size has been reclaimed and the zones are still
> +		 * not balanced then recheck the watermarks at order-0 to
> +		 * prevent kswapd reclaiming excessively. Assume that a
> +		 * process requested a high-order can direct reclaim/compact.
>   		 */
> -		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> -			break;
> -	} while (--sc.priority >= 0);
> -
> -out:
> -	if (!pgdat_is_balanced) {
> -		cond_resched();
> +		if (order && sc.nr_reclaimed >= 2UL << order)
> +			order = sc.order = 0;
>   
> -		try_to_freeze();
> +		/* Check if kswapd should be suspending */
> +		if (try_to_freeze() || kthread_should_stop())
> +			break;
>   
>   		/*
> -		 * Fragmentation may mean that the system cannot be
> -		 * rebalanced for high-order allocations in all zones.
> -		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> -		 * it means the zones have been fully scanned and are still
> -		 * not balanced. For high-order allocations, there is
> -		 * little point trying all over again as kswapd may
> -		 * infinite loop.
> -		 *
> -		 * Instead, recheck all watermarks at order-0 as they
> -		 * are the most important. If watermarks are ok, kswapd will go
> -		 * back to sleep. High-order users can still perform direct
> -		 * reclaim if they wish.
> +		 * Raise priority if scanning rate is too low or there was no
> +		 * progress in reclaiming pages
>   		 */
> -		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> -			order = sc.order = 0;
> -
> -		goto loop_again;
> -	}
> +		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> +			sc.priority--;
> +	} while (sc.priority >= 0 &&
> +		 !pgdat_balanced(pgdat, order, *classzone_idx));
>   
>   	/*
>   	 * If kswapd was reclaiming at a higher order, it has the option of
> @@ -2915,6 +2909,7 @@ out:
>   			compact_pgdat(pgdat, order);
>   	}
>   
> +out:
>   	/*
>   	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
>   	 * makes a decision on the order we were last reclaiming at. However,
> 



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
@ 2013-04-10  7:47     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  7:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing
> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency. This patch irons out the logic a bit by controlling
> when priority is raised and removing the "goto loop_again".
> 
> This patch has kswapd raise the scanning priority until it is scanning
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> the scanning prioirty higher unless it is failing to reclaim any pages.
> 
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>   mm/vmscan.c | 85 +++++++++++++++++++++++++++++--------------------------------
>   1 file changed, 40 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0742c45..78268ca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2633,8 +2633,12 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>   /*
>    * kswapd shrinks the zone by the number of pages required to reach
>    * the high watermark.
> + *
> + * Returns true if kswapd scanned at least the requested number of pages to
> + * reclaim. This is used to determine if the scanning priority needs to be
> + * raised.
>    */
> -static void kswapd_shrink_zone(struct zone *zone,
> +static bool kswapd_shrink_zone(struct zone *zone,
>   			       struct scan_control *sc,
>   			       unsigned long lru_pages)
>   {
> @@ -2654,6 +2658,8 @@ static void kswapd_shrink_zone(struct zone *zone,
>   
>   	if (nr_slab == 0 && !zone_reclaimable(zone))
>   		zone->all_unreclaimable = 1;
> +
> +	return sc->nr_scanned >= sc->nr_to_reclaim;
>   }
>   
>   /*
> @@ -2680,26 +2686,25 @@ static void kswapd_shrink_zone(struct zone *zone,
>   static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   							int *classzone_idx)
>   {
> -	bool pgdat_is_balanced = false;
>   	int i;
>   	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
>   	unsigned long nr_soft_reclaimed;
>   	unsigned long nr_soft_scanned;
>   	struct scan_control sc = {
>   		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
>   		.may_unmap = 1,
>   		.may_swap = 1,
> +		.may_writepage = !laptop_mode,
>   		.order = order,
>   		.target_mem_cgroup = NULL,
>   	};
> -loop_again:
> -	sc.priority = DEF_PRIORITY;
> -	sc.nr_reclaimed = 0;
> -	sc.may_writepage = !laptop_mode;
>   	count_vm_event(PAGEOUTRUN);
>   
>   	do {
>   		unsigned long lru_pages = 0;
> +		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
> +		bool raise_priority = true;
>   
>   		/*
>   		 * Scan in the highmem->dma direction for the highest
> @@ -2741,10 +2746,8 @@ loop_again:
>   			}
>   		}
>   
> -		if (i < 0) {
> -			pgdat_is_balanced = true;
> +		if (i < 0)
>   			goto out;
> -		}
>   
>   		for (i = 0; i <= end_zone; i++) {
>   			struct zone *zone = pgdat->node_zones + i;
> @@ -2811,8 +2814,16 @@ loop_again:
>   
>   			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
>   			    !zone_balanced(zone, testorder,
> -					   balance_gap, end_zone))
> -				kswapd_shrink_zone(zone, &sc, lru_pages);
> +					   balance_gap, end_zone)) {
> +				/*
> +				 * There should be no need to raise the
> +				 * scanning priority if enough pages are
> +				 * already being scanned that high
> +				 * watermark would be met at 100% efficiency.
> +				 */
> +				if (kswapd_shrink_zone(zone, &sc, lru_pages))
> +					raise_priority = false;

priority will be raised up enough to scan the amount of "high" watermark
and will not get larger than that if some pages are reclaimed ?

Thanks,
-Kame


> +			}
>   
>   			/*
>   			 * If we're getting trouble reclaiming, start doing
> @@ -2847,46 +2858,29 @@ loop_again:
>   				pfmemalloc_watermark_ok(pgdat))
>   			wake_up(&pgdat->pfmemalloc_wait);
>   
> -		if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> -			pgdat_is_balanced = true;
> -			break;		/* kswapd: all done */
> -		}
> -
>   		/*
> -		 * We do this so kswapd doesn't build up large priorities for
> -		 * example when it is freeing in parallel with allocators. It
> -		 * matches the direct reclaim path behaviour in terms of impact
> -		 * on zone->*_priority.
> +		 * Fragmentation may mean that the system cannot be rebalanced
> +		 * for high-order allocations in all zones. If twice the
> +		 * allocation size has been reclaimed and the zones are still
> +		 * not balanced then recheck the watermarks at order-0 to
> +		 * prevent kswapd reclaiming excessively. Assume that a
> +		 * process requested a high-order can direct reclaim/compact.
>   		 */
> -		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> -			break;
> -	} while (--sc.priority >= 0);
> -
> -out:
> -	if (!pgdat_is_balanced) {
> -		cond_resched();
> +		if (order && sc.nr_reclaimed >= 2UL << order)
> +			order = sc.order = 0;
>   
> -		try_to_freeze();
> +		/* Check if kswapd should be suspending */
> +		if (try_to_freeze() || kthread_should_stop())
> +			break;
>   
>   		/*
> -		 * Fragmentation may mean that the system cannot be
> -		 * rebalanced for high-order allocations in all zones.
> -		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> -		 * it means the zones have been fully scanned and are still
> -		 * not balanced. For high-order allocations, there is
> -		 * little point trying all over again as kswapd may
> -		 * infinite loop.
> -		 *
> -		 * Instead, recheck all watermarks at order-0 as they
> -		 * are the most important. If watermarks are ok, kswapd will go
> -		 * back to sleep. High-order users can still perform direct
> -		 * reclaim if they wish.
> +		 * Raise priority if scanning rate is too low or there was no
> +		 * progress in reclaiming pages
>   		 */
> -		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> -			order = sc.order = 0;
> -
> -		goto loop_again;
> -	}
> +		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> +			sc.priority--;
> +	} while (sc.priority >= 0 &&
> +		 !pgdat_balanced(pgdat, order, *classzone_idx));
>   
>   	/*
>   	 * If kswapd was reclaiming at a higher order, it has the option of
> @@ -2915,6 +2909,7 @@ out:
>   			compact_pgdat(pgdat, order);
>   	}
>   
> +out:
>   	/*
>   	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
>   	 * makes a decision on the order we were last reclaiming at. However,
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
  2013-04-09 11:06   ` Mel Gorman
@ 2013-04-10  8:05     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  8:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> In the past, kswapd makes a decision on whether to compact memory after the
> pgdat was considered balanced. This more or less worked but it is late to
> make such a decision and does not fit well now that kswapd makes a decision
> whether to exit the zone scanning loop depending on reclaim progress.
> 
> This patch will compact a pgdat if at least the requested number of pages
> were reclaimed from unbalanced zones for a given priority. If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I like this way.

> ---
>   mm/vmscan.c | 60 ++++++++++++++++++++++++++++++------------------------------
>   1 file changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 78268ca..a9e68b4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2640,7 +2640,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>    */
>   static bool kswapd_shrink_zone(struct zone *zone,
>   			       struct scan_control *sc,
> -			       unsigned long lru_pages)
> +			       unsigned long lru_pages,
> +			       unsigned long *nr_attempted)
>   {
>   	unsigned long nr_slab;
>   	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2656,6 +2657,9 @@ static bool kswapd_shrink_zone(struct zone *zone,
>   	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>   	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>   
> +	/* Account for the number of pages attempted to reclaim */
> +	*nr_attempted += sc->nr_to_reclaim;
> +
>   	if (nr_slab == 0 && !zone_reclaimable(zone))
>   		zone->all_unreclaimable = 1;
>   
> @@ -2703,8 +2707,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   
>   	do {
>   		unsigned long lru_pages = 0;
> +		unsigned long nr_attempted = 0;
>   		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
> +		unsigned long this_reclaimed;
>   		bool raise_priority = true;
> +		bool pgdat_needs_compaction = (order > 0);
>   
>   		/*
>   		 * Scan in the highmem->dma direction for the highest
> @@ -2752,7 +2759,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   		for (i = 0; i <= end_zone; i++) {
>   			struct zone *zone = pgdat->node_zones + i;
>   
> +			if (!populated_zone(zone))
> +				continue;
> +
>   			lru_pages += zone_reclaimable_pages(zone);
> +
> +			/*
> +			 * If any zone is currently balanced then kswapd will
> +			 * not call compaction as it is expected that the
> +			 * necessary pages are already available.
> +			 */
> +			if (pgdat_needs_compaction &&
> +					zone_watermark_ok(zone, order,
> +						low_wmark_pages(zone),
> +						*classzone_idx, 0))
> +				pgdat_needs_compaction = false;
>   		}
>   
>   		/*
> @@ -2821,7 +2842,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   				 * already being scanned that high
>   				 * watermark would be met at 100% efficiency.
>   				 */
> -				if (kswapd_shrink_zone(zone, &sc, lru_pages))
> +				if (kswapd_shrink_zone(zone, &sc, lru_pages,
> +						       &nr_attempted))
>   					raise_priority = false;
>   			}
>   
> @@ -2873,42 +2895,20 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   		if (try_to_freeze() || kthread_should_stop())
>   			break;
>   
> +		/* Compact if necessary and kswapd is reclaiming efficiently */
> +		this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (pgdat_needs_compaction && this_reclaimed > nr_attempted)
> +			compact_pgdat(pgdat, order);
> +

What does "this_reclaimed" mean ?   
"the total amount of reclaimed memory - reclaimed memory at this iteration" ?

And this_reclaimed > nr_attempted means kswapd is efficient ?
What "efficient" means here ?

Thanks,
-Kame

>   		/*
>   		 * Raise priority if scanning rate is too low or there was no
>   		 * progress in reclaiming pages
>   		 */
> -		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> +		if (raise_priority || !this_reclaimed)
>   			sc.priority--;
>   	} while (sc.priority >= 0 &&
>   		 !pgdat_balanced(pgdat, order, *classzone_idx));
>   
> -	/*
> -	 * If kswapd was reclaiming at a higher order, it has the option of
> -	 * sleeping without all zones being balanced. Before it does, it must
> -	 * ensure that the watermarks for order-0 on *all* zones are met and
> -	 * that the congestion flags are cleared. The congestion flag must
> -	 * be cleared as kswapd is the only mechanism that clears the flag
> -	 * and it is potentially going to sleep here.
> -	 */
> -	if (order) {
> -		int zones_need_compaction = 1;
> -
> -		for (i = 0; i <= end_zone; i++) {
> -			struct zone *zone = pgdat->node_zones + i;
> -
> -			if (!populated_zone(zone))
> -				continue;
> -
> -			/* Check if the memory needs to be defragmented. */
> -			if (zone_watermark_ok(zone, order,
> -				    low_wmark_pages(zone), *classzone_idx, 0))
> -				zones_need_compaction = 0;
> -		}
> -
> -		if (zones_need_compaction)
> -			compact_pgdat(pgdat, order);
> -	}
> -
>   out:
>   	/*
>   	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
> 



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
@ 2013-04-10  8:05     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-10  8:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/09 20:06), Mel Gorman wrote:
> In the past, kswapd makes a decision on whether to compact memory after the
> pgdat was considered balanced. This more or less worked but it is late to
> make such a decision and does not fit well now that kswapd makes a decision
> whether to exit the zone scanning loop depending on reclaim progress.
> 
> This patch will compact a pgdat if at least the requested number of pages
> were reclaimed from unbalanced zones for a given priority. If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I like this way.

> ---
>   mm/vmscan.c | 60 ++++++++++++++++++++++++++++++------------------------------
>   1 file changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 78268ca..a9e68b4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2640,7 +2640,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>    */
>   static bool kswapd_shrink_zone(struct zone *zone,
>   			       struct scan_control *sc,
> -			       unsigned long lru_pages)
> +			       unsigned long lru_pages,
> +			       unsigned long *nr_attempted)
>   {
>   	unsigned long nr_slab;
>   	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2656,6 +2657,9 @@ static bool kswapd_shrink_zone(struct zone *zone,
>   	nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>   	sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>   
> +	/* Account for the number of pages attempted to reclaim */
> +	*nr_attempted += sc->nr_to_reclaim;
> +
>   	if (nr_slab == 0 && !zone_reclaimable(zone))
>   		zone->all_unreclaimable = 1;
>   
> @@ -2703,8 +2707,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   
>   	do {
>   		unsigned long lru_pages = 0;
> +		unsigned long nr_attempted = 0;
>   		unsigned long nr_reclaimed = sc.nr_reclaimed = 0;
> +		unsigned long this_reclaimed;
>   		bool raise_priority = true;
> +		bool pgdat_needs_compaction = (order > 0);
>   
>   		/*
>   		 * Scan in the highmem->dma direction for the highest
> @@ -2752,7 +2759,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   		for (i = 0; i <= end_zone; i++) {
>   			struct zone *zone = pgdat->node_zones + i;
>   
> +			if (!populated_zone(zone))
> +				continue;
> +
>   			lru_pages += zone_reclaimable_pages(zone);
> +
> +			/*
> +			 * If any zone is currently balanced then kswapd will
> +			 * not call compaction as it is expected that the
> +			 * necessary pages are already available.
> +			 */
> +			if (pgdat_needs_compaction &&
> +					zone_watermark_ok(zone, order,
> +						low_wmark_pages(zone),
> +						*classzone_idx, 0))
> +				pgdat_needs_compaction = false;
>   		}
>   
>   		/*
> @@ -2821,7 +2842,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   				 * already being scanned that high
>   				 * watermark would be met at 100% efficiency.
>   				 */
> -				if (kswapd_shrink_zone(zone, &sc, lru_pages))
> +				if (kswapd_shrink_zone(zone, &sc, lru_pages,
> +						       &nr_attempted))
>   					raise_priority = false;
>   			}
>   
> @@ -2873,42 +2895,20 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   		if (try_to_freeze() || kthread_should_stop())
>   			break;
>   
> +		/* Compact if necessary and kswapd is reclaiming efficiently */
> +		this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (pgdat_needs_compaction && this_reclaimed > nr_attempted)
> +			compact_pgdat(pgdat, order);
> +

What does "this_reclaimed" mean ?   
"the total amount of reclaimed memory - reclaimed memory at this iteration" ?

And this_reclaimed > nr_attempted means kswapd is efficient ?
What "efficient" means here ?

Thanks,
-Kame

>   		/*
>   		 * Raise priority if scanning rate is too low or there was no
>   		 * progress in reclaiming pages
>   		 */
> -		if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> +		if (raise_priority || !this_reclaimed)
>   			sc.priority--;
>   	} while (sc.priority >= 0 &&
>   		 !pgdat_balanced(pgdat, order, *classzone_idx));
>   
> -	/*
> -	 * If kswapd was reclaiming at a higher order, it has the option of
> -	 * sleeping without all zones being balanced. Before it does, it must
> -	 * ensure that the watermarks for order-0 on *all* zones are met and
> -	 * that the congestion flags are cleared. The congestion flag must
> -	 * be cleared as kswapd is the only mechanism that clears the flag
> -	 * and it is potentially going to sleep here.
> -	 */
> -	if (order) {
> -		int zones_need_compaction = 1;
> -
> -		for (i = 0; i <= end_zone; i++) {
> -			struct zone *zone = pgdat->node_zones + i;
> -
> -			if (!populated_zone(zone))
> -				continue;
> -
> -			/* Check if the memory needs to be defragmented. */
> -			if (zone_watermark_ok(zone, order,
> -				    low_wmark_pages(zone), *classzone_idx, 0))
> -				zones_need_compaction = 0;
> -		}
> -
> -		if (zones_need_compaction)
> -			compact_pgdat(pgdat, order);
> -	}
> -
>   out:
>   	/*
>   	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
  2013-04-10  7:47     ` Kamezawa Hiroyuki
@ 2013-04-10 13:29       ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 13:29 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 04:47:31PM +0900, Kamezawa Hiroyuki wrote:
> > @@ -2811,8 +2814,16 @@ loop_again:
> >   
> >   			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> >   			    !zone_balanced(zone, testorder,
> > -					   balance_gap, end_zone))
> > -				kswapd_shrink_zone(zone, &sc, lru_pages);
> > +					   balance_gap, end_zone)) {
> > +				/*
> > +				 * There should be no need to raise the
> > +				 * scanning priority if enough pages are
> > +				 * already being scanned that high
> > +				 * watermark would be met at 100% efficiency.
> > +				 */
> > +				if (kswapd_shrink_zone(zone, &sc, lru_pages))
> > +					raise_priority = false;
> 
> priority will be raised up enough to scan the amount of "high" watermark
> and will not get larger than that if some pages are reclaimed ?
> 

Yes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
@ 2013-04-10 13:29       ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 13:29 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 04:47:31PM +0900, Kamezawa Hiroyuki wrote:
> > @@ -2811,8 +2814,16 @@ loop_again:
> >   
> >   			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> >   			    !zone_balanced(zone, testorder,
> > -					   balance_gap, end_zone))
> > -				kswapd_shrink_zone(zone, &sc, lru_pages);
> > +					   balance_gap, end_zone)) {
> > +				/*
> > +				 * There should be no need to raise the
> > +				 * scanning priority if enough pages are
> > +				 * already being scanned that high
> > +				 * watermark would be met at 100% efficiency.
> > +				 */
> > +				if (kswapd_shrink_zone(zone, &sc, lru_pages))
> > +					raise_priority = false;
> 
> priority will be raised up enough to scan the amount of "high" watermark
> and will not get larger than that if some pages are reclaimed ?
> 

Yes.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
  2013-04-10  8:05     ` Kamezawa Hiroyuki
@ 2013-04-10 13:57       ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 13:57 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 05:05:14PM +0900, Kamezawa Hiroyuki wrote:
> (2013/04/09 20:06), Mel Gorman wrote:
> > In the past, kswapd makes a decision on whether to compact memory after the
> > pgdat was considered balanced. This more or less worked but it is late to
> > make such a decision and does not fit well now that kswapd makes a decision
> > whether to exit the zone scanning loop depending on reclaim progress.
> > 
> > This patch will compact a pgdat if at least the requested number of pages
> > were reclaimed from unbalanced zones for a given priority. If any zone is
> > currently balanced, kswapd will not call compaction as it is expected the
> > necessary pages are already available.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> I like this way.
> 

Thanks
> > <SNIP>
> > @@ -2873,42 +2895,20 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> >   		if (try_to_freeze() || kthread_should_stop())
> >   			break;
> >   
> > +		/* Compact if necessary and kswapd is reclaiming efficiently */
> > +		this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > +		if (pgdat_needs_compaction && this_reclaimed > nr_attempted)
> > +			compact_pgdat(pgdat, order);
> > +
> 
> What does "this_reclaimed" mean ?   
> "the total amount of reclaimed memory - reclaimed memory at this iteration" ?
> 

It's meant to be "reclaimed memory at this iteration" but I made a merge
error when I decided to reset sc.nr_reclaimed to 0 on every loop in the patch
"mm: vmscan: Flatten kswapd priority loop". Once I did that, nr_reclaimed
became redundant and should have been removed. I've done that now.

> And this_reclaimed > nr_attempted means kswapd is efficient ?
> What "efficient" means here ?
> 

Reclaim efficiency is normally the ratio between pages scanned and pages
reclaimed. Ideally, every page scanned is reclaimed. In this case, being
efficient means that we reclaimed at least the number of pages requested
which is sc->nr_to_reclaim which in the case of kswapd is the high
watermark. I changed the comment to

                /*
                 * Compact if necessary and kswapd is reclaiming at least the
                 * high watermark number of pages as requested
                 */

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
@ 2013-04-10 13:57       ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 13:57 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 05:05:14PM +0900, Kamezawa Hiroyuki wrote:
> (2013/04/09 20:06), Mel Gorman wrote:
> > In the past, kswapd makes a decision on whether to compact memory after the
> > pgdat was considered balanced. This more or less worked but it is late to
> > make such a decision and does not fit well now that kswapd makes a decision
> > whether to exit the zone scanning loop depending on reclaim progress.
> > 
> > This patch will compact a pgdat if at least the requested number of pages
> > were reclaimed from unbalanced zones for a given priority. If any zone is
> > currently balanced, kswapd will not call compaction as it is expected the
> > necessary pages are already available.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> I like this way.
> 

Thanks
> > <SNIP>
> > @@ -2873,42 +2895,20 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> >   		if (try_to_freeze() || kthread_should_stop())
> >   			break;
> >   
> > +		/* Compact if necessary and kswapd is reclaiming efficiently */
> > +		this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > +		if (pgdat_needs_compaction && this_reclaimed > nr_attempted)
> > +			compact_pgdat(pgdat, order);
> > +
> 
> What does "this_reclaimed" mean ?   
> "the total amount of reclaimed memory - reclaimed memory at this iteration" ?
> 

It's meant to be "reclaimed memory at this iteration" but I made a merge
error when I decided to reset sc.nr_reclaimed to 0 on every loop in the patch
"mm: vmscan: Flatten kswapd priority loop". Once I did that, nr_reclaimed
became redundant and should have been removed. I've done that now.

> And this_reclaimed > nr_attempted means kswapd is efficient ?
> What "efficient" means here ?
> 

Reclaim efficiency is normally the ratio between pages scanned and pages
reclaimed. Ideally, every page scanned is reclaimed. In this case, being
efficient means that we reclaimed at least the number of pages requested
which is sc->nr_to_reclaim which in the case of kswapd is the high
watermark. I changed the comment to

                /*
                 * Compact if necessary and kswapd is reclaiming at least the
                 * high watermark number of pages as requested
                 */

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
  2013-04-10  7:16     ` Kamezawa Hiroyuki
@ 2013-04-10 14:08       ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 14:08 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 04:16:47PM +0900, Kamezawa Hiroyuki wrote:
> (2013/04/09 20:06), Mel Gorman wrote:
> > Simplistically, the anon and file LRU lists are scanned proportionally
> > depending on the value of vm.swappiness although there are other factors
> > taken into account by get_scan_count().  The patch "mm: vmscan: Limit
> > the number of pages kswapd reclaims" limits the number of pages kswapd
> > reclaims but it breaks this proportional scanning and may evenly shrink
> > anon/file LRUs regardless of vm.swappiness.
> > 
> > This patch preserves the proportional scanning and reclaim. It does mean
> > that kswapd will reclaim more than requested but the number of pages will
> > be related to the high watermark.
> > 
> > [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >   mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
> >   1 file changed, 46 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4835a7a..0742c45 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >   	enum lru_list lru;
> >   	unsigned long nr_reclaimed = 0;
> >   	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > +	unsigned long nr_anon_scantarget, nr_file_scantarget;
> >   	struct blk_plug plug;
> > +	bool scan_adjusted = false;
> >   
> >   	get_scan_count(lruvec, sc, nr);
> >   
> > +	/* Record the original scan target for proportional adjustments later */
> > +	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
> > +	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
> > +
> 
> I'm sorry I couldn't understand the calc...
> 
> Assume here
>         nr_file_scantarget = 100
>         nr_anon_file_target = 100.
> 

I think you might have meant nr_anon_scantarget here instead of
nr_anon_file_target.

> 
> >   	blk_start_plug(&plug);
> >   	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> >   					nr[LRU_INACTIVE_FILE]) {
> > +		unsigned long nr_anon, nr_file, percentage;
> > +
> >   		for_each_evictable_lru(lru) {
> >   			if (nr[lru]) {
> >   				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
> > @@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >   							    lruvec, sc);
> >   			}
> >   		}
> > +
> > +		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> > +			continue;
> > +
> >   		/*
> > -		 * On large memory systems, scan >> priority can become
> > -		 * really large. This is fine for the starting priority;
> > -		 * we want to put equal scanning pressure on each zone.
> > -		 * However, if the VM has a harder time of freeing pages,
> > -		 * with multiple processes reclaiming pages, the total
> > -		 * freeing target can get unreasonably large.
> > +		 * For global direct reclaim, reclaim only the number of pages
> > +		 * requested. Less care is taken to scan proportionally as it
> > +		 * is more important to minimise direct reclaim stall latency
> > +		 * than it is to properly age the LRU lists.
> >   		 */
> > -		if (nr_reclaimed >= nr_to_reclaim &&
> > -		    sc->priority < DEF_PRIORITY)
> > +		if (global_reclaim(sc) && !current_is_kswapd())
> >   			break;
> > +
> > +		/*
> > +		 * For kswapd and memcg, reclaim at least the number of pages
> > +		 * requested. Ensure that the anon and file LRUs shrink
> > +		 * proportionally what was requested by get_scan_count(). We
> > +		 * stop reclaiming one LRU and reduce the amount scanning
> > +		 * proportional to the original scan target.
> > +		 */
> > +		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
> > +		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
> > +
>
> Then, nr_file = 80, nr_anon=70.
> 

As we scan evenly in SCAN_CLUSTER_MAX groups of pages, this wouldn't happen
but for the purposes of discussions, lets assume it did.

> 
> > +		if (nr_file > nr_anon) {
> > +			lru = LRU_BASE;
> > +			percentage = nr_anon * 100 / nr_anon_scantarget;
> > +		} else {
> > +			lru = LRU_FILE;
> > +			percentage = nr_file * 100 / nr_file_scantarget;
> > +		}
> 
> the percentage will be 70.
> 

Yes.

> > +
> > +		/* Stop scanning the smaller of the LRU */
> > +		nr[lru] = 0;
> > +		nr[lru + LRU_ACTIVE] = 0;
> > +
>
> this will stop anon scan.
> 

Yes.

> > +		/* Reduce scanning of the other LRU proportionally */
> > +		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
> > +		nr[lru] = nr[lru] * percentage / 100;;
> > +		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
> > +
> 
> finally, in the next iteration,
> 
>               nr[file] = 80 * 0.7 = 56.
>              
> After loop, anon-scan is 30 pages , file-scan is 76(20+56) pages..
> 

Well spotted, this would indeed reclaim too many pages from the other
LRU. I wanted to avoid recording the original scan targets as it's an
extra 40 bytes on the stack but it's unavoidable.

> I think the calc here should be
> 
>    nr[lru] = nr_lru_scantarget * percentage / 100 - nr[lru]
> 
>    Here, 80-70=10 more pages to scan..should be proportional.
> 

nr[lru] at the end there is pages remaining to be scanned not pages
scanned already. Did you mean something like this?

nr[lru] = scantarget[lru] * percentage / 100 - (scantarget[lru] - nr[lru])

With care taken to ensure we do not underflow? Something like

        unsigned long nr[NR_LRU_LISTS];
        unsigned long targets[NR_LRU_LISTS];

...

	memcpy(targets, nr, sizeof(nr));

...

        nr[lru] = targets[lru] * percentage / 100;
        nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));

        lru += LRU_ACTIVE;
        nr[lru] = targets[lru] * percentage / 100;
        nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));

?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
@ 2013-04-10 14:08       ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 14:08 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 04:16:47PM +0900, Kamezawa Hiroyuki wrote:
> (2013/04/09 20:06), Mel Gorman wrote:
> > Simplistically, the anon and file LRU lists are scanned proportionally
> > depending on the value of vm.swappiness although there are other factors
> > taken into account by get_scan_count().  The patch "mm: vmscan: Limit
> > the number of pages kswapd reclaims" limits the number of pages kswapd
> > reclaims but it breaks this proportional scanning and may evenly shrink
> > anon/file LRUs regardless of vm.swappiness.
> > 
> > This patch preserves the proportional scanning and reclaim. It does mean
> > that kswapd will reclaim more than requested but the number of pages will
> > be related to the high watermark.
> > 
> > [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >   mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
> >   1 file changed, 46 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4835a7a..0742c45 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >   	enum lru_list lru;
> >   	unsigned long nr_reclaimed = 0;
> >   	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > +	unsigned long nr_anon_scantarget, nr_file_scantarget;
> >   	struct blk_plug plug;
> > +	bool scan_adjusted = false;
> >   
> >   	get_scan_count(lruvec, sc, nr);
> >   
> > +	/* Record the original scan target for proportional adjustments later */
> > +	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
> > +	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
> > +
> 
> I'm sorry I couldn't understand the calc...
> 
> Assume here
>         nr_file_scantarget = 100
>         nr_anon_file_target = 100.
> 

I think you might have meant nr_anon_scantarget here instead of
nr_anon_file_target.

> 
> >   	blk_start_plug(&plug);
> >   	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> >   					nr[LRU_INACTIVE_FILE]) {
> > +		unsigned long nr_anon, nr_file, percentage;
> > +
> >   		for_each_evictable_lru(lru) {
> >   			if (nr[lru]) {
> >   				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
> > @@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >   							    lruvec, sc);
> >   			}
> >   		}
> > +
> > +		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> > +			continue;
> > +
> >   		/*
> > -		 * On large memory systems, scan >> priority can become
> > -		 * really large. This is fine for the starting priority;
> > -		 * we want to put equal scanning pressure on each zone.
> > -		 * However, if the VM has a harder time of freeing pages,
> > -		 * with multiple processes reclaiming pages, the total
> > -		 * freeing target can get unreasonably large.
> > +		 * For global direct reclaim, reclaim only the number of pages
> > +		 * requested. Less care is taken to scan proportionally as it
> > +		 * is more important to minimise direct reclaim stall latency
> > +		 * than it is to properly age the LRU lists.
> >   		 */
> > -		if (nr_reclaimed >= nr_to_reclaim &&
> > -		    sc->priority < DEF_PRIORITY)
> > +		if (global_reclaim(sc) && !current_is_kswapd())
> >   			break;
> > +
> > +		/*
> > +		 * For kswapd and memcg, reclaim at least the number of pages
> > +		 * requested. Ensure that the anon and file LRUs shrink
> > +		 * proportionally what was requested by get_scan_count(). We
> > +		 * stop reclaiming one LRU and reduce the amount scanning
> > +		 * proportional to the original scan target.
> > +		 */
> > +		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
> > +		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
> > +
>
> Then, nr_file = 80, nr_anon=70.
> 

As we scan evenly in SCAN_CLUSTER_MAX groups of pages, this wouldn't happen
but for the purposes of discussions, lets assume it did.

> 
> > +		if (nr_file > nr_anon) {
> > +			lru = LRU_BASE;
> > +			percentage = nr_anon * 100 / nr_anon_scantarget;
> > +		} else {
> > +			lru = LRU_FILE;
> > +			percentage = nr_file * 100 / nr_file_scantarget;
> > +		}
> 
> the percentage will be 70.
> 

Yes.

> > +
> > +		/* Stop scanning the smaller of the LRU */
> > +		nr[lru] = 0;
> > +		nr[lru + LRU_ACTIVE] = 0;
> > +
>
> this will stop anon scan.
> 

Yes.

> > +		/* Reduce scanning of the other LRU proportionally */
> > +		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
> > +		nr[lru] = nr[lru] * percentage / 100;;
> > +		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
> > +
> 
> finally, in the next iteration,
> 
>               nr[file] = 80 * 0.7 = 56.
>              
> After loop, anon-scan is 30 pages , file-scan is 76(20+56) pages..
> 

Well spotted, this would indeed reclaim too many pages from the other
LRU. I wanted to avoid recording the original scan targets as it's an
extra 40 bytes on the stack but it's unavoidable.

> I think the calc here should be
> 
>    nr[lru] = nr_lru_scantarget * percentage / 100 - nr[lru]
> 
>    Here, 80-70=10 more pages to scan..should be proportional.
> 

nr[lru] at the end there is pages remaining to be scanned not pages
scanned already. Did you mean something like this?

nr[lru] = scantarget[lru] * percentage / 100 - (scantarget[lru] - nr[lru])

With care taken to ensure we do not underflow? Something like

        unsigned long nr[NR_LRU_LISTS];
        unsigned long targets[NR_LRU_LISTS];

...

	memcpy(targets, nr, sizeof(nr));

...

        nr[lru] = targets[lru] * percentage / 100;
        nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));

        lru += LRU_ACTIVE;
        nr[lru] = targets[lru] * percentage / 100;
        nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));

?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-09 17:27   ` Christoph Lameter
@ 2013-04-10 14:14     ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 14:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> One additional measure that may be useful is to make kswapd prefer one
> specific processor on a socket. Two benefits arise from that:
> 
> 1. Better use of cpu caches and therefore higher speed, less
> serialization.
> 

Considering the volume of pages that kswapd can scan when it's active
I would expect that it trashes its cache anyway. The L1 cache would be
flushed after scanning struct pages for just a few MB of memory.

> 2. Reduction of the disturbances to one processor.
> 

I've never checked it but I would have expected kswapd to stay on the
same processor for significant periods of time. Have you experienced
problems where kswapd bounces around on CPUs within a node causing
workload disruption?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-10 14:14     ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-10 14:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> One additional measure that may be useful is to make kswapd prefer one
> specific processor on a socket. Two benefits arise from that:
> 
> 1. Better use of cpu caches and therefore higher speed, less
> serialization.
> 

Considering the volume of pages that kswapd can scan when it's active
I would expect that it trashes its cache anyway. The L1 cache would be
flushed after scanning struct pages for just a few MB of memory.

> 2. Reduction of the disturbances to one processor.
> 

I've never checked it but I would have expected kswapd to stay on the
same processor for significant periods of time. Have you experienced
problems where kswapd bounces around on CPUs within a node causing
workload disruption?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-10 14:14     ` Mel Gorman
@ 2013-04-10 22:28       ` dormando
  -1 siblings, 0 replies; 83+ messages in thread
From: dormando @ 2013-04-10 22:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Zlatko Calusic, Johannes Weiner, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

> On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> > One additional measure that may be useful is to make kswapd prefer one
> > specific processor on a socket. Two benefits arise from that:
> >
> > 1. Better use of cpu caches and therefore higher speed, less
> > serialization.
> >
>
> Considering the volume of pages that kswapd can scan when it's active
> I would expect that it trashes its cache anyway. The L1 cache would be
> flushed after scanning struct pages for just a few MB of memory.
>
> > 2. Reduction of the disturbances to one processor.
> >
>
> I've never checked it but I would have expected kswapd to stay on the
> same processor for significant periods of time. Have you experienced
> problems where kswapd bounces around on CPUs within a node causing
> workload disruption?

When kswapd shares the same CPU as our main process it causes a measurable
drop in response time (graphs show tiny spikes at the same time memory is
freed). Would be nice to be able to ensure it runs on a different core
than our latency sensitive processes at least. We can pin processes to
subsets of cores but I don't think there's a way to keep kswapd from
waking up on any of them?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-10 22:28       ` dormando
  0 siblings, 0 replies; 83+ messages in thread
From: dormando @ 2013-04-10 22:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Zlatko Calusic, Johannes Weiner, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

> On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> > One additional measure that may be useful is to make kswapd prefer one
> > specific processor on a socket. Two benefits arise from that:
> >
> > 1. Better use of cpu caches and therefore higher speed, less
> > serialization.
> >
>
> Considering the volume of pages that kswapd can scan when it's active
> I would expect that it trashes its cache anyway. The L1 cache would be
> flushed after scanning struct pages for just a few MB of memory.
>
> > 2. Reduction of the disturbances to one processor.
> >
>
> I've never checked it but I would have expected kswapd to stay on the
> same processor for significant periods of time. Have you experienced
> problems where kswapd bounces around on CPUs within a node causing
> workload disruption?

When kswapd shares the same CPU as our main process it causes a measurable
drop in response time (graphs show tiny spikes at the same time memory is
freed). Would be nice to be able to ensure it runs on a different core
than our latency sensitive processes at least. We can pin processes to
subsets of cores but I don't think there's a way to keep kswapd from
waking up on any of them?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-10 22:28       ` dormando
@ 2013-04-10 23:46         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 83+ messages in thread
From: KOSAKI Motohiro @ 2013-04-10 23:46 UTC (permalink / raw)
  To: dormando
  Cc: Mel Gorman, Christoph Lameter, Andrew Morton, Jiri Slaby,
	Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner,
	Satoru Moriya, Michal Hocko, Linux-MM, LKML, kosaki.motohiro

>> I've never checked it but I would have expected kswapd to stay on the
>> same processor for significant periods of time. Have you experienced
>> problems where kswapd bounces around on CPUs within a node causing
>> workload disruption?
> 
> When kswapd shares the same CPU as our main process it causes a measurable
> drop in response time (graphs show tiny spikes at the same time memory is
> freed). Would be nice to be able to ensure it runs on a different core
> than our latency sensitive processes at least. We can pin processes to
> subsets of cores but I don't think there's a way to keep kswapd from
> waking up on any of them?

You are only talking about extream corner case and don't talk about the other hand.
When number-of-nodes > nubmer-of-cpus, we have no way to avoid cpu sharing. 

Moreover, this is not kswapd specific isssue, every kernel thread makes the same
latency ick. so, this issue should be solved more generic layer.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-10 23:46         ` KOSAKI Motohiro
  0 siblings, 0 replies; 83+ messages in thread
From: KOSAKI Motohiro @ 2013-04-10 23:46 UTC (permalink / raw)
  To: dormando
  Cc: Mel Gorman, Christoph Lameter, Andrew Morton, Jiri Slaby,
	Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner,
	Satoru Moriya, Michal Hocko, Linux-MM, LKML, kosaki.motohiro

>> I've never checked it but I would have expected kswapd to stay on the
>> same processor for significant periods of time. Have you experienced
>> problems where kswapd bounces around on CPUs within a node causing
>> workload disruption?
> 
> When kswapd shares the same CPU as our main process it causes a measurable
> drop in response time (graphs show tiny spikes at the same time memory is
> freed). Would be nice to be able to ensure it runs on a different core
> than our latency sensitive processes at least. We can pin processes to
> subsets of cores but I don't think there's a way to keep kswapd from
> waking up on any of them?

You are only talking about extream corner case and don't talk about the other hand.
When number-of-nodes > nubmer-of-cpus, we have no way to avoid cpu sharing. 

Moreover, this is not kswapd specific isssue, every kernel thread makes the same
latency ick. so, this issue should be solved more generic layer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
  2013-04-10 14:08       ` Mel Gorman
@ 2013-04-11  0:14         ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-11  0:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/10 23:08), Mel Gorman wrote:
> On Wed, Apr 10, 2013 at 04:16:47PM +0900, Kamezawa Hiroyuki wrote:
>> (2013/04/09 20:06), Mel Gorman wrote:
>>> Simplistically, the anon and file LRU lists are scanned proportionally
>>> depending on the value of vm.swappiness although there are other factors
>>> taken into account by get_scan_count().  The patch "mm: vmscan: Limit
>>> the number of pages kswapd reclaims" limits the number of pages kswapd
>>> reclaims but it breaks this proportional scanning and may evenly shrink
>>> anon/file LRUs regardless of vm.swappiness.
>>>
>>> This patch preserves the proportional scanning and reclaim. It does mean
>>> that kswapd will reclaim more than requested but the number of pages will
>>> be related to the high watermark.
>>>
>>> [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>> Acked-by: Rik van Riel <riel@redhat.com>
>>> ---
>>>    mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
>>>    1 file changed, 46 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 4835a7a..0742c45 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>>    	enum lru_list lru;
>>>    	unsigned long nr_reclaimed = 0;
>>>    	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
>>> +	unsigned long nr_anon_scantarget, nr_file_scantarget;
>>>    	struct blk_plug plug;
>>> +	bool scan_adjusted = false;
>>>
>>>    	get_scan_count(lruvec, sc, nr);
>>>
>>> +	/* Record the original scan target for proportional adjustments later */
>>> +	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
>>> +	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
>>> +
>>
>> I'm sorry I couldn't understand the calc...
>>
>> Assume here
>>          nr_file_scantarget = 100
>>          nr_anon_file_target = 100.
>>
>
> I think you might have meant nr_anon_scantarget here instead of
> nr_anon_file_target.
>
>>
>>>    	blk_start_plug(&plug);
>>>    	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>>>    					nr[LRU_INACTIVE_FILE]) {
>>> +		unsigned long nr_anon, nr_file, percentage;
>>> +
>>>    		for_each_evictable_lru(lru) {
>>>    			if (nr[lru]) {
>>>    				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
>>> @@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>>    							    lruvec, sc);
>>>    			}
>>>    		}
>>> +
>>> +		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
>>> +			continue;
>>> +
>>>    		/*
>>> -		 * On large memory systems, scan >> priority can become
>>> -		 * really large. This is fine for the starting priority;
>>> -		 * we want to put equal scanning pressure on each zone.
>>> -		 * However, if the VM has a harder time of freeing pages,
>>> -		 * with multiple processes reclaiming pages, the total
>>> -		 * freeing target can get unreasonably large.
>>> +		 * For global direct reclaim, reclaim only the number of pages
>>> +		 * requested. Less care is taken to scan proportionally as it
>>> +		 * is more important to minimise direct reclaim stall latency
>>> +		 * than it is to properly age the LRU lists.
>>>    		 */
>>> -		if (nr_reclaimed >= nr_to_reclaim &&
>>> -		    sc->priority < DEF_PRIORITY)
>>> +		if (global_reclaim(sc) && !current_is_kswapd())
>>>    			break;
>>> +
>>> +		/*
>>> +		 * For kswapd and memcg, reclaim at least the number of pages
>>> +		 * requested. Ensure that the anon and file LRUs shrink
>>> +		 * proportionally what was requested by get_scan_count(). We
>>> +		 * stop reclaiming one LRU and reduce the amount scanning
>>> +		 * proportional to the original scan target.
>>> +		 */
>>> +		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
>>> +		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
>>> +
>>
>> Then, nr_file = 80, nr_anon=70.
>>
>
> As we scan evenly in SCAN_CLUSTER_MAX groups of pages, this wouldn't happen
> but for the purposes of discussions, lets assume it did.
>
>>
>>> +		if (nr_file > nr_anon) {
>>> +			lru = LRU_BASE;
>>> +			percentage = nr_anon * 100 / nr_anon_scantarget;
>>> +		} else {
>>> +			lru = LRU_FILE;
>>> +			percentage = nr_file * 100 / nr_file_scantarget;
>>> +		}
>>
>> the percentage will be 70.
>>
>
> Yes.
>
>>> +
>>> +		/* Stop scanning the smaller of the LRU */
>>> +		nr[lru] = 0;
>>> +		nr[lru + LRU_ACTIVE] = 0;
>>> +
>>
>> this will stop anon scan.
>>
>
> Yes.
>
>>> +		/* Reduce scanning of the other LRU proportionally */
>>> +		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
>>> +		nr[lru] = nr[lru] * percentage / 100;;
>>> +		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
>>> +
>>
>> finally, in the next iteration,
>>
>>                nr[file] = 80 * 0.7 = 56.
>>
>> After loop, anon-scan is 30 pages , file-scan is 76(20+56) pages..
>>
>
> Well spotted, this would indeed reclaim too many pages from the other
> LRU. I wanted to avoid recording the original scan targets as it's an
> extra 40 bytes on the stack but it's unavoidable.
>
>> I think the calc here should be
>>
>>     nr[lru] = nr_lru_scantarget * percentage / 100 - nr[lru]
>>
>>     Here, 80-70=10 more pages to scan..should be proportional.
>>
>
> nr[lru] at the end there is pages remaining to be scanned not pages
> scanned already.

yes.

> Did you mean something like this?
>
> nr[lru] = scantarget[lru] * percentage / 100 - (scantarget[lru] - nr[lru])
>

For clarification, this "percentage" means the ratio of remaining scan target of
another LRU. So, *scanned* percentage is "100 - percentage", right ?

If I understand the changelog correctly, you'd like to keep

    scantarget[anon] : scantarget[file]
    == really_scanned_num[anon] : really_scanned_num[file]

even if we stop scanning in the middle of scantarget. And you introduced "percentage"
to make sure that both scantarget should be done in the same ratio.

So...another lru should scan  scantarget[x] * (100 - percentage)/100 in total.

nr[lru] = scantarget[lru] * (100 - percentage)/100 - (scantarget[lru] - nr[lru])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^
              proportionally adjusted scan target        already scanned num

        =  nr[lru] - scantarget[lru] * percentage/100.

This means to avoid scanning the amount of pages in the ratio which another lru
didn't scan.

> With care taken to ensure we do not underflow?

yes.

Regards,
-Kame


> Something like
>
>          unsigned long nr[NR_LRU_LISTS];
>          unsigned long targets[NR_LRU_LISTS];
>
> ...
>
> 	memcpy(targets, nr, sizeof(nr));
>
> ...
>
>          nr[lru] = targets[lru] * percentage / 100;
>          nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));
>
>          lru += LRU_ACTIVE;
>          nr[lru] = targets[lru] * percentage / 100;
>          nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));
>
> ?
>



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
@ 2013-04-11  0:14         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: Kamezawa Hiroyuki @ 2013-04-11  0:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

(2013/04/10 23:08), Mel Gorman wrote:
> On Wed, Apr 10, 2013 at 04:16:47PM +0900, Kamezawa Hiroyuki wrote:
>> (2013/04/09 20:06), Mel Gorman wrote:
>>> Simplistically, the anon and file LRU lists are scanned proportionally
>>> depending on the value of vm.swappiness although there are other factors
>>> taken into account by get_scan_count().  The patch "mm: vmscan: Limit
>>> the number of pages kswapd reclaims" limits the number of pages kswapd
>>> reclaims but it breaks this proportional scanning and may evenly shrink
>>> anon/file LRUs regardless of vm.swappiness.
>>>
>>> This patch preserves the proportional scanning and reclaim. It does mean
>>> that kswapd will reclaim more than requested but the number of pages will
>>> be related to the high watermark.
>>>
>>> [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>> Acked-by: Rik van Riel <riel@redhat.com>
>>> ---
>>>    mm/vmscan.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--------
>>>    1 file changed, 46 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 4835a7a..0742c45 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1825,13 +1825,21 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>>    	enum lru_list lru;
>>>    	unsigned long nr_reclaimed = 0;
>>>    	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
>>> +	unsigned long nr_anon_scantarget, nr_file_scantarget;
>>>    	struct blk_plug plug;
>>> +	bool scan_adjusted = false;
>>>
>>>    	get_scan_count(lruvec, sc, nr);
>>>
>>> +	/* Record the original scan target for proportional adjustments later */
>>> +	nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
>>> +	nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
>>> +
>>
>> I'm sorry I couldn't understand the calc...
>>
>> Assume here
>>          nr_file_scantarget = 100
>>          nr_anon_file_target = 100.
>>
>
> I think you might have meant nr_anon_scantarget here instead of
> nr_anon_file_target.
>
>>
>>>    	blk_start_plug(&plug);
>>>    	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>>>    					nr[LRU_INACTIVE_FILE]) {
>>> +		unsigned long nr_anon, nr_file, percentage;
>>> +
>>>    		for_each_evictable_lru(lru) {
>>>    			if (nr[lru]) {
>>>    				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
>>> @@ -1841,17 +1849,47 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>>    							    lruvec, sc);
>>>    			}
>>>    		}
>>> +
>>> +		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
>>> +			continue;
>>> +
>>>    		/*
>>> -		 * On large memory systems, scan >> priority can become
>>> -		 * really large. This is fine for the starting priority;
>>> -		 * we want to put equal scanning pressure on each zone.
>>> -		 * However, if the VM has a harder time of freeing pages,
>>> -		 * with multiple processes reclaiming pages, the total
>>> -		 * freeing target can get unreasonably large.
>>> +		 * For global direct reclaim, reclaim only the number of pages
>>> +		 * requested. Less care is taken to scan proportionally as it
>>> +		 * is more important to minimise direct reclaim stall latency
>>> +		 * than it is to properly age the LRU lists.
>>>    		 */
>>> -		if (nr_reclaimed >= nr_to_reclaim &&
>>> -		    sc->priority < DEF_PRIORITY)
>>> +		if (global_reclaim(sc) && !current_is_kswapd())
>>>    			break;
>>> +
>>> +		/*
>>> +		 * For kswapd and memcg, reclaim at least the number of pages
>>> +		 * requested. Ensure that the anon and file LRUs shrink
>>> +		 * proportionally what was requested by get_scan_count(). We
>>> +		 * stop reclaiming one LRU and reduce the amount scanning
>>> +		 * proportional to the original scan target.
>>> +		 */
>>> +		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
>>> +		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
>>> +
>>
>> Then, nr_file = 80, nr_anon=70.
>>
>
> As we scan evenly in SCAN_CLUSTER_MAX groups of pages, this wouldn't happen
> but for the purposes of discussions, lets assume it did.
>
>>
>>> +		if (nr_file > nr_anon) {
>>> +			lru = LRU_BASE;
>>> +			percentage = nr_anon * 100 / nr_anon_scantarget;
>>> +		} else {
>>> +			lru = LRU_FILE;
>>> +			percentage = nr_file * 100 / nr_file_scantarget;
>>> +		}
>>
>> the percentage will be 70.
>>
>
> Yes.
>
>>> +
>>> +		/* Stop scanning the smaller of the LRU */
>>> +		nr[lru] = 0;
>>> +		nr[lru + LRU_ACTIVE] = 0;
>>> +
>>
>> this will stop anon scan.
>>
>
> Yes.
>
>>> +		/* Reduce scanning of the other LRU proportionally */
>>> +		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
>>> +		nr[lru] = nr[lru] * percentage / 100;;
>>> +		nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
>>> +
>>
>> finally, in the next iteration,
>>
>>                nr[file] = 80 * 0.7 = 56.
>>
>> After loop, anon-scan is 30 pages , file-scan is 76(20+56) pages..
>>
>
> Well spotted, this would indeed reclaim too many pages from the other
> LRU. I wanted to avoid recording the original scan targets as it's an
> extra 40 bytes on the stack but it's unavoidable.
>
>> I think the calc here should be
>>
>>     nr[lru] = nr_lru_scantarget * percentage / 100 - nr[lru]
>>
>>     Here, 80-70=10 more pages to scan..should be proportional.
>>
>
> nr[lru] at the end there is pages remaining to be scanned not pages
> scanned already.

yes.

> Did you mean something like this?
>
> nr[lru] = scantarget[lru] * percentage / 100 - (scantarget[lru] - nr[lru])
>

For clarification, this "percentage" means the ratio of remaining scan target of
another LRU. So, *scanned* percentage is "100 - percentage", right ?

If I understand the changelog correctly, you'd like to keep

    scantarget[anon] : scantarget[file]
    == really_scanned_num[anon] : really_scanned_num[file]

even if we stop scanning in the middle of scantarget. And you introduced "percentage"
to make sure that both scantarget should be done in the same ratio.

So...another lru should scan  scantarget[x] * (100 - percentage)/100 in total.

nr[lru] = scantarget[lru] * (100 - percentage)/100 - (scantarget[lru] - nr[lru])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^
              proportionally adjusted scan target        already scanned num

        =  nr[lru] - scantarget[lru] * percentage/100.

This means to avoid scanning the amount of pages in the ratio which another lru
didn't scan.

> With care taken to ensure we do not underflow?

yes.

Regards,
-Kame


> Something like
>
>          unsigned long nr[NR_LRU_LISTS];
>          unsigned long targets[NR_LRU_LISTS];
>
> ...
>
> 	memcpy(targets, nr, sizeof(nr));
>
> ...
>
>          nr[lru] = targets[lru] * percentage / 100;
>          nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));
>
>          lru += LRU_ACTIVE;
>          nr[lru] = targets[lru] * percentage / 100;
>          nr[lru] -= min(nr[lru], (targets[lru] - nr[lru]));
>
> ?
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
  2013-04-11  0:14         ` Kamezawa Hiroyuki
@ 2013-04-11  9:09           ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-11  9:09 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Thu, Apr 11, 2013 at 09:14:19AM +0900, Kamezawa Hiroyuki wrote:
> >
> >nr[lru] at the end there is pages remaining to be scanned not pages
> >scanned already.
> 
> yes.
> 
> >Did you mean something like this?
> >
> >nr[lru] = scantarget[lru] * percentage / 100 - (scantarget[lru] - nr[lru])
> >
> 
> For clarification, this "percentage" means the ratio of remaining scan target of
> another LRU. So, *scanned* percentage is "100 - percentage", right ?
> 

Yes, correct.

> If I understand the changelog correctly, you'd like to keep
> 
>    scantarget[anon] : scantarget[file]
>    == really_scanned_num[anon] : really_scanned_num[file]
> 

Yes.

> even if we stop scanning in the middle of scantarget. And you introduced "percentage"
> to make sure that both scantarget should be done in the same ratio.
> 

Yes.

> So...another lru should scan  scantarget[x] * (100 - percentage)/100 in total.
> 
> nr[lru] = scantarget[lru] * (100 - percentage)/100 - (scantarget[lru] - nr[lru])
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^
>              proportionally adjusted scan target        already scanned num
> 
>        =  nr[lru] - scantarget[lru] * percentage/100.
> 

Yes, you are completely correct. This preserves the original ratio of
anon:file scanning properly.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd
@ 2013-04-11  9:09           ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-11  9:09 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Zlatko Calusic, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Thu, Apr 11, 2013 at 09:14:19AM +0900, Kamezawa Hiroyuki wrote:
> >
> >nr[lru] at the end there is pages remaining to be scanned not pages
> >scanned already.
> 
> yes.
> 
> >Did you mean something like this?
> >
> >nr[lru] = scantarget[lru] * percentage / 100 - (scantarget[lru] - nr[lru])
> >
> 
> For clarification, this "percentage" means the ratio of remaining scan target of
> another LRU. So, *scanned* percentage is "100 - percentage", right ?
> 

Yes, correct.

> If I understand the changelog correctly, you'd like to keep
> 
>    scantarget[anon] : scantarget[file]
>    == really_scanned_num[anon] : really_scanned_num[file]
> 

Yes.

> even if we stop scanning in the middle of scantarget. And you introduced "percentage"
> to make sure that both scantarget should be done in the same ratio.
> 

Yes.

> So...another lru should scan  scantarget[x] * (100 - percentage)/100 in total.
> 
> nr[lru] = scantarget[lru] * (100 - percentage)/100 - (scantarget[lru] - nr[lru])
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^
>              proportionally adjusted scan target        already scanned num
> 
>        =  nr[lru] - scantarget[lru] * percentage/100.
> 

Yes, you are completely correct. This preserves the original ratio of
anon:file scanning properly.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-10 22:28       ` dormando
@ 2013-04-11  9:10         ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-11  9:10 UTC (permalink / raw)
  To: dormando
  Cc: Christoph Lameter, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Zlatko Calusic, Johannes Weiner, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 03:28:32PM -0700, dormando wrote:
> > On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> > > One additional measure that may be useful is to make kswapd prefer one
> > > specific processor on a socket. Two benefits arise from that:
> > >
> > > 1. Better use of cpu caches and therefore higher speed, less
> > > serialization.
> > >
> >
> > Considering the volume of pages that kswapd can scan when it's active
> > I would expect that it trashes its cache anyway. The L1 cache would be
> > flushed after scanning struct pages for just a few MB of memory.
> >
> > > 2. Reduction of the disturbances to one processor.
> > >
> >
> > I've never checked it but I would have expected kswapd to stay on the
> > same processor for significant periods of time. Have you experienced
> > problems where kswapd bounces around on CPUs within a node causing
> > workload disruption?
> 
> When kswapd shares the same CPU as our main process it causes a measurable
> drop in response time (graphs show tiny spikes at the same time memory is
> freed). Would be nice to be able to ensure it runs on a different core
> than our latency sensitive processes at least. We can pin processes to
> subsets of cores but I don't think there's a way to keep kswapd from
> waking up on any of them?

I've never tried it myself but does the following work?

taskset -p MASK `pidof kswapd`

where MASK is a cpumask describing what CPUs kswapd can run on?
Obviously care should be taken to ensure that you bind kswapd to a CPU
running on the node kswapd cares about.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-11  9:10         ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-11  9:10 UTC (permalink / raw)
  To: dormando
  Cc: Christoph Lameter, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Zlatko Calusic, Johannes Weiner, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On Wed, Apr 10, 2013 at 03:28:32PM -0700, dormando wrote:
> > On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> > > One additional measure that may be useful is to make kswapd prefer one
> > > specific processor on a socket. Two benefits arise from that:
> > >
> > > 1. Better use of cpu caches and therefore higher speed, less
> > > serialization.
> > >
> >
> > Considering the volume of pages that kswapd can scan when it's active
> > I would expect that it trashes its cache anyway. The L1 cache would be
> > flushed after scanning struct pages for just a few MB of memory.
> >
> > > 2. Reduction of the disturbances to one processor.
> > >
> >
> > I've never checked it but I would have expected kswapd to stay on the
> > same processor for significant periods of time. Have you experienced
> > problems where kswapd bounces around on CPUs within a node causing
> > workload disruption?
> 
> When kswapd shares the same CPU as our main process it causes a measurable
> drop in response time (graphs show tiny spikes at the same time memory is
> freed). Would be nice to be able to ensure it runs on a different core
> than our latency sensitive processes at least. We can pin processes to
> subsets of cores but I don't think there's a way to keep kswapd from
> waking up on any of them?

I've never tried it myself but does the following work?

taskset -p MASK `pidof kswapd`

where MASK is a cpumask describing what CPUs kswapd can run on?
Obviously care should be taken to ensure that you bind kswapd to a CPU
running on the node kswapd cares about.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-11  9:10         ` Mel Gorman
@ 2013-04-11 20:13           ` Michal Hocko
  -1 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2013-04-11 20:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: dormando, Christoph Lameter, Andrew Morton, Jiri Slaby,
	Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner,
	Satoru Moriya, Linux-MM, LKML

On Thu 11-04-13 10:10:44, Mel Gorman wrote:
> On Wed, Apr 10, 2013 at 03:28:32PM -0700, dormando wrote:
> > > On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> > > > One additional measure that may be useful is to make kswapd prefer one
> > > > specific processor on a socket. Two benefits arise from that:
> > > >
> > > > 1. Better use of cpu caches and therefore higher speed, less
> > > > serialization.
> > > >
> > >
> > > Considering the volume of pages that kswapd can scan when it's active
> > > I would expect that it trashes its cache anyway. The L1 cache would be
> > > flushed after scanning struct pages for just a few MB of memory.
> > >
> > > > 2. Reduction of the disturbances to one processor.
> > > >
> > >
> > > I've never checked it but I would have expected kswapd to stay on the
> > > same processor for significant periods of time. Have you experienced
> > > problems where kswapd bounces around on CPUs within a node causing
> > > workload disruption?
> > 
> > When kswapd shares the same CPU as our main process it causes a measurable
> > drop in response time (graphs show tiny spikes at the same time memory is
> > freed). Would be nice to be able to ensure it runs on a different core
> > than our latency sensitive processes at least. We can pin processes to
> > subsets of cores but I don't think there's a way to keep kswapd from
> > waking up on any of them?
> 
> I've never tried it myself but does the following work?
> 
> taskset -p MASK `pidof kswapd`

I would use pgrep rather than pidof which seem to need the whole process
name but yes this should work as kswapdN is not PF_THREAD_BOUND kernel
thread.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-11 20:13           ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2013-04-11 20:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: dormando, Christoph Lameter, Andrew Morton, Jiri Slaby,
	Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner,
	Satoru Moriya, Linux-MM, LKML

On Thu 11-04-13 10:10:44, Mel Gorman wrote:
> On Wed, Apr 10, 2013 at 03:28:32PM -0700, dormando wrote:
> > > On Tue, Apr 09, 2013 at 05:27:18PM +0000, Christoph Lameter wrote:
> > > > One additional measure that may be useful is to make kswapd prefer one
> > > > specific processor on a socket. Two benefits arise from that:
> > > >
> > > > 1. Better use of cpu caches and therefore higher speed, less
> > > > serialization.
> > > >
> > >
> > > Considering the volume of pages that kswapd can scan when it's active
> > > I would expect that it trashes its cache anyway. The L1 cache would be
> > > flushed after scanning struct pages for just a few MB of memory.
> > >
> > > > 2. Reduction of the disturbances to one processor.
> > > >
> > >
> > > I've never checked it but I would have expected kswapd to stay on the
> > > same processor for significant periods of time. Have you experienced
> > > problems where kswapd bounces around on CPUs within a node causing
> > > workload disruption?
> > 
> > When kswapd shares the same CPU as our main process it causes a measurable
> > drop in response time (graphs show tiny spikes at the same time memory is
> > freed). Would be nice to be able to ensure it runs on a different core
> > than our latency sensitive processes at least. We can pin processes to
> > subsets of cores but I don't think there's a way to keep kswapd from
> > waking up on any of them?
> 
> I've never tried it myself but does the following work?
> 
> taskset -p MASK `pidof kswapd`

I would use pgrep rather than pidof which seem to need the whole process
name but yes this should work as kswapdN is not PF_THREAD_BOUND kernel
thread.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-09 11:06 ` Mel Gorman
@ 2013-04-11 20:55   ` Zlatko Calusic
  -1 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-11 20:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 09.04.2013 13:06, Mel Gorman wrote:
> Posting V2 of this series got delayed due to trying to pin down an unrelated
> regression in 3.9-rc where interactive performance is shot to hell. That
> problem still has not been identified as it's resisting attempts to be
> reproducible by a script for the purposes of bisection.
>
> For those that looked at V1, the most important difference in this version
> is how patch 2 preserves the proportional scanning of anon/file LRUs.
>
> The series is against 3.9-rc6.
>
> Changelog since V1
> o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
> o Reformat comment in shrink_page_list				(andi)
> o Clarify some comments						(dhillf)
> o Rework how the proportional scanning is preserved
> o Add PageReclaim check before kswapd starts writeback
> o Reset sc.nr_reclaimed on every full zone scan
>

I believe this is what you had in your tree as kswapd-v2r9 branch? If 
I'm right, then I had this series under test for about 2 weeks on two 
different machines (one server, one desktop). Here's what I've found:

- while the series looks overwhelming, with a lot of intricate changes 
(at least from my POV), it proved completely stable and robust. I had 
ZERO issues with it. I'd encourage everybody to test it, even on the 
production!

- I've just sent to you and to the linux-mm list a longish report of the 
issue I tracked last few months that is unfortunately NOT solved with 
this patch series (although at first it looked like it would be). 
Occasionaly I still see large parts of memory freed for no good reason, 
except I explained in the report how it happens. What I still don't know 
is what's the real cause of the heavy imbalance in the pagecache 
utilization between DMA32/NORMAL zones. Seen only on 4GB RAM machines, 
but I suppose that is a quite popular configuration these days.

- The only slightly negative thing I observed is that with the patch 
applied kswapd burns 10x - 20x more CPU. So instead of about 15 seconds, 
it has now spent more than 4 minutes on one particular machine with a 
quite steady load (after about 12 days of uptime). Admittedly, that's 
still nothing too alarming, but...

- I like VERY much how you cleaned up the code so it is more readable 
now. I'd like to see it in the Linus tree as soon as possible. Very good 
job there!

Regards,
-- 
Zlatko


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-11 20:55   ` Zlatko Calusic
  0 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-11 20:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 09.04.2013 13:06, Mel Gorman wrote:
> Posting V2 of this series got delayed due to trying to pin down an unrelated
> regression in 3.9-rc where interactive performance is shot to hell. That
> problem still has not been identified as it's resisting attempts to be
> reproducible by a script for the purposes of bisection.
>
> For those that looked at V1, the most important difference in this version
> is how patch 2 preserves the proportional scanning of anon/file LRUs.
>
> The series is against 3.9-rc6.
>
> Changelog since V1
> o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
> o Reformat comment in shrink_page_list				(andi)
> o Clarify some comments						(dhillf)
> o Rework how the proportional scanning is preserved
> o Add PageReclaim check before kswapd starts writeback
> o Reset sc.nr_reclaimed on every full zone scan
>

I believe this is what you had in your tree as kswapd-v2r9 branch? If 
I'm right, then I had this series under test for about 2 weeks on two 
different machines (one server, one desktop). Here's what I've found:

- while the series looks overwhelming, with a lot of intricate changes 
(at least from my POV), it proved completely stable and robust. I had 
ZERO issues with it. I'd encourage everybody to test it, even on the 
production!

- I've just sent to you and to the linux-mm list a longish report of the 
issue I tracked last few months that is unfortunately NOT solved with 
this patch series (although at first it looked like it would be). 
Occasionaly I still see large parts of memory freed for no good reason, 
except I explained in the report how it happens. What I still don't know 
is what's the real cause of the heavy imbalance in the pagecache 
utilization between DMA32/NORMAL zones. Seen only on 4GB RAM machines, 
but I suppose that is a quite popular configuration these days.

- The only slightly negative thing I observed is that with the patch 
applied kswapd burns 10x - 20x more CPU. So instead of about 15 seconds, 
it has now spent more than 4 minutes on one particular machine with a 
quite steady load (after about 12 days of uptime). Admittedly, that's 
still nothing too alarming, but...

- I like VERY much how you cleaned up the code so it is more readable 
now. I'd like to see it in the Linus tree as soon as possible. Very good 
job there!

Regards,
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
  2013-04-09 11:06   ` Mel Gorman
@ 2013-04-12  2:45     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:06 AM, Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing
> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency. This patch irons out the logic a bit by controlling
> when priority is raised and removing the "goto loop_again".
>
> This patch has kswapd raise the scanning priority until it is scanning
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> the scanning prioirty higher unless it is failing to reclaim any pages.
>
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

It looks like this patch could lead to near-infinite reclaiming when
higher-order reclaim is being done, but patch 4/10 should fix that...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop
@ 2013-04-12  2:45     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:06 AM, Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing
> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency. This patch irons out the logic a bit by controlling
> when priority is raised and removing the "goto loop_again".
>
> This patch has kswapd raise the scanning priority until it is scanning
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> the scanning prioirty higher unless it is failing to reclaim any pages.
>
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

It looks like this patch could lead to near-infinite reclaiming when
higher-order reclaim is being done, but patch 4/10 should fix that...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
  2013-04-09 11:06   ` Mel Gorman
@ 2013-04-12  2:46     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:06 AM, Mel Gorman wrote:
> In the past, kswapd makes a decision on whether to compact memory after the
> pgdat was considered balanced. This more or less worked but it is late to
> make such a decision and does not fit well now that kswapd makes a decision
> whether to exit the zone scanning loop depending on reclaim progress.
>
> This patch will compact a pgdat if at least the requested number of pages
> were reclaimed from unbalanced zones for a given priority. If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

This has the potential to increase kswapd cpu use, but probably at
the benefit of making reclaim run a little more smoothly. It should
help that compaction is only called when enough pages have been
freed.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress
@ 2013-04-12  2:46     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:06 AM, Mel Gorman wrote:
> In the past, kswapd makes a decision on whether to compact memory after the
> pgdat was considered balanced. This more or less worked but it is late to
> make such a decision and does not fit well now that kswapd makes a decision
> whether to exit the zone scanning loop depending on reclaim progress.
>
> This patch will compact a pgdat if at least the requested number of pages
> were reclaimed from unbalanced zones for a given priority. If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

This has the potential to increase kswapd cpu use, but probably at
the benefit of making reclaim run a little more smoothly. It should
help that compaction is only called when enough pages have been
freed.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority
  2013-04-09 11:07   ` Mel Gorman
@ 2013-04-12  2:51     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:07 AM, Mel Gorman wrote:
> Currently kswapd queues dirty pages for writeback if scanning at an elevated
> priority but the priority kswapd scans at is not related to the number
> of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten kswapd
> priority loop", the priority is related to the size of the LRU and the
> zone watermark which is no indication as to whether kswapd should write
> pages or not.
>
> This patch tracks if an excessive number of unqueued dirty pages are being
> encountered at the end of the LRU.  If so, it indicates that dirty pages
> are being recycled before flusher threads can clean them and flags the
> zone so that kswapd will start writing pages until the zone is balanced.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I like your approach of essentially not writing out from
kswapd if we manage to reclaim well at DEF_PRIORITY, and
doing writeout more and more aggressively if we have to
reduce priority.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority
@ 2013-04-12  2:51     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:07 AM, Mel Gorman wrote:
> Currently kswapd queues dirty pages for writeback if scanning at an elevated
> priority but the priority kswapd scans at is not related to the number
> of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten kswapd
> priority loop", the priority is related to the size of the LRU and the
> zone watermark which is no indication as to whether kswapd should write
> pages or not.
>
> This patch tracks if an excessive number of unqueued dirty pages are being
> encountered at the end of the LRU.  If so, it indicates that dirty pages
> are being recycled before flusher threads can clean them and flags the
> zone so that kswapd will start writing pages until the zone is balanced.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I like your approach of essentially not writing out from
kswapd if we manage to reclaim well at DEF_PRIORITY, and
doing writeout more and more aggressively if we have to
reduce priority.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback
  2013-04-09 11:07   ` Mel Gorman
@ 2013-04-12  2:54     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:07 AM, Mel Gorman wrote:
> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
>
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
>
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it
> would cause direct reclaimers to stall on writeback which is not desirable.
>
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback
@ 2013-04-12  2:54     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:07 AM, Mel Gorman wrote:
> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
>
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
>
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it
> would cause direct reclaimers to stall on writeback which is not desirable.
>
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()
  2013-04-09 11:07   ` Mel Gorman
@ 2013-04-12  2:56     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:07 AM, Mel Gorman wrote:
> balance_pgdat() is very long and some of the logic can and should
> be internal to kswapd_shrink_zone(). Move it so the flow of
> balance_pgdat() is marginally easier to follow.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()
@ 2013-04-12  2:56     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2013-04-12  2:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Zlatko Calusic,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 04/09/2013 07:07 AM, Mel Gorman wrote:
> balance_pgdat() is very long and some of the logic can and should
> be internal to kswapd_shrink_zone(). Move it so the flow of
> balance_pgdat() is marginally easier to follow.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-11 20:55   ` Zlatko Calusic
@ 2013-04-12 19:40     ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-12 19:40 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
> On 09.04.2013 13:06, Mel Gorman wrote:
> <SNIP>
>
> - The only slightly negative thing I observed is that with the patch
> applied kswapd burns 10x - 20x more CPU. So instead of about 15
> seconds, it has now spent more than 4 minutes on one particular
> machine with a quite steady load (after about 12 days of uptime).
> Admittedly, that's still nothing too alarming, but...
> 

Would you happen to know what circumstances trigger the higher CPU
usage?

> - I like VERY much how you cleaned up the code so it is more
> readable now. I'd like to see it in the Linus tree as soon as
> possible. Very good job there!
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-12 19:40     ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-12 19:40 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
> On 09.04.2013 13:06, Mel Gorman wrote:
> <SNIP>
>
> - The only slightly negative thing I observed is that with the patch
> applied kswapd burns 10x - 20x more CPU. So instead of about 15
> seconds, it has now spent more than 4 minutes on one particular
> machine with a quite steady load (after about 12 days of uptime).
> Admittedly, that's still nothing too alarming, but...
> 

Would you happen to know what circumstances trigger the higher CPU
usage?

> - I like VERY much how you cleaned up the code so it is more
> readable now. I'd like to see it in the Linus tree as soon as
> possible. Very good job there!
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-12 19:40     ` Mel Gorman
@ 2013-04-12 19:52       ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-12 19:52 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On Fri, Apr 12, 2013 at 08:40:04PM +0100, Mel Gorman wrote:
> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
> > On 09.04.2013 13:06, Mel Gorman wrote:
> > <SNIP>
> >
> > - The only slightly negative thing I observed is that with the patch
> > applied kswapd burns 10x - 20x more CPU. So instead of about 15
> > seconds, it has now spent more than 4 minutes on one particular
> > machine with a quite steady load (after about 12 days of uptime).
> > Admittedly, that's still nothing too alarming, but...
> > 
> 
> Would you happen to know what circumstances trigger the higher CPU
> usage?
> 

There is also a slight possibility it has been fixed in V3 by the
proportional scanning changes. In my own parallelio tests I got the
following kswapd CPU times from top.

3.9.0-rc6-vanilla           0:05.21
3.9.0-rc6-lessdisrupt-v2r11 0:07.44
3.9.0-rc6-lessdisrupt-v3r6  0:03.21

In v2, I did see slightly higher CPU usage but it was reduced in v3. For
a general set of page reclaim tests I got

3.9.0-rc6-vanilla-micro     3:09.51
3.9.0-rc6-lessdisrupt-v2r11 2:57.78
3.9.0-rc6-lessdisrupt-v3r6  1:10.05

In that case, v2 was comparable so unfortunately I was never seeing the
10-20x more CPU that you got.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-12 19:52       ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-12 19:52 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On Fri, Apr 12, 2013 at 08:40:04PM +0100, Mel Gorman wrote:
> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
> > On 09.04.2013 13:06, Mel Gorman wrote:
> > <SNIP>
> >
> > - The only slightly negative thing I observed is that with the patch
> > applied kswapd burns 10x - 20x more CPU. So instead of about 15
> > seconds, it has now spent more than 4 minutes on one particular
> > machine with a quite steady load (after about 12 days of uptime).
> > Admittedly, that's still nothing too alarming, but...
> > 
> 
> Would you happen to know what circumstances trigger the higher CPU
> usage?
> 

There is also a slight possibility it has been fixed in V3 by the
proportional scanning changes. In my own parallelio tests I got the
following kswapd CPU times from top.

3.9.0-rc6-vanilla           0:05.21
3.9.0-rc6-lessdisrupt-v2r11 0:07.44
3.9.0-rc6-lessdisrupt-v3r6  0:03.21

In v2, I did see slightly higher CPU usage but it was reduced in v3. For
a general set of page reclaim tests I got

3.9.0-rc6-vanilla-micro     3:09.51
3.9.0-rc6-lessdisrupt-v2r11 2:57.78
3.9.0-rc6-lessdisrupt-v3r6  1:10.05

In that case, v2 was comparable so unfortunately I was never seeing the
10-20x more CPU that you got.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-12 19:40     ` Mel Gorman
@ 2013-04-12 20:07       ` Zlatko Calusic
  -1 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-12 20:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 12.04.2013 21:40, Mel Gorman wrote:
> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>> On 09.04.2013 13:06, Mel Gorman wrote:
>> <SNIP>
>>
>> - The only slightly negative thing I observed is that with the patch
>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>> seconds, it has now spent more than 4 minutes on one particular
>> machine with a quite steady load (after about 12 days of uptime).
>> Admittedly, that's still nothing too alarming, but...
>>
>
> Would you happen to know what circumstances trigger the higher CPU
> usage?
>

Really nothing special. The server is lightly loaded, but it does enough 
reading from the disk so that pagecache is mostly populated and page 
reclaiming is active. So, kswapd is no doubt using CPU time gradually, 
nothing extraordinary.

When I sent my reply yesterday, the server uptime was 12 days, and 
kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13 
days uptime):

root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 [kswapd0]

I will apply your v3 series soon and see if there's any improvement wrt 
CPU usage, although as I said I don't see that as a big issue. It's 
still only 0.013% of available CPU resources (dual core CPU).

-- 
Zlatko


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-12 20:07       ` Zlatko Calusic
  0 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-12 20:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On 12.04.2013 21:40, Mel Gorman wrote:
> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>> On 09.04.2013 13:06, Mel Gorman wrote:
>> <SNIP>
>>
>> - The only slightly negative thing I observed is that with the patch
>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>> seconds, it has now spent more than 4 minutes on one particular
>> machine with a quite steady load (after about 12 days of uptime).
>> Admittedly, that's still nothing too alarming, but...
>>
>
> Would you happen to know what circumstances trigger the higher CPU
> usage?
>

Really nothing special. The server is lightly loaded, but it does enough 
reading from the disk so that pagecache is mostly populated and page 
reclaiming is active. So, kswapd is no doubt using CPU time gradually, 
nothing extraordinary.

When I sent my reply yesterday, the server uptime was 12 days, and 
kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13 
days uptime):

root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 [kswapd0]

I will apply your v3 series soon and see if there's any improvement wrt 
CPU usage, although as I said I don't see that as a big issue. It's 
still only 0.013% of available CPU resources (dual core CPU).

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-12 20:07       ` Zlatko Calusic
@ 2013-04-12 20:41         ` Mel Gorman
  -1 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-12 20:41 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On Fri, Apr 12, 2013 at 10:07:54PM +0200, Zlatko Calusic wrote:
> On 12.04.2013 21:40, Mel Gorman wrote:
> >On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
> >>On 09.04.2013 13:06, Mel Gorman wrote:
> >><SNIP>
> >>
> >>- The only slightly negative thing I observed is that with the patch
> >>applied kswapd burns 10x - 20x more CPU. So instead of about 15
> >>seconds, it has now spent more than 4 minutes on one particular
> >>machine with a quite steady load (after about 12 days of uptime).
> >>Admittedly, that's still nothing too alarming, but...
> >>
> >
> >Would you happen to know what circumstances trigger the higher CPU
> >usage?
> >
> 
> Really nothing special. The server is lightly loaded, but it does
> enough reading from the disk so that pagecache is mostly populated
> and page reclaiming is active. So, kswapd is no doubt using CPU time
> gradually, nothing extraordinary.
> 
> When I sent my reply yesterday, the server uptime was 12 days, and
> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
> days uptime):
> 
> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 [kswapd0]
> 

Ok, that's not too crazy.

> I will apply your v3 series soon and see if there's any improvement
> wrt CPU usage, although as I said I don't see that as a big issue.
> It's still only 0.013% of available CPU resources (dual core CPU).
> 

Excellent, thanks very much for testing and reporting back. I read your
mail on the zone balancing and FWIW I would not have expected this series
to have any impact on it. I do not have a good theory yet as to what the
problem is but I'll give it some thought and se what I come up with. I'll
be at LSF/MM next week so it might take me a while.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-12 20:41         ` Mel Gorman
  0 siblings, 0 replies; 83+ messages in thread
From: Mel Gorman @ 2013-04-12 20:41 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

On Fri, Apr 12, 2013 at 10:07:54PM +0200, Zlatko Calusic wrote:
> On 12.04.2013 21:40, Mel Gorman wrote:
> >On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
> >>On 09.04.2013 13:06, Mel Gorman wrote:
> >><SNIP>
> >>
> >>- The only slightly negative thing I observed is that with the patch
> >>applied kswapd burns 10x - 20x more CPU. So instead of about 15
> >>seconds, it has now spent more than 4 minutes on one particular
> >>machine with a quite steady load (after about 12 days of uptime).
> >>Admittedly, that's still nothing too alarming, but...
> >>
> >
> >Would you happen to know what circumstances trigger the higher CPU
> >usage?
> >
> 
> Really nothing special. The server is lightly loaded, but it does
> enough reading from the disk so that pagecache is mostly populated
> and page reclaiming is active. So, kswapd is no doubt using CPU time
> gradually, nothing extraordinary.
> 
> When I sent my reply yesterday, the server uptime was 12 days, and
> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
> days uptime):
> 
> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 [kswapd0]
> 

Ok, that's not too crazy.

> I will apply your v3 series soon and see if there's any improvement
> wrt CPU usage, although as I said I don't see that as a big issue.
> It's still only 0.013% of available CPU resources (dual core CPU).
> 

Excellent, thanks very much for testing and reporting back. I read your
mail on the zone balancing and FWIW I would not have expected this series
to have any impact on it. I do not have a good theory yet as to what the
problem is but I'll give it some thought and se what I come up with. I'll
be at LSF/MM next week so it might take me a while.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-12 20:41         ` Mel Gorman
@ 2013-04-12 21:14           ` Zlatko Calusic
  -1 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-12 21:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Michal Hocko, Linux-MM, LKML

On 12.04.2013 22:41, Mel Gorman wrote:
> On Fri, Apr 12, 2013 at 10:07:54PM +0200, Zlatko Calusic wrote:
>> On 12.04.2013 21:40, Mel Gorman wrote:
>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>> <SNIP>
>>>>
>>>> - The only slightly negative thing I observed is that with the patch
>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>> seconds, it has now spent more than 4 minutes on one particular
>>>> machine with a quite steady load (after about 12 days of uptime).
>>>> Admittedly, that's still nothing too alarming, but...
>>>>
>>>
>>> Would you happen to know what circumstances trigger the higher CPU
>>> usage?
>>>
>>
>> Really nothing special. The server is lightly loaded, but it does
>> enough reading from the disk so that pagecache is mostly populated
>> and page reclaiming is active. So, kswapd is no doubt using CPU time
>> gradually, nothing extraordinary.
>>
>> When I sent my reply yesterday, the server uptime was 12 days, and
>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>> days uptime):
>>
>> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 [kswapd0]
>>
>
> Ok, that's not too crazy.
>

Certainly.

>> I will apply your v3 series soon and see if there's any improvement
>> wrt CPU usage, although as I said I don't see that as a big issue.
>> It's still only 0.013% of available CPU resources (dual core CPU).
>>
>
> Excellent, thanks very much for testing and reporting back.

The pleasure is all mine. I really admire your work.

> I read your
> mail on the zone balancing and FWIW I would not have expected this series
> to have any impact on it.

Good to know. At first I thought that your changes on the anon/file 
balance could make something different, obviously not.

> I do not have a good theory yet as to what the
> problem is but I'll give it some thought and se what I come up with. I'll
> be at LSF/MM next week so it might take me a while.
>

Yeah, that's definitely not something to be solved quickly, let it wait 
until you have more time, and I'll also continue to test various things 
after a slight break.

It's a quite subtle issue, although the solution will probably be simple 
and obvious. But, I also think it'll take a lot of time to find it. I 
tried to develop an artificial test case to speed up debugging, but 
failed horribly. It seems that the issue can be seen only on real workloads.

-- 
Zlatko


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-12 21:14           ` Zlatko Calusic
  0 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-12 21:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Michal Hocko, Linux-MM, LKML

On 12.04.2013 22:41, Mel Gorman wrote:
> On Fri, Apr 12, 2013 at 10:07:54PM +0200, Zlatko Calusic wrote:
>> On 12.04.2013 21:40, Mel Gorman wrote:
>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>> <SNIP>
>>>>
>>>> - The only slightly negative thing I observed is that with the patch
>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>> seconds, it has now spent more than 4 minutes on one particular
>>>> machine with a quite steady load (after about 12 days of uptime).
>>>> Admittedly, that's still nothing too alarming, but...
>>>>
>>>
>>> Would you happen to know what circumstances trigger the higher CPU
>>> usage?
>>>
>>
>> Really nothing special. The server is lightly loaded, but it does
>> enough reading from the disk so that pagecache is mostly populated
>> and page reclaiming is active. So, kswapd is no doubt using CPU time
>> gradually, nothing extraordinary.
>>
>> When I sent my reply yesterday, the server uptime was 12 days, and
>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>> days uptime):
>>
>> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 [kswapd0]
>>
>
> Ok, that's not too crazy.
>

Certainly.

>> I will apply your v3 series soon and see if there's any improvement
>> wrt CPU usage, although as I said I don't see that as a big issue.
>> It's still only 0.013% of available CPU resources (dual core CPU).
>>
>
> Excellent, thanks very much for testing and reporting back.

The pleasure is all mine. I really admire your work.

> I read your
> mail on the zone balancing and FWIW I would not have expected this series
> to have any impact on it.

Good to know. At first I thought that your changes on the anon/file 
balance could make something different, obviously not.

> I do not have a good theory yet as to what the
> problem is but I'll give it some thought and se what I come up with. I'll
> be at LSF/MM next week so it might take me a while.
>

Yeah, that's definitely not something to be solved quickly, let it wait 
until you have more time, and I'll also continue to test various things 
after a slight break.

It's a quite subtle issue, although the solution will probably be simple 
and obvious. But, I also think it'll take a lot of time to find it. I 
tried to develop an artificial test case to speed up debugging, but 
failed horribly. It seems that the issue can be seen only on real workloads.

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-12 20:07       ` Zlatko Calusic
  (?)
  (?)
@ 2013-04-22  6:37       ` Zlatko Calusic
  2013-04-22  6:43           ` Simon Jeons
  -1 siblings, 1 reply; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-22  6:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Slaby, Valdis Kletnieks, Rik van Riel,
	Johannes Weiner, dormando, Satoru Moriya, Michal Hocko, Linux-MM,
	LKML

[-- Attachment #1: Type: text/plain, Size: 1663 bytes --]

On 12.04.2013 22:07, Zlatko Calusic wrote:
> On 12.04.2013 21:40, Mel Gorman wrote:
>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>> <SNIP>
>>>
>>> - The only slightly negative thing I observed is that with the patch
>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>> seconds, it has now spent more than 4 minutes on one particular
>>> machine with a quite steady load (after about 12 days of uptime).
>>> Admittedly, that's still nothing too alarming, but...
>>>
>>
>> Would you happen to know what circumstances trigger the higher CPU
>> usage?
>>
>
> Really nothing special. The server is lightly loaded, but it does enough
> reading from the disk so that pagecache is mostly populated and page
> reclaiming is active. So, kswapd is no doubt using CPU time gradually,
> nothing extraordinary.
>
> When I sent my reply yesterday, the server uptime was 12 days, and
> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
> days uptime):
>
> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 [kswapd0]
>
> I will apply your v3 series soon and see if there's any improvement wrt
> CPU usage, although as I said I don't see that as a big issue. It's
> still only 0.013% of available CPU resources (dual core CPU).
>

JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU 
time after 6 days 14h uptime.

And find attached another debugging graph that shows how ANON pages are 
privileged in the ZONE_NORMAL on a 4GB machine. Take notice that the 
number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph nicely.

-- 
Zlatko

[-- Attachment #2: memdebug-daily.png --]
[-- Type: image/png, Size: 17239 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-22  6:37       ` Zlatko Calusic
@ 2013-04-22  6:43           ` Simon Jeons
  0 siblings, 0 replies; 83+ messages in thread
From: Simon Jeons @ 2013-04-22  6:43 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Mel Gorman, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

Hi Zlatko,
On 04/22/2013 02:37 PM, Zlatko Calusic wrote:
> On 12.04.2013 22:07, Zlatko Calusic wrote:
>> On 12.04.2013 21:40, Mel Gorman wrote:
>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>> <SNIP>
>>>>
>>>> - The only slightly negative thing I observed is that with the patch
>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>> seconds, it has now spent more than 4 minutes on one particular
>>>> machine with a quite steady load (after about 12 days of uptime).
>>>> Admittedly, that's still nothing too alarming, but...
>>>>
>>>
>>> Would you happen to know what circumstances trigger the higher CPU
>>> usage?
>>>
>>
>> Really nothing special. The server is lightly loaded, but it does enough
>> reading from the disk so that pagecache is mostly populated and page
>> reclaiming is active. So, kswapd is no doubt using CPU time gradually,
>> nothing extraordinary.
>>
>> When I sent my reply yesterday, the server uptime was 12 days, and
>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>> days uptime):
>>
>> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 
>> [kswapd0]
>>
>> I will apply your v3 series soon and see if there's any improvement wrt
>> CPU usage, although as I said I don't see that as a big issue. It's
>> still only 0.013% of available CPU resources (dual core CPU).
>>
>
> JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU 
> time after 6 days 14h uptime.
>
> And find attached another debugging graph that shows how ANON pages 
> are privileged in the ZONE_NORMAL on a 4GB machine. Take notice that 
> the number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph 
> nicely.
>

Could you tell me how you draw this picture?


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-22  6:43           ` Simon Jeons
  0 siblings, 0 replies; 83+ messages in thread
From: Simon Jeons @ 2013-04-22  6:43 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Mel Gorman, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

Hi Zlatko,
On 04/22/2013 02:37 PM, Zlatko Calusic wrote:
> On 12.04.2013 22:07, Zlatko Calusic wrote:
>> On 12.04.2013 21:40, Mel Gorman wrote:
>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>> <SNIP>
>>>>
>>>> - The only slightly negative thing I observed is that with the patch
>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>> seconds, it has now spent more than 4 minutes on one particular
>>>> machine with a quite steady load (after about 12 days of uptime).
>>>> Admittedly, that's still nothing too alarming, but...
>>>>
>>>
>>> Would you happen to know what circumstances trigger the higher CPU
>>> usage?
>>>
>>
>> Really nothing special. The server is lightly loaded, but it does enough
>> reading from the disk so that pagecache is mostly populated and page
>> reclaiming is active. So, kswapd is no doubt using CPU time gradually,
>> nothing extraordinary.
>>
>> When I sent my reply yesterday, the server uptime was 12 days, and
>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>> days uptime):
>>
>> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52 
>> [kswapd0]
>>
>> I will apply your v3 series soon and see if there's any improvement wrt
>> CPU usage, although as I said I don't see that as a big issue. It's
>> still only 0.013% of available CPU resources (dual core CPU).
>>
>
> JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU 
> time after 6 days 14h uptime.
>
> And find attached another debugging graph that shows how ANON pages 
> are privileged in the ZONE_NORMAL on a 4GB machine. Take notice that 
> the number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph 
> nicely.
>

Could you tell me how you draw this picture?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-22  6:43           ` Simon Jeons
@ 2013-04-22  6:54             ` Zlatko Calusic
  -1 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-22  6:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On 22.04.2013 08:43, Simon Jeons wrote:
> Hi Zlatko,
> On 04/22/2013 02:37 PM, Zlatko Calusic wrote:
>> On 12.04.2013 22:07, Zlatko Calusic wrote:
>>> On 12.04.2013 21:40, Mel Gorman wrote:
>>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>>> <SNIP>
>>>>>
>>>>> - The only slightly negative thing I observed is that with the patch
>>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>>> seconds, it has now spent more than 4 minutes on one particular
>>>>> machine with a quite steady load (after about 12 days of uptime).
>>>>> Admittedly, that's still nothing too alarming, but...
>>>>>
>>>>
>>>> Would you happen to know what circumstances trigger the higher CPU
>>>> usage?
>>>>
>>>
>>> Really nothing special. The server is lightly loaded, but it does enough
>>> reading from the disk so that pagecache is mostly populated and page
>>> reclaiming is active. So, kswapd is no doubt using CPU time gradually,
>>> nothing extraordinary.
>>>
>>> When I sent my reply yesterday, the server uptime was 12 days, and
>>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>>> days uptime):
>>>
>>> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52
>>> [kswapd0]
>>>
>>> I will apply your v3 series soon and see if there's any improvement wrt
>>> CPU usage, although as I said I don't see that as a big issue. It's
>>> still only 0.013% of available CPU resources (dual core CPU).
>>>
>>
>> JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU
>> time after 6 days 14h uptime.
>>
>> And find attached another debugging graph that shows how ANON pages
>> are privileged in the ZONE_NORMAL on a 4GB machine. Take notice that
>> the number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph
>> nicely.
>>
>
> Could you tell me how you draw this picture?
>

It's a home made server monitoring system. I just added the code needed 
to graph the size of active + inactive LRU lists, per zone and per type. 
Check out http://oss.oetiker.ch/rrdtool/

-- 
Zlatko


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-22  6:54             ` Zlatko Calusic
  0 siblings, 0 replies; 83+ messages in thread
From: Zlatko Calusic @ 2013-04-22  6:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

On 22.04.2013 08:43, Simon Jeons wrote:
> Hi Zlatko,
> On 04/22/2013 02:37 PM, Zlatko Calusic wrote:
>> On 12.04.2013 22:07, Zlatko Calusic wrote:
>>> On 12.04.2013 21:40, Mel Gorman wrote:
>>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>>> <SNIP>
>>>>>
>>>>> - The only slightly negative thing I observed is that with the patch
>>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>>> seconds, it has now spent more than 4 minutes on one particular
>>>>> machine with a quite steady load (after about 12 days of uptime).
>>>>> Admittedly, that's still nothing too alarming, but...
>>>>>
>>>>
>>>> Would you happen to know what circumstances trigger the higher CPU
>>>> usage?
>>>>
>>>
>>> Really nothing special. The server is lightly loaded, but it does enough
>>> reading from the disk so that pagecache is mostly populated and page
>>> reclaiming is active. So, kswapd is no doubt using CPU time gradually,
>>> nothing extraordinary.
>>>
>>> When I sent my reply yesterday, the server uptime was 12 days, and
>>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>>> days uptime):
>>>
>>> root        23  0.0  0.0      0     0 ?        S    Mar30   4:52
>>> [kswapd0]
>>>
>>> I will apply your v3 series soon and see if there's any improvement wrt
>>> CPU usage, although as I said I don't see that as a big issue. It's
>>> still only 0.013% of available CPU resources (dual core CPU).
>>>
>>
>> JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU
>> time after 6 days 14h uptime.
>>
>> And find attached another debugging graph that shows how ANON pages
>> are privileged in the ZONE_NORMAL on a 4GB machine. Take notice that
>> the number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph
>> nicely.
>>
>
> Could you tell me how you draw this picture?
>

It's a home made server monitoring system. I just added the code needed 
to graph the size of active + inactive LRU lists, per zone and per type. 
Check out http://oss.oetiker.ch/rrdtool/

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
  2013-04-22  6:54             ` Zlatko Calusic
@ 2013-04-22  7:12               ` Simon Jeons
  -1 siblings, 0 replies; 83+ messages in thread
From: Simon Jeons @ 2013-04-22  7:12 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Mel Gorman, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

Hi Zlatko,
On 04/22/2013 02:54 PM, Zlatko Calusic wrote:
> On 22.04.2013 08:43, Simon Jeons wrote:
>> Hi Zlatko,
>> On 04/22/2013 02:37 PM, Zlatko Calusic wrote:
>>> On 12.04.2013 22:07, Zlatko Calusic wrote:
>>>> On 12.04.2013 21:40, Mel Gorman wrote:
>>>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>>>> <SNIP>
>>>>>>
>>>>>> - The only slightly negative thing I observed is that with the patch
>>>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>>>> seconds, it has now spent more than 4 minutes on one particular
>>>>>> machine with a quite steady load (after about 12 days of uptime).
>>>>>> Admittedly, that's still nothing too alarming, but...
>>>>>>
>>>>>
>>>>> Would you happen to know what circumstances trigger the higher CPU
>>>>> usage?
>>>>>
>>>>
>>>> Really nothing special. The server is lightly loaded, but it does 
>>>> enough
>>>> reading from the disk so that pagecache is mostly populated and page
>>>> reclaiming is active. So, kswapd is no doubt using CPU time gradually,
>>>> nothing extraordinary.
>>>>
>>>> When I sent my reply yesterday, the server uptime was 12 days, and
>>>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>>>> days uptime):
>>>>
>>>> root        23  0.0  0.0      0     0 ?        S    Mar30 4:52
>>>> [kswapd0]
>>>>
>>>> I will apply your v3 series soon and see if there's any improvement 
>>>> wrt
>>>> CPU usage, although as I said I don't see that as a big issue. It's
>>>> still only 0.013% of available CPU resources (dual core CPU).
>>>>
>>>
>>> JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU
>>> time after 6 days 14h uptime.
>>>
>>> And find attached another debugging graph that shows how ANON pages
>>> are privileged in the ZONE_NORMAL on a 4GB machine. Take notice that
>>> the number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph
>>> nicely.
>>>
>>
>> Could you tell me how you draw this picture?
>>
>
> It's a home made server monitoring system. I just added the code 
> needed to graph the size of active + inactive LRU lists, per zone and 
> per type. Check out http://oss.oetiker.ch/rrdtool/

Thanks Zlatko, I successfully install, could you tell me your options?



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
@ 2013-04-22  7:12               ` Simon Jeons
  0 siblings, 0 replies; 83+ messages in thread
From: Simon Jeons @ 2013-04-22  7:12 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Mel Gorman, Andrew Morton, Jiri Slaby, Valdis Kletnieks,
	Rik van Riel, Johannes Weiner, dormando, Satoru Moriya,
	Michal Hocko, Linux-MM, LKML

Hi Zlatko,
On 04/22/2013 02:54 PM, Zlatko Calusic wrote:
> On 22.04.2013 08:43, Simon Jeons wrote:
>> Hi Zlatko,
>> On 04/22/2013 02:37 PM, Zlatko Calusic wrote:
>>> On 12.04.2013 22:07, Zlatko Calusic wrote:
>>>> On 12.04.2013 21:40, Mel Gorman wrote:
>>>>> On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:
>>>>>> On 09.04.2013 13:06, Mel Gorman wrote:
>>>>>> <SNIP>
>>>>>>
>>>>>> - The only slightly negative thing I observed is that with the patch
>>>>>> applied kswapd burns 10x - 20x more CPU. So instead of about 15
>>>>>> seconds, it has now spent more than 4 minutes on one particular
>>>>>> machine with a quite steady load (after about 12 days of uptime).
>>>>>> Admittedly, that's still nothing too alarming, but...
>>>>>>
>>>>>
>>>>> Would you happen to know what circumstances trigger the higher CPU
>>>>> usage?
>>>>>
>>>>
>>>> Really nothing special. The server is lightly loaded, but it does 
>>>> enough
>>>> reading from the disk so that pagecache is mostly populated and page
>>>> reclaiming is active. So, kswapd is no doubt using CPU time gradually,
>>>> nothing extraordinary.
>>>>
>>>> When I sent my reply yesterday, the server uptime was 12 days, and
>>>> kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
>>>> days uptime):
>>>>
>>>> root        23  0.0  0.0      0     0 ?        S    Mar30 4:52
>>>> [kswapd0]
>>>>
>>>> I will apply your v3 series soon and see if there's any improvement 
>>>> wrt
>>>> CPU usage, although as I said I don't see that as a big issue. It's
>>>> still only 0.013% of available CPU resources (dual core CPU).
>>>>
>>>
>>> JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU
>>> time after 6 days 14h uptime.
>>>
>>> And find attached another debugging graph that shows how ANON pages
>>> are privileged in the ZONE_NORMAL on a 4GB machine. Take notice that
>>> the number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph
>>> nicely.
>>>
>>
>> Could you tell me how you draw this picture?
>>
>
> It's a home made server monitoring system. I just added the code 
> needed to graph the size of active + inactive LRU lists, per zone and 
> per type. Check out http://oss.oetiker.ch/rrdtool/

Thanks Zlatko, I successfully install, could you tell me your options?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2013-04-22  7:12 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-09 11:06 [PATCH 0/10] Reduce system disruption due to kswapd V2 Mel Gorman
2013-04-09 11:06 ` Mel Gorman
2013-04-09 11:06 ` [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority Mel Gorman
2013-04-09 11:06   ` Mel Gorman
2013-04-09 13:27   ` Michal Hocko
2013-04-09 13:27     ` Michal Hocko
2013-04-10  6:47   ` Kamezawa Hiroyuki
2013-04-10  6:47     ` Kamezawa Hiroyuki
2013-04-09 11:06 ` [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd Mel Gorman
2013-04-09 11:06   ` Mel Gorman
2013-04-10  7:16   ` Kamezawa Hiroyuki
2013-04-10  7:16     ` Kamezawa Hiroyuki
2013-04-10 14:08     ` Mel Gorman
2013-04-10 14:08       ` Mel Gorman
2013-04-11  0:14       ` Kamezawa Hiroyuki
2013-04-11  0:14         ` Kamezawa Hiroyuki
2013-04-11  9:09         ` Mel Gorman
2013-04-11  9:09           ` Mel Gorman
2013-04-09 11:06 ` [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop Mel Gorman
2013-04-09 11:06   ` Mel Gorman
2013-04-10  7:47   ` Kamezawa Hiroyuki
2013-04-10  7:47     ` Kamezawa Hiroyuki
2013-04-10 13:29     ` Mel Gorman
2013-04-10 13:29       ` Mel Gorman
2013-04-12  2:45   ` Rik van Riel
2013-04-12  2:45     ` Rik van Riel
2013-04-09 11:06 ` [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress Mel Gorman
2013-04-09 11:06   ` Mel Gorman
2013-04-10  8:05   ` Kamezawa Hiroyuki
2013-04-10  8:05     ` Kamezawa Hiroyuki
2013-04-10 13:57     ` Mel Gorman
2013-04-10 13:57       ` Mel Gorman
2013-04-12  2:46   ` Rik van Riel
2013-04-12  2:46     ` Rik van Riel
2013-04-09 11:07 ` [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority Mel Gorman
2013-04-09 11:07   ` Mel Gorman
2013-04-09 11:07 ` [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority Mel Gorman
2013-04-09 11:07   ` Mel Gorman
2013-04-12  2:51   ` Rik van Riel
2013-04-12  2:51     ` Rik van Riel
2013-04-09 11:07 ` [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback Mel Gorman
2013-04-09 11:07   ` Mel Gorman
2013-04-12  2:54   ` Rik van Riel
2013-04-12  2:54     ` Rik van Riel
2013-04-09 11:07 ` [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority Mel Gorman
2013-04-09 11:07   ` Mel Gorman
2013-04-09 11:07 ` [PATCH 09/10] mm: vmscan: Check if kswapd should writepage once per pgdat scan Mel Gorman
2013-04-09 11:07   ` Mel Gorman
2013-04-09 11:07 ` [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone() Mel Gorman
2013-04-09 11:07   ` Mel Gorman
2013-04-12  2:56   ` Rik van Riel
2013-04-12  2:56     ` Rik van Riel
2013-04-09 17:27 ` [PATCH 0/10] Reduce system disruption due to kswapd V2 Christoph Lameter
2013-04-09 17:27   ` Christoph Lameter
2013-04-10 14:14   ` Mel Gorman
2013-04-10 14:14     ` Mel Gorman
2013-04-10 22:28     ` dormando
2013-04-10 22:28       ` dormando
2013-04-10 23:46       ` KOSAKI Motohiro
2013-04-10 23:46         ` KOSAKI Motohiro
2013-04-11  9:10       ` Mel Gorman
2013-04-11  9:10         ` Mel Gorman
2013-04-11 20:13         ` Michal Hocko
2013-04-11 20:13           ` Michal Hocko
2013-04-11 20:55 ` Zlatko Calusic
2013-04-11 20:55   ` Zlatko Calusic
2013-04-12 19:40   ` Mel Gorman
2013-04-12 19:40     ` Mel Gorman
2013-04-12 19:52     ` Mel Gorman
2013-04-12 19:52       ` Mel Gorman
2013-04-12 20:07     ` Zlatko Calusic
2013-04-12 20:07       ` Zlatko Calusic
2013-04-12 20:41       ` Mel Gorman
2013-04-12 20:41         ` Mel Gorman
2013-04-12 21:14         ` Zlatko Calusic
2013-04-12 21:14           ` Zlatko Calusic
2013-04-22  6:37       ` Zlatko Calusic
2013-04-22  6:43         ` Simon Jeons
2013-04-22  6:43           ` Simon Jeons
2013-04-22  6:54           ` Zlatko Calusic
2013-04-22  6:54             ` Zlatko Calusic
2013-04-22  7:12             ` Simon Jeons
2013-04-22  7:12               ` Simon Jeons

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.