linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] Consider higher small zone and mmaped-pages stream
@ 2012-08-22  7:15 Minchan Kim
  2012-08-22  7:15 ` [PATCH 1/5] vmscan: Fix obsolete comment of balance_pgdat Minchan Kim
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Minchan Kim @ 2012-08-22  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Minchan Kim

This patchset solves two problem.

1. higher small memory zone - [2] and [3]
2. mmaped-pages stream reclaim efficiency [5]

[1] and [4] is minor fix which isn't related with
this series so it could be apply separately.

I wrote down each problem in each patch description.
Please look at each patch.

Test enviroment is following as

1. Intel(R) Core(TM)2 Duo CPU
2. 2G RAM and 400M movable zone
3. Test program:
   Hannes's mapped-file-stream.c with 78 processes per 1G.
   10 times exectuion.

Thanks.

Minchan Kim (5):
  [1] vmscan: Fix obsolete comment of balance_pgdat
  [2] vmscan: sleep only if backingdev is congested
  [3] vmscan: prevent excessive pageout of kswapd
  [4] vmscan: get rid of unnecessary nr_dirty ret variable
  [5] vmscan: accelerate to reclaim mapped-pages stream

 include/linux/mmzone.h |   23 +++++++++++++++
 mm/vmscan.c            |   77 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 85 insertions(+), 15 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/5] vmscan: Fix obsolete comment of balance_pgdat
  2012-08-22  7:15 [RFC 0/5] Consider higher small zone and mmaped-pages stream Minchan Kim
@ 2012-08-22  7:15 ` Minchan Kim
  2012-08-23 17:37   ` Rik van Riel
  2012-08-22  7:15 ` [PATCH 2/5] vmscan: sleep only if backingdev is congested Minchan Kim
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Minchan Kim @ 2012-08-22  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Minchan Kim, Nick Piggin

This patch correct obsolete comment caused by [1] and [2].

[1] 7ac6218, kswapd lockup fix
[2] 32a4330, mm: prevent kswapd from freeing excessive amounts of lowmem

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d01243..f015d92 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2472,16 +2472,17 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  * This can happen if the pages are all mlocked, or if they are all used by
  * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
  * What we do is to detect the case where all pages in the zone have been
- * scanned twice and there has been zero successful reclaim.  Mark the zone as
- * dead and from now on, only perform a short scan.  Basically we're polling
- * the zone for when the problem goes away.
+ * scanned above 6 times of the number of reclaimable pages and there has
+ * been zero successful reclaim.  Mark the zone as dead and from now on,
+ * only perform a short scan. Basically we're polling the zone for when
+ * the problem goes away.
  *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
  * zones which have free_pages > high_wmark_pages(zone), but once a zone is
- * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
- * lower zones regardless of the number of free pages in the lower zones. This
- * interoperates with the page allocator fallback scheme to ensure that aging
- * of pages is balanced across the zones.
+ * found to have free_pages <= high_wmark_pages(zone), we scan that zone and
+ * lower zones which don't have too many pages free. This interoperates with
+ * the page allocator fallback scheme to ensure that aging of pages is balanced
+ * across the zones.
  */
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int *classzone_idx)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/5] vmscan: sleep only if backingdev is congested
  2012-08-22  7:15 [RFC 0/5] Consider higher small zone and mmaped-pages stream Minchan Kim
  2012-08-22  7:15 ` [PATCH 1/5] vmscan: Fix obsolete comment of balance_pgdat Minchan Kim
@ 2012-08-22  7:15 ` Minchan Kim
  2012-08-25 23:02   ` Rik van Riel
  2012-08-22  7:15 ` [PATCH 3/5] vmscan: prevent excessive pageout of kswapd Minchan Kim
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Minchan Kim @ 2012-08-22  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Minchan Kim

In small high zone(ex, 40M movable zone), reclaim priority
could be raised easily so congestion_wait of balance_pgdat can make
kswapd sleep unnecessarily so process ends up entering into direct
reclaim path. It means processes's latency would be longer.

This patch changes congestion_wait with wait_iff_congested so kswapd
will sleep only if backdev really is congested.

==DRIVER                      mapped-file-stream            mapped-file-stream(0.00,    -nan%)
Name                          mapped-file-stream            mapped-file-stream(0.00,    -nan%)
Elapsed                       676                           663       (-13.00,  -1.92%)
nr_vmscan_write               91                            1341      (1250.00, 1373.63%)
nr_vmscan_immediate_reclaim   0                             0         (0.00,    0.00%)
pgpgin                        29932                         21668     (-8264.00,-27.61%)
pgpgout                       3652                          8392      (4740.00, 129.79%)
pswpin                        0                             22        (22.00,   0.00%)
pswpout                       91                            1341      (1250.00, 1373.63%)
pgactivate                    15686                         16217     (531.00,  3.39%)
pgdeactivate                  14171                         15431     (1260.00, 8.89%)
pgfault                       204523237                     204524355 (1118.00, 0.00%)
pgmajfault                    204472586                     204472528 (-58.00,  -0.00%)
pgsteal_kswapd_dma            149066                        466676    (317610.00,213.07%)
pgsteal_kswapd_normal         56219654                      49663877  (-6555777.00,-11.66%)
pgsteal_kswapd_high           92860817                      138182330 (45321513.00,48.81%)
pgsteal_kswapd_movable        1211389                       4236726   (3025337.00,249.74%)
pgsteal_direct_dma            35808                         9306      (-26502.00,-74.01%)
pgsteal_direct_normal         21270282                      123835    (-21146447.00,-99.42%)
pgsteal_direct_high           21051650                      274887    (-20776763.00,-98.69%)
pgsteal_direct_movable        250572                        38011     (-212561.00,-84.83%)
pgscan_kswapd_dma             325126                        947813    (622687.00,191.52%)
pgscan_kswapd_normal          111171753                     97902722  (-13269031.00,-11.94%)
pgscan_kswapd_high            178149789                     274337809 (96188020.00,53.99%)
pgscan_kswapd_movable         2537926                       8496474   (5958548.00,234.78%)
pgscan_direct_dma             56919                         22855     (-34064.00,-59.85%)
pgscan_direct_normal          45698152                      3604954   (-42093198.00,-92.11%)
pgscan_direct_high            51326549                      4504909   (-46821640.00,-91.22%)
pgscan_direct_movable         433830                        105418    (-328412.00,-75.70%)
pgscan_direct_throttle        0                             0         (0.00,    0.00%)
pginodesteal                  6721                          11111     (4390.00, 65.32%)
slabs_scanned                 57344                         56320     (-1024.00,-1.79%)
kswapd_inodesteal             36327                         31121     (-5206.00,-14.33%)
kswapd_low_wmark_hit_quickly  533                           4607      (4074.00, 764.35%)
kswapd_high_wmark_hit_quickly 39                            432       (393.00,  1007.69%)
kswapd_skip_congestion_wait   71505                         10254     (-61251.00,-85.66%)
pageoutrun                    2406110                       2879697   (473587.00,19.68%)
allocstall                    696424                        8222      (-688202.00,-98.82%)
pgrotated                     91                            1341      (1250.00, 1373.63%)
kswapd_totalscan              292184594                     381684818 (89500224.00,30.63%)
kswapd_totalsteal             150440926                     192549609 (42108683.00,27.99%)
Kswapd_efficiency             51.00                         50.00     (-1.00,   -1.96%)
direct_totalscan              97515450                      8238136   (-89277314.00,-91.55%)
direct_totalsteal             42608312                      446039    (-42162273.00,-98.95%)
direct_efficiency             43.00                         5.00      (-38.00,  -88.37%)
reclaim_velocity              576479.35                     588119.08 (11639.73,2.02%)

Elapsed time of test program is reduced by 13 second.
As I expected, kswapd scanning/reclaim ratio is increased about 30%
but kswapd's efficiency is still good. We reduced allocstall about 98%
so I think it's most important factor for reducing elapsed time of
test program.

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f015d92..d1ebe69 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2705,8 +2705,16 @@ loop_again:
 		if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-			else
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else {
+				for (i = 0; i <= end_zone; i++) {
+					struct zone *zone = pgdat->node_zones
+								+ i;
+					if (!populated_zone(zone))
+						continue;
+					wait_iff_congested(zone, BLK_RW_ASYNC,
+								HZ/10);
+				}
+			}
 		}
 
 		/*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/5] vmscan: prevent excessive pageout of kswapd
  2012-08-22  7:15 [RFC 0/5] Consider higher small zone and mmaped-pages stream Minchan Kim
  2012-08-22  7:15 ` [PATCH 1/5] vmscan: Fix obsolete comment of balance_pgdat Minchan Kim
  2012-08-22  7:15 ` [PATCH 2/5] vmscan: sleep only if backingdev is congested Minchan Kim
@ 2012-08-22  7:15 ` Minchan Kim
  2012-08-22  7:15 ` [PATCH 4/5] vmscan: get rid of unnecessary nr_dirty ret variable Minchan Kim
  2012-08-22  7:15 ` [PATCH 5/5] vmscan: accelerate to reclaim mapped-pages stream Minchan Kim
  4 siblings, 0 replies; 8+ messages in thread
From: Minchan Kim @ 2012-08-22  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Minchan Kim

If higher zone is very small, priority could be raised easily
while lower zones have enough free pages. When one of lower zones
doesn't meet high watermark, the zone try to reclaim pages with
the high prioirty which is increased by higher small zone.
It ends up reclaiming excessive pages. I saw 8~16M pageout
in my KVM test although we need just a few Kbytes.

This patch decrease the priority temporally by average between
current and previous reclaim prioirty and if we can't reclaim
enough pages with the priority, we can use the big jumped high
priority continuosly.

==DRIVER                      mapped-file-stream            mapped-file-stream(0.00,    -nan%)
Name                          mapped-file-stream            mapped-file-stream(0.00,    -nan%)
Elapsed                       663                           665       (2.00,    0.30%)
nr_vmscan_write               1341                          849       (-492.00, -36.69%)
nr_vmscan_immediate_reclaim   0                             8         (8.00,    0.00%)
pgpgin                        21668                         30280     (8612.00, 39.75%)
pgpgout                       8392                          6396      (-1996.00,-23.78%)
pswpin                        22                            8         (-14.00,  -63.64%)
pswpout                       1341                          849       (-492.00, -36.69%)
pgactivate                    16217                         15959     (-258.00, -1.59%)
pgdeactivate                  15431                         15303     (-128.00, -0.83%)
pgfault                       204524355                     204524410 (55.00,   0.00%)
pgmajfault                    204472528                     204472602 (74.00,   0.00%)
pgsteal_kswapd_dma            466676                        475265    (8589.00, 1.84%)
pgsteal_kswapd_normal         49663877                      51289479  (1625602.00,3.27%)
pgsteal_kswapd_high           138182330                     135817904 (-2364426.00,-1.71%)
pgsteal_kswapd_movable        4236726                       4380123   (143397.00,3.38%)
pgsteal_direct_dma            9306                          11910     (2604.00, 27.98%)
pgsteal_direct_normal         123835                        165012    (41177.00,33.25%)
pgsteal_direct_high           274887                        309271    (34384.00,12.51%)
pgsteal_direct_movable        38011                         45638     (7627.00, 20.07%)
pgscan_kswapd_dma             947813                        972089    (24276.00,2.56%)
pgscan_kswapd_normal          97902722                      100850050 (2947328.00,3.01%)
pgscan_kswapd_high            274337809                     269039236 (-5298573.00,-1.93%)
pgscan_kswapd_movable         8496474                       8774392   (277918.00,3.27%)
pgscan_direct_dma             22855                         26410     (3555.00, 15.55%)
pgscan_direct_normal          3604954                       4186439   (581485.00,16.13%)
pgscan_direct_high            4504909                       5132110   (627201.00,13.92%)
pgscan_direct_movable         105418                        122790    (17372.00,16.48%)
pgscan_direct_throttle        0                             0         (0.00,    0.00%)
pginodesteal                  11111                         6836      (-4275.00,-38.48%)
slabs_scanned                 56320                         56320     (0.00,    0.00%)
kswapd_inodesteal             31121                         35904     (4783.00, 15.37%)
kswapd_low_wmark_hit_quickly  4607                          5193      (586.00,  12.72%)
kswapd_high_wmark_hit_quickly 432                           421       (-11.00,  -2.55%)
kswapd_skip_congestion_wait   10254                         12375     (2121.00, 20.68%)
pageoutrun                    2879697                       3071912   (192215.00,6.67%)
allocstall                    8222                          9727      (1505.00, 18.30%)
pgrotated                     1341                          850       (-491.00, -36.61%)
kswapd_totalscan              381684818                     379635767 (-2049051.00,-0.54%)
kswapd_totalsteal             192549609                     191962771 (-586838.00,-0.30%)
Kswapd_efficiency             50.00                         50.00     (0.00,    0.00%)
direct_totalscan              8238136                       9467749   (1229613.00,14.93%)
direct_totalsteal             446039                        531831    (85792.00,19.23%)
direct_efficiency             5.00                          5.00      (0.00,    0.00%)
reclaim_velocity              588119.08                     585118.06 (-3001.02,-0.51%)

Elapsed time of test program is rather increased compared to
previous patch[2/5] but the number of reclaimed pages is much decreased.

before-patch: 192995648  after-patch: 192494602 diff: 501046(about 2G)

Since kswapd reclaimed smaller pages per turn compared to old behavior,
kswapd's pageoutrun is increased and allocstall is also increased
by about 18%. Yeb. It's not good in this workload but old behavior
worked well by just *luck* which reclaimed too many pages than
necessary amount so we could avoid frequent reclaim path.
As downside of that, it might evict part of working set and this patch
will prevent that problem without big downside, I believe.

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c |   24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d1ebe69..0e2550c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2492,6 +2492,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long total_scanned;
+	int prev_priority[MAX_NR_ZONES];
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
@@ -2513,6 +2514,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 loop_again:
 	total_scanned = 0;
 	sc.priority = DEF_PRIORITY;
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		prev_priority[i] = DEF_PRIORITY;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
@@ -2635,6 +2638,21 @@ loop_again:
 				    !zone_watermark_ok_safe(zone, testorder,
 					high_wmark_pages(zone) + balance_gap,
 					end_zone, 0)) {
+				/*
+				 * If higher zone is very small, priority could
+				 * be raised easily while lower zones have
+				 * enough free pages. When one of lower zones
+				 * doesn't meet high watermark, the zone try to
+				 * reclaim pages with high prioirty which is
+				 * increased by higher small zone. It ends up
+				 * reclaiming excessive pages.
+				 * Let's decrease the priority temporally.
+				 */
+				int tmp_priority = sc.priority;
+				if ((prev_priority[i] - sc.priority) > 1)
+					sc.priority = (prev_priority[i] +
+							sc.priority) >> 1;
+
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
@@ -2644,7 +2662,11 @@ loop_again:
 
 				if (nr_slab == 0 && !zone_reclaimable(zone))
 					zone->all_unreclaimable = 1;
-			}
+
+				prev_priority[i] = tmp_priority;
+				sc.priority = tmp_priority;
+			} else
+				prev_priority[i] = DEF_PRIORITY;
 
 			/*
 			 * If we've done a decent amount of scanning and
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/5] vmscan: get rid of unnecessary nr_dirty ret variable
  2012-08-22  7:15 [RFC 0/5] Consider higher small zone and mmaped-pages stream Minchan Kim
                   ` (2 preceding siblings ...)
  2012-08-22  7:15 ` [PATCH 3/5] vmscan: prevent excessive pageout of kswapd Minchan Kim
@ 2012-08-22  7:15 ` Minchan Kim
  2012-08-22  7:15 ` [PATCH 5/5] vmscan: accelerate to reclaim mapped-pages stream Minchan Kim
  4 siblings, 0 replies; 8+ messages in thread
From: Minchan Kim @ 2012-08-22  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Minchan Kim

Now anyone don't use nr_dirty so remove it.

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0e2550c..1a66680 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -674,7 +674,6 @@ static enum page_references page_check_references(struct page *page,
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
-				      unsigned long *ret_nr_dirty,
 				      unsigned long *ret_nr_writeback)
 {
 	LIST_HEAD(ret_pages);
@@ -955,7 +954,6 @@ keep:
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	mem_cgroup_uncharge_end();
-	*ret_nr_dirty += nr_dirty;
 	*ret_nr_writeback += nr_writeback;
 	return nr_reclaimed;
 }
@@ -1236,7 +1234,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
-	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
@@ -1278,8 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
-						&nr_dirty, &nr_writeback);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_writeback);
 
 	spin_lock_irq(&zone->lru_lock);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/5] vmscan: accelerate to reclaim mapped-pages stream
  2012-08-22  7:15 [RFC 0/5] Consider higher small zone and mmaped-pages stream Minchan Kim
                   ` (3 preceding siblings ...)
  2012-08-22  7:15 ` [PATCH 4/5] vmscan: get rid of unnecessary nr_dirty ret variable Minchan Kim
@ 2012-08-22  7:15 ` Minchan Kim
  4 siblings, 0 replies; 8+ messages in thread
From: Minchan Kim @ 2012-08-22  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-mm,
	linux-kernel, Minchan Kim

Normally, mmapped data pages has a chance to stay around in LRU
one more rather than others because they were bon with referecend
pte so we can keep workingset mapped page in memory.

But it can have a problem when there are a ton of mmaped page stream.
VM should burn out CPU for rotating them in LRU so that kswapd's
efficiency would drop so that processes start to enter direct reclaim
path. It's not desirable.

This patch try to detect mmaped pages stream.
If VM see above 80%'s mmaped pages in a reclaim chunk(32),
he consider it as mmaped-pages stream's symptom and monitor
consecutive reclaim chunk. If VM find 1M mmapped pages during
consecutive reclaim, he concludes it as mmaped pages stream and
start to reclaim them without rotation.
If VM see below 80%'s mmaped pages in a reclaim chunck during
consecutive reclaim, it back off instantly

==DRIVER                      mapped-file-stream            mapped-file-stream(0.00,    -nan%)
Name                          mapped-file-stream            mapped-file-stream(0.00,    -nan%)
Elapsed                       665                           615       (-50.00,  -7.52%)
nr_vmscan_write               849                           62        (-787.00, -92.70%)
nr_vmscan_immediate_reclaim   8                             5         (-3.00,   -37.50%)
pgpgin                        30280                         27096     (-3184.00,-10.52%)
pgpgout                       6396                          2680      (-3716.00,-58.10%)
pswpin                        8                             0         (-8.00,   -100.00%)
pswpout                       849                           18        (-831.00, -97.88%)
pgactivate                    15959                         15585     (-374.00, -2.34%)
pgdeactivate                  15303                         13896     (-1407.00,-9.19%)
pgfault                       204524410                     204524092 (-318.00, -0.00%)
pgmajfault                    204472602                     204472572 (-30.00,  -0.00%)
pgsteal_kswapd_dma            475265                        892600    (417335.00,87.81%)
pgsteal_kswapd_normal         51289479                      44560409  (-6729070.00,-13.12%)
pgsteal_kswapd_high           135817904                     142316673 (6498769.00,4.78%)
pgsteal_kswapd_movable        4380123                       4793399   (413276.00,9.44%)
pgsteal_direct_dma            11910                         0         (-11910.00,-100.00%)
pgsteal_direct_normal         165012                        1322      (-163690.00,-99.20%)
pgsteal_direct_high           309271                        40        (-309231.00,-99.99%)
pgsteal_direct_movable        45638                         0         (-45638.00,-100.00%)
pgscan_kswapd_dma             972089                        893162    (-78927.00,-8.12%)
pgscan_kswapd_normal          100850050                     44609130  (-56240920.00,-55.77%)
pgscan_kswapd_high            269039236                     142394025 (-126645211.00,-47.07%)
pgscan_kswapd_movable         8774392                       4798082   (-3976310.00,-45.32%)
pgscan_direct_dma             26410                         0         (-26410.00,-100.00%)
pgscan_direct_normal          4186439                       1322      (-4185117.00,-99.97%)
pgscan_direct_high            5132110                       1161      (-5130949.00,-99.98%)
pgscan_direct_movable         122790                        0         (-122790.00,-100.00%)
pgscan_direct_throttle        0                             0         (0.00,    0.00%)
pginodesteal                  6836                          0         (-6836.00,-100.00%)
slabs_scanned                 56320                         52224     (-4096.00,-7.27%)
kswapd_inodesteal             35904                         41679     (5775.00, 16.08%)
kswapd_low_wmark_hit_quickly  5193                          7587      (2394.00, 46.10%)
kswapd_high_wmark_hit_quickly 421                           463       (42.00,   9.98%)
kswapd_skip_congestion_wait   12375                         23        (-12352.00,-99.81%)
pageoutrun                    3071912                       3202200   (130288.00,4.24%)
allocstall                    9727                          32        (-9695.00,-99.67%)
pgrotated                     850                           18        (-832.00, -97.88%)
kswapd_totalscan              379635767                     192694399 (-186941368.00,-49.24%)
kswapd_totalsteal             191962771                     192563081 (600310.00,0.31%)
Kswapd_efficiency             50.00                         99.00     (49.00,   98.00%)
direct_totalscan              9467749                       2483      (-9465266.00,-99.97%)
direct_totalsteal             531831                        1362      (-530469.00,-99.74%)
direct_efficiency             5.00                          54.00     (49.00,   980.00%)
reclaim_velocity              585118.06                     313328.26 (-271789.80,-46.45%)

Elapsed time of test program is 50 second. Of course,
the number of scanning is decreased hugely so efficiency of
kswapd/direct reclaim is super enhanced.
I think this patch can help very much on mmapped-file stream
while it doesn't have a problem on other workload due to instant
backoff.

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/mmzone.h |   23 +++++++++++++++++++++++
 mm/vmscan.c            |   24 ++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2daa54f..190376e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -325,6 +325,28 @@ enum zone_type {
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
 
+/*
+ * VM try to detect mp(mapped-pages) stream so it could be reclaimed
+ * without rotation. It reduces CPU burning and enhances kswapd
+ * efficiency.
+ */
+struct mp_detector {
+	bool force_reclaim;
+	int stream_detect_shift;
+};
+
+/*
+ * If we detect SWAP_CLUSTER_MAX * MP_DETECT_MAX_SHIFT(ie, 1M)
+ * mapped-pages during consecutive reclaim, we consider it as
+ * mapped-pages stream.
+ */
+#define MP_DETECT_MAX_SHIFT	8	/* 1 is SWAP_CLUSTER_MAX pages */
+/*
+ * If above 80% is mapped pages in a reclaim chunk, we consider it as
+ * mapped-pages stream's symptom.
+ */
+#define MP_STREAM_RATIO		(4 / 5)
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -422,6 +444,7 @@ struct zone {
 	 */
 	unsigned int inactive_ratio;
 
+	struct mp_detector mp;
 
 	ZONE_PADDING(_pad2_)
 	/* Rarely used or read-mostly fields */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1a66680..e215e98 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -674,12 +674,14 @@ static enum page_references page_check_references(struct page *page,
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
+				      unsigned long *ret_nr_referenced_ptes,
 				      unsigned long *ret_nr_writeback)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
+	unsigned long nr_referenced_ptes = 0;
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_writeback = 0;
@@ -762,12 +764,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
 		case PAGEREF_KEEP:
+			nr_referenced_ptes++;
+			if (zone->mp.force_reclaim)
+				goto free_mapped_page;
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
 		case PAGEREF_RECLAIM_CLEAN:
 			; /* try to reclaim the page below */
 		}
-
+free_mapped_page:
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
@@ -954,6 +959,7 @@ keep:
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	mem_cgroup_uncharge_end();
+	*ret_nr_referenced_ptes = nr_referenced_ptes;
 	*ret_nr_writeback += nr_writeback;
 	return nr_reclaimed;
 }
@@ -1234,6 +1240,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
+	unsigned long nr_referenced_ptes = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
@@ -1275,7 +1282,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_writeback);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+				&nr_referenced_ptes, &nr_writeback);
 
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1325,6 +1333,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			(nr_taken >> (DEF_PRIORITY - sc->priority)))
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
+
+	if (nr_referenced_ptes >= (nr_taken * MP_STREAM_RATIO)) {
+		int shift = zone->mp.stream_detect_shift;
+		shift = min(++shift, MP_DETECT_MAX_SHIFT);
+		if (shift == MP_DETECT_MAX_SHIFT)
+			zone->mp.force_reclaim = true;
+		zone->mp.stream_detect_shift = shift;
+	} else {
+		zone->mp.stream_detect_shift = 0;
+		zone->mp.force_reclaim = false;
+	}
+
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
 		nr_scanned, nr_reclaimed,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/5] vmscan: Fix obsolete comment of balance_pgdat
  2012-08-22  7:15 ` [PATCH 1/5] vmscan: Fix obsolete comment of balance_pgdat Minchan Kim
@ 2012-08-23 17:37   ` Rik van Riel
  0 siblings, 0 replies; 8+ messages in thread
From: Rik van Riel @ 2012-08-23 17:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, linux-mm,
	linux-kernel, Nick Piggin

On 08/22/2012 03:15 AM, Minchan Kim wrote:
> This patch correct obsolete comment caused by [1] and [2].
>
> [1] 7ac6218, kswapd lockup fix
> [2] 32a4330, mm: prevent kswapd from freeing excessive amounts of lowmem
>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/5] vmscan: sleep only if backingdev is congested
  2012-08-22  7:15 ` [PATCH 2/5] vmscan: sleep only if backingdev is congested Minchan Kim
@ 2012-08-25 23:02   ` Rik van Riel
  0 siblings, 0 replies; 8+ messages in thread
From: Rik van Riel @ 2012-08-25 23:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, linux-mm, linux-kernel

On 08/22/2012 03:15 AM, Minchan Kim wrote:

> +++ b/mm/vmscan.c
> @@ -2705,8 +2705,16 @@ loop_again:
>   		if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
>   			if (has_under_min_watermark_zone)
>   				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> -			else
> -				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			else {
> +				for (i = 0; i <= end_zone; i++) {
> +					struct zone *zone = pgdat->node_zones
> +								+ i;
> +					if (!populated_zone(zone))
> +						continue;
> +					wait_iff_congested(zone, BLK_RW_ASYNC,
> +								HZ/10);
> +				}
> +			}
>   		}

Do we really want to wait on every zone?

That could increase the sleep time by a factor 3.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-08-25 23:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-22  7:15 [RFC 0/5] Consider higher small zone and mmaped-pages stream Minchan Kim
2012-08-22  7:15 ` [PATCH 1/5] vmscan: Fix obsolete comment of balance_pgdat Minchan Kim
2012-08-23 17:37   ` Rik van Riel
2012-08-22  7:15 ` [PATCH 2/5] vmscan: sleep only if backingdev is congested Minchan Kim
2012-08-25 23:02   ` Rik van Riel
2012-08-22  7:15 ` [PATCH 3/5] vmscan: prevent excessive pageout of kswapd Minchan Kim
2012-08-22  7:15 ` [PATCH 4/5] vmscan: get rid of unnecessary nr_dirty ret variable Minchan Kim
2012-08-22  7:15 ` [PATCH 5/5] vmscan: accelerate to reclaim mapped-pages stream Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).